Author Identification on Imbalanced Class Dataset of Indian Literature in Marathi

Authors

  • Kale SD Computer Engineering Department, Smt. Kashibai Navale College of Engineering, Pune, Maharashtra, India
  • Prasad RS Computer Engineering Department, Smt. Kashibai Navale College of Engineering, Pune, Maharashtra, India

DOI:

https://doi.org/10.26438/ijcse/v6i11.542547

Keywords:

Author Identification, Text Mining,, Machine Learning, Marathi Language, Stylometry

Abstract

Author Identification is one of the application of text mining and is the task of investigating author of the anonymous text document. Application of author Identification includes in digital forensic, plagiarism detection, copyright issues, etc. The numerous amount of work is already done on English language perhaps Author identification of Indian regional languages is limited. This research paper presents Author identification on Indian regional Marathi Language. In this paper proposing a technique for identifying probabilistic authors via linguistic stylometry i.e. the statistical analysis of variations in literary style between one author or genre with another. In total 11 features are extracted with 8 lexical and syntactic features and 3 word N-gram features. Experimentation is performed with 8 features and machine learning algorithms, i.e. k-nearest neighbor, Naïve Bayes and Support Vector Machine. Moreover, result based on word n-gram i.e. unigram, bigram and trigram are also presented. Experimentation result shows better result with word N-gram method

References

[1] C. Qian, T. He, and R. Zhang, “Deep Learning based Authorship Identification.”

[2] Wikipedia contributors, “Languages with official status in India- Wikipedia,” Wikipedia, The Free Encyclopedia., 2018. [Online] Available: https://en.wikipedia.org/w/index.php?title=Languages_with_official_status_in_India&oldid=841744869. [Accessed: 21-May-2018]

[3] “Diversity of India – Geographical and Cultural contexts – Am an aspirant too,” Wikipedia, The Free Encyclopedia. [Online] Available: https://tklvch.wordpress.com/2015/01/07/diversity-of-india-geographical-and-cultural-contexts/. [Accessed: 27-Apr-2018]

[4] T. C. Mendenhall, “The characteristic curves of composition.,” Science, vol. 9, no. 216, pp. 237–249, 1887.

[5] F. Mosteller and D. Wallace, “Inference and disputed authorship: The Federalist,” 1964.

[6] K. S. Digamberrao and R. S. Prasad, “Author Identification on Literature in Different Languages: A Systematic Survey,” in 2018 International Conference On Advances in Communication and Computing Technology (ICACCT), 2018, pp. 174–181.

[7] S. D. Kale and R. S. Prasad, “A Systematic Review on Author Identification Methods,” Int. J. Rough Sets Data Anal., vol. 4, no. 2, pp. 81–91, Apr. 2017.

[8] A. F. Otoom, E. E. Abdullah, S. Jaafer, A. Hamdallh, and D. Amer, “Towards author identification of Arabic text articles,” in 2014 5th International Conference on Information and Communication Systems (ICICS), 2014, pp. 1–4.

[9] B. Diri and M. Fatih Amasyali, “Automatic Author Detection for Turkish Texts.”

[10] H. Paci, E. Kajo, E. Trandafili, I. Tafa, and D. Salillari, “Author identification in Albanian language,” in Proceedings - 2011 International Conference on Network-Based Information Systems, NBiS 2011, 2011, pp. 425–430.

[11] S. D. Kale and R. S. Prasad, “Author Identification using Sequential Minimal Optimization with rule-based Decision Tree on Indian Literature in Marathi,” Procedia Comput. Sci., vol. 132, pp. 1086–1101, Jan. 2018.

[12] S. N. Prasad, V. B. Narsimha, P. V. Reddy, and A. V. Babu, “Influence of Lexical, Syntactic and Structural Features and their Combination on Authorship Attribution for Telugu Text,” Procedia Comput. Sci., vol. 48, no. C, pp. 58–64, 2015.

[13] S. Das and P. Mitra, “Author Identification in Bengali Literary Works,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 6744 LNCS, springer, 2011, pp. 220–226.

[14] J. R. Prasad, U. V. Kulkarni, and R. S. Prasad, “Template Matching Algorithm for Gujrati Character Recognition,” in 2009 Second International Conference on Emerging Trends in Engineering & Technology, 2009, pp. 263–268.

[15] J. R. Prasad, U. V. Kulkarni, and R. S. Prasad, “Offline Handwritten Character Recognition of Gujrati script using Pattern Matching,” in 2009 3rd International Conference on Anti-counterfeiting, Security, and Identification in Communication, 2009, pp. 611–615.

[16] F. Wikipedia, “Statistical classification Frequentist procedures.”

[17] E. Stamatatos, “A survey of modern authorship attribution methods,” J. Am. Soc. Inf. Sci. Technol., vol. 60, no. 3, pp. 538–556, 2009.

[18] M. W. Corney, “Analysing E-mail Text Authorship for Forensic Purposes by,” 2003.

[19] Chaitanya Singh, “HashMap in Java with Example.” [Online] Available: https://beginnersbook.com/2013/12/hashmap-in-java-with-example/. [Accessed: 29-Oct-2018]

[20] “HashMap in Java - javatpoint.” [Online] Available: https://www.javatpoint.com/java-hashmap. [Accessed: 29-Oct-2018]

[21] E. Table, R. External, C. Cat, and D. Rabbit, “Confusion matrix,” pp. 1–4, 2018.

Downloads

Published

2025-11-18
CITATION
DOI: 10.26438/ijcse/v6i11.542547
Published: 2025-11-18

How to Cite

[1]
S. D. Kale and R. S. Prasad, “Author Identification on Imbalanced Class Dataset of Indian Literature in Marathi”, Int. J. Comp. Sci. Eng., vol. 6, no. 11, pp. 542–547, Nov. 2025.

Issue

Section

Research Article