Author Identification on Imbalanced Class Dataset of Indian Literature in Marathi
DOI:
https://doi.org/10.26438/ijcse/v6i11.542547Keywords:
Author Identification, Text Mining,, Machine Learning, Marathi Language, StylometryAbstract
Author Identification is one of the application of text mining and is the task of investigating author of the anonymous text document. Application of author Identification includes in digital forensic, plagiarism detection, copyright issues, etc. The numerous amount of work is already done on English language perhaps Author identification of Indian regional languages is limited. This research paper presents Author identification on Indian regional Marathi Language. In this paper proposing a technique for identifying probabilistic authors via linguistic stylometry i.e. the statistical analysis of variations in literary style between one author or genre with another. In total 11 features are extracted with 8 lexical and syntactic features and 3 word N-gram features. Experimentation is performed with 8 features and machine learning algorithms, i.e. k-nearest neighbor, Naïve Bayes and Support Vector Machine. Moreover, result based on word n-gram i.e. unigram, bigram and trigram are also presented. Experimentation result shows better result with word N-gram method
References
[1] C. Qian, T. He, and R. Zhang, “Deep Learning based Authorship Identification.”
[2] Wikipedia contributors, “Languages with official status in India- Wikipedia,” Wikipedia, The Free Encyclopedia., 2018. [Online] Available: https://en.wikipedia.org/w/index.php?title=Languages_with_official_status_in_India&oldid=841744869. [Accessed: 21-May-2018]
[3] “Diversity of India – Geographical and Cultural contexts – Am an aspirant too,” Wikipedia, The Free Encyclopedia. [Online] Available: https://tklvch.wordpress.com/2015/01/07/diversity-of-india-geographical-and-cultural-contexts/. [Accessed: 27-Apr-2018]
[4] T. C. Mendenhall, “The characteristic curves of composition.,” Science, vol. 9, no. 216, pp. 237–249, 1887.
[5] F. Mosteller and D. Wallace, “Inference and disputed authorship: The Federalist,” 1964.
[6] K. S. Digamberrao and R. S. Prasad, “Author Identification on Literature in Different Languages: A Systematic Survey,” in 2018 International Conference On Advances in Communication and Computing Technology (ICACCT), 2018, pp. 174–181.
[7] S. D. Kale and R. S. Prasad, “A Systematic Review on Author Identification Methods,” Int. J. Rough Sets Data Anal., vol. 4, no. 2, pp. 81–91, Apr. 2017.
[8] A. F. Otoom, E. E. Abdullah, S. Jaafer, A. Hamdallh, and D. Amer, “Towards author identification of Arabic text articles,” in 2014 5th International Conference on Information and Communication Systems (ICICS), 2014, pp. 1–4.
[9] B. Diri and M. Fatih Amasyali, “Automatic Author Detection for Turkish Texts.”
[10] H. Paci, E. Kajo, E. Trandafili, I. Tafa, and D. Salillari, “Author identification in Albanian language,” in Proceedings - 2011 International Conference on Network-Based Information Systems, NBiS 2011, 2011, pp. 425–430.
[11] S. D. Kale and R. S. Prasad, “Author Identification using Sequential Minimal Optimization with rule-based Decision Tree on Indian Literature in Marathi,” Procedia Comput. Sci., vol. 132, pp. 1086–1101, Jan. 2018.
[12] S. N. Prasad, V. B. Narsimha, P. V. Reddy, and A. V. Babu, “Influence of Lexical, Syntactic and Structural Features and their Combination on Authorship Attribution for Telugu Text,” Procedia Comput. Sci., vol. 48, no. C, pp. 58–64, 2015.
[13] S. Das and P. Mitra, “Author Identification in Bengali Literary Works,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 6744 LNCS, springer, 2011, pp. 220–226.
[14] J. R. Prasad, U. V. Kulkarni, and R. S. Prasad, “Template Matching Algorithm for Gujrati Character Recognition,” in 2009 Second International Conference on Emerging Trends in Engineering & Technology, 2009, pp. 263–268.
[15] J. R. Prasad, U. V. Kulkarni, and R. S. Prasad, “Offline Handwritten Character Recognition of Gujrati script using Pattern Matching,” in 2009 3rd International Conference on Anti-counterfeiting, Security, and Identification in Communication, 2009, pp. 611–615.
[16] F. Wikipedia, “Statistical classification Frequentist procedures.”
[17] E. Stamatatos, “A survey of modern authorship attribution methods,” J. Am. Soc. Inf. Sci. Technol., vol. 60, no. 3, pp. 538–556, 2009.
[18] M. W. Corney, “Analysing E-mail Text Authorship for Forensic Purposes by,” 2003.
[19] Chaitanya Singh, “HashMap in Java with Example.” [Online] Available: https://beginnersbook.com/2013/12/hashmap-in-java-with-example/. [Accessed: 29-Oct-2018]
[20] “HashMap in Java - javatpoint.” [Online] Available: https://www.javatpoint.com/java-hashmap. [Accessed: 29-Oct-2018]
[21] E. Table, R. External, C. Cat, and D. Rabbit, “Confusion matrix,” pp. 1–4, 2018.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors contributing to this journal agree to publish their articles under the Creative Commons Attribution 4.0 International License, allowing third parties to share their work (copy, distribute, transmit) and to adapt it, under the condition that the authors are given credit and that in the event of reuse or distribution, the terms of this license are made clear.
