Evaluating Techniques for Pre-Processing of Unstructured Text For Text Classification
DOI:
https://doi.org/10.26438/ijcse/v6i8.151160Keywords:
Tokenisation, stemming, parts of speech tagging, document representation, vector space modelAbstract
The availability of digital information over the internet can be analyzed for knowledge discovery and intelligent decision making. Text categorization is an important and extensively studied problem in machine learning. Text classification or grouping of text into appropriate categories requires pre-processing techniques and machine learning algorithms. Preprocessing or data cleaning involves removal of html characters, tokenization, stop words removal, stemming, lemmatization and advanced processes such as parts of speech tagging followed by representation in appropriate form for machine learning. This paper experimentally evaluates the impact of stemming and tokenization techniques on text classification on five text datasets.
References
[1] Frakes William B. “Strength and similarity of affix removal stemming algorithms”. ACM SIGIR Forum, Volume 37, No. 1. 2003, 26-30.
[2] J. B. Lovins, “Development of a stemming algorithm,” Mechanical Translation and Computer Linguistic., vol.11, no.1/2, pp. 22-31, 1968.
[3] Mayfield James and McNamee Paul. “Single Ngram stemming”. Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval. 2003, 415-416.
[4] Mladenic Dunja. “Automatic word lemmatization”. Proceedings B of the 5th International Multi- Conference Information Society IS. 2002, 153-159.
[5] Paice Chris D. “An evaluation method for stemming algorithms”. Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval. 1994, 42- 50.
[6] Porter M.F. “An algorithm for suffix stripping”. Program. 1980; 14, 130-137. Porter M.F. “Snowball: A language for stemming algorithms”. 2001.
[7] Hull David A. and Grefenstette Gregory. “A detailed analysis of English stemming algorithms”. Rank Xerox ResearchCenter Technical Report. 1996. (2002) The IEEE website. [Online]. Available: http://www.ieee.org/
[8] Han, Jiawei, Jian Pei, and Micheline Kamber. Data mining: concepts and techniques. Elsevier, 2011
[9] Derczynski, Leon, et al. "Twitter part-of-speech tagging for all: Overcoming sparse and noisy data." Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013. 2013.
[10] Jivani, Anjali Ganesh. "A comparative study of stemming algorithms." Int. J. Comp. Tech. Appl 2.6 (2011): 1930-1938.
[11] Majumder, Prasenjit, et al. "YASS: Yet another suffix stripper." ACM transactions on information systems (TOIS) 25.4 (2007): 18.
[12] Eibe Frank, Mark A. Hall, and Ian H. Witten (2016). The WEKA Workbench. Online Appendix for "Data Mining: Practical Machine Learning Tools and Techniques", Morgan Kaufmann, Fourth Edition, 2016.
[13] https://www.kaggle.com/ranjitha1/hotel-reviews-city-chennai/version/2#
[14] https://www.kaggle.com/uciml/sms-spam-collection-dataset/data
[15] https://sourceforge.net/projects/weka/
[16] Pomikálek, J., & Rehurek, R. (2007). The Influence of preprocessing parameters on text categorization. International Journal of Applied Science, Engineering and Technology, 1, 430-434.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors contributing to this journal agree to publish their articles under the Creative Commons Attribution 4.0 International License, allowing third parties to share their work (copy, distribute, transmit) and to adapt it, under the condition that the authors are given credit and that in the event of reuse or distribution, the terms of this license are made clear.
