Evaluating Techniques for Pre-Processing of Unstructured Text For Text Classification

Authors

  • Koshy S Bharathiar University, Coimbatore, Tamil Nadu, India
  • R Padmajavalli Department of Computer Applications, Bhaktavatsalam Memorial College for Women, Chennai, India

DOI:

https://doi.org/10.26438/ijcse/v6i8.151160

Keywords:

Tokenisation, stemming, parts of speech tagging, document representation, vector space model

Abstract

The availability of digital information over the internet can be analyzed for knowledge discovery and intelligent decision making. Text categorization is an important and extensively studied problem in machine learning. Text classification or grouping of text into appropriate categories requires pre-processing techniques and machine learning algorithms. Preprocessing or data cleaning involves removal of html characters, tokenization, stop words removal, stemming, lemmatization and advanced processes such as parts of speech tagging followed by representation in appropriate form for machine learning. This paper experimentally evaluates the impact of stemming and tokenization techniques on text classification on five text datasets.

References

[1] Frakes William B. “Strength and similarity of affix removal stemming algorithms”. ACM SIGIR Forum, Volume 37, No. 1. 2003, 26-30.

[2] J. B. Lovins, “Development of a stemming algorithm,” Mechanical Translation and Computer Linguistic., vol.11, no.1/2, pp. 22-31, 1968.

[3] Mayfield James and McNamee Paul. “Single Ngram stemming”. Proceedings of the 26th annual international ACM SIGIR conference on Research and development in information retrieval. 2003, 415-416.

[4] Mladenic Dunja. “Automatic word lemmatization”. Proceedings B of the 5th International Multi- Conference Information Society IS. 2002, 153-159.

[5] Paice Chris D. “An evaluation method for stemming algorithms”. Proceedings of the 17th annual international ACM SIGIR conference on Research and development in information retrieval. 1994, 42- 50.

[6] Porter M.F. “An algorithm for suffix stripping”. Program. 1980; 14, 130-137. Porter M.F. “Snowball: A language for stemming algorithms”. 2001.

[7] Hull David A. and Grefenstette Gregory. “A detailed analysis of English stemming algorithms”. Rank Xerox ResearchCenter Technical Report. 1996. (2002) The IEEE website. [Online]. Available: http://www.ieee.org/

[8] Han, Jiawei, Jian Pei, and Micheline Kamber. Data mining: concepts and techniques. Elsevier, 2011

[9] Derczynski, Leon, et al. "Twitter part-of-speech tagging for all: Overcoming sparse and noisy data." Proceedings of the International Conference Recent Advances in Natural Language Processing RANLP 2013. 2013.

[10] Jivani, Anjali Ganesh. "A comparative study of stemming algorithms." Int. J. Comp. Tech. Appl 2.6 (2011): 1930-1938.

[11] Majumder, Prasenjit, et al. "YASS: Yet another suffix stripper." ACM transactions on information systems (TOIS) 25.4 (2007): 18.

[12] Eibe Frank, Mark A. Hall, and Ian H. Witten (2016). The WEKA Workbench. Online Appendix for "Data Mining: Practical Machine Learning Tools and Techniques", Morgan Kaufmann, Fourth Edition, 2016.

[13] https://www.kaggle.com/ranjitha1/hotel-reviews-city-chennai/version/2#

[14] https://www.kaggle.com/uciml/sms-spam-collection-dataset/data

[15] https://sourceforge.net/projects/weka/

[16] Pomikálek, J., & Rehurek, R. (2007). The Influence of preprocessing parameters on text categorization. International Journal of Applied Science, Engineering and Technology, 1, 430-434.

Downloads

Published

2018-08-31
CITATION
DOI: 10.26438/ijcse/v6i8.151160
Published: 2018-08-31

How to Cite

[1]
S. Koshy and R. Padmajavalli, “Evaluating Techniques for Pre-Processing of Unstructured Text For Text Classification”, Int. J. Comp. Sci. Eng., vol. 6, no. 8, pp. 151–160, Aug. 2018.

Issue

Section

Research Article