A Survey on Text Pre-processing Techniques and Tools

Authors

Lourdusamy R Dept. of Computer Science, Sacred Heart College, Thiruvalluvar University, Tirupattur, 635 601, Tamil Nadu, INDIA
Abraham S Dept. of Computer Science, Sacred Heart College, Thiruvalluvar University, Tirupattur, 635 601, Tamil Nadu, INDIA

DOI:

https://doi.org/10.26438/ijcse/v6si3.148157

Keywords:

Text Mining, Pre-processing techniques, Pre-processing Tools, Natural Language Processing

Abstract

We live in an era of digital data explosion over Internet. Data warehouses deal with numerical databases than textual sources. Nearly eighty percent of digital data is either in semi or un-structured textual form. Several knowledge mining techniques developed over the past decade and those that are being developed now continue to draw attention to transform such textual data into desirable information and useful knowledge. This knowledge and information is used to benefit many fields of applications such as: social network, business management, customer care management system, market analysis, search engines, fraud detection, just to name a few. Text Mining (TM) is what is needed if desired information is to be obtained from such voluminous data. TM is multi-disciplinary in nature. Several TM techniques are deployed in the process of extracting knowledge from textual sources. Input text for such techniques needs to be pre-processed and cleaned. This survey briefly presents pre-processing tools for TM in general and Natural Language Processing (NLP) in particular. Also presents the broad categories of TM techniques used. The focus of this paper is to explore and analyze several features of text preprocessing techniques and tools that would interest researchers in the area of TM.

References

Feldman Ronen & Dagan Ido, “Knowledge Discovery in Textual Databases”, KDD, Vol. 95. pp. 112–117, 1995.

Saira, Gillani Andleeb, “From text mining to knowledge mining: An integrated framework of concept extraction and categorization for domain ontology”, PhD Dissertation, Budapesti Corvinus Egyetem, 2015.

J. I. Toledo-Alvarado et al., “Automatic Building of an Ontology from a Corpus of Text Documents Using Data Mining Tools”, 2012

Joe Tekli, “An overview on XML Semantic Disambiguation from Unstructured”, Member, IEEE, 2016.

Harris, Z., ‘The structure of science information’, J Biomed. Inform., Vol. 35(4), pp. 215–221, 2002.

Alexander Gelbukh, ”Special issue: Natural Language Processing and its Applications”, Institut Politécnico Nacional Centro de Investigaciónen Computación México, Mexico, 2010.

Sibarani E. M., Nadial M., Panggabean E., & Meryana S., "A Study of parsing process on natural language processing in Bahasa Indonesia", International Conference on Computational Science and Engineering, pp. 309-316 2013.

Andreas Hotho, Andreas Nürnberger, and Gerhard Paaß, “A Brief Survey of Text Mining. In Ldv Forum”, Vol. 20.19–62. 2005.

Dragomir R Radev, Eduard Hovy, and Kathleen McKeown, “Introduction to the special issue on summarization”, Computational linguistics 28, 4, pp. 399–408, 2002.

M. Allahyari, S. Pouriyeh, M. Assefi, S. Safaei, E. D. Trippe, J. B. Gutierrez, and K. Kochut., Text Summarization Techniques: A Brief Survey. ArXiv e-prints, 2017, arXiv:1707.02268

Dipanjan Das and André FT Martins, “A survey on automatic text summarization”, Literature Survey for the Language and Statistics II course at CMU 4, pp. 192–195, 2007.

Pritam C Gaigole, L. H. Patil, & P. M. Chaudhari, “Preprocessing Techniques in Text categorization”, National Conference on Innovative Paradigms in Engineering & Technology (NVIPET-2013), Proceedings published by International Journal of Computer Applications (IJCA), 2013.

Katariya Nikita, & Chaudhari M. S., “Text Preprocessing For Text Mining Using Side Information”, International Journal of Computer Science and Mobile Applications, vol.3 Issue. 1, pp. 01-05, 2015.

Ramasubramanian C., & Ramya R., “Effective Pre-Processing Activities in Text Mining using Improved Porter’s Stemming Algorithm”, International Journal of Advanced Research in Computer and Communication Engineering, vol. 2, Issue 12, pp. 4536-4538, 2013.

Vijayarani S, Ilamathi J, & Nithya, International Journal of Computer Science & Communication Networks, Vol 5(1), pp. 7-16, 2015.

Vijayarani S, & Janani R, "Text mining: open source tokenization tools–an analysis", Advanced Computational Intelligence 3.1: pp. 37-47, 2016.

Vaidya, Swapnil, & Jayshree Aher, "Natural Language Processing Preprocessing Techniques", International Journal of Computer Engineering and Applications, Volume XI, Special Issue, 2017, www.ijcea.com ISSN 2321-3469

Nayak Arjun Srinivas, Kanive Ananthu, Chandavekar Naveen, & Balasubramani R, “Survey on Pre-Processing Techniques for Text Mining”, International Journal Of Engineering And Computer Science, Volume 5 Issues 6 2016.

Nazri Mohd Zakree Ahmad, Siti Mariyam Shamsudin, &Azuraliza Abu Bakar. "An exploratory study of the Malay text processing tools in ontology learning.", Research project, Ministry of Higher Learning – Malesia, 2008.

Downloads

PDF ⁰

Published

2025-11-13

CITATION

DOI: 10.26438/ijcse/v6si3.148157

Published: 2025-11-13

How to Cite

[1]

R. Lourdusamy and S. Abraham, “A Survey on Text Pre-processing Techniques and Tools”, Int. J. Comp. Sci. Eng., vol. 6, no. 3, pp. 148–157, Nov. 2025.

Download Citation

Issue

Vol. 6 No. 3 (2018): IJCSE Special Issue April Edition

Section

Survey Article

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors contributing to this journal agree to publish their articles under the Creative Commons Attribution 4.0 International License, allowing third parties to share their work (copy, distribute, transmit) and to adapt it, under the condition that the authors are given credit and that in the event of reuse or distribution, the terms of this license are made clear.

A Survey on Text Pre-processing Techniques and Tools

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

Journal Information

UGC Gazette Regulation

Join Editorial Board

Information

Current Issue

Keywords