Algorithm for Removal of Semantically Insignificant Content Words

Authors

Barman A Departmet of Computer science and Engineering, Jadavpur University, Kolkata, India
Saha D Departmet of Computer science and Engineering, Jadavpur University, Kolkata, India

Keywords:

Machine Learning(ML), Natural Language Processing(NLP), Information Retrieval (IR), Term Document Matrix, Inverse Document Frequency (IDF) and Inverse Class Frequency (ICF), Stop Words, Content Words

Abstract

This paper describes how the context specific semantically insignificant content words are extracted using Inverse Document Frequency (IDF) and Inverse Class Frequency (ICF) measure. We are able to remove around 42% of total corpus volume as irrelevant information which includes textual noise, function words and context specific semantically insignificant content words. We have executed different Machine Learning(ML) algorithms used for text classification on a corpus, before and after the removal of the textual noise. We found that there have been no significant change in accuracy of those ML algorithms before and after removal of the textual noise.

References

[1] Dharmendra Sharma, Suresh Jain, “Evaluation of Stemming and Stop Word Techniques on Text Classification Problem”, International Journal of Scientific Research in Computer Science and Engineering, Vol-3(2), PP (1-4) Apr 2015, E-ISSN: 2320-7639.

[2] Ljiljana Dolamic, Jacques Savoy, “When Stopword Lists Makethe Difference,”, Journal of the American Society for Information Science and Technology no. 1, pp. 200–203, 2009.

[3] M. P. Sinka and D. W. Corne, “Evolving Better Stoplists for Document Clustering and Web Intelligence,” Des. Appl. hybrid Intell. Syst., pp. 1015–1023, 2003.

[4] R. Al-Shalabi, G. Kanaan, J. M. Jaam, A. Hasnah and E. Hilat, "Stop-word removal algorithm for Arabic language," Proceedings. 2004 International Conference on Information and Communication Technologies: From Theory to Applications, 2004., Damascus, Syria, 2004, pp. 545

[5] B. Alhadidi and M. Alwedyan, “Hybrid Stop-Word Removal Technique for Arabic Language.,” Egypt Comput Sci, vol. 30(1), no. 1, pp. 35–38, 2008

[6] R. Puri, R. P. S. Bedi, and V. Goyal, “Automated Stopwords Identification in Punjabi Documents,” An Int. J. Eng. Sci., vol. 8, no. June 2013, pp. 119–125, 2013.

[7] Ashish T, Kothari M and Pinkesh P, “Pre-Processing Phase of Text Summarization Based on Gujarati Language”, International Journal of Innovative Research in Computer Science & Technology (IJIRCST) Vol-2,Iss-4, July 2014

[8] Jaideepsinh K. Raulji, Jatinderkumar R. Saini, “Stop-Word Removal Algorithm and its Implementation for Sanskrit Language”, International Journal of Computer Applications (0975 – 8887), Volume 150 – No.2, September 2016

[9] V. Jha, N. Manjunath, P. D. Shenoy and K. R. Venugopal, "HSRA: Hindi stopword removal algorithm," 2016 International Conference on Microelectronics, Computing and Communications (MicroCom), Durgapur, 2016, pp. 1-5

[10] S. Siddiqi and A. Sharan, “Construction of a generic stopwords list for Hindi language without corpus statistics,” Int. J. Adv. Comput. Res., vol. 8, no. 34, pp. 35–40, 2017.

[11] Rakholia R. M. and Saini J. R., “A Rule-based Approach to Identify Stop Words for Gujarati Language”, accepted for publication in Advances in Intelligent and Soft Computing (AISC) Series, ISSN: 1615-3871, 2194-5357, 1860-0794 by Springer-Verlag, Germany. 2017.

[12] Ankita Dhar, Niladri Sekhar Dash, Kaushik Roy, “Categorization of Bangla Web Text DocumentsBased on TF-IDF-ICF Text Analysis Scheme”, Springer Nature Singapore Pte Ltd. 2018,J. K. Mandal and D. Sinha (Eds.): CSI 2017, CCIS 836, pp. 477–484, 2018.

Downloads

PDF ⁰

Published

2025-11-24

How to Cite

[1]

A. Barman and D. Saha, “Algorithm for Removal of Semantically Insignificant Content Words”, Int. J. Comp. Sci. Eng., vol. 7, no. 1, pp. 53–56, Nov. 2025.

Download Citation

Issue

Vol. 7 No. 1 (2019): IJCSE Special Issue Jan Edition

Section

Research Article

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors contributing to this journal agree to publish their articles under the Creative Commons Attribution 4.0 International License, allowing third parties to share their work (copy, distribute, transmit) and to adapt it, under the condition that the authors are given credit and that in the event of reuse or distribution, the terms of this license are made clear.

Algorithm for Removal of Semantically Insignificant Content Words

Authors

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

Journal Information

UGC Gazette Regulation

Join Editorial Board

Information

Current Issue

Keywords