Algorithm for Removal of Semantically Insignificant Content Words

Authors

  • Barman A Departmet of Computer science and Engineering, Jadavpur University, Kolkata, India
  • Saha D Departmet of Computer science and Engineering, Jadavpur University, Kolkata, India

Keywords:

Machine Learning(ML), Natural Language Processing(NLP), Information Retrieval (IR), Term Document Matrix, Inverse Document Frequency (IDF) and Inverse Class Frequency (ICF), Stop Words, Content Words

Abstract

This paper describes how the context specific semantically insignificant content words are extracted using Inverse Document Frequency (IDF) and Inverse Class Frequency (ICF) measure. We are able to remove around 42% of total corpus volume as irrelevant information which includes textual noise, function words and context specific semantically insignificant content words. We have executed different Machine Learning(ML) algorithms used for text classification on a corpus, before and after the removal of the textual noise. We found that there have been no significant change in accuracy of those ML algorithms before and after removal of the textual noise.

References

[1] Dharmendra Sharma, Suresh Jain, “Evaluation of Stemming and Stop Word Techniques on Text Classification Problem”, International Journal of Scientific Research in Computer Science and Engineering, Vol-3(2), PP (1-4) Apr 2015, E-ISSN: 2320-7639.

[2] Ljiljana Dolamic, Jacques Savoy, “When Stopword Lists Makethe Difference,”, Journal of the American Society for Information Science and Technology no. 1, pp. 200–203, 2009.

[3] M. P. Sinka and D. W. Corne, “Evolving Better Stoplists for Document Clustering and Web Intelligence,” Des. Appl. hybrid Intell. Syst., pp. 1015–1023, 2003.

[4] R. Al-Shalabi, G. Kanaan, J. M. Jaam, A. Hasnah and E. Hilat, "Stop-word removal algorithm for Arabic language," Proceedings. 2004 International Conference on Information and Communication Technologies: From Theory to Applications, 2004., Damascus, Syria, 2004, pp. 545

[5] B. Alhadidi and M. Alwedyan, “Hybrid Stop-Word Removal Technique for Arabic Language.,” Egypt Comput Sci, vol. 30(1), no. 1, pp. 35–38, 2008

[6] R. Puri, R. P. S. Bedi, and V. Goyal, “Automated Stopwords Identification in Punjabi Documents,” An Int. J. Eng. Sci., vol. 8, no. June 2013, pp. 119–125, 2013.

[7] Ashish T, Kothari M and Pinkesh P, “Pre-Processing Phase of Text Summarization Based on Gujarati Language”, International Journal of Innovative Research in Computer Science & Technology (IJIRCST) Vol-2,Iss-4, July 2014

[8] Jaideepsinh K. Raulji, Jatinderkumar R. Saini, “Stop-Word Removal Algorithm and its Implementation for Sanskrit Language”, International Journal of Computer Applications (0975 – 8887), Volume 150 – No.2, September 2016

[9] V. Jha, N. Manjunath, P. D. Shenoy and K. R. Venugopal, "HSRA: Hindi stopword removal algorithm," 2016 International Conference on Microelectronics, Computing and Communications (MicroCom), Durgapur, 2016, pp. 1-5

[10] S. Siddiqi and A. Sharan, “Construction of a generic stopwords list for Hindi language without corpus statistics,” Int. J. Adv. Comput. Res., vol. 8, no. 34, pp. 35–40, 2017.

[11] Rakholia R. M. and Saini J. R., “A Rule-based Approach to Identify Stop Words for Gujarati Language”, accepted for publication in Advances in Intelligent and Soft Computing (AISC) Series, ISSN: 1615-3871, 2194-5357, 1860-0794 by Springer-Verlag, Germany. 2017.

[12] Ankita Dhar, Niladri Sekhar Dash, Kaushik Roy, “Categorization of Bangla Web Text DocumentsBased on TF-IDF-ICF Text Analysis Scheme”, Springer Nature Singapore Pte Ltd. 2018,J. K. Mandal and D. Sinha (Eds.): CSI 2017, CCIS 836, pp. 477–484, 2018.

Downloads

Published

2025-11-24

How to Cite

[1]
A. Barman and D. Saha, “Algorithm for Removal of Semantically Insignificant Content Words”, Int. J. Comp. Sci. Eng., vol. 7, no. 1, pp. 53–56, Nov. 2025.