Pre-processing Phase of Automatic Text Summarization for the Assamese Language

Authors

  • Chetia G Centre for Computer Science and Applications, Dibrugarh University, Dibrugarh, India
  • Hazarika GC Dept. of Mathematics, Dibrugarh University, Dibrugarh, India

DOI:

https://doi.org/10.26438/ijcse/v6i10.159163

Keywords:

Pre-processing, Summarization, Stemming, Lemmatization, n-gram

Abstract

Pre-processing is the first and important phase of automatic text summarization. Pre-processing helps in normalizing a text document and generating a structured representation of the text. Major pre-processing tasks include segmentation, tokenization, stop-word removal, stemming and lemmatization. In this paper, we discuss these pre-processing tasks required for automatically summarizing Assamese text documents. Both Stemming and lemmatization play an important role in the pre-processing phase of morphologically rich highly inflected language like Assamese. We present a corpus based approach for stemming the Assamese words using n-gram similarity matching technique. We also propose a hybrid method for lemmatization of the Assamese verbs to obtain the grammatically correct root of a verb. Assamese verbs are the most inflectional compared to other word categories. Stemming alone is not sufficient to find the original roots in case of Assamese verbs. So, after segmentation, tokenization and stop-word removal we first apply stemming to all the words in the text document irrespective of their grammatical categories and then apply lemmatization to only the Assamese verbs. For identifying the Assamese verbs we use a look-up dictionary which contains a list of possible stems along with the corresponding lemma of the verbs

References

[1] Maryam Kiabod, Mohammad Naderi Dehkordi and Sayed Mehran Sharafi, “A Novel Method of Significant Words Identification in Text Summarization”, Journal of Emerging Technologies in Web Intelligence, Vol. 4, No. 3, August, 2012.

[2] Joel Larocca Neto, Alex A. Freitas, Celso A. A. Kaestner, “Automatic Text Summarization using a Machine Learning Approach”, Proceeding SBIA `02 Proceedings of the 16th Brazilian Symposium on Artificial Intelligence: Advances in Artificial Intelligence Pages 205-215 November 11 - 14, 2002.

[3] Gordon, Raymond G., Jr. (ed.). “Ethnologue: Languages of the World”, Fifteenth edition. Dallas, Tex.: SIL International, 2005.

[4] Dipanjan Das, André FT Martins. "A survey on automatic text summarization." Literature Survey for the Language and Statistics II course at CMU 4,192-195, 2007.

[5] Prachi Shah, Nikita P. Desai, “A Survey of Automatic Text Summarization Techniques for Indian and Foreign Languages”, International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT) – 2016.

[6] Silla, C.N., Kaestner, C.A.A. “An Analysis of Sentence Boundary Detection Systems for English and Portuguese Documents” In: Gelbukh A. (eds) Computational Linguistics and Intelligent Text Processing. CICLing 2004. Lecture Notes in Computer Science, vol 2945. Springer, Berlin, Heidelberg.

[7] Moral, C., de Antonio, A., Imbert, R. & Ramírez, J. “A survey of stemming algorithms in information retrieval/ Information Research”, 19(1) paper 605.

[8] Banikanta Kakati. “Assamese, Its Formation and Development”. LBS Publication, G.N.B. Road, Guwahati, fifth edition, 1995.

[9] Golok Chandra Goswami, “Structures of Assamese”. Department of Publication, Gauhati University, 1982.

[10] Nitin Indurkhya , Fred J. Damerau, “Handbook of Natural Language Processing”, Chapman & Hall/CRC, 2010.

[11] Tuomo Korenius , Jorma Laurikkala , Kalervo Järvelin , Martti Juhola, “Stemming and lemmatization in the clustering of finnish text documents”, Proceedings of the thirteenth ACM international conference on Information and knowledge management, Washington, D.C., USA ,November 08-13, 2004.

[12] Plisson Joel, Lavrac Nada and Mladenic Dunja. “A rule based approach to word lemmatization”, Proceedings of the 7th International Multi-Conference Information Society IS. 2004.

[13] M. F. Porter “An algorithm for suffix stripping. Program”, 14(3): 130-137. 1980

[14] Adamson, G. W. & Boreham, J., "The use of an Association Measure Based on Character Structure to identify Semantically Related Pairs of Words and Document Titles", InformationStorage and Retrieval 10, pp 253-260, 1974.

[15] Akinwale, A.T., Niewiadomski, A E Cient “Similarity Measures for Texts Matching” Journal of Applied Computer Science Vol. 23 No. 1,pp. 7-28, 2015,

[16] Kleinberg, J. & Tardos, É. “Algorithm Design”, Addison Wesley, 2006.

Downloads

Published

2025-11-17
CITATION
DOI: 10.26438/ijcse/v6i10.159163
Published: 2025-11-17

How to Cite

[1]
G. Chetia and G. C. Hazarika, “Pre-processing Phase of Automatic Text Summarization for the Assamese Language”, Int. J. Comp. Sci. Eng., vol. 6, no. 10, pp. 159–163, Nov. 2025.

Issue

Section

Research Article