Document Categorization for Probabilistic Redundant Documents
DOI:
https://doi.org/10.26438/ijcse/v7i1.5155Keywords:
Duplicate-detection, text categorization, information retrieval, similarity measureAbstract
Text categorization is an active research area in information retrieval and machine learning. The major issue regarding preprocessing the document for this categorization is redundancy. The redundant documents slow down the learning steps of classification and also affect its efficiency and scalability. To resolve this issue it is preferred, first identify the duplicates and then perform the classification. This paper proposes to apply the Similarity Measure for duplicate detection and Random forest for classification. The results are evaluated using ‘20 newsgroups’ data sets with generated duplicate documents. Accuracy and time parameters show better results in the proposed method than that in the existing text categorization model.
References
[1] D. Xue, F. Li, “Research of Text Categorization Model based on Random Forests,” IEEE International Conference on Computational Intelligence & Communication Technology, pp. 173-176, 2015.
[2] G. Gao, S. Guan, “Text Categorization Based on Improved Rocchio Algorithm,” International Conference on Systems and Informatics, pp. 2247-2250, 2012.
[3] Thamarai, S.S., Kartikeyan, P., Vincent, A., Abinaya, V., Neeraja, G. and Deepika, R. 2016.Text Categorization using Rocchio Algorithm and Random Forest Algorithm. In the IEEE 2016 Eighth International Conference on Advanced Computing (ICoAC) held at Chennai, India, pp. 7-12, 2017.
[4] J.Y. Jiang, S.C. Tsai, S.J. Lee, “FSKNN: Multi-label text categorization based on fuzzy similarity and k nearest neighbors,” Expert Systems with Applications, Vol. 39, Issue. 3, pp. 2813-2821, 2012.
[5] M.L. Zhang, Z.H. Zhou, “A lazy learning approach to mullti-label learning,” National Laboratory for Novel Software Technology, Vol. 40, Issue. 7, pp. 2038-2048, 2007.
[6] S. Seshasai,” Efficient near duplicate document detection for specialized corpora“, Massachusetts Institute of Technology, 2008.
[7] W. Zong, F. Wu, L.K. Chu, D. Schulli, “A discriminative and semantic feature selection method for text Categorization,” School of Management, Xian Jiatoong University, China, IntJ.Production Economics, Vol.165, pp. 215-222, 2015.
[8] M. Bilenko, R.J. Mooney,” Adaptive Duplicate Detection Using Learnable String Similarity Measures”, Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 39-48, 2003.
[9] G.S. Manku, A.D. Sarma, A. Jain,” Detecting Near Duplicates for Web Crawling”, International World Wide Web Conference Committee (IW3C2), pp 141-149, 2007.
[10] E.P. Sim,” Classification & Detection of Near Duplicate Web Pages using Five Stage Algorithm”,Online International Conference on Green Engineering and Technologies (IC-GET), 2015.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors contributing to this journal agree to publish their articles under the Creative Commons Attribution 4.0 International License, allowing third parties to share their work (copy, distribute, transmit) and to adapt it, under the condition that the authors are given credit and that in the event of reuse or distribution, the terms of this license are made clear.
