Document Categorization for Probabilistic Redundant Documents

Authors

Singh S Dept. of Computer Science and Engineering College of Technology and Engineering, MPUAT Udaipur, India
Jain K Dept. of Computer Science and Engineering College of Technology and Engineering, MPUAT Udaipur, India

DOI:

https://doi.org/10.26438/ijcse/v7i1.5155

Keywords:

Duplicate-detection, text categorization, information retrieval, similarity measure

Abstract

Text categorization is an active research area in information retrieval and machine learning. The major issue regarding preprocessing the document for this categorization is redundancy. The redundant documents slow down the learning steps of classification and also affect its efficiency and scalability. To resolve this issue it is preferred, first identify the duplicates and then perform the classification. This paper proposes to apply the Similarity Measure for duplicate detection and Random forest for classification. The results are evaluated using ‘20 newsgroups’ data sets with generated duplicate documents. Accuracy and time parameters show better results in the proposed method than that in the existing text categorization model.

References

[1] D. Xue, F. Li, “Research of Text Categorization Model based on Random Forests,” IEEE International Conference on Computational Intelligence & Communication Technology, pp. 173-176, 2015.

[2] G. Gao, S. Guan, “Text Categorization Based on Improved Rocchio Algorithm,” International Conference on Systems and Informatics, pp. 2247-2250, 2012.

[3] Thamarai, S.S., Kartikeyan, P., Vincent, A., Abinaya, V., Neeraja, G. and Deepika, R. 2016.Text Categorization using Rocchio Algorithm and Random Forest Algorithm. In the IEEE 2016 Eighth International Conference on Advanced Computing (ICoAC) held at Chennai, India, pp. 7-12, 2017.

[4] J.Y. Jiang, S.C. Tsai, S.J. Lee, “FSKNN: Multi-label text categorization based on fuzzy similarity and k nearest neighbors,” Expert Systems with Applications, Vol. 39, Issue. 3, pp. 2813-2821, 2012.

[5] M.L. Zhang, Z.H. Zhou, “A lazy learning approach to mullti-label learning,” National Laboratory for Novel Software Technology, Vol. 40, Issue. 7, pp. 2038-2048, 2007.

[6] S. Seshasai,” Efficient near duplicate document detection for specialized corpora“, Massachusetts Institute of Technology, 2008.

[7] W. Zong, F. Wu, L.K. Chu, D. Schulli, “A discriminative and semantic feature selection method for text Categorization,” School of Management, Xian Jiatoong University, China, IntJ.Production Economics, Vol.165, pp. 215-222, 2015.

[8] M. Bilenko, R.J. Mooney,” Adaptive Duplicate Detection Using Learnable String Similarity Measures”, Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 39-48, 2003.

[9] G.S. Manku, A.D. Sarma, A. Jain,” Detecting Near Duplicates for Web Crawling”, International World Wide Web Conference Committee (IW3C2), pp 141-149, 2007.

[10] E.P. Sim,” Classification & Detection of Near Duplicate Web Pages using Five Stage Algorithm”,Online International Conference on Green Engineering and Technologies (IC-GET), 2015.

Downloads

PDF ⁰

Published

2019-01-31

CITATION

DOI: 10.26438/ijcse/v7i1.5155

Published: 2019-01-31

How to Cite

[1]

S. Singh and K. Jain, “Document Categorization for Probabilistic Redundant Documents”, Int. J. Comp. Sci. Eng., vol. 7, no. 1, pp. 51–55, Jan. 2019.

Download Citation

Issue

Vol. 7 No. 1 (2019): IJCSE January Edition

Section

Research Article

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors contributing to this journal agree to publish their articles under the Creative Commons Attribution 4.0 International License, allowing third parties to share their work (copy, distribute, transmit) and to adapt it, under the condition that the authors are given credit and that in the event of reuse or distribution, the terms of this license are made clear.

Document Categorization for Probabilistic Redundant Documents

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

Journal Information

UGC Gazette Regulation

Join Editorial Board

Information

Current Issue

Keywords