Document Categorization for Probabilistic Redundant Documents

Authors

  • Singh S Dept. of Computer Science and Engineering College of Technology and Engineering, MPUAT Udaipur, India
  • Jain K Dept. of Computer Science and Engineering College of Technology and Engineering, MPUAT Udaipur, India

DOI:

https://doi.org/10.26438/ijcse/v7i1.5155

Keywords:

Duplicate-detection, text categorization, information retrieval, similarity measure

Abstract

Text categorization is an active research area in information retrieval and machine learning. The major issue regarding preprocessing the document for this categorization is redundancy. The redundant documents slow down the learning steps of classification and also affect its efficiency and scalability. To resolve this issue it is preferred, first identify the duplicates and then perform the classification. This paper proposes to apply the Similarity Measure for duplicate detection and Random forest for classification. The results are evaluated using ‘20 newsgroups’ data sets with generated duplicate documents. Accuracy and time parameters show better results in the proposed method than that in the existing text categorization model.

References

[1] D. Xue, F. Li, “Research of Text Categorization Model based on Random Forests,” IEEE International Conference on Computational Intelligence & Communication Technology, pp. 173-176, 2015.

[2] G. Gao, S. Guan, “Text Categorization Based on Improved Rocchio Algorithm,” International Conference on Systems and Informatics, pp. 2247-2250, 2012.

[3] Thamarai, S.S., Kartikeyan, P., Vincent, A., Abinaya, V., Neeraja, G. and Deepika, R. 2016.Text Categorization using Rocchio Algorithm and Random Forest Algorithm. In the IEEE 2016 Eighth International Conference on Advanced Computing (ICoAC) held at Chennai, India, pp. 7-12, 2017.

[4] J.Y. Jiang, S.C. Tsai, S.J. Lee, “FSKNN: Multi-label text categorization based on fuzzy similarity and k nearest neighbors,” Expert Systems with Applications, Vol. 39, Issue. 3, pp. 2813-2821, 2012.

[5] M.L. Zhang, Z.H. Zhou, “A lazy learning approach to mullti-label learning,” National Laboratory for Novel Software Technology, Vol. 40, Issue. 7, pp. 2038-2048, 2007.

[6] S. Seshasai,” Efficient near duplicate document detection for specialized corpora“, Massachusetts Institute of Technology, 2008.

[7] W. Zong, F. Wu, L.K. Chu, D. Schulli, “A discriminative and semantic feature selection method for text Categorization,” School of Management, Xian Jiatoong University, China, IntJ.Production Economics, Vol.165, pp. 215-222, 2015.

[8] M. Bilenko, R.J. Mooney,” Adaptive Duplicate Detection Using Learnable String Similarity Measures”, Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 39-48, 2003.

[9] G.S. Manku, A.D. Sarma, A. Jain,” Detecting Near Duplicates for Web Crawling”, International World Wide Web Conference Committee (IW3C2), pp 141-149, 2007.

[10] E.P. Sim,” Classification & Detection of Near Duplicate Web Pages using Five Stage Algorithm”,Online International Conference on Green Engineering and Technologies (IC-GET), 2015.

Downloads

Published

2019-01-31
CITATION
DOI: 10.26438/ijcse/v7i1.5155
Published: 2019-01-31

How to Cite

[1]
S. Singh and K. Jain, “Document Categorization for Probabilistic Redundant Documents”, Int. J. Comp. Sci. Eng., vol. 7, no. 1, pp. 51–55, Jan. 2019.

Issue

Section

Research Article