Efficient Clustering of Text Documents for Feature Selection on the use of side Information

Authors

  • Sonal S Deshmukh Department of Computer Engineering JSPM’s Imperial College of Engineering & Research, Wagholi, Pune, India
  • RN Phursule Department of Computer Engineering JSPM’s Imperial College of Engineering & Research, Wagholi, Pune, India

Keywords:

Text Mining, LSI, Side information, Clustering, Probabilistic Latent Semantic Indexing

Abstract

This paper presents efficient clustering with side information using probabilistic latent Semantic indexing. Meta information is available in many texts mining application. It may be useful or sometimes it is a risky approach to add side information. The aim of this work is to resolve clustering problem, for data mining problems, in which auxiliary information is available, to enhance the extraction of text document. The work proposed an approach, Probabilistic Latent Semantic Indexing,  which gives more efficiency by considering class labels and also will be applicable for large number of clusters. The goal of this work is to utilize side information available with the documents for clustering, to improve the efficiency of the clusters and also to reduce the time required to form clusters.

References

Charu C. Aggarwal and Yuchen Zhao, “On the Use of Side Information for Mining Text Data”, in IEEE Transactions on Knowledge and Data Engineering, Vol. 26, No. 6, June 2014.

S. Guha, R. Rastogi, and K. Shim, “CURE: An efficient clustering algorithm for large databases,” in Proc. ACM SIGMOD Conf., New York, NY, USA, 1998, pp. 73–84.

R. Ng and J. Han, “Efficient and effective clustering methods for spatial data mining,” in Proc. VLDB Conf., San Francisco, CA,USA, 1994, pp. 144–155.

T. Zhang, R. Ramakrishnan, and M. Livny, “BIRCH: An efficient data clustering method for very large databases,” in Proc. ACM SIGMOD Conf., New York, NY, USA, 1996, pp. 103–114.

Vilas V Pichad and Sachin N Deshmukh, "Role of Document Clustering For Forensic Analysis Investigation System", International Journal of Computer Sciences and Engineering, Volume-03, Issue-03, Page No (116-120), Mar -2015, E-ISSN: 2347-2693

D. Cutting, D. Karger, J. Pedersen, and J. Tukey, “Scatter/Gather: A cluster-based approach to browsing large document collections,” in Proc. ACM SIGIR Conf., New York, NY, USA, 1992, pp. 318–329.

C. C. Aggarwal and P. S. Yu, “A framework for clustering massive text and categorical data streams,” in Proc. SIAM Conf. Data Mining, 2006, pp. 477–481.

H. Schutze and C. Silverstein, “Projections for efficient document clustering,” in Proc. ACM SIGIR Conf., New York, NY, USA, 1997, pp. 74–81.

M. Steinbach, G. Karypis, and V. Kumar, “A comparison of document clustering techniques,” in Proc. Text Mining Workshop KDD, 2000, pp. 109–110.

S. Elakkiya and T. Kavitha, "Detection of Text Using Connected Component Clustering and Nontext Filtering", International Journal of Computer Sciences and Engineering, Volume-03, Issue-04, Page No (53-57), Apr -2015, E-ISSN: 2347-2693

S. Guha, R. Rastogi, and K. Shim, “ROCK: A robust clustering algorithm for categorical attributes,” Inf. Syst., vol. 25, no. 5, pp. 345–366, 2000.

A. Jain and R. Dubes, Algorithms for Clustering Data. Englewood Cliffs, NJ, USA: Prentice-Hall, Inc., 1988.

C. C. Aggarwal, S. C. Gates, and P. S. Yu, “On using partial supervision for text categorization,” IEEE Trans. Knowl. Data Eng., vol. 16, no. 2, pp. 245–255, Feb. 2004.

G. P. C. Fung, J. X. Yu, and H. Lu, “Classifying text streams in the presence of concept drifts,” in Proc. PAKDD Conf., Sydney, NSW, Australia, 2004, pp. 373–383.

H. Frigui and O. Nasraoui, “Simultaneous clustering and dynamic keyword weighting for text documents,” in Survey of Text Mining, M. Berry, Ed. New York, NY, USA: Springer, 2004, pp. 45–70.

C. C. Aggarwal and H. Wang, Managing and Mining Graph Data. New York, NY, USA: Springer, 2010.

C. C. Aggarwal and P. S. Yu, “A framework for clustering massive text and categorical data streams,” in Proc. SIAM Conf. Data Mining, 2006, pp. 477–481.

Downloads

Published

2025-11-10

How to Cite

[1]
D. Sonal S and R. Phursule, “Efficient Clustering of Text Documents for Feature Selection on the use of side Information”, Int. J. Comp. Sci. Eng., vol. 3, no. 10, pp. 10–16, Nov. 2025.

Issue

Section

Research Article