A Clustering Framework for Large Document Datasets
Keywords:
Large Document Set, Similarity measurement, Term Extraction, DendrogramAbstract
Document set is the collection of different types of document. Each document contains special type of information, which is beneficial for the peoples. We have the need of document clustering by their similarity. Document may contain data related to the blogs, website access pattern, any transaction or simply text. By the clustering of similar documents one can find the future trends of the people and it is also useful for the business point of view. In this paper, we have proposed a clustering approach for large size document sets. This proposed approach immediately assign document into appropriate cluster. Experiments are conducted with the twenty newsgroup dataset using java and MATLAB software. Comparisons are also performed with the existing methods. Experimental results show the effectiveness of the proposed approach for large document sets.
References
Rui Xu, Student Member, IEEE and Donald Wunsch II,Fellow, IEEE, Survey of Clustering Algorithms, IEEE Transactions on Neural Networks Vol. 16, No. 3, May 2005.
Bidyut kr. Patra,Sukumar Nandi,P.Viswanath, A distance based clustering method for arbitrary shaped clusters in large datasets,Pattern Recognition 44(2011) 2862-2870.
M. Anderberg, Cluster Analysis for Applications. New York: Academic,1973.
R. Duda, P. Hart, and D. Stork, Pattern Classification, 2nd ed. NewYork: Wiley, 2001.
Jin Chen, Alan M. MacEachren, and Donna J. Peuquet, “Constructing Overview + Detail Dendrogram-Matrix Views ”, IEEE Transactions on Visualization and Computer Graphics, Vol .15, No.6 ,Nov 2009.
B. Duran and P. Odell, Cluster Analysis: A Survey. New York:Springer-Verlag, 1974.
B. Everitt, S. Landau, and M. Leese, Cluster Analysis.London: Arnold, 2001.
P. Hansen and B. Jaumard, “Cluster analysis and Math- ematical programming,” Math. Program., vol. 79, pp. 191–215, 1997.
A. Jain and R. Dubes, Algorithms for Clustering Data.Englewood Cliffs, NJ: Prentice-Hall, 1988.
E. Backer and A. Jain, “A clustering performance measure based on fuzzy set decomposition,” IEEE Trans. Pattern Anal. Mach. Intell., vol. PAMI-3, no. 1, pp. 66–75, Jan. 1981.
C. Bishop, Neural Networks for Pattern Recognition. New York: Oxford Univ. Press, 1995.
V. Cherkassky and F. Mulier, Learning From Data: Concepts, Theory, and Methods. New York: Wiley, 1998.
A. Baraldi and E. Alpaydin, “Constructive feedforward ART
clustering networks—Part I and II,” IEEE Trans. Neural Netw., vol. 13, no. 3, pp. 645–677, May 2002.
M. Steinbach, G.Karypis, V.Kumar, A Comparison of document clustering techniques, Proc. Of the 6th ACM SIGKDD int’l conf. on Knowledge Discovery and Data Mining(KDD), 2000.
P.Willet, Recent trends in hierarchical document clustering: a critical review, Information processing & Management 24(5) (1988), pp 577-597.
Ghanshyam Thakur, Rekha Thakur and R.C. Jain, “Association Rule Generation from Textual Document” International Journal of Soft Computing, 2: 2007 pp. 346-348.
M. Dash, H.Liu, P. Scheuermann, K.L. Tan, fast hierarchical clustering and its validation, Data & Knowledge Engineering 44(1) (2003) pp. 109-138.
R. Balaji And R.B. Bapat, Block Distance Matrices, Electronic Journal of Linear Algebra ISSN 1081-3810 A publication of the International Linear Algebra Society Volume 16, pp. 435-443, December 2007.
M.Nanni, speeding-up hierarchical agglomerative clustering in presence of expensive metrics, in proc. Of Ninth Pacific-Asia conference on knowledge discovery and Data mining (PAKDD)2005, pp. 378-387.
P.A.Vijaya, M.N.Murty, D.K. Subramanian, Efficient bottom up hybrid hierarchical clustering techniques for protein sequence classification, pattern Recognition 39 (12) (2006), pp.2344-2355.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors contributing to this journal agree to publish their articles under the Creative Commons Attribution 4.0 International License, allowing third parties to share their work (copy, distribute, transmit) and to adapt it, under the condition that the authors are given credit and that in the event of reuse or distribution, the terms of this license are made clear.
