On Applying Document Similarity Measures for Template based Clustering of Web Documents
DOI:
https://doi.org/10.26438/ijcse/v6si1.3742Keywords:
Template, Clustering, Cosine, Jaccard, Agglomerative Hierarchical ClusteringAbstract
World Wide Web is the useful and easy way to get the source of information on the Internet. In order to reduce the content generation and publishing time, templates are used to populate the contents in web documents. Template provides easy access to the web document contents through their layout and structures. However, for search engines, due to its irrelevant terms, the templates degrade search engines accuracy and performance. Also the templates are used by wrapper induction tools used in information extractor to extract and integrate information from various E-commerce sites. Thus it has received a lot of attention to improve the search engines performance and content integration. In this paper we have discussed how heterogeneous web documents i.e. web documents generated from different templates, can be clustered. We have applied document similarity measures to cluster the heterogeneous web documents generated from templates. Our experimental results on real data sets show that cosine distance similarity measure is more suitable for template based clustering of heterogeneous web documents.
References
Bar-Yossef, Z., Rajagopalan, S,“Template detection via data mining and its applications”,WWW ’02: Proceedings of the 11th International Conference on World Wide Web, New York, NY, USA, ACM Press 580–591, 2002.
Lin, S.H., Ho, J.M,“Discovering informative content blocks from web documents”, KDD ’02: Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, ACM Press 588–593, 2002.
Debnath, S., Mitra, P., Giles, C.L,”Automatic extraction of informative blocks from webpages”, SAC ’05: Proceedings of the 2005 ACM Symposium on Applied Computing, New York, NY, USA, ACM Press 1722–1726,2005.
Yi, L., Liu, B., Li, X,”Eliminating noisy information in web pages for data mining”, KDD ’03: Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, ACM Press 296–305, 2003
[5] Reis, D.C., Golgher, P.B., Silva, A.S., Laender, A.F,”Automatic web news extraction using tree edit distance”, WWW ’04: Proceedings of the 13th International Conference on World Wide Web, New York, NY, USA, ACM Press 502–511,2004
Gibson, D., Punera, K., Tomkins, A,”The volume and evolution of web page templates”,WWW ’05: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, New York, NY, USA, ACM Press ,830–839,2005
Cruz, I.F., Borisov, S., Marks, M.A., Webbs, T.R,”Measuring structural similarity among webdocuments: preliminary results”, EP ’98: Proceedings of the 7th international Conference on Electronic Publishing, Artistic Imaging, and Digital Typography,.513 – 524, 1998
Buttler, D,”A short survey of document structure similarity algorithms”, IC ’04: Proceedings of theInternational Conference on Internet Computing, CSREA Press 3–9, 2004
Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G,”Syntactic clustering of the web”,ComputerNetworks 29(8-13) 1157–1166, 1997
A. Arasu and H. Garcia-Molina,“Extracting Structured Data from Web Pages”, Proc. ACM SIGMOD, 2003.
M. de Castro Reis, P.B. Golgher, A.S. da Silva, and A.H.F. Laender,“Automatic Web News Extraction Using Tree Edit Distance”, Proc. 13th Int’l Conf. World Wide Web (WWW), 2004.
M.N. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim,“Xtract: A System for Extracting Document Type Descriptors from Xml Documents”, Proc. ACM SIGMOD, 2000.
Y. Zhai and B. Liu,“Web Data Extraction Based on Partial Tree Alignment”, Proc. 14th Int’l Conf. World Wide Web (WWW), 2005.
V. Crescenzi, G. Mecca, and P. Merialdo,“Roadrunner: Towards Automatic Data Extraction from Large Web Sites”, Proc. 27th Int’l Conf. Very Large Data Bases (VLDB), 2001.
K. Vieira, A.S. da Silva, N. Pinto, E.S. de Moura, J.M.B. Cavalcanti, and J. Freire,“A Fast and Robust Method for Web Page Template Detection and Removal”, Proc. 15th ACM Int’l Conf. Information andKnowledge Management (CIKM), 2006.
S. Zheng, D. Wu, R. Song, and J.-R. Wen,“Joint Optimization of Wrapper Generation and Template Detection”, Proc. ACMtiSIGKDD, 2007.
Chulyun Kim and Kyuseok Shim,”TEXT: Automatic Template Extraction from Heterogeneous Web Pages”,IEEE Transaction on Knowledge and Data Engineering, 2011
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors contributing to this journal agree to publish their articles under the Creative Commons Attribution 4.0 International License, allowing third parties to share their work (copy, distribute, transmit) and to adapt it, under the condition that the authors are given credit and that in the event of reuse or distribution, the terms of this license are made clear.
