On Applying Document Similarity Measures for Template based Clustering of Web Documents

Authors

  • TI Bagban Department of Information Technology, D.K.T.E’S T.E.I, Ichalkaranji, India
  • PJ Kulkarni Department of Computer Science and Engineering, Walchand College of Engineering, Sangli, India

DOI:

https://doi.org/10.26438/ijcse/v6si1.3742

Keywords:

Template, Clustering, Cosine, Jaccard, Agglomerative Hierarchical Clustering

Abstract

World Wide Web is the useful and easy way to get the source of information on the Internet. In order to reduce the content generation and publishing time, templates are used to populate the contents in web documents. Template provides easy access to the web document contents through their layout and structures. However, for search engines, due to its irrelevant terms, the templates degrade search engines accuracy and performance. Also the templates are used by wrapper induction tools used in information extractor to extract and integrate information from various E-commerce sites. Thus it has received a lot of attention to improve the search engines performance and content integration. In this paper we have discussed how heterogeneous web documents i.e. web documents generated from different templates, can be clustered. We have applied document similarity measures to cluster the heterogeneous web documents generated from templates. Our experimental results on real data sets show that cosine distance similarity measure is more suitable for template based clustering of heterogeneous web documents.

References

Bar-Yossef, Z., Rajagopalan, S,“Template detection via data mining and its applications”,WWW ’02: Proceedings of the 11th International Conference on World Wide Web, New York, NY, USA, ACM Press 580–591, 2002.

Lin, S.H., Ho, J.M,“Discovering informative content blocks from web documents”, KDD ’02: Proceedings of the eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, ACM Press 588–593, 2002.

Debnath, S., Mitra, P., Giles, C.L,”Automatic extraction of informative blocks from webpages”, SAC ’05: Proceedings of the 2005 ACM Symposium on Applied Computing, New York, NY, USA, ACM Press 1722–1726,2005.

Yi, L., Liu, B., Li, X,”Eliminating noisy information in web pages for data mining”, KDD ’03: Proceedings of the ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, NY, USA, ACM Press 296–305, 2003

[5] Reis, D.C., Golgher, P.B., Silva, A.S., Laender, A.F,”Automatic web news extraction using tree edit distance”, WWW ’04: Proceedings of the 13th International Conference on World Wide Web, New York, NY, USA, ACM Press 502–511,2004

Gibson, D., Punera, K., Tomkins, A,”The volume and evolution of web page templates”,WWW ’05: Special Interest Tracks and Posters of the 14th International Conference on World Wide Web, New York, NY, USA, ACM Press ,830–839,2005

Cruz, I.F., Borisov, S., Marks, M.A., Webbs, T.R,”Measuring structural similarity among webdocuments: preliminary results”, EP ’98: Proceedings of the 7th international Conference on Electronic Publishing, Artistic Imaging, and Digital Typography,.513 – 524, 1998

Buttler, D,”A short survey of document structure similarity algorithms”, IC ’04: Proceedings of theInternational Conference on Internet Computing, CSREA Press 3–9, 2004

Broder, A.Z., Glassman, S.C., Manasse, M.S., Zweig, G,”Syntactic clustering of the web”,ComputerNetworks 29(8-13) 1157–1166, 1997

A. Arasu and H. Garcia-Molina,“Extracting Structured Data from Web Pages”, Proc. ACM SIGMOD, 2003.

M. de Castro Reis, P.B. Golgher, A.S. da Silva, and A.H.F. Laender,“Automatic Web News Extraction Using Tree Edit Distance”, Proc. 13th Int’l Conf. World Wide Web (WWW), 2004.

M.N. Garofalakis, A. Gionis, R. Rastogi, S. Seshadri, and K. Shim,“Xtract: A System for Extracting Document Type Descriptors from Xml Documents”, Proc. ACM SIGMOD, 2000.

Y. Zhai and B. Liu,“Web Data Extraction Based on Partial Tree Alignment”, Proc. 14th Int’l Conf. World Wide Web (WWW), 2005.

V. Crescenzi, G. Mecca, and P. Merialdo,“Roadrunner: Towards Automatic Data Extraction from Large Web Sites”, Proc. 27th Int’l Conf. Very Large Data Bases (VLDB), 2001.

K. Vieira, A.S. da Silva, N. Pinto, E.S. de Moura, J.M.B. Cavalcanti, and J. Freire,“A Fast and Robust Method for Web Page Template Detection and Removal”, Proc. 15th ACM Int’l Conf. Information andKnowledge Management (CIKM), 2006.

S. Zheng, D. Wu, R. Song, and J.-R. Wen,“Joint Optimization of Wrapper Generation and Template Detection”, Proc. ACMtiSIGKDD, 2007.

Chulyun Kim and Kyuseok Shim,”TEXT: Automatic Template Extraction from Heterogeneous Web Pages”,IEEE Transaction on Knowledge and Data Engineering, 2011

Downloads

Published

2025-11-12
CITATION
DOI: 10.26438/ijcse/v6si1.3742
Published: 2025-11-12

How to Cite

[1]
T. Bagban and P. Kulkarni, “On Applying Document Similarity Measures for Template based Clustering of Web Documents”, Int. J. Comp. Sci. Eng., vol. 6, no. 1, pp. 37–42, Nov. 2025.