HTML Tag Structure Based Content Retrieval from Web Pages

Authors

Bhamare SS School of Computer Sciences, Kavayitri Bahinabai Chaudhari North Maharashtra University, Jalgaon (M.S) India

DOI:

https://doi.org/10.26438/ijcse/v10i11.3539

Keywords:

WWW, Web Page, HTML Tags, Text Density.

Abstract

With the immense quantity of information in the World Wide Web, the World Wide Web (WWW) contains enormous amounts of web pages which are accessible by users. Web pages formatted in HTML (i.e. Hyper Text Markup Language) are found on this network of computers. All the Web pages, pictures, videos and other online content can be accessed via a Web browser. This provides a very useful and helpful means of collecting information. Information retrieval systems can help to retrieving the relevant information from web documents. This process of information retrieval involves three stages such as identifying the documents want to be processed, writing of query and use of searching mechanism to retrieve the relevant web document information. This paper discuss how HTML Tags structure of web page are useful for retrieval of main or informative content from web pages for efficient web mining operations.

References

[1] Malik Agyemang, Ken Barker, Rada S. Alhajj, Mining Web Content Outliers using Structure Oriented Weighting Techniques and N-Grams ACM Symposium on Applied Computing, pp.482-487, 2005.

[2] Gupta Et. Al Automating Content Extraction of HTML Documents World Wide Web: Internet and Web Information Systems, Online version published in 2004.

[3] Pan Ei San, Boilerplate Removal and Content Extraction From Dynamic Web Pages, International Journal of Computer Science, Engineering and Applications (IJCSEA) Vol.4, No.6, 2014.

[4] Li Xiaoli, Shi Zhongzhi Innovative Web Page Classification through Reducing Noise Journal of Computer Science and Technology, Vol.17, No. 1., 2002

[5] A.K. Tripathy, A.K. Singh An Efficient Method Of Eliminating Noisy Information In Web Pages for Data mining, in Proceedings of the Fourth International Conference on Computer and Information Technology (CIT’04), 2004.

[6] Kaasinen, E., Aaltonen, M., Kolari, J., Melakoski, S., and Laakko, T., Two Approaches to Bringing Internet Services to WAP Devices, In Proceedings of 9th International World-Wide Web Conference, pp. 231-246, 2000.

[7] Wong, W. and Fu, A. W., Finding Structure and Characteristics of Web Documents for Classification, In ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD), Dallas, TX., USA, 2000.

[8] Deng Cai, Yu Shipeng and Wen Jirong, “VIPS: a vision-based page segmentation algorithm”, Microsoft Technical Report, MSR-TR-2003-79, 406-417, 2003.

[9] Sun Chengjie and Guan Yi, “A Statistical Approach for Content Extraction from Web Page”, Journal of Information Processing, Vol.18, Issue.5, pp.17-22, 2004.

[10] Zhao Xinxin,Suo Hongguang and Liu Yushu, “Web Content Information Extraction Method Based on Tag Window. Application Research of Computers, Vol.24, Issue.3, pp.144-145, 2007.

[11] Simple HTML Guide, 2014.

[12] List of main html tags. Online; accessed 25 march, 2014.

[13] S S Bhmare, B.V. Pawar “An Efficient Method of Web Page Noise Cleaning for Effective Web Mining", International Journal of Computer Applications (0975 – 8887) Vol.146 – No.3, pp.18-22, 2016.

[14] Dandan Song, Fei Sun, Lejian Liao.? A hybrid approach for content extraction with text density and visual importance of DOM nodes?. In the proceedings of Springer Knowl Inf Syst, Verlag London. Vol.42, pp.75-96, 2015.

[15] G. Salton and M. J. McGill, “Introduction to Modern Information Retrieval”, McGraw-Hill, New York, 1983.

[16] Soma Chatterjee, Kamal Sarkar “A Comparative Study of Three IR models for Bengali Document Retrieval” International Journal of Computer Sciences and Engineering E-ISSN 2347-2693 Vol.07, Issue.1, pp.220-225, 2019.

Downloads

PDF ⁰

Published

2022-11-30

CITATION

DOI: 10.26438/ijcse/v10i11.3539

Published: 2022-11-30

How to Cite

[1]

S. v, “HTML Tag Structure Based Content Retrieval from Web Pages”, Int. J. Comp. Sci. Eng., vol. 10, no. 11, pp. 35–39, Nov. 2022.

Download Citation

Issue

Vol. 10 No. 11 (2022): IJCSE November Edition

Section

Research Article

License

This work is licensed under a Creative Commons Attribution 4.0 International License.

Authors contributing to this journal agree to publish their articles under the Creative Commons Attribution 4.0 International License, allowing third parties to share their work (copy, distribute, transmit) and to adapt it, under the condition that the authors are given credit and that in the event of reuse or distribution, the terms of this license are made clear.

HTML Tag Structure Based Content Retrieval from Web Pages

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

How to Cite

Issue

Section

License

Make a Submission

Journal Information

UGC Gazette Regulation

Join Editorial Board

Information

Current Issue

Keywords