A Supervised Forum Crawler
Keywords:
Forum Crawling, URL Type, Page Classification, Crawling Strategy, Javascript-based URLsAbstract
Web Forums or Internet Forums provide a space for users to share, discuss and request information. Web Forums are sources of huge amount of structured information that is rapidly changing. So crawling Web Forums requires special softwares. A Generic Deep Web Crawler or a Focused Crawler cannot be used for this purpose. In this paper, we propose an effective Web Crawler especially for Internet Forums. This Forum Crawler overcomes the drawbacks of many of the existing Forum Crawlers. It has the ability to detect the Entry URL (Uniform Resource Locator) of a Forum site, given any page of it. Crawling process starting from Entry URL increases the coverage. Different URLs in the Web Forums are classified into four categories. The entire crawling process is divided into a learning part and an online crawling part. Learning part will create regular expressions based on URLs and crawling part actually crawls the Web pages.
References
Internet forum. http://en.wikipedia.org/wiki/Internet forums.
Web Crawler. http://en.wikipedia.org/wiki/Webcrawler.
Asa Ben-Hur and JasonWeston. A user’s guide to support vector machines. In Data mining techniques for the life sciences, pages 223–239. Springer, 2010N.B. Salem, and J-P Hubaux, “Securing Wireless Mesh Networks”, IEEE Wireless Communications, Vol.13, Issue-2, 2006, pp.50-55.
Rui Cai, Jiang-Ming Yang, Wei Lai, Yida Wang, and Lei Zhang. irobot: An intelligent crawler for web forums. In Proceedings of the 17th international conference on World Wide Web, pages 447–456. ACM, 2008.
Li Ding, Tim Finin, Anupam Joshi, Rong Pan, R Scott Cost, Yun Peng, Pavan Reddivari, VC Doshi, and Joel Sachs. Swoogle: A semantic web search and metadata engine. In Proc. 13th ACM Conf. on Information and Knowledge Management, pages 65–69, 2004.
Hai Dong and Farookh Khadeer Hussain. Focused crawling for automatic service discovery, annotation, and classification in industrial digital ecosystems. Industrial Electronics,IEEE Transactions on, 58(6):2106–2116, 2011.
Yan Guo, Kui Li, Kai Zhang, and Gang Zhang. Board forum crawling: a web crawling method for web forum. In Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence, pages 745–748. IEEE Computer Society, 2006.
Amit Sachan, Wee-Yong Lim, and Vrizlynn LL Thing. A generalized links and text properties based forum crawler. In Proceedings of the The 2012 IEEE/WIC/ACM International Joint Conferences on Web Intelligence and Intelligent Agent Technology-Volume 01, pages 113–120. IEEE Computer Society, 2012.
] Jingtian Jiang, Nenghai Yu, and Chin-Yew Lin. Focus: learning to crawl web forums. In Proceedings of the 21st international conference companion on World Wide Web, pages 33–42. ACM, 2012.
Alex Goh Kwang Leng, KP Ravi, Ashutosh Kumar Singh, and Rajendra Kumar Dash.Pybot: An algorithm for web crawling. In Nanoscience, Technology and Societal Implications (NSTSI), 2011 International Conference on, pages 1–6. IEEE, 2011.
Ian H Witten, Eibe Frank, Leonard E Trigg, Mark A Hall, Geoffrey Holmes, and Sally Jo Cunningham. Weka: Practical machine learning tools and techniques with java implementations. 1999.
Jamali, Mohsen, et al. "A method for focused crawling using combination of link structure and content similarity." Proceedings of the 2006 IEEE/WIC/ACM International Conference on Web Intelligence. IEEE Computer Society, 2006.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors contributing to this journal agree to publish their articles under the Creative Commons Attribution 4.0 International License, allowing third parties to share their work (copy, distribute, transmit) and to adapt it, under the condition that the authors are given credit and that in the event of reuse or distribution, the terms of this license are made clear.
