Heuristic Approach for Designing a Focused Web Crawler using Cuckoo Search
Keywords:
Cuckoo search, DNS, meta-heuristic, optimization, pattern recognition, web crawlingAbstract
In order to find a geographical location in the Globe, we usually follow the geographical map. By a similar analogy, a Web-page from the World Wide Web (WWW), we usually use a Web search engine. Web crawler design is an important job to collect Web search engine resources from WWW. Millions of searches are done every minute around the Globe. A better Web search engine resource leads to achieve a better performance of the Web search engine. WWW is a huge resource of information. However this information is often spread throughout the internet via many Web servers and hosts. Every day people are publishing their Web pages in the Internet, as a result the traffic overhead increases exponentially. In order to produce a more accurate result, I have been motivated to follow a heuristic approach to design a Web crawler, which produces the best optimized search result in minimal time. This paper has built an approach to generate the best result in by Cuckoo Searching so that time will be least. I have divided my approach in two parts. First part is implementation of the crawler, which includes “what to search for”, “from where to search” and even filters the unwanted data. Second part proposed a string matching algorithm for producing the search result.
References
Yang X., Deb S.: “Cuckoo Search via Levy Flights”. World Congress on Nature & Biologically Inspired Computing, 2009.
Hu K., Wong W.S.: “A Probabilistic Model for Intelligent Web Crawlers”, 27th Annual International Computer Software and Applications Conference.
Sun Y., Councill I. G., Giles C. L.: “The Ethicality of Web Crawlers”, IEEE: International Conference on Web Intelligence and Intelligent Agent Technology, 2010.
Ntoulas A., Cho J, Olston C.: “What's New on the Web? The Evolution of the Web from a Search Engine Perspective”, World-wide-Web Conference (WWW), May 2004.
Arasu A., Cho J., Molina H. G., Paepcke A., Raghavan S.: “Searching The Web”, Computer Science Department, Stanford University.
Cho J., Garcia-Molina H., Page L., “Efficient Crawling Through URL Ordering,” Technical Report, Computer Science Department, Stanford University, Stanford, CA, USA, 1997.
Nath R., Bal S., “A Novel Mobile Crawler System Based on Filtering off Non-Modified Pages for Reducing Load on the Network,” Intenational Arab Journal of Information Technology, Vol. 8, Issue 3, pp.(272-279), 2011.
Shkapenyuk V., Suel T., “Design and Implementation of A High Performance Distributed Web Crawler,” 18th International Conference on Data Engineering, San Jose, CA, IEEE CS Press, pp.(357-368), 2002.
Boldi P., Codenotti B., Santini M., Vigna S., “Ubicrawler: A scalable fully distributed web crawler,” 8th Australian World Wide Web Conference, AUSWEB02, pp.(1-14), Australia, 2002.
Edwards J., McCurley K. S., Tomlin J. A., “An adaptive model for optimizing performance of an incremental web crawler”, 10th Conference on World Wide Web, Elsevier Science, pp.(106-113), Hong Kong, 2001.
Najork M., Wiener J. L., “Breadth-first crawling yields high-quality pages”, 10th Conference on World Wide Web, Elsevier Science, pp.(114-118), Hong Kong, 2001.
Pinkerton B., “Finding what people want: Experiences with the WebCrawler”, 1st World Wide Web Conference, Geneva, Switzerland, 1994.
Chakrabarti S., Berg M., Dom B. E., “Focused Crawling: a New Approach to Topic-specific Web Resource Discovery”, 8th International World Wide Web Conference, Elsevier, pp.(545-562), Toronto, Canada, 1999.
Altingovde I. S., Ulusoy O., “Exploiting interclass rules for focused crawling”, IEEE Intelligent Systems, Vol. 19, Issue 6, pp.(66-73), DOI: 10.1109/MIS.2004.62, 2004.
Zong X. J., Shen Y., Liao X. X., “Improvement of HITS for topic-specific web crawler”, Advances in Intelligent Computing, ICIC 2005, Part I, Lecture Notes in Compter Science, Vol. 3644, pp.(524-532), 2005.
Shivlal Mewada, Sharma Pradeep, Gautam S.S., “Classification of Efficient Symmetric Key Cryptography Algorithms”, International Journal of Computer Science and Information Security (IJCSIS) USA, Vol. 14, No. 2, pp (105-110), Feb 2016
Pant G., Srinivasan P., “Link contexts in classifier-guided topical crawlers”, IEEE Transactions on Knowledge and Data Engineering, Vol. 18, Issue 1, pp.(107-122), 2006.
Almpanidis G., Kotropoulos C., Pitas I., “Focused crawling using latent semantic indexing-An application for vertical search engines”, Research and Advanced Technology for Digital Libraries, Lecture Notes in Computer Science, Vol. 3652, pp.(402-413), 2005.
Diligenti M., Coetzee F., Lawrence S., Giles C. L., Gori M., “Focused crawling using context graphs”, 26th International Conference on Very Large Databases, VLDB, Morgan Kaufmann, pp.(527-534), San Francisco, 2000.
Bergmark D., Lagoze C., Sbityakov A., “Focused crawls, tunneling, and digital libraries,” European Conference on Digital Libraries, ECDL 2002. Lacture Notes in Computer Science, Roma, Italy, Vol. 2458, pp.(91-106), 2002.
Blum C., Roli A.: Metaheuristics in combinatorial optimization: Overview and conceptural comparision, ACM Comput. Surv, 35, Page No. (268- 308), 2003.
Yang X., “Nature-Inspired Metaheuristic Algorithms”. Feb, 2008.
Cormen T. H., Leiserson C. E., Rivest R. L.,: Introduction to Algorithm, Prentice-Hall of India Private Limited, 7th ed, 2009.
Abe U., Brandenburg. :String Matching., Page No (1–9), Sommersemester 2001.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors contributing to this journal agree to publish their articles under the Creative Commons Attribution 4.0 International License, allowing third parties to share their work (copy, distribute, transmit) and to adapt it, under the condition that the authors are given credit and that in the event of reuse or distribution, the terms of this license are made clear.
