Crawling Hidden Objects with KNN Queries
Keywords:
Knn, Hidden Objects, Location Based Services, Crawling, DensityAbstract
Many websites offering Location Based Services (LBS) provide a k NN search interface that returns the topk nearest-neighbor objects (e.g., nearest restaurants) for a given query location. This paper addresses the problem of crawling all objects efficiently from an LBS website, through the public k NN web search interface it provides. Specifically, we develop crawling algorithm for 2D and higher-dimensional spaces, respectively, and demonstrate through theoretical analysis that the overhead of our algorithms can be bounded by a function of the number of dimensions and the number of crawled objects, regardless of the underlying distributions of the objects. We also extend the algorithms to leverage scenarios where certain auxiliary information about the underlying data distribution, e.g., the population density of an area which is often positively correlated with the density of LBS objects, is available. Extensive experiments on real-world datasets demonstrate the superiority of our algorithms over the state-of-the-art competitors in the literature
References
[1] Mcdonalds, “Mcdonalds page, http://www.mcdonalds.com/,” [Accessed: Aug. 6, 2014] [Online] Available: url{http://www.mcdonalds.com/us/en/restaurant locator.html}
[2] S. Byers, J. Freire, and C. T. Silva, “Efficient acquisition of web data through restricted query interfaces,” in Poster Proceedingsof the Tenth International World Wide Web Conference, WWW 10,Hong Kong, China, May 1-5, 2001, 2001. [Online]Available:http://www10.org/cdrom/posters/1051.pdf
[3] W. D. Bae, S. Alkobaisi, S. H. Kim, S. Narayanappa, and C. Shahabi, “Web data retrieval: solving spatial range queries using k-nearest neighbor searches,” Geoinformatica, vol. 13, no. 4, pp. 483–514, 2009.
[4] G. E. Glasses, “Great eye glasses page, http://www.greateyeglasses.com/shop/search.php,” [Ac cessed: Jan. 20, 2014] [Online] Available: url{http://www.greateyeglasses.com/shop/search.php}
[5] Yahoo, “Yahoo local page, https://local.yahoo.com/,”[Accessed: Dec. 2012] [Online] Available: url{https://local.yahoo.com/}
[6] U. Census, “Us census, http://www.census.gov/cgibin/geo/shapefiles2013/layers.cgi,” [Accessed: Dec. 2013][Online] Available: url{http://www.census.gov/cgi-bin/geo/shapefiles2013/layers.cgi}
[7] L. Devroye, “Sample-based non-uniform random variate generation,” in Proceedings of the 18th conference on Winter simulation. ACM, 1986, pp. 260–265.
[8] L. Barbosa and J. Freire, “Siphoning hidden-web data throughkeyword-based interfaces,” in SBBD, 2004, pp. 309–321.
[9] A. Ntoulas, P. Pzerfos, and J. Cho, “Downloading textualhidden web content through keyword queries,” in DigitalLibraries, 2005.JCDL’05.Proceedings of the 5th ACM/IEEE-CSJoint Conference on. IEEE, 2005, pp. 100–109.
[10] K. Vieira, L. Barbosa, J. Freire, and A. Silva, “Siphon++: ahidden-webcrawler for keyword-based interfaces,” in Proceedings of the 17th ACM conference on Information and knowledgemanagement. ACM, 2008, pp. 1361–1362.
[11] L. Jiang, Z. Wu, Q. Feng, J. Liu, and Q. Zheng, “Efficient deep web crawling using reinforcement learning,” in Advances inKnowledge Discovery and Data Mining. Springer, 2010, pp. 428–439.
[12] S. Raghavan and H. Garcia-Molina, “Crawling the hidden web,” in VLDB 2001, Proceedings of 27th InternationalConference on Very Large Data Bases, September 11-14, 2001,Roma, Italy, 2001, pp. 129–138. [Online] Available: http://www.vldb.org/conf/2001/P129.pdf
[13] S. W. Liddle, D. W. Embley, D. T. Scott, and S. H. Yau, “Extracting data behind web forms,” in ConceptualModeling - ER 2002, 21st International Conference on ConceptualModeling, Tampere, Finland, October 7-11, 2002, Proceedings,
2002, pp. 402–413. [Online] Available: http://dx.doi.org/10.1007/978-3-540-45275-1 35
[14] P. Wu, J. Wen, H. Liu, and W. Ma, “Query selection techniques for efficient crawling of structured web sources,” in Proceedingsof the 22nd International Conference on Data Engineering, ICDE2006, 3-8 April 2006, Atlanta, GA, USA, 2006, p. 47. [Online] Available: http://dx.doi.org/10.1109/ICDE.2006.124
[15] M. Alvarez, J. Raposo, A. Pan, F. Cacheda, F. Bellas, and ´ V. Carneiro, “Crawling the content hidden behind web forms,” in Computational Science and Its Applications–ICCSA 2007. Springer, 2007, pp. 322–333.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors contributing to this journal agree to publish their articles under the Creative Commons Attribution 4.0 International License, allowing third parties to share their work (copy, distribute, transmit) and to adapt it, under the condition that the authors are given credit and that in the event of reuse or distribution, the terms of this license are made clear.
