A Review on Duplicate and Near Duplicate Documents Detection Technique
Keywords:
Web crawling, web pages, web mining, web content mining, and duplicate document, near duplicate detectionAbstract
Duplicated web pages in consist of identical structure but regarded as clones regarded as clones different data. The identification of similar and near-duplicate pairs in a large collection is a significant the problem with the twide-spread application. The problem deliberated for diverse data types in diverse settings. The contemporary materialization is efficient of the problem identification of the near duplicate Web pages. This is challenging in the web scale to the voluminous data and the high dimensionalities of documents. This review has a fundamental intention to present an up-to-date review of the existing of literature in duplicate and near duplicate detection of general documents and web documents in web crawling. The classification of the existing literature in duplicate and the near duplicate detection techniques and a detailed description of same are the presented so as to make the review more comprehensible.
References
Andrei Z. Broder., "Identifying and Filtering Near-Duplicate Documents", Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching. UK: Springer-Verlag, pp. 1-10, 2000.
Broder, A., Glassman, S., Manasse, M., and Zweig, G., “Syntactic Clustering of the Web”, In 6th International World Wide Web Conference, pp: 393-404, 1997.
Bernstein, Y., Shokouhi, M., and Zobel, J., "Compact Features for Detection of Near- Duplicates in Distributed Retrieval", in 'Proceedings of String Processing and Information Retrieval Symposium (to appear)', Glasgow, Schotland, 2006.
Charikar, M.,“Similarity estimation techniques from rounding algorithms”, In Proc. 34th Annual Symposium on Theory of Computing (STOC 2002), pp. 380-388, 2002.
Chowdhury, A., Frieder, O., Grossman, D., and Catherine Mccabe, M., “Collection Statistics for Fast Duplicate Document Detection", In. ACM Transactions on Information Systems (TOIS), Vol. 20, No. 2, 2002.
Deng, F., Rafiei, D., "Approximately detecting duplicates for streaming data using stable bloom filters" ,Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pp. 25-36, 2006.
Deng, F., Rafiei, D., "Estimating the Number of Near Duplicate Document Pairs for Massive Data Sets using Small Space", University of Alberta, Canada, 2007.
Manku, G. S., Jain, A., Sarma, A. D., "Detecting near-duplicates for web crawling", Proceedings of the 16th international conference on World Wide Web, pp: 141 – 150, 2007.
Udi Manber., "Finding Similar Files In A Large File System", Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference, San Francisco, California, pp. 2-2, 1994.
Ye, S., Wen, J., R., and Ma, W.Y., "A systematic study of parameter correlations in large scale duplicate document detection", Text and Document Mining, 10th Pacific-Asia Conference, PAKDD 2006, Singapore, April 9-12, pp. 275-284, 2006.
Downloads
Published
How to Cite
Issue
Section
License

This work is licensed under a Creative Commons Attribution 4.0 International License.
Authors contributing to this journal agree to publish their articles under the Creative Commons Attribution 4.0 International License, allowing third parties to share their work (copy, distribute, transmit) and to adapt it, under the condition that the authors are given credit and that in the event of reuse or distribution, the terms of this license are made clear.
