A Review on Duplicate and Near Duplicate Documents Detection Technique

Authors

  • Patil Deepali E Department of Computer Science and Engineering, Bharati Vidyapeeth’s College of Engineering, Kolhapur, Maharashtra, India
  • Ghatage Trupti B Department of Computer Science and Engineering, Bharati Vidyapeeth’s College of Engineering, Kolhapur, Maharashtra, India
  • Takmare Sachin B Department of Computer Science and Engineering, Bharati Vidyapeeth’s College of Engineering, Kolhapur, Maharashtra, India
  • Patil Sushama A DC Branch, Dept of Digital Communication, SSSIST Sehore.

Keywords:

Web crawling, web pages, web mining, web content mining, and duplicate document, near duplicate detection

Abstract

Duplicated web pages in consist of identical structure but regarded as clones regarded as clones different data. The identification of similar and near-duplicate pairs in a large collection is a significant the problem with the twide-spread application. The problem deliberated for diverse data types in diverse settings. The contemporary materialization is efficient of the problem identification of the near duplicate Web pages. This is challenging in the web scale to the voluminous data and the high dimensionalities of documents. This review has a fundamental intention to present an up-to-date review of the existing of literature in duplicate and near duplicate detection of general documents and web documents in web crawling. The classification of the existing literature in duplicate and the near duplicate detection techniques and a detailed description of same are the presented so as to make the review more comprehensible.

References

Andrei Z. Broder., "Identifying and Filtering Near-Duplicate Documents", Proceedings of the 11th Annual Symposium on Combinatorial Pattern Matching. UK: Springer-Verlag, pp. 1-10, 2000.

Broder, A., Glassman, S., Manasse, M., and Zweig, G., “Syntactic Clustering of the Web”, In 6th International World Wide Web Conference, pp: 393-404, 1997.

Bernstein, Y., Shokouhi, M., and Zobel, J., "Compact Features for Detection of Near- Duplicates in Distributed Retrieval", in 'Proceedings of String Processing and Information Retrieval Symposium (to appear)', Glasgow, Schotland, 2006.

Charikar, M.,“Similarity estimation techniques from rounding algorithms”, In Proc. 34th Annual Symposium on Theory of Computing (STOC 2002), pp. 380-388, 2002.

Chowdhury, A., Frieder, O., Grossman, D., and Catherine Mccabe, M., “Collection Statistics for Fast Duplicate Document Detection", In. ACM Transactions on Information Systems (TOIS), Vol. 20, No. 2, 2002.

Deng, F., Rafiei, D., "Approximately detecting duplicates for streaming data using stable bloom filters" ,Proceedings of the 2006 ACM SIGMOD international conference on Management of data, pp. 25-36, 2006.

Deng, F., Rafiei, D., "Estimating the Number of Near Duplicate Document Pairs for Massive Data Sets using Small Space", University of Alberta, Canada, 2007.

Manku, G. S., Jain, A., Sarma, A. D., "Detecting near-duplicates for web crawling", Proceedings of the 16th international conference on World Wide Web, pp: 141 – 150, 2007.

Udi Manber., "Finding Similar Files In A Large File System", Proceedings of the USENIX Winter 1994 Technical Conference on USENIX Winter 1994 Technical Conference, San Francisco, California, pp. 2-2, 1994.

Ye, S., Wen, J., R., and Ma, W.Y., "A systematic study of parameter correlations in large scale duplicate document detection", Text and Document Mining, 10th Pacific-Asia Conference, PAKDD 2006, Singapore, April 9-12, pp. 275-284, 2006.

Downloads

Published

2025-11-11

How to Cite

[1]
E. Patil Deepali, B. Ghatage Trupti, B. Takmare Sachin, and A. Patil Sushama, “A Review on Duplicate and Near Duplicate Documents Detection Technique”, Int. J. Comp. Sci. Eng., vol. 4, no. 3, pp. 59–63, Nov. 2025.

Issue

Section

Review Article