Abstract
Background: With the surge in the volume of collected data, deduplication will undoubtedly become one of the problems faced by researchers. There is significant advantage for deduplication to reduce storage, network bandwidth, and system scalability of coarse-grained redundant data. Since the conventional methods of deleting duplicate data include hash comparison and binary differential incremental. They will lead to several bottlenecks for processing large scale data. And, the traditional Simhash similarity method has less consideration on the natural similarity of text in some specific fields and cannot run in parallel program with large scale text data processing efficiently. This paper examines several most important patents in the area of data detection. Then, this paper will focus on large scale of data deduplication based on MapReduce and HDFS.
Methods: We propose a duplicate data detection approach based on MapReduce and HDFS, which uses the Simhash similarity computing algorithm and SSN algorithm, and explain our distributed duplicate detection workflow. The important technical advantages of the invention include generating a checksum for each processed record and comparing the generated checksum to detect duplicate record. It produces the fingerprints of short text with Simhash similarity algorithm. It clusters the fingerprint results using Shared Nearest Neighbor (SNN) algorithm. The whole parallel progress is implemented using MapReduce programming model. Results: From the experimental results, we conclude that our proposed approach obtains MapReduce job schedules with significantly less executing time, making it suitable for processing large scale datasets in real applications. The experimental results show the proposed approach has better performance and efficiency. Conclusion: In this patent, we propose a duplicate data detection approach based on MapReduce and HDFS, which uses the Simhash similarity computing algorithm and SSN algorithm. The results show that the new approach is applied to MapReduce, which is suitable for the document similarity calculation of large scale data sets, which greatly reduces the time overhead, has higher precision and recall rate, and provides some reference value for solving the same problem in large scale data. The invention is also applied to large scale duplicate data detection. And it is a good solution for large scale data process issue. In the future, we plan to design and implement a scheduler for MapReduce jobs and new similarity algorithm with the primary focus of large scale duplicate data detection.Keywords: Large scale data sets, deduplication, MapReduce, HDFS, Simhash, shared nearest neighbor.
Graphical Abstract