On the Optimization of a Duplicate Document Detection Algorithm Based on SIMD and Document Statistics

X P Yuan; J Long; H Zhang; Z P Zhang; W H Gui

doi:10.1109/CISE.2010.5676949

On the Optimization of a Duplicate Document Detection Algorithm Based on SIMD and Document Statistics

Yuan, X.P., Long, J., Zhang, H., Zhang, Z.P., Gui, W.H.

Source

2010 International Conference on Computational Intelligence and Software Engineering > 1 - 4

Abstract

Although considerable effort has been devoted to duplicate document detection (DDD) and its applications, there is very limited study on the optimization of its time-consuming functions. An experimental analysis which is conducted on a million Grant Proposal documents from the nsfc.gov.cn shows that even by using the clustering and the sampling methods, the speed of DDD is still quite slow. By analyzing the performance of our system with Intel VTune Performance Analyzer, we find out that the shingle comparison is the most time-consuming part in our system, occupying 58% CPU usage. Based on the analysis of the whole algorithm and the data statistics, we propose and implement an optimized shingle comparison algorithm using Intel SIMD technology. Experiments demonstrate that the proposed optimization technique brings 11.6%-38.5% performance gain with various instruction sets and parameters settings. Further performance gain could be achieved base on the accuracy and speed tradeoff.