String Similarity Search: A Hash-Based Approach

Hao Wei; Jeffrey Xu Yu; Can Lu

doi:10.1109/TKDE.2017.2756932

String similarity search is a fundamental query that has been widely used for DNA sequencing, error-tolerant query auto-completion, and data cleaning needed in database, data warehouse, and data mining. In this paper, we study string similarity search based on edit distance that is supported by many database management systems such as Oracle and PostgreSQL. Given the edit distance, ${\mathsf {ed}} (s,t)$<alternatives><inline-graphic xlink:href="yu-ieq1-2756932.gif"/></alternatives> , between two strings, $s$<alternatives> <inline-graphic xlink:href="yu-ieq2-2756932.gif"/></alternatives> and $t$<alternatives><inline-graphic xlink:href="yu-ieq3-2756932.gif"/></alternatives> , the string similarity search is to find every string $t$ <alternatives><inline-graphic xlink:href="yu-ieq4-2756932.gif"/></alternatives> in a string database $D$<alternatives> <inline-graphic xlink:href="yu-ieq5-2756932.gif"/></alternatives> which is similar to a query string $s$<alternatives> <inline-graphic xlink:href="yu-ieq6-2756932.gif"/></alternatives> such that ${\mathsf {ed}} (s, t) \leq \tau$<alternatives> <inline-graphic xlink:href="yu-ieq7-2756932.gif"/></alternatives> for a given threshold $\tau$<alternatives> <inline-graphic xlink:href="yu-ieq8-2756932.gif"/></alternatives>. In the literature, most existing work takes a filter-and-verify approach, where the filter step is introduced to reduce the high verification cost of two strings by utilizing an index built offline for $D$ <alternatives><inline-graphic xlink:href="yu-ieq9-2756932.gif"/></alternatives>. The two up-to-date approaches are prefix filtering and local filtering. In this paper, we study string similarity search where strings can be either short or long. Our approach can support long strings, which are not well supported by the existing approaches due to the size of the index built and the time to build such index. We propose two new hash-based labeling techniques, named $\mathsf {OX}$<alternatives> <inline-graphic xlink:href="yu-ieq10-2756932.gif"/></alternatives> label and $\mathsf {XX}$<alternatives><inline-graphic xlink:href="yu-ieq11-2756932.gif"/> </alternatives> label, for string similarity search. We assign a hash-label, ${\mathsf {H}} _s$<alternatives> <inline-graphic xlink:href="yu-ieq12-2756932.gif"/></alternatives>, to a string $s$<alternatives><inline-graphic xlink:href="yu-ieq13-2756932.gif"/> </alternatives>, and prune the dissimilar strings by comparing two hash-labels, ${\mathsf {H}} _s$<alternatives> <inline-graphic xlink:href="yu-ieq14-2756932.gif"/></alternatives> and ${\mathsf {H}} _t$<alternatives> <inline-graphic xlink:href="yu-ieq15-2756932.gif"/></alternatives>, for two strings $s$<alternatives><inline-graphic xlink:href="yu-ieq16-2756932.gif"/> </alternatives> and $t$<alternatives> <inline-graphic xlink:href="yu-ieq17-2756932.gif"/></alternatives> in the filter step. The key idea is to take the dissimilar bit-patterns between two hash-labels. We discuss our hash-based approaches, address their pruning power, and give the algorithms. Our hash-based approaches achieve high efficiency, and keep its index size and index construction time one order of magnitude smaller than the existing approaches in our experiment at the same time.

journal ISSN :	1041-4347
DOI	10.1109/TKDE.2017.2756932

INFONA - science communication portal

String Similarity Search: A Hash-Based Approach

Source

Abstract

Identifiers

Authors

Wei, H.

Yu, J.X.

Lu, C.

Keywords

Additional information

Publisher

Fields of science


Assign to other user
	×
Wrong email address

INFONA - science communication portal

String Similarity Search: A Hash-Based Approach $("#expandableTitles").expandable();

Source

Abstract

Identifiers

Authors

User assignment

Assignment remove confirmation

You're going to remove this assignment. Are you sure?

Wei, H.

Yu, J.X.

Lu, C.

Keywords

Additional information

Publisher

Fields of science

Fields of science

Share

Export to bibliography

Reporting an error / abuse

Sending the report failed

Accessibility options

String Similarity Search: A Hash-Based Approach