Resemblance and mergence based indexing for high performance data deduplication

Panfeng Zhang; Ping Huang; Xubin He; Hua Wang; Ke Zhou

doi:10.1016/j.jss.2017.02.039

Resemblance and mergence based indexing for high performance data deduplication

Panfeng Zhang, Ping Huang, Xubin He, Hua Wang, Ke Zhou

Source

Journal of Systems and Software > 2017 > 128 > C > 11-24

Abstract

Data deduplication, a data redundancy elimination technique, has been widely employed in many application environments to reduce data storage space. However, it is challenging to provide a fast and scalable key-value fingerprint index particularly for large datasets, while the index performance is critical to the overall deduplication performance. This paper proposes RMD, a resemblance and mergence based deduplication scheme, which aims to provide quick responses to fingerprint queries. The key idea of RMD is to leverage a bloom filter array and a data resemblance algorithm to dramatically reduce the query range. At data ingesting time, RMD uses a resemblance algorithm to detect resemble data segments and put resemblance segments in the same bin. As a result, at querying time, it only needs to search in the corresponding bin to detect duplicate content, which significantly speeds up the query process. Moreover, RMD uses a mergence strategy to accumulate resemblance segments to relevant bins, and exploits frequency-based fingerprint retention policy to cap the bin capacity to improve query throughput and data deduplication ratio. Extensive experimental results with real-world datasets have shown that RMD is able to achieve high query performance and outperforms several well-known deduplication schemes.

Identifiers

journal ISSN :	0164-1212
DOI	10.1016/j.jss.2017.02.039

Authors

Panfeng Zhang

School of Computer, Huazhong University of Science and Technology, Wuhan, China
Wuhan National Laboratory for Optoelectronics, Wuhan, China

Ping Huang

Department of Computer and Information Sciences, Temple University, USA

Xubin He

Department of Computer and Information Sciences, Temple University, USA
Department of Electrical and Computer Engineering, Virginia Commonwealth University, USA

Hua Wang

School of Computer, Huazhong University of Science and Technology, Wuhan, China
Wuhan National Laboratory for Optoelectronics, Wuhan, China

see all

Keywords

Fast index Deduplication Resemblance mergence Fingerprint retrieval Key value index

Additional information

Publication languages: English

Data set: Elsevier

Publisher

Elsevier Science

Fields of science

No field of science has been suggested yet.

article

Read online
Download
Add to read later
Add to collection
Add to followed
Share

Export to bibliography


Assign to other user
	×
Wrong email address

INFONA - science communication portal

Resemblance and mergence based indexing for high performance data deduplication $("#expandableTitles").expandable();

Source

Abstract

Identifiers

Authors

User assignment

Assignment remove confirmation

You're going to remove this assignment. Are you sure?

Panfeng Zhang

Ping Huang

Xubin He

Hua Wang

Keywords

Additional information

Publisher

Fields of science

Fields of science

Share

Export to bibliography

Reporting an error / abuse

Sending the report failed

Accessibility options

Resemblance and mergence based indexing for high performance data deduplication