On the speedup of single-disk failure recovery in XOR-coded storage systems: Theory and practice

Yunfeng Zhu; Patrick P. C. Lee; Yuchong Hu; Liping Xiang; Yinlong Xu

doi:10.1109/MSST.2012.6232371

On the speedup of single-disk failure recovery in XOR-coded storage systems: Theory and practice

Zhu, Yunfeng, Lee, Patrick P. C., Hu, Yuchong, Xiang, Liping, Xu, Yinlong

Source

12 IEEE 28th Symposium on Mass Storage Systems and Technologies (MSST) > 1 - 12

Abstract

Modern storage systems stripe redundant data across multiple disks to provide availability guarantees against disk failures. One form of data redundancy is based on XOR-based erasure codes, which use only XOR operations for encoding and decoding. In addition to providing failure tolerance, a storage system must also provide fast failure recovery to avoid data unavailability. We consider the problem of speeding up the recovery of a single-disk failure for arbitrary XOR-based erasure codes. We address this problem from both theoretical and practical perspectives. We propose a replace recovery algorithm, which uses a hill-climbing technique to search for a fast recovery solution, such that the solution search can be completed within a short time period. We further implement our replace recovery algorithm atop a parallelized architecture to justify its practicality. We experiment our replace recovery algorithm and its parallelized implementation on a networked storage system testbed, and demonstrate that our replace recovery algorithm uses less recovery time than the conventional approach.