A highly parallel next-generation DNA sequencing data analysis pipeline in Hadoop

Kareem S. Aggour; Vijay S. Kumar; Dipen P. Sangurdekar; Lee A. Newberg; Chinnappa D. Kodira

doi:10.1109/BIBM.2015.7359781

A highly parallel next-generation DNA sequencing data analysis pipeline in Hadoop

Aggour, Kareem S., Kumar, Vijay S., Sangurdekar, Dipen P., Newberg, Lee A., Kodira, Chinnappa D.

Source

2015 IEEE International Conference on Bioinformatics and Biomedicine (BIBM) > 756 - 763

Abstract

The era of precision medicine is best exemplified by the growing reliance on next-generation sequencing (NGS) technologies to provide improved disease diagnosis and targeted therapeutic selection. Well-established NGS data analysis software tools, in their unmodified form, can take days to identify and interpret single nucleotide and structural variations in DNA for a single patient. To improve sample analysis throughput, we developed a highly parallel end-to-end next-generation DNA sequencing data analysis pipeline in Hadoop. In our pipeline, each step is parallelized not only across samples but also within each individual sample, achieving a 30× speedup over a single server workflow execution. Furthermore, we extensively evaluate the viability of having our Hadoop-based pipeline as part of a larger commercial genomic services offering—we demonstrate how our pipeline scales sub-linearly both with the number of samples being analyzed and with the depth of coverage of those samples. In particular, on our commodity cluster, 10× as many samples resulted in only a 2.24× increase in the execution time, and a 4× increase in coverage depth resulted in only a 2.53× growth in execution time. We anticipate that such improvements will allow large cohort populations to be analyzed in parallel, and can fundamentally change the way DNA sequencing analyses are used by both researchers and clinicians.