Processing Terabyte-Scale Genomics Datasets with ADAM

Download Slides

The detection and analysis of rare genomic events requires integrative analysis across large cohorts with terabytes to petabytes of genomic data. Contemporary genomic analysis tools have not been designed for this scale of data-intensive computing. This talk presents ADAM, an Apache 2 licensed library built on top of the popular Apache Spark distributed computing framework. ADAM is designed to allow genomic analyses to be seamlessly distributed across large clusters, and presents a clean API for writing parallel genomic analysis algorithms. In this talk, we’ll look at how we’ve used ADAM to achieve a 3.5× improvement in end-to-end variant calling latency and a 66% cost improvement over current toolkits, without sacrificing accuracy. We will talk about a recent recompute effort where we have used ADAM to recall the Simons Genome Diversity Dataset against GRCh38. We will also talk about using ADAM alongside Apache Hbase to interactively explore large variant datasets.

About Frank Austin Nothaft

Frank Austin Nothaft is a PhD candidate in Computer Science at UC Berkeley. Frank holds a Masters of Science in Computer Science from UC Berkeley, and a Bachelors of Science with Honors in Electrical Engineering from Stanford University. Prior to joining UC Berkeley, Frank worked at Broadcom Corporation on design automation techniques for industrial scale wireless communication chips.