Modern genome sequencing projects capture hundreds of gigabytes of data per individual. In this talk, we discuss recent work where we used the Spark-based ADAM tool to recompute genomic variants from 70TB of reads from the Simons Genome Diversity dataset. ADAM presents a drop-in, Spark-based replacement for conventional genomics pipelines like the GATK. We ran this computation across hundreds of nodes on Amazon EC2 using Toil, a novel cluster orchestration tool. Toil was used to automatically scale the number of nodes used, and to seamlessly run large single node jobs and Spark clusters in a single workflow. By combining ADAM and Toil, we are able to improve end-to-end pipeline runtime while taking advantage of the EC2 Spot Instances market. Additionally, Toil is designed for scientific reproducibility, and our entire workflow was run using Docker containers to ensure that there is a static set of binaries that could be used to reproduce the pipeline at a later date. ADAM and Toil are both freely available Apache 2 licensed tools.
Frank is the Technical Director for the Healthcare and Life Sciences vertical at Databricks. Prior to joining Databricks, Frank was a lead developer on the Big Data Genomics/ADAM and Toil projects at UC Berkeley, and worked at Broadcom Corporation on design automation techniques for industrial scale wireless communication chips. Frank holds a PhD and Masters of Science in Computer Science from UC Berkeley, and a Bachelor’s of Science with Honors in Electrical Engineering from Stanford University.