Next generation sequencing is becoming cheaper and more accessible. The volume of data sequenced is increasing faster than Moore’s Law. However, it is still expensive and slow to go from raw reads to variant calls, and to produce annotated variants that can then be analyzed downstream. In this talk, we will discuss the first state of the art, scalable and simple DNA sequencing workflow that is built on top of Apache Spark and the Databricks APIs. The pipeline is simple to set up, is easy to scale out, and can sequence a 30x coverage genome cost efficiently on the cloud.
We’ll introduce the problem of alignment and variant calling on whole genomes, discuss the challenges of building a simple yet scalable pipeline and demonstrate our solution. This talk should be of interest to developers wishing to build ETL pipelines on top of Apache Spark, as well as biochemists and molecular biologists who wish to learn how to develop cheap and fast DNA sequencing pipelines.
Sesson hashtag: #DevSAIS10
I am the Product Manager for open source efforts at Databricks. Prior experience includes Spark and Data Science Architect at Hortonworks, Principal Research Scientist at Yahoo focused on large scale data mining and machine learning for search and display advertising. I am an Apache Spark PMC Member and Committer.
Frank is the Technical Director for the Healthcare and Life Sciences vertical at Databricks. Prior to joining Databricks, Frank was a lead developer on the Big Data Genomics/ADAM and Toil projects at UC Berkeley, and worked at Broadcom Corporation on design automation techniques for industrial scale wireless communication chips. Frank holds a PhD and Masters of Science in Computer Science from UC Berkeley, and a Bachelor’s of Science with Honors in Electrical Engineering from Stanford University.