Scaling Genomics on Apache Spark by 100x - Databricks

Scaling Genomics on Apache Spark by 100x

As the growth of genomic data continues to accelerate, practitioners are turning to big data technologies like Apache Spark to scale their analyzes to millions of samples and thousands of CPU cores. However, despite the flexibility and broad capabilities of Spark, many organizations still struggle to make the switch from single node tools running on premise to massively scalable platforms in the cloud. With the Databricks Unified Analytics Platform for Genomics, we simplify the end-to-end analytics workflow by providing users with a single platform that handles ingestion to visualization.

To ingest data, we’ve built a variant calling pipeline that can run a 30x coverage whole genome sample in under 30 minutes and match single-node best practice pipelines by leveraging Spark SQL to efficiently shard work across the cluster. In addition to common genomics file formats like VCF and BAM, we make the output available in parquet and Databricks Delta tables to simplify downstream analyses.

Proprietary extensions to Spark SQL’s query optimizer improve the performance of common genomic query patterns such as region joins by 100x and enable interactive exploration and visualization. By leveraging Spark and proprietary connectors, users can join with genomic data with clinical data like medical images and electronic medical records to find correlations between genetic variants and high level outcomes. Together, the components of this platform bring best practice big data tools and techniques to the genomic setting to accelerate innovation in the health and life sciences.

Session hashtag: #SAISExp3

About Henry Davidge

Henry Davidge is a software engineer at Databricks where he focuses on building the cluster management infrastructure. Before Databricks, he graduated from Yale University with a BS in computer science.