Insights from Building the Future of Drug Discovery with Apache Spark - Databricks

Insights from Building the Future of Drug Discovery with Apache Spark

Download Slides

Human genetics holds the key to understanding pathogenesis of many devastating diseases like type 2 diabetes and Alzheimer’s disease.  The discovery, development, and commercialization of new classes of drugs can take 10-15 years and greater than $5 billion in R&D investment only to see less than 5% of the drugs make it to market. Committed to creating therapeutic innovations, Regeneron has built one of the world’s most comprehensive genetics databases to supplement our state-of-the-art drug development pipeline. While these massive volumes of data provide an unprecedented opportunity to gain novel therapeutic insights, Regeneron has encountered a number of challenges on the road to delivering on the promises of big data and genomics in drug discovery. For example, how do you enable fast and accurate query from >80B data points? And how do you expedite novel statistical tests on TB-scale data?

This presentation will share Regeneron’s vision for building a scalable and performant informatics infrastructure to accelerate genetics-driven drug development. Specifically, we highlight key challenges in establishing the world’s largest clinical genetics databases, provide an overview of how Regeneron leverages Databricks’ Unified Analytics Platform and Apache Spark, and discuss in detail key engineering innovations that have already come out of this collaborative effort.

Session hashtag: #EntSAIS14

About Dr. Lukas Habegger

Dr. Lukas Habegger is the Associate Director of Bioinformatics at the Regeneron Genetics Center (RGC), one of the most productive sequencing efforts in the world. Lukas manages the Genome Informatics R&D Team which develops new algorithms to analyze genomic and clinical data. Lukas is spearheading a project to build out the RGC’s big data infrastructure and create a cutting-edge Apache Spark data analysis platform to integrate clinical and genomic data and provide advanced query/analytical capabilities. He received his undergraduate degrees in Bioinformatics and Statistics from the Rochester Institute of Technology and obtained his Ph.D. in Computational Biology & Bioinformatics from Yale University.