Regeneron customer story Discovering new treatments with AI

Regeneron accelerates drug discovery with genomic insights at population scale

INDUSTRY: Life sciences

SOLUTION: Genomic sequencing

TECHNICAL USE CASE: Data ingest and ETL, machine learning

Regeneron’s mission is to tap into the power of genomic data to bring new medicines to patients in need. Yet, transforming this data into life changing discovery and targeted treatments has never been more challenging. With poor processing performance and scalability limitations, their data teams lacked what they needed to analyze petabytes of genomic and clinical data. Databricks now empowers them to quickly analyze entire genomic data sets quickly to accelerate the discovery of new therapeutics.

Decentralized Genomic Data Blocks Machine Learning

More than 95% of all experimental medicines that are currently in the drug development pipeline are expected to fail. To improve these efforts, the Regeneron Genetics Center built one of the most comprehensive genetics databases by pairing the sequenced exomes and electronic health records of more than 400,000 people. However, they faced numerous challenges analyzing this massive set of data:

  • Genomic and clinical data is highly decentralized, making it very difficult to analyze and train models against their entire 10TB dataset.
  • Difficult and costly to scale their legacy architecture to support analytics on over 80 billion data points.
  • Data teams were spending days just trying to ETL the data so that it can be used for analytics.

Databricks Simplified Infrastructure and ML at Scale

Databricks provides Regeneron with a Unified Data Analytics Platform running on Amazon Web Services that simplifies operations and accelerates drug discovery through improved data science productivity. This is empowering them to analyze the data in new ways that were previously impossible.

  • Automated cluster management: simplifies the provisioning of clusters, reducing time spent on DevOps work so engineers and data scientists can spend more time on high valued tasks.
  • Interactive workspaces: allows data scientists to share data and insights, fostering an environment of transparency and collaboration across the entire drug development lifecycle.
  • Performant Spark-powered Pipelines: significantly improved reliability and speed of ETL pipelines used to process their 10TBs of EHR + DNAseq data.

Faster Discovery of New Drugs and Therapies

With Databricks, the team at Regeneron no longer needs to waste excessive resources on DevOps work setting up and maintaining infrastructure to support their analytics. Today, bioinformatics teams, data scientists and computational biologists can spend more time on higher valued tasks such as developing novel new treatments.

  • Accelerated drug target identification: reduced the time it takes data scientists and computational biologists to run queries on their entire dataset from 30 minutes to down 3 seconds – 600x improvement!
  • Increased productivity: improved collaboration, automated DevOps and accelerated pipelines (ETL in 2 days vs 3 weeks) have enabled their teams to support a broader range of studies.
  • 600x
    Improvement in query runtime on entire data sets
  • 10x
    Faster data pipeline enabling team to support more studies

The Databricks Unified Analytics Platform is enabling everyone in our integrated drug development process – from physician-scientists to computational biologists – to easily access, analyze, and extract insights from all of our data.”

– Jeffrey Reid, PhD, Head of Genome Informatics at Regeneron

Related Content


Technical Talk at Spark + AI Summit EU 2019