Massive genomics data sets are transforming how pharmaceutical companies like Biogen pinpoint new targets for therapeutic approaches to patient care and enhance the efficacy of existing treatments. But as Biogen’s portfolio of research programs grew, their infrastructure and analytics capabilities weren’t prepared to manage the immense genomics data sets comprising billions of findings for neurological disorders. Biogen turned to Databricks to move their on-premises data infrastructure into the AWS cloud, which drastically reduced data processing time and increased bandwidth across collaborating teams. By enhancing scalability and speed, disease biologists are now able to deepen their understanding of genetic variants, human longevity, and neurological statuses to develop therapies and treatments for patients around the world.
Biogen uses human genetic evidence to rank their drug portfolio, discover new gene targets, and better understand neurological disease biology. But distilling petabytes of genomics data into clear connections between genotype and phenotype requires data technologies built to scale and adapt — something that legacy solutions are not equipped to handle.
With a huge amount of health and well-being data to process from the UK Biobank’s 500,000 volunteer participants, Biogen faced major informatics challenges. Insufficient storage capacity in their existing data center made collecting and analyzing data at scale impossible. Their network bandwidth couldn’t handle transferring so much information and, in 2018, these issues led to a one-week outage of Biogen’s high-performance compute cluster.
“We really needed a new data paradigm for Biogen,” said David Sexton, senior director of genome technology and informatics at Biogen. “Moving to Databricks and the cloud has helped us visualize and analyze our genomic data at petabyte scale.”
In 2018, Databricks introduced Databricks for Genomics, a runtime specifically targeted at genomic data workflows, and a component of the Databricks Unified Data Analytics Platform. It supports the full spectrum of Biogen’s needs, from initial data processing to large-scale statistical analysis. It has also helped move their data team to an architecture where they can use open source technologies to accelerate the ingestion and analysis of large data sets.
Collaborating with DNAnexus and Databricks, Biogen migrated their on-premises data infrastructure into the Amazon Web Services (AWS) cloud. These combined efforts simplified operations and helped reduce average data processing time. And with Delta Lake, Biogen took a pipeline that previously required 2 weeks to process 700,000 variants, and optimized it to annotate 2 million variants in approximately 15 minutes.
“The UK Biobank data set is challenging because of its sheer size and complexity. There are 500,000 participants, we are dealing with millions of variants and data points that we need to understand,” said Sexton. “To build a high-quality data set, we have to process those variants, combine them with health and assessment data, and combine everything into a large corpus of data that scientists can then easily query.”
With the storage and bandwidth they needed now supporting their efforts, Biogen could focus on data science productivity and targeting new therapies. By combining the DNAnexus platform with Databricks for Genomics, Biogen was able to use UK Biobank data to identify genes containing protein-truncating variants that impact human longevity and neurological status. These discoveries led to the identification of two new drug targets and unearthed insights into neurodegenerative diseases like Alzheimer’s and Parkinson’s.
“What’s really important about this data is that it needs to be high quality and consistent,” said Sexton. “Databricks allows us to stay focused on the science of aligning specific genetic variations with particular diseases — not waste time and bandwidth on cloud optimizations.”
To ensure a highly-accurate, queryable database, Biogen needed to be able to heavily partition data based on genetic location. With so much metadata across thousands of columns, vertical partitioning was critical. So was security; it was incredibly important to protect the integrity of the data as the system was being built and researchers gained access. Migrating to the Databricks environment allowed Biogen to splice complicated data in numerous ways, and integrate the Spark Hive Metastore into their platform access control model for hands-on oversight on data security.
“Databricks has enabled us to find a number of variants in about six different genes, all of which have a significant impact on the human lifespan,” said Sexton. “We’ve been able to build out ML models that allow us to understand how genomic variants impact the functionality and the possible success of other drugs we’re developing. With drastically improved data efficiency and discovery, we now have a unique opportunity to better understand the biology of complex diseases and develop targeted therapies to treat them.”
Databricks allows us to stay focused on the science of aligning specific genetic variations with particular diseases — not waste time and bandwidth on cloud optimizations.”
– David Sexton, Senior Director, Genome Technology and Informatics, Biogen