Genomics data has exploded in recent years, especially as some datasets, such as the UK Biobank, become freely available to researchers anywhere. Genomics data is leveraged for high-impact use cases – gene discovery, research and development prioritization, and to conduct randomized controlled trials. These use cases will help in developing the next generation of therapeutics.
The catch: deriving insights from this data requires data teams to scale their analytics. And scaling requires data scientists and engineers with deep technical skill sets. That’s why we’re excited to announce the release of Glow version 1.0.0, an open source library that solves key challenges of applying distributed computing to genomics data in the cloud.
As genetic data has grown, processing, storing and analyzing it has become a major bottleneck. Challenges include:
Glow is an open-source toolkit for working with genomic data at population-level scale. The toolkit is natively built on Apache Spark™, a unified analytics engine for large-scale data processing and machine learning.
Figure 1. The Glow library can be run on Databricks on any of the three major clouds, starter notebooks can be found on the documentation.
Figure 2. Glow’s whole genome regression (GloWGR) is orders of magnitude more scalable than existing methods
We have collaborated with the Regeneron Genetics Center to solve key scaling challenges in genomics through project Glow. Bioinformatics, computational biologists, statistical geneticists and research scientists can work together on The Databricks analytics platform, on any cloud, to scale their genomics data analytics and downstream machine learning applications. The first use case of Apache Spark™ and Delta Lake to genomics has been for population genetic association studies. And we are now seeing new use cases emerging for cancer and childhood developmental disorders.
Try out Glow V1.0.0 on Databricks or learn more at projectglow.io.