The key to solving some of today’s most challenging medical problems lies in the analysis of genomics data. Understanding the impact of the minor changes in an individual’s genome on their overall health is fundamentally a data driven challenge that requires integration across hundreds of thousands of individuals. By analyzing genomes across large cohorts, researchers can build highly accurate models that predict disease risk, which can then be interrogated to understand the prospects of targeting a single gene for therapeutic development. However, aggregating data from many individuals in order to build these models and run them at population-scale entails significant data engineering efforts.
Today, we are excited to introduce Glow, an open-source collaboration between the Regeneron Genetics Center® and Databricks. Glow is an open-source toolkit built on Apache Spark™ that makes it easy to aggregate genomic and phenotypic data with accelerated algorithms for genomic data preparation, statistical analysis, and machine learning at biobank-scale. Over the last few years, we have seen researchers struggle when trying to aggregate insights across large genomic cohorts. In the Glow project, we have jointly developed an industrial-quality framework that accelerates the data engineering processes required to build high-quality pipelines for storing and analyzing genomic data at scale. Simultaneously, Glow provides a bridge out of niche bioinformatic toolkits into modern data analytics environments, where machine learning can be fully leveraged on multifaceted population-scale health datasets that include genomic data, and a rapidly expanding universe of -omics and phenotypes.
In the last few years, our organizations have worked with a wide variety of projects and collaborators, and we found areas where we could be more effective and efficient when working with large genomic variation datasets, such as the UK Biobank cohort. These areas for improvement include:
As genomic datasets grow larger and larger, these problems become more challenging. While single node command line tools may have been sufficient to preprocess and conduct quality control on cohorts of hundreds of samples, they are far too slow and cumbersome to use when merging hundreds of thousands of samples. Traditional GWAS tools may have been sufficient when studying a single phenotype, but their throughput becomes too low when working with high-dimensional phenotype data or PheWAS studies. While several tools aim to solve these problems today, they have complex and proprietary APIs that make them both hard to learn and difficult to use alongside phenotypic data culled from an electronic medical record system, or generated by transcriptomic or imaging studies.
In Glow, we aspire to solve these problems by building an easy-to-learn and easy-to-use genomics library that builds on top of the widely used Apache Spark open-source project, and is natively optimized to benefit from the scale of cloud computing. We approach the problem with the following three guidelines:
Our approach abstracts away complexity, leading to a framework that is powerful, but lightweight. Most functions in Glow are implemented directly in Spark SQL and can be called with a single line of code. Native integration with Spark SQL also provides a unified set of APIs for working with both genomic data or phenotype data, and allows users to flow directly between traditional genomic data processing and machine learning. Ultimately, a complex GWAS analysis can be simplified down to tens of lines of code, and run in minutes.
We are excited to release Glow into the wild, and we hope that you are excited by our vision too! Glow is an open source project hosted on Github, with an Apache 2 license. You can get started by reading our project docs, or create a fork of the repository to start contributing code today. Our hope is to grow Glow as a project where many diverse researchers with varied interests and skills who are working across large-scale genomics can come together to collaborate on new architectures and methods.