The key to solving some of today’s most challenging medical problems lies in the analysis of genomics data. Understanding the impact of the minor changes in an individual’s genome on their overall health is fundamentally a data driven challenge that requires integration across hundreds of thousands of individuals. By analyzing genomes across large cohorts, researchers can build highly accurate models that predict disease risk, which can then be interrogated to understand the prospects of targeting a single gene for therapeutic development. However, aggregating data from many individuals in order to build these models and run them at population-scale entails significant data engineering efforts.
Today, we are excited to introduce Glow, an open-source collaboration between the Regeneron Genetics Center® and Databricks. Glow is an open-source toolkit built on Apache Spark™ that makes it easy to aggregate genomic and phenotypic data with accelerated algorithms for genomic data preparation, statistical analysis, and machine learning at biobank-scale. Over the last few years, we have seen researchers struggle when trying to aggregate insights across large genomic cohorts. In the Glow project, we have jointly developed an industrial-quality framework that accelerates the data engineering processes required to build high-quality pipelines for storing and analyzing genomic data at scale. Simultaneously, Glow provides a bridge out of niche bioinformatic toolkits into modern data analytics environments, where machine learning can be fully leveraged on multifaceted population-scale health datasets that include genomic data, and a rapidly expanding universe of -omics and phenotypes.
Problems with analyzing large genomic datasets
In the last few years, our organizations have worked with a wide variety of projects and collaborators, and we found areas where we could be more effective and efficient when working with large genomic variation datasets, such as the UK Biobank cohort. These areas for improvement include:
- Lack of scalable workflows: As incredibly valuable as the legacy bioinformatics tools for genomics have been (e.g. GATK, PLINK, BCFtools, tabix, Picard, SAIGE, BOLT-LMM, VEP, SnpEff), they were primarily designed to run on single-node machines that do not scale for population-wide analyses. Teams are spending long hours splitting up datasets to parallelize workflows in hopes of improving processing speeds. On top of that, they typically create a large number of interconnected jobs to run these complex workflows. Not only is this time consuming, but it’s also hard to manage. A single job failure can bring down an entire pipeline, causing hours or days of lost work.
- Rigid tools that are hard to use: Traditional bioinformatics tools have a steep learning curve with little flexibility making them hard to adopt and use. These tools are typically rigid command line tools requiring users to learn a proprietary query language and API structure. Although there have been efforts to create tools for massive datasets, they require specialized APIs, file formats, and neither support user-defined functions nor integration with common phenotypic data sources like an EHR system or imaging study, preventing teams from optimizing their genome analysis workflows to meet their unique needs.
- Limited support for tertiary analytics and ML: Genomic analysis tools rely on file formats without explicit data schemas that are designed for a limited set of genomic analyses. Integrating novel genomic methods and machine learning is not an option, preventing teams from building powerful predictive models. For example, using ML across genotypes from many samples for use cases like polygenic risk scoring is difficult because existing genome-wide association study (GWAS) tools do not efficiently integrate into large-scale ML frameworks. As such, scientists typically pre-filter variants before building a risk model, reducing the quality of the model, especially in the presence of high impact rare variants.
As genomic datasets grow larger and larger, these problems become more challenging. While single node command line tools may have been sufficient to preprocess and conduct quality control on cohorts of hundreds of samples, they are far too slow and cumbersome to use when merging hundreds of thousands of samples. Traditional GWAS tools may have been sufficient when studying a single phenotype, but their throughput becomes too low when working with high-dimensional phenotype data or PheWAS studies. While several tools aim to solve these problems today, they have complex and proprietary APIs that make them both hard to learn and difficult to use alongside phenotypic data culled from an electronic medical record system, or generated by transcriptomic or imaging studies.
Glow integrates bioinformatics tools with best-of-breed big data processing engines
In Glow, we aspire to solve these problems by building an easy-to-learn and easy-to-use genomics library that builds on top of the widely used Apache Spark open-source project, and is natively optimized to benefit from the scale of cloud computing. We approach the problem with the following three guidelines:
- Build for scale on industry-trusted tooling: We have developed on top of leading open-source technologies for distributed computing in the cloud including Apache Spark SQL and the high performance Delta Lake storage layer. These tools transparently manage, cache, and process large volumes of data, making it possible to both query petabytes of genomic data in near-real time and run thousands of data processing tasks with high reliability and scalability.
- Simplify use with prebuilt genomic analyses and integration with common tools: We provide built-in, single line commands in Python, R, Scala, and SQL for common genomic analyses (e.g., quality control functions, variant normalization, GWAS, etc.), that make it easy to get your workflows running in no time. We work with common array (BGEN) and sequencing (VCF) file formats, and provide a bridge to run command-line tools in parallel using Glow. This allows you to eliminate time spent slicing and dicing your genotype data, and lets you focus on doing science.
- Empower downstream workflows with open source integrations: Glow allows you to take advantage of machine learning with native integrations with popular open-source technologies for machine learning (e.g. tensorflow through Horovod, pandas, scikit-learn, etc.) and native integration with tracking frameworks like MLFlow that enable analysis reproducibility. Glow is built with open-source APIs, uses the open and widely used Delta Lake file format, and has clear project documentation and source code.
Our approach abstracts away complexity, leading to a framework that is powerful, but lightweight. Most functions in Glow are implemented directly in Spark SQL and can be called with a single line of code. Native integration with Spark SQL also provides a unified set of APIs for working with both genomic data or phenotype data, and allows users to flow directly between traditional genomic data processing and machine learning. Ultimately, a complex GWAS analysis can be simplified down to tens of lines of code, and run in minutes.
Join us and try Glow!
We are excited to release Glow into the wild, and we hope that you are excited by our vision too! Glow is an open source project hosted on Github, with an Apache 2 license. You can get started by reading our project docs, or create a fork of the repository to start contributing code today. Our hope is to grow Glow as a project where many diverse researchers with varied interests and skills who are working across large-scale genomics can come together to collaborate on new architectures and methods.