Frank is the Technical Director for the Healthcare and Life Sciences vertical at Databricks. Prior to joining Databricks, Frank was a lead developer on the Big Data Genomics/ADAM and Toil projects at UC Berkeley, and worked at Broadcom Corporation on design automation techniques for industrial scale wireless communication chips. Frank holds a PhD and Masters of Science in Computer Science from UC Berkeley, and a Bachelor’s of Science with Honors in Electrical Engineering from Stanford University.
Large scale genomics datasets like the UK Biobank are revolutionizing how pharmaceutical companies identify targets for therapeutic development. However, turning petabytes of genomics data into actionable links between genotype and phenotype is out of reach for companies using legacy genomic data technologies. In this talk, Biogen will describe how they collaborated with DNAnexus and Databricks to move their on-premises data infrastructure into the AWS cloud. By combining the DNAnexus platform with the Databricks Genomics Runtime, Biogen was able to use the UK Biobank dataset to identify genes containing protein-truncating variants that impact human longevity and neurological status.[saisna20-sessions] [transcript-play-code]
- Good evening to those of you who are joining the session. My name is David Sexton. I'm the head of Genome Technology and Informatics at Biogen.
I am joined today by John Ellithorpe, who is the Chief Product Officer at DNA Nexus, and Frank Nothaft who is the Technical Director of Healthcare and Life Sciences at Databricks.
So the agenda for today is, I will be speaking about what the UK Biobank actually is and why Biogen was not prepared for data at this scale. John will be speaking about large scale analysis of the UK Biobank data using DNA Nexus, and how DNA Nexus helped us in scaling up the analysis of this data. And Frank Nothaft will be speaking about combining the best-of-breed cloud architectures to accelerate UK Biobank analysis, and he will also speak to how data bricks has helped Biogen to scale out the solution.
So what is the UK Biobank, and how has Biogen used it to discover new therapeutics?
So the UK Biobank is the premier data set for associating disease to genetics in the world right now. So this is a long term study of genetic predisposition and environmental exposure to the development of disease. This data set has been collected in participants who are aged 40 to 69. This is over 30 years of phenotypic data, and these patients are being followed over the long term, measuring their their health over that 30 year time span. There are 22 centers across the United Kingdom with over 500,000 volunteers, and this is one of the largest and most detailed population studies ever undertaken. And you can see that there are a large number of papers that have come out of the UK Biobank. So what genetic data is collected by the UK Biobank? In 2017, eight partner companies came together to form the life sciences genetics consortium. As part of this consortium, we will be sequencing the exomes of 500,000 of the UK Biobank participants. An exome is the protein coding region of a gene, and as part of the consortium, Biogen will be probing all of that genomic region for the protein coding genes in the 500,000 participants. The participants of the UK Biobank Consortia Regeneron, and GSK sequence the first 50,000 samples, and all 500,000 participants are to be sequenced in 2020. Industry partners will have exclusivity until 2021, and we currently We have 300,000 exomes in-house.
So how is Biogen leveraging the UK Biobank data?
We are using human genetic evidence to rank our drug portfolio. We are using the data to find new gene targets, and we're using it to understand neurological disease biology.
So Biogen had some informatics challenges in using this data.
Our data infrastructure had challenges in not having enough storage capacity and data center. The UK Biobank data will be approximately one petabyte of data, and we did not have that storage currently in our data center. We had issues with our network, and not having enough bandwidth to transfer all of this data across to our data center. And when I came into Biogen in 2018, we had just had a one week outage of our high performance compute cluster.
So we really needed a new data paradigm for Biogen, and this is where DNA Nexus and Dataricks has helped us. We needed to scale our infrastructure to deal with petabyte data set sizes, we need to store and visualize our genomic data, we needed to analyze this data at scale, and we were required by Biogen IT to be cloud first in storing and using our data.
And so now I will turn it over to John to talk about scaling the UK Biobank data using DNA Nexus's Titan and Apollo products. - Thank you, David. So when we look at the UK Biobank and what Biogen needed to do if you fail to look at the steps that are required in processing a large genomic data set.
We really break this up into two different pieces. So there's one piece, which is an uptrend piece where you look at preparing the high quality data set. So the data that's coming off a sequencing machine or these raw reads, and you don't get the understanding of exactly what the variants are. And so there's a processing step that needs to happen across the entire 500,000 exome data set to be able to build that high quality genomic data set. And there is a second piece of combining that with the health and assessment data, that's like demographic data and other types of data that get brought in as well. That then needs to get combined into a large corpus of data that then scientists can actually go and use by querying the data, asking questions of data and do statistical analysis to get to the final result of the data. When we look at the UK Biobank data, that data set is challenging because of the sheer size and complexity of the data. So if you look on the right, you see on the genomic side, we have 500,000 participants in that data set. Each participant can have up to millions of variants that they are tracking across these that now gives you essentially trillions of data points that are needed to be able to understand what that looks like purely on the genomic side. On the so called phenotypic side which might be demographic information, health information, clinical information that might come from medical records. That is a wide data set. There are a lot of different fields, over 3500 different phenotypic fields that are there. And they're also quite complex, in that you might have significant encodings of the values, you might have hierarchies in terms of what the values could be. You also have some level of launch set of data because people came in multiple times into the assessment centers to measure things like blood pressure and things like that. So there's a longitudinal aspect to the data. And so once you've combined all these together, you have essentially a very large data set that you have to manage to be able to do the things that David mentioned needing to do. Let's focus now at the different stages of this process, and we could talk a little bit about how DNA Nexus help. In this first stage with 500,000 samples, you end up with about 2 million files as you have the alignment files as well as the outputs, are called GCF files, as well as the index files that go along with it. we computed that, it's around one and a half petabytes of data that you have to process.
And what's really important about this data is you really need to do it in a high quality, consistent manner. If we look at the stages within that you might take the raw reads and then you do a so called alignment step, which is aligning it to a reference genome to really identify where the different fragments are stacked up. And then identify for every point along the exome, what exactly is the call at that point in that snip, along with a variety of data ingestion.
This dataset was processed at the Regeneron Genomic Center, they got the samples that were processed. They have a cost optimized pipeline that takes about four hours to run per sample, and so across the 500,000 samples, we're looking at millions of hours on machine. So that's a large scale problem. And so if we look at the technology that they use, they use the Titan product to be able to process those. When we look at why this is hard, if you're processing a few samples, it's actually not that hard to do yourself in the cloud. But once you get into the levels of thousands, tens of thousands, hundreds of thousands of samples, you really want to do it consistently and efficiently, then fault tolerance into the clouds are very important. The ability to focus only on exceptions and in the science and not having to deal with cloud optimizations and such is also quite important.
Moving data around and making sure that you have consistency in the data integrity is also incredibly important. And for a research environment, it's also quite important that you actually use the tools that are particularly important and relevant view. And so these are all the pieces of what Titan can do.
For example, we had a another pharmaceutical company as part of the consortia and they re-ran 100,000 samples, processing about 1000 exomes an hour. And so in three and a half days, we'll be able to process 100,000 exomes. That capability is really required at scale as you go and reprocess the data in a secure environment.
I'll move to the next stage of the process. Now we have to combine that large corpus of genomic data with this clinical data. And while there are 3500 or more fields, it ended up being something like 11,000 plus columns that needed to get stored in the system, and then combined to be able to effectively query the data. As part of the consortia, we provided a cohort browser that allowed the researchers to visually go through and interact with the data in query and ask questions of the data, explore the data. I'll show that in a second. And what was quite important about that is that the amount of data... So the data is both wide and deep. We ended up with something like 50 billion rows to go across the entire genomic data set down to the variant level as well as needing hundreds of tables to manage the columns of data. And so those made real time interaction with the data quite difficult.
So what we did is that this is built using the Apollo technology, which takes the high quality data sets that are being processed out of Titan and combines it with structured data, that is the health assessment data, and provides you an ability to interact with it in a variety of ways. It's important that with the datasets this large, you can't move the data around to the researchers and to the tools. You really need to bring the tools and the researches to the data. And so this is the core aspect and the core engine that's powering Apollo, is a Spark based engine.
That spark based engine is one that is using Spark SQL to be able to query down into that data. But in order to do this, we needed to do few things. One is that we needed to be able to partition the data heavily based on genetic location to really be able to query this quickly as well as do vertical table splitting, essentially vertical partitioning to be able to query fast enough because there's so much metadata across 11,000 columns. You'd have to be very intelligent how to achieve query to be able to get a response time in the seconds and process this data set, and be able to really explore visually rapidly to be able to get that. And then as well as being able to provide data sciences, that capability through Jupiter notes books and other mechanisms to able to enter the data, and do statistical analysis and such with the tools and the Python libraries or R scripts and things like that they did want to do. Another important aspect is to keep this all secured in our environment, we've integrated the Spark hive meta store into our platform access control model to really control the security of the massive data set. And so that was incredibly important as we build this entire system together and provided it out to the authorized researchers that were part of the UK Biobank Consortium. - Excellent, thank you very much, John and David for leading into this section.
I'm Frank Nothaft. I'm the technical director for Healthcare and Life Sciences at Data Bricks. So I manage our worldwide technical efforts bead on the product development side with our genomics runtime, bid in our Solution Architecture team that works closely with customers, as well as in some of our functions, working with partners like DNA Nexus. And what I'll talk about is how the Databricks platform comes into this, and then how we work collaboratively between the DNA Nexus, Biogen and Databricks teams to achieve success in analyzing this large scale UK Biobank data.
So this slide summarizes what our Databricks platform looks like in the genomic space. If you're familiar with the Databricks platform, we have a kind of a cloud infrastructure layer that optimizes the machines you work with, and then we have a top layer that provides notebook functionality that makes it easy to use notebooks in a reproducible and shareable way. But in the middle layer, we provide a number of different runtimes that provide optimized software stacks for various different tasks that a customer is working on. Be it something like processing large scale streaming datasets, doing machine learning, or as we introduced in 2018 and made generally available last year, we've introduced a runtime specifically targeted at genomic data workflows. Our workflow covers the whole gamut of tasks from initial data processing, through the large scale statistical analysis variation data. We've been able to use all of these workflows at Biogen, but to focus on a couple of different points, on the upfront execution side, what we've gone ahead and we've done is we've taken the GTK's best practice pipeline. So for those of you who are familiar with genomics, GTK is a standard set of pipelines for taking single individuals, raw DNA sequencing reads and turning it into either germline variants or mutation calls if you're looking at cancer data. We've taken these pipelines, we've made them easy to use through a one click interface that will set them up and run them, takes about five minutes to get the pipeline set up, making it a lot easier to access these. We've also extensively performance optimized them and made them work well with Spark. So ultimately, we've been able to do things like reduce the latency of running the GTK germline pipeline on a high coverage whole genome from 30 hours to under 40 minutes by having about a two x performance improvement from a CPU efficiency perspective, and then using the power of Spark to paralyze this work across many cores. We've then gone ahead and we take a very big focus at working on population scale data. We actually have extended support for the GTK's joint genotyping pipeline. So this is the pipeline that takes data from many single samples and blends it together into a single population. We've accelerated that and paralyze that out using Spark. And then we've worked to package up a couple of open source libraries. Hail, which comes from the Broad Institute and Glow, which is a project we've actually developed here at Databricks in conjunction with the Regeneron Genetic Center that allow people to go ahead and merge these datasets together while we control them and ultimately run large scale statistical analyses on top of that data. Our ultimate ambition here is to move people to an architecture where they're able to use open source technologies like Glow that make it easy to use many different languages, be the Python or SQL on top of genomic data, coupled along with efficient, optimized and open source file formats like our Delta Lake file format, which is a open source. So that they can go ahead and accelerate the process of taking large data sets, wrangling and cleaning them up, joining them with a variety of different data types, be a clinical data, be the images, be the other lab measures, and ultimately produces head of GWAS results or other other statistical results that they can do machine learning on top of to generate scores, and that they can go ahead and serve directly to research and clinical audiences.
When we look at some of the work that we've done on top of UK Biobank data at Biogen, you now go ahead and highlight some of the work that we've collaborated on with David's team around some of their genome wide association pipelines. So with the GWAS pipelines, what this is is essentially a statistical kernel that takes every single genomic variants in the data set and the phenotypes that we're interested in, and goes ahead and performs some sort of a statistical test to see if these two are associated.
If I'm looking at, let's say, a common continuously distributed variable like height, this could be something like a linear association between every single genomic variants and every single phenotype. Or if I'm looking at something more complex, I might be using more complex tests, like a Cox proportional hazards model or something like that. With the UK Biobank data set, this is particularly challenging because the amount of data that we're dealing with is very large. UK Biobank has over 2000 phenotypes.
With the exome sequencing data, there are 10s of millions of variants that are associated. So when you go ahead and do that full cross, you can be looking at running billions of billions of regression tests to associate this data together. With the pipeline that that we were able to build out, first, we were able to go ahead and use the open source Hail tool to ingest this data very rapidly and start generating our first results. When these results were generated, the Biogen team was able to take some of their traditional annotation pipelines. These are tools that take those variants that we found that seem to have some sort of an association with a disease that we're interested and add additional functional consequences to them, is this a variant that truncates a protein? Is this a variant we've seen in other diseases? Is this a variant that we know changes how a gene is expressed? They were able to take a pipeline that previously took two weeks to process 700,000 variants and accelerated tremendously. They were able to annotate 2 million variants in 15 minutes so they had orders of magnitude acceleration there. And ultimately, this gave them a rapidly queryable database genotype/phenotype associations with joined in consequences that allowed them to really understand how these variants function and what these variants did. This was really exciting. Just earlier this month, the Biogen team released the preprints on some of this work that summarizes the effects that they found in protein truncating variants. So these are a genetic change that causes a gene to be truncated so you don't get the full copy of the protein, you instead get a scrambled copy that doesn't produce the correct thing. They've been able to find a number of variants in about six different genes that have a significant impact on human lifespan. And they've been able to understand the biology of complex diseases a bit better through that.
Ultimately, when we look at how all of this pairs together, the great thing that the Biogen team has been able to achieve is they've been able to achieve an architecture where they are able to use their own cloud environment, data coming from the Regeneron Genetic Center from this UK Biobank cohort, and they've been able to blend the DNA Nexus and database platforms together. Ultimately, this gives them the best-of-breed solution where they have access to DNA Nexus platform with many best practice pipelines and best practice visualization tools across both the Apollo and the Titan project. So they're able to quickly spin that out, they're able to quickly run the pipelines that they need, ultimately generate the visualizations that they need for their bench science and clinical teams. And they've been able to use the Databricks platform to do a really deep dive into these workflows. So ultimately, by going ahead and combining the capabilities of these three teams together, Biogen's expertise in the data, their expertise in the science, the great tooling that's available on both the DNA Nexus and database platforms, we've been able to take a large set of challenges.
Taking this massive amount of raw data, over a petabyte of data from 500,000 individuals complexities around the traditional ecosystem that Biogen had and it needs to move to the cloud. And ultimately, the Biogen team has been able to deliver a lot of success. With the findings that they have from this very large scale database of comprehensive variation linked to comprehensive phenotypes, they've been able to go ahead and identify new drug targets. They've also been able to build out models that allow them to understand how genomic variants impact the functionality and the possible success of other drugs that they've been developing. So they've been able to go ahead and reposition and reprioritize their drug portfolio. As you look at the complex neurodegenerative diseases that Biogen is working on, data sets like this give them a lot more insight, a lot more precision, a lot more ability to interrogate the complex biology of neurodegenerative disease. As every month passes, as we mind this data set more, we are all developing as a community a better understanding of complex human disease with the power of this combined genotype and phenotype data.
I've been very excited to see the collaboration between the DNA Nexus and the Databricks teams. We see a tremendous amount of overlap where customers can benefit from both of these technologies, be it how they're using the Titan project for many of their data processing needs, Databricks for much of their ML, and they're leveraging a lot of the technologies for visualizing, understanding, and interrogating genomic and phenotype data that are available in the Apollo product. We see a tight integration coming up as their Spark team underpower many of the different technologies in this space. We're looking forward to collaborating a lot more. We're very interested in talking to anyone out in the field who is interested in using these products together, and who would like to influence our future roadmap on how these products integrate together. Thank you again very much, David and John, for joining us on this adventure. I think everyone in the genomics community has been really thrilled to see how the UK Biobank data is giving us a ton of tremendous insight into the complex biology of human disease, and I really greatly appreciate
Big data and AI are joined at the hip: the best AI applications require massive amounts of constantly updated training data to build state-of-the-art models AI has always been one of the most exciting applications of big data and Apache Spark. Increasingly Spark users want to integrate Spark with distributed deep learning and machine learning frameworks built for state-of-the-art training.
Next generation sequencing is becoming cheaper and more accessible. The volume of data sequenced is increasing faster than Moore’s Law. However, it is still expensive and slow to go from raw reads to variant calls, and to produce annotated variants that can then be analyzed downstream. In this talk, we will discuss the first state of the art, scalable and simple DNA sequencing workflow that is built on top of Apache Spark and the Databricks APIs. The pipeline is simple to set up, is easy to scale out, and can sequence a 30x coverage genome cost efficiently on the cloud. We'll introduce the problem of alignment and variant calling on whole genomes, discuss the challenges of building a simple yet scalable pipeline and demonstrate our solution. This talk should be of interest to developers wishing to build ETL pipelines on top of Apache Spark, as well as biochemists and molecular biologists who wish to learn how to develop cheap and fast DNA sequencing pipelines. Sesson hashtag: #DevSAIS10
ADAM is a high-performance distributed processing pipeline and API for DNA sequencing data. To allow computation to scale on clusters with more than a hundred nodes, ADAM uses Apache Spark as a computational engine and stores data using Apache Avro and the open-source Parquet columnar store. This scalability allows us to perform complex, computationally heavy tasks such as base quality score recalibration (BQSR), or duplicate marking on high coverage human genomes (> 60%, 236GB) in under a half hour. In tests on the Amazon Elastic Compute platform, we achieve a 50% speedup over current processing pipelines, and a lower processing cost. To achieve scalability in a distributed setting, we rephrased conventional sequential DNA processing algorithms as data-parallel algorithms. In this talk, we’ll discuss the general principles we used for making these algorithms scalable while achieving full concordance with the equivalent serial algorithms. Additionally, by adapting genomic analysis to a commodity distributed analytics platform like Apache Spark, it is easier to perform ad hoc analysis and machine learning on genomic data. We will discuss how this impacts the clinical use of DNA analysis pipelines, as well as population genomics.
Modern genome sequencing projects capture hundreds of gigabytes of data per individual. In this talk, we discuss recent work where we used the Spark-based ADAM tool to recompute genomic variants from 70TB of reads from the Simons Genome Diversity dataset. ADAM presents a drop-in, Spark-based replacement for conventional genomics pipelines like the GATK. We ran this computation across hundreds of nodes on Amazon EC2 using Toil, a novel cluster orchestration tool. Toil was used to automatically scale the number of nodes used, and to seamlessly run large single node jobs and Spark clusters in a single workflow. By combining ADAM and Toil, we are able to improve end-to-end pipeline runtime while taking advantage of the EC2 Spot Instances market. Additionally, Toil is designed for scientific reproducibility, and our entire workflow was run using Docker containers to ensure that there is a static set of binaries that could be used to reproduce the pipeline at a later date. ADAM and Toil are both freely available Apache 2 licensed tools.
The detection and analysis of rare genomic events requires integrative analysis across large cohorts with terabytes to petabytes of genomic data. Contemporary genomic analysis tools have not been designed for this scale of data-intensive computing. This talk presents ADAM, an Apache 2 licensed library built on top of the popular Apache Spark distributed computing framework. ADAM is designed to allow genomic analyses to be seamlessly distributed across large clusters, and presents a clean API for writing parallel genomic analysis algorithms. In this talk, we’ll look at how we’ve used ADAM to achieve a 3.5× improvement in end-to-end variant calling latency and a 66% cost improvement over current toolkits, without sacrificing accuracy. We will talk about a recent recompute effort where we have used ADAM to recall the Simons Genome Diversity Dataset against GRCh38. We will also talk about using ADAM alongside Apache Hbase to interactively explore large variant datasets.