Healthcare, life sciences, and agricultural companies are generating petabytes of data, whether through genome sequencing, electronic health records, imaging systems, or the Internet of Medical Things. The value of these datasets grows when we are able to blend them together, such as integrating genomics and EHR-derived phenotypes for target discovery, or blending IoMT data with medical images to predict patient disease severity. In this session, we will look at the challenges customers face when blending these data types together. We will then present an architecture that uses the Databricks Unified Data Analytics Platform to unify these data types into a single data lake, and discuss the use cases this architecture can empower. We will then dive into a workload that uses the whole genome regression method from Project Glow to accelerate the joint analysis of genotypes and phenotypes data. Afterwards, Frank Austin Nothaft, Technical Director for Healthcare and Life Sciences, will be available to answer questions about this solution or any other use case questions you may have across healthcare, the life sciences, or agriculture.
Speaker: Amir Kermany
– Hello everyone, my name is Amir R Kermany. I’m a solutions architect at databricks, focusing in health and life sciences. Today, I would like to talk about how researchers in genomics fields are leveraging databricks, to combine different data sources with genomics data and also perform statistical analysis on their data sets at scale. As you know, there’s the data is being generated with multitude of you know, different data sources are more and more being leveraged in genomic space. For example, we have data from wearable devices, electronic health records, electronic medical records, combined with genomics datasets to inform our science. Genomics data by itself is valuable to some certain degree. However, the combination with genomics data and phenotypic data is where the real value is. And, what we are we are seeing across the board is that more and more people want to use the same APIs that we are using for analyzing large scale, other data sets to be used with genomics data. So bring that unification to the field. and they platform actually, we can start from the raw data from sequencers. And did you never database, a genomic strong time, which comes with, for example, optimized versions of gas, K pipelines, which runs variant calling process on your dataset at scale so that you can get you from fast you to barring poles that can be done in 20 minutes with the comparable cost. So if you really want to get results fast, you can get use this pipeline. The only added value is that, organic they when you are running this also the results are automatically written into a data Lake, and then you can directly ingest and start analyzing your data using sports . However, if you don’t have that and you already started on PCF, you can then leverage GLOW that I will be talking about to ingest your data, and integrate that within your data Lake. Also on database platform, genomics run time, you can run joint genotyping and joint genotyping basically you can have thousands of genomes, that you can by you running joint genotyping, you pull information across genomes by forming the population level formation by integrating population level information to increase the accuracy of your variance. They’re called the variance. And this is a really good example of how you can use the distributed power of sport to run such workloads. Also, as I mentioned, because we are seeing that more and more people want to use scale and combined data sets. It would be good to think, uh, you know, this kind of standard APIs to the genomics field. So having this motivation, we actually partnered with genomics Regeneron genetic center to create this open source library called GLOW. So GLOW is a library that is designed to manage tertiary analysis on your data sets from ingest, and downstream GBS, using spark native APIs. So for example, you start from different you can start from, VCF, BGEN, or pink flies and ingest that and write that directly into a Delta Lake. So Delta is, data Lake, a data bricks generally created this, library, which you can think about. It is a data Lake that has reliability performance and data versioning that you would expect from a data warehouse. So now when you have your data in a Delta Lake, you have, you can, for example, query from specific version of your data, which is very important when we are doing anything in science for the reproducibility reasons. So as an example of how we are using Delta and how we are using GLOW for ingest, we, you know, you can start from, you know, UCF file, you know, standard genomics, barium, data dataset, and with a one line of code, actually you load this VCF into spark data field. Now, after you’re loading to start data frame, you can also display that on database in the top, you know, display the data frame and you see that it has this tabular or structure already with columns associated with different levels of information that you have in VCF, and also is expands info fields for you as separate columns. And also you have all the genotype genotypes, you know, sample IDs and phase information and the variant calls, et cetera, in one corner. As you see, you can also look at the schema of this table. None of you have this, as a spark data frame, you can simply interact with this data set the same way that you’re interacting with let’s say any other data set that you have ingested in your day-to-day. As a matter of fact, you can write sequal point queries against this data dataset. For example, I want to see how many variants have referenced Allele A or even I can add quarries on a sample and variant level, as an example, of per sample, summary statistics, you have specific APIs for that. Let’s say, I want to look at you know the summary statistics that I have on a sample level. I’m just writing it passing this queries in sequel spark sequel. And the results that I get are, you know, it’s also the information for each sample, you know, their call rates, et cetera. And with databricks, actually, I also, with the notebook environment, I can also visualize this results. for example, if I want to like court correlation between, you know, number of deletions and issue individual, I can quickly interactively visualize that. Similarly, I can look at pair variance, summary statistics, and, in this case, similar to sample level summary statistics, I passe varying level of summary statistics query and I can get, quickly I can get that for example, I use frequencies for each of these variants and other information. Similarly, they can do other sorts of analysis. Like for example, Hardy-Weinberg these are queries that initially I want to run on my variance so that they can feel some of those variance out, put a downstream analysis. So ensure you can use GLOW to ingest your genomics data sets at scale. So the same thing you can ingest the you know you can data. It doesn’t matter the if you’re a data center is filled up you just provision across bigger number, of course, to do the ingest. You can write it back in a Delta Lake, optimize that Delta lake, so that you have better performance if you’re writing the queries often, which is usually the case. And also you can have data versioning in your Delta lake. But usually what you’re doing, all of these analysis is that downstream we want to do someone analysis canonical analysis is GBS, which is genome genome-wide association studies. Recently we added new functionality to GLOW, which is called GLoWGR. This is built on top of regenerous Regene matter, which is the fastest, whole genome regression method. And the advantage of this method that we put it on in GLOW is that now you can scale your workload again by the number of calls that you have. And as this graph shows that it really orders of magnitude, you know, decrease in the wrong time. What does it translate? Is that what you’re doing a GBS on a trait if for a trait, it takes days for your device It can, you can bring it down to hours. So that, that organically translates, you know, getting faster to your results. Without going into details, this is basically the idea here is that you are going from, you know, in two stages and this is done. So one stage is you continue about training thousands of models to predict a trait. Then this is actually used for correcting for, compounding factors in your analysis. And then you can perform your GBS. The key here is that all of these steps or linear is scalable and you can get, you know, you get, it translates to a very good performance. On that note, I stop here. Thank you so much for joining and I would be happy to answer any of your questions.
Amir Kermany is a Health and Life Sciences Solutions Architect at Databricks, where he leverages his expertise in genomics and machine learning to help companies in the space to solve their problems in generating actionable insights from vast amounts of health related datasets. Amir’s past positions include Sr. Staff Scientist at AncestryDNA, Sr. Data Scientist at Shopify, Postdoctoral Scholar at the Howard Hughes Medical Institute and the University of Montreal. He holds a PhD in Mathematical Biology, MSc in Electrical Engineering and BSc. in Physics.