Extending Machine Learning Algorithms with PySpark

May 26, 2021 04:25 PM (PT)

Download Slides

Machine learning practitioners are most comfortable using high-level programming languages such as Python. This is a barrier to parallelizing algorithms with big data frameworks such as Apache Spark, which are written in lower-level languages. Databricks partnered with the Regeneron Genetics Center to create the Glow library for population-scale genomics data storage and analytics. Glow V1.0.0 includes PySpark-based implementations for both existing and novel machine learning algorithms. We will discuss how leveraging tooling for Python users, especially Pandas UDFs, accelerated our development velocity and impacted our algorithms’ computational performance.

In this session watch:
Karen Feng, Developer, Databricks
Kiavash Kianfar, Developer, Databricks

 

Transcript

Karen Feng: Hi, everyone. Thanks for attending our session. By way of introduction, your presenters today will be me, Karen, and Kiavash. Both of us are software engineers at Databricks with backgrounds in bioinformatics. We spent the last couple years working on Glow, an open-source genomics library built on top of Apache Spark. Today, we’ll be discussing some of the lessons we’ve learned en route to our Glow v1 release.
We’ll be highlighting how our pivot to PySpark improved our experience as developers, as well as our users’ experience. Even if you’re not a bioinformatician, we believe that these lessons are widely applicable to anyone who is in the data science space, as well as to anyone who builds tools for data scientists. In our discussion of how we redesigned the Glow library around PySpark, we’ll be taking you through a deep dive on a genomics application, specifically, the GloWGR algorithm.
Given that the community at the Data and AI Summit consists of a large variety of personas and applications, you may not be a genomics expert. Don’t worry too much about that. I have a mini genomics lecture coming up so that you can follow along by the time we get to the fun part; algorithms. And the important takeaways aren’t genomic specifics at all. They’re more about how you can make your machine learning algorithms usable and scalable.
The PySpark-based redesign for Glow v1 was a major decision made over time as we identified three key problems facing our user base; the bioinformatics community. The first problem drove us to create the Glow project in the first place. Over the past decade, genomics data has seen massive growth, and can now be categorized as big data.
To enable bioinformaticians to analyze industry-scale data, we created Glow as a bridge between genomic data and a general purpose big data engine, Spark. The second problem was one that we learned the importance of over time. This will likely be a familiar story to you if you’re in the Spark community. We built Glow from the beginning as software developers, with software developers in mind as our end users.
As you can guess, this means that we built our library predominantly in Scala. As our user base expanded, however, we discovered that our users are largely data engineers and data scientists writing in higher-level languages like Python. When we discovered this, we decided to refocus our efforts to improve the Python user experience.
Around the same time, we discovered that it would be equally meaningful to reimagine the Python developer experience by making Python a first-class language in our library from both a user and a developer perspective. Our users are now better served, and can even contribute their expertise to the library in the form of new features. With recent developments in PySpark, bioinformaticians can now write their algorithms in Python and then link them to Spark with minimal effort. This has shaped how we design our new algorithms, as well as how we have redesigned our existing algorithms.
I covered those design decisions pretty quickly, so let’s break them down one by one. First problem up, genomics data are growing too fast for existing tools. In fact, genomic data are now big data. The increasing volume of genomic data has exciting implications for all of us, as it accelerates the discovery of valuable insights about human health. As any statistician will tell you, more data equals greater power. One particularly exciting field today is the analysis of rare genetic variants, which can have an outsized impact on disease risk. With the volume of data that we now have, we can begin to link these variants to diseases in a statistically significant manner.
One company that has a front-row seat to the growing pace of genomic data sets is Illumina. Illumina is the biggest sequencing platform today. And last January, they noted three things: ongoing programs, such as Genomics England, plan to genotype or sequence more than 10 million people. Second, the amount of data has 50xed in the last decade, driven by improvements and cost effectiveness.
And finally, customers generated 150 petabases of genomic data in 2019, which represents a 50% year-over-year increase. A petabase, for context, is 1,000 trillion base pairs of DNA sequence. To provide a concrete example, Canada’s Michael Smith Genome Sciences Center spent 15 years sequencing a single petabase, and then sequenced a second petabase in less than two and a half years.
The growth of genomic data is due to a variety of scaling factors. One factor is the amount of data we are extracting from the genetic material of a single individual. From the beginning, geneticists have used something called a genotyping array. These arrays are often what you would get from a consumer-facing service like 23andMe. They target specific known genetic markers of interest.
Over time, geneticists have started moving toward whole exome sequencing. This involves sequencing the protein encoding part of your DNA. However, the exome actually only represents a small portion of your DNA. The rise of next generation sequencing has enabled whole-genome sequencing.
Another factor is the sheer number of individuals who have been sequenced. One of the first major studies was the 1000 Genome Project, which involves sequencing thousands of individuals. Today, biobanks often include hundreds of thousands of individuals, with some even approaching the million individual mark.
A final factor is the phenotypic information biobanks now contain. These represent the measurement of traits and can be quantitative, think body mass index or BMI, or qualitative, such as whether or not an individual has diabetes. Biobanks are capturing more and more information about the individuals that are sequenced, representing a gold mine of material.
As you can guess, the single-node tools that are the bread and butter of bioinformatics cannot operate at these scales. Rather than choosing to reinvent the wheel, we decided to use off-the-shelf tools that are proven themselves at this massive scale. In our case, this is why we built Glow on Apache Spark.
Some genomic single-node tools do allow for some level of parallelism, often on on-premise clusters. However, this is done in an ad hoc fashion. Every library and every language does it a little bit differently, which makes both developers’ and users’ lives difficult. With Glow, we had chose to perform all of our parallelization in Spark, which has proven itself for a variety of applications in the big data space.
In addition, single-node tools are hard to wrangle when it comes to building production pipelines. When working with text input and output, we introduce significant amounts of overhead. Using Spark, we can keep everything in memory and minimize serialization costs; in Java using Kryo, and in Python using Arrow.
Once we decided to build our tool on Spark, however, we faced a new hurdle. Spark is built largely on Scala. However, data scientists and engineers such as bioinformaticians, are often more comfortable in higher-level languages. As demonstrated by GitHub’s language statistics, Scala and Java make up over 80% of the Spark codebase. And the 20% or so remaining, you’ll see languages that demonstrate how developers have responded to the diversity of users within the Spark community.
In particular, you’ll notice that the third most popular language in the Spark codebase is Python. This represents the backbone of PySpark. Although Python code makes up only 8% of the Spark codebase, the majority of notebook commands written by Databricks users are Python commands. In fact, the number of Scala commands are outnumbered by Python ones one to three. This reflects a trend across the Spark user base at large.
In terms of year-over-year growth, the number of PySpark and Koalas downloads from PyPI have significantly outpaced that of Spark downloads from Maven Central, meaning that the number of Spark users on Python is growing over time.
When we captured the usage patterns of Glow users on Databricks, we see an even more significant bias towards Python usage. If we take a look at Biostars, which is a Q&A website for bioinformaticians, think it’s along the lines of Stack Overflow, you’ll see that this fits into broader trends.
Bioinformatics is a field that traditionally does not write code in Scala. And to be fair, very few fields actually do. Instead, bioinformaticians are more comfortable in higher-level languages like R and Python. This is representative of the fact that bioinformaticians are not traditional software engineers, and are better characterized as data engineers and data scientists.
So how do we accommodate users that would prefer to interface with Python? The answer was fairly obvious, write a Python client. In our case, we wrote a thin layer on top of our mostly Scala codebase that allowed users to interact with the same APIs using Py4J as the middleman. This is similar to how PySpark works. By using Py4J under the hood, Python users do not need to learn a new language in order to achieve similar performance with minimal overhead.
In addition, efforts like Project Zen and PySpark have made documentation more readable, errors more understandable, and added type hints decreasing user frustration. The final hurdle in our redesign was one that we face as developers of the library. Even as software engineers who use Scala day-to-day, Spark and Scala are hard to use when writing machine learning algorithms. This was especially problematic given that our partners in Project Glow were not Scala developers, and therefore could not contribute code to their project easily.
To write a custom language-agnostic algorithm, we had to create a Spark SQL expression. This was painful for a variety of reasons. First of all, Spark SQL expressions are meant to process records individually, while our algorithms often require some sort of state to be retained. Secondly, transforming Spark SQL rows to a shape that is compatible with a linear algebra library requires custom logic that adds significant overhead. And finally, the linear algebra library selection in Scala is quite limited, and even the libraries that do exist, often are missing major features.
We struggled with this for a while and eventually realized that we did not need to force ourselves into this corner. Due to Python’s many user-friendly linear algebra libraries, we wrote many of our new machine learning algorithms in PySpark. Using PySpark, Python developers can write algorithms with their favorite libraries, like pandas, and link them to Spark with minimal effort, thereby achieving the same scalability and performance without the headache of learning an entirely new language.
These algorithms are embedded in user-defined functions or UDFs, which use Arrow under the hood. Arrow allows for conversions between Spark DataFrames and pandas DataFrames with minimal overhead. These UDFs are incredibly powerful. They can process scalars, groupby and applybys, windows and aggregates. By allowing pandas users to run their functions natively on Spark, pandas power users can become Spark power users.
Although PySpark and pandas UDFs were introduced a while back, there have been many key improvements since their initial release. In particular, I want to highlight a feature introduced in Spark 3.0 that was key for our Glow development: mapInPandas. Previously, we wrote our algorithms from scratch as Spark SQL expressions. By forcing all development to be done in Scala, we were unable to leverage our power users, who are largely Python developers, to contribute to the library.
With mapInPandas, they now can. Now, Python developers can perform the majority of their development locally with pandas. They simply need to create a function that transforms a pandas DataFrame to another pandas DataFrame. Here, f of x. When plugged into Spark, Arrow is used to convert a Spark DataFrame partition to an iterator of pandas DataFrames.
Each pandas DataFrame is processed via the Python data function and the output iterator is serialized as a partition of a new Spark DataFrame. Using pandas UDFs, we were able to achieve better parallelism and performance than with our Spark SQL expressions due to limitations in Scala libraries. Best of all, we were able to do so with the guidance of our users.
Now that we’ve covered the big picture ideas that drove our redesign with PySpark, it’s time for our deep dive into the genomics use case. Fair warning, there are a couple of important acronyms, so I’ll define them for you first. The first acronym we’ll be using regularly is SNP, or single nucleotide polymorphism. SNPs are the simplest type of genetic variation, although there are other types, such as insertions and deletions, also called indels, copy number variants, also called CNVs and so on.
For the sake of simplicity, we’ll only be talking about SNPs today. A SNP is a mutation of a single base pair in your DNA. This results in a mutation in the mRNA or messenger RNA during transcription. When this mRNA is translated into a protein, the resulting amino acid chain can be mutated resulting in a different shape, and therefore, a different function.
One simple example of an important SNP occurs in sickle cell disease. Sickle cell disease or sickle cell anemia is characterized by abnormalities in the oxygen-carrying protein hemoglobin, which is found inside red blood cells. Under certain conditions, sickle cell disease changes the shape of a red blood cell from a round shape to a sickle shape, which can result in severe pain for individuals with the disease.
A SNP that causes sickle cell disease changes a codon in the beta-globin gene that results in an amino acid substitution, which then changes the shape of the protein under a low-oxygen concentration. If all diseases were triggered by one single nucleotide polymorphism, bioinformatics would be a much simpler field. Unfortunately, the reality is much more complicated.
Generally speaking, disease risk is confirmed from multiple variations across an individual’s genome. In the common model that we’ll be discussing today, each variant may add or subtract to the amount of risk an individual has for a given disease. As a side note, this model greatly oversimplifies the complexity of risk in the real world. For example, genes often impact each other.
The second acronym I’ll be introducing today is GWAS or genome-wide association study. GWAS is a method of identifying associations between genetic variants and a trait of interest. Historically, these studies have focused on common variants as the relatively small size of datasets meant that we had insufficient power to analyze variants that occur on a low frequency across the population. As the scale of data has increased, we have been able to investigate rare variants in a statistically significant manner.
Although you likely do not have any given rare variant, most people do have some number of rare variants. And rare variants are interesting because they can have greater effect sizes compared to common variants whose effect sizes are limited by purifying selection. This is a trait of natural selection. If a genetic variant is very harmful, it tends to get selected out of the gene pool.
On the right, I’ve attached an image of a Manhattan plot. This is a common way to visualize the results of a GWAS. The x-axis represents the location of our genetic markers, usually SNPs. The numbers on the bottom represent the autosomal chromosomes or number of chromosomes of which you have 22. The y-axis represents the significance of the genetic variant in relation to the trait of interest. If a genetic variant is significantly associated with the trait, then we see a point high up on the scale.
In order for a GWAS to yield meaningful results, we have to remove false positives and false negatives. These can arise due to a variety of causes such as population structure. At a big picture level, these cause individuals to have SNPs at frequencies that do not match the regression model used in GWAS, which assumes that genetic variations are present in individuals in an IID or individually and identically distributed manner.
This is where whole-genome regression or WGR comes in. WGR is used to calculate a phenotypic predictor, which estimates an individual’s propensity to a trait given a set of covariates under the null model. These covariates can capture important characteristics like population structure, and the predictor can further capture polygenic effects. During the GWAS regression step, we use these phenotypic predictors as an offset.
Empirically, this reduces spurious associations and increases our power to detect more subtle associations. In these comparative Manhattan plots, we can see that using WGR boosted the signal for a set of known genetic markers. Next, we’ll discuss the WGR and GWAS implementations in the Glow v1 release. I’ll hand it over to Kiavash to discuss our WGR implementation.

Kiavash Kianfar: Thank you, Karen. So let’s talk a little bit about Glow and the principles of developing it and then dive into how Glow does the WGR and GWAS as Karen mentioned. So as we discussed, we developed Glow to industrialize genomics, essentially bring the bioinformatics under the umbrella of data science so that with the evergrowing size of genomics data, the industrial-scale and biobank-scale statistical geneticists can use Glow to do the types of study like the GWAS study that Karen introduced.
So the core principles that we had in Glow was to build it up on Apache Spark, then that is to just support the industrial-scale on a well established tool, as it was mentioned before. And we definitely want it to be used by the genomics community. Therefore, it’s very natural that we want it to be flexibly and natively support tools and file formats that are in the genomics port, including the variant file formats and annotation file formats.
And we want to keep it to as simple as possible for the users. So provide single-line APIs and functions to do genomics workloads, so like variant quality control manipulation, sample quality control and other things related to variants and also doing the genome-wide association studies. And of course, we made it an open-source project to have the contributions of the community and at the same time, can involve both industry and academia in making it better and better every day.
So Glow v1 was recently released. And this is a version that encapsulates all these principles actually by offering a lot of features that are available to the community to do genomic data ingest, and also genomic workload analysis. And more specifically, Glow v1 offers data sources to reading the genomics formats like VCF, BGEN, Plink, GFF, for annotation into the Spark DataFrame.
When coupled with the Delta project, it’s very reasonable for industrial-scale genomics users to import these formats into much more flexible databases like Delta Lakes that are offered by Delta. We have developed, as Karen mentioned, SQL expressions that does variant handling operations in several languages, including Python, which is the most popular one. We can do variant quality control, a lot of different manipulations that are customary in genomic workloads.
We have another concept called Transformers to do more complex genomic translation like escalating multi-allelic variants for those who are familiar with the bioinformatics work for doing normalizations on variants. Essentially, the difference of these with SQL expressions is that they usually need some type of a state across the whole DataFrame, and that is what we do using these transformers.
And the one that we are most excited about is the GloWGR which is a whole-genome regression boosted genome-wide association algorithm, which is a great example of the power of PySpark and mix in the usage of pandas UDF inside the PySpark to achieve a very large-scale linear regression-based learning, as I will discuss a little shortly.
So just to recap what Karen mentioned about GWAS, we do genome-wide association studies to essentially answer the question that whether a specific phenotype like a disease or a specific trait we are interested in it in individuals of a particularly species like human is associated with particular genotypes like the SNPs that we mentioned.
This is done in the GWAS work using generalized linear models. Usually there are different ways of doing GWAS. The method that is implemented in Glow is based on the algorithm that was developed by our collaborators in the Regeneron Genetic Centers. This algorithm is called REGENIE and a tool developed for it in C++ is a single-node tool which it works based on this algorithm. And that package is also called regenie.
So, being a single-node tool in C++, although this method is revolutionary in terms of speed and accuracy compared to its predecessors, it can still hit the scalability issue as the size of the biobank data gets larger and larger. And therefore, we took upon ourselves to do this, to develop GloWGR in collaboration with our colleagues at Regeneron to make it scalable based on Spark.
And the whole GloWGR is essentially developed in the Python environment and using PySpark, and that is what we’re going to discuss a little bit more about today. Just to give you a hint of what is the size of the problem we are dealing with here, as I said, that GloWGR essentially is a combination of what we call whole-genome regression and the genome-wide association study.
I’m going to talk about the whole-genome regression part. And then after we get the polygenic predictors, then we will pass them to a GWAS stage, which Karen will talk about. But let’s just look at the dimensions. So in the current biobanks, as it was mentioned, we can easily have hundred of thousands, closing to a million individuals or samples in your database. And then the phenotypes that you may be interested in seeing in a study can be in the order of tens or even hundreds.
In this example, for example, I’ve assumed that we I am interested in 50 phenotypes. So the number of rows in this linear regression model is going to be, let’s say, for example, 500,000 and the number of phenotypes is going to be 50, and the dimension of my phenotype vector is going to be 500,000 by 50.
Then I have a bunch of covariates that capture non-genotypic features of an individual. There may be hundreds of them that I’m interested in, in a study. And then, for each SNP across the genome, I have one variable in my generalized linear model, so I can, depending on how many SNPs I pick and I’m interested in, I may easily get to an order of million or even millions of SNPs or variants in my model.
Therefore, doing linear regression or logistic regression at this scale is just impossible without distributing the process and that is what the WGR does algorithmically. Meaning that instead of dealing with this huge matrix of genomic effects here, through some blocking this matrix and some rounds of reduction, we replace it with a single column, which is a predictor of the polygenic effects, and then pass the results as an offset to the GWAS model which is done as other GWAS study are done.
So the innovation in this algorithm is essentially related to this reduction that happens from this huge dimension to this very small dimension. How this is done is through two steps: one called reduction, one called regression, in our code. So during reduction, essentially this huge matrix is blocked across the rows and columns into specific sizes of block that can be set.
And let’s say for example, with the customary blocking that we do, such a matrix can be split into 5,000 block, and then we do multivariate linear regression on each one of these blocks. So as you can see, there is a lot of computation that has to go this because these blocks are still quite large, but the blocking makes it possible, at least to do… We’re using the pandas linear algebra tooling.
And then after we do that, we do another round of the regression on these to pick the best ridge regression parameters. And even make the model still smaller. So another round of regression, either linear or logistic, depending on whether your covariate is linear, whether your phenotype is binary and quantitative happens.
And after that, the result of the fitted null model is used to do the predictions in a LOCO manner, where LOCO stands for leave-one-chromosome-out fashion. Meaning that to get the effects for a SNP in one particular chromosome, we use the model that is fitted over all other chromosome except that one, and then use that to do the prediction for that particular chromosome.
This is just to give you a little bit of idea about what we are doing. Even if you put aside the genomic context, obviously this is a big learning problem based on doing regression and logistics in a blockwise fashion on very huge dimensions. Let me just talk briefly about how we tackle this problem. As we saw the genotype data, which was this x-map matrix, is quite huge and needs to be stored in a distributed manner. So that’s why we have it as a Spark DataFrame.
But the phenotype data and covariate data are relatively are smaller, and we use pandas DataFrames, to capture them. In GloWGR, inside Glow, b1, there are some functionality to essentially manipulate this genotype data and prepare it for being blocked. So if you’re familiar with bioinformatics, you are familiar with these types of functions. But we don’t want to worry about them that much because the context of what I’m addressing here is the emphasis on how pandas UDF can answer this challenge that we have here.
For the blocking, we used a transformer which we develop inside actually, Scala language and then had it just connected into a Python function. This is done because the genotype DataFrame is a Spark DataFrame and the manipulation that we are doing we can easily do using the Spark SQL operations.
Essentially, what happens is that you have this matrix: rows are samples, columns are variants. They are blocked into, as I explained, things are blocked across the rows and across column. And each block is going to be essentially flattened and presented stacked upon each other and presented in this shape inside a Spark DataFrame. Because in the Spark DataFrame, you have to have column formats. All this operation can be done with this single transformer.
After this though, we then enter the world of needing to do a lot of linear algebra operations, which as Karen mentioned, is quite a hurdle if you just want to do it inside a Spark DataFrame because most of the things are not in the shape that you require them. And then the operations and the linear algebra primitives offered in languages outside pandas or Python are quite inferior in terms of speed, and performance compared with what you can find in Python universe and pandas.
So essentially, what we do here if we use pandas UDF. And the power of pandas UDFs is such that you can essentially filter out and group your Spark DataFrame however you want and then get the piece that you require and pass it along with some other parameters, including some other pandas DataFrame into your function. And then, do the operations that you want in the pandas world, and then come and export the pandas and then convert them back to a Spark DataFrame, as Karen mentioned.
So for example, in the reduction stage, we have two functions: fit and transform. The fit itself contains several operations which each operation we do using a particular pandas UDF. We have one UDF that essentially gets the particular blocks that we need from this Spark DataFrame. And also, so the pandas DataFrame and does some calculation using… and you have linear algebra operators inside pandas to calculate X transpose X and X transpose Y.
Then we use another UDF to do some reduction to sum these values over particular blocks that we need based on the linear algebra of the algorithm and then another UDF to assemble things together and then solve therefore the coefficient matrix using the formula that exists for the linear ridge regression. And then on transformed, we have another pandas UDF that uses the result of this coefficient matrix and essentially calculates the response variable by multiplying the blocks with the coefficients.
The same thing happens in the ridge regression almost, that we do additional cross validation as well. So, we have functions that essentially use these UDFs again to do the cross validation. And then we have another function that does the transformation using LOCO, using another UDF.
So, to recap, essentially, in the reduction stage, we do a fitting and then, followed by a transformation. All of this is done in a class called RidgeReduction where inside this fitting, several UDFs are called to do them with their manipulations and linear algebra that are needed. Inside the transform, the same.
Then the result, which is a model, is used again to do their regression by use of the same UDFs in addition to a cross validation function, which is again written in Python, and then the transforming in a LOCO manner. And the result would be essentially the WGR offset predictors that are passed to the GWAS association testing. So similar ideas are also used in the GWAS part, and to elaborate a little bit more on them, I pass back to Karen.

Karen Feng: Thanks, Kiavash, for the deep dive into whole-genome regression. I’ll be taking over for the final part of the talk, which will focus on an algorithm that is a little bit simpler and less novel, but represents the step of actually identifying associations between genetic variants and traits.
In the traditional GWAS regression model, the phenotype, Y is modeled as accounting for two major components, as you already saw on Kiavash’s slides. The impact of the genetic variants where the genotype matrix here is represented as G, and the impact of each variant is represented as the vector, beta G. As well as the impact of the covariates, which account for population structure. And here this is the matrix C, where the impact is in the vector, beta C. There’s also the error term, eta.
Under the GloWGR model, we residualize our outcome, Y with the Y-hat phenotype predictor to account for polygenic effects. But before I get into the GWAS functionality that we wrote for GloWGR and PySpark, I’m going to reflect a bit on our original GWAS implementation. As I mentioned at the beginning, we wrote everything as native Spark SQL expressions. In this setting, we had to group the original data by trait and by variant for our fitted model. And this was due to limitations in Scala’s linear algebra functionality.
Specifically, the primary linear algebra library in Scala, called Breeze, only works on data with up to two dimensions. Because of further limitations in Sparks linear algebra APIs, we also incurred the cost of conversions between Spark DataFrames, Spark ML matrices and Breeze matrices. As you can guess, this resulted in a non-negligible performance hit.
So why did we go through all of this effort? As I mentioned earlier, we assumed that our user base would be like us and very Scala-focused, and as Spark SQL expressions are portable for every language, we maximize for user flexibility. But as our user base expanded, we realized that Python was actually our users’ language of choice. As a result, implementing algorithms as Spark SQL expressions imposed unnecessary limitations on our performance, as well as our development velocity.
These limitations for both our user base and us as developers. The first major barrier was learning how to write Spark SQL expressions at all. In our case, the Spark SQL expressions for GWAS were especially complicated because our model was stateful. As you probably saw earlier, we had to calculate a null model and store in a thread-local state because it had to be shared.
As our algorithm required a complicated series of linear algebra operations, we also needed to use Breeze, which is the linear algebra library in Scala. Unfortunately, to get from a Spark DataFrame to a Breeze matrix, we had to perform expensive customized collections and transformations. And even when we got to the level of a Breeze matrix, there were major limitations, including the limitation of two dimensions.
Spark SQL expressions also force the users to work with Spark DataFrames as input and as output. The required input format here was unnatural and inefficient, so no one, not even us, stored Spark DataFrames as needed by the GWAS expressions. Therefore, a prerequisite to using our GWAS functionality was a willingness to perform extensive amounts of data wrangling.
That’s why in Glow v1, we rewrote our GWAS regression algorithms using PySpark and pandas UDFs in particular. At Databricks, we’re not really a genomics company. So for the GWAS was part of GloWGR, we also depended on our partnership with the Regeneron Genetics Center. Migrating to pandas UDFs for GWAS was driven by their domain expertise.
With the addition of mapPartitions to PySpark in Spark 3.0, the developers at the RGC could write and test the regression algorithms in Python, and link them up to Spark when they were ready. This allowed us to share the burden of development with domain experts who interface directly with the users day-to-day.
You’ll also note here that we no longer have to split the data by both trait and variant when processing the fitted model. Rather, we can parallelize along all dimensions as panda works even with high-dimensional data leveraging the batched, minimal overhead conversions with Arrow.
Finally, you’ll see that the input is no longer a single Spark DataFrame, which as I mentioned earlier, felt inefficient and unnatural. Rather, we allow the users to bring their data to the model with pandas minimizing the amount of data wrangling. By going from rolling our own Spark SQL expressions to using native Python functions, we were able to improve developer velocity, algorithmic performance and user experience. The only downside was that users had to access the API via Python.
But as I mentioned earlier, based on usage patterns, this was actually a non-issue. And we decided that the pros outweighed the cons. So to cap this all off, here’s a quick summary of what we’ve covered in the GWAS section. In Spark SQL, all input and output must be Spark DataFrames. In PySpark, we can also use pandas DataFrames. This reduces the amount of data wrangling and can feel more natural from a user perspective.
In Scala land, the linear algebra libraries are quite limited, and data transfers can be expensive. In Python, there are a large number of linear algebra libraries that are intercompatible. And when paired with Arrow, we have batched, optimized data transfers between Spark and pandas.
For GloWGR, we also used a library called Einsum, which allowed us to write high-dimensional linear algebra expressions in a compact, readable fashion. Einsum here stands for Einstein summation convention. Also, Einsum has a cool feature where we can cache intermediate results.
Of course, all of these benefits do come with a downside. PySpark only works from Python. But if your user base is as Python-focused as ours, this is an acceptable trade-off. We’ve spent the meat of this talk discussing GloWGR, but it’s only part of the bigger story of how bioinformaticians interact with their data.
Through our journey to the v1 release, we learned that an important principle was letting the open-source community shape the library to fit their specific needs. By building Glow directly on top of a general purpose engine, the Glow community benefits from work done by the broader open-source community. This has been key as we expand the breadth of the Glow project, as we can now leverage recent developments in the Spark space.
Because Glow is so lightweight, it can also work out of the box with other open-source libraries like Delta. The flexible nature of Glow also makes it amenable to custom, ad hoc extensions. By avoiding the use of custom wrappers, we create a minimal barrier of entry for our users to work with PySpark DataFrames that are ingested and processed with Glow.
One extension that we’ve already seen users implement is gene burden tests. These tests can be thought of as an extension of traditional GWAS. While traditional non-wide association studies process a single variant at a time, multiple variants are combined to increase power in a gene burden test. Using native PySpark functions and pandas UDFs, Glow users can extend existing algorithms for their use case.
This framework is an example of how Glow fits into the Big Data ecosystem. By combining Glow’s functionality with Delta, we can store genomic data at different stages of processing, while naturally addressing the N + 1 problem. At the end of the day, we hope that Glow helps bioinformaticians extract, transform and load, or ETL, their data, just like any other big data.
So what are the big takeaways here? The lessons that we learned as library developers are not limited to just the genomics space. In fact, I’d say they’re not even limited to machine learning or PySpark users.
First of all, usability trumps flexibility. Proactively listen to your users. If they don’t want something, don’t build it. In our particular case, we assumed that our users wanted multi-language flexibility. As we listened to our users more closely, we realized that we had made usability trade-offs, and we should have gone the other way around.
Second, don’t roll your own. If there’s a popular, well-established, general purpose tool that works for you, use it. As a small developer team, we would not have been able to create, test, and distribute our own data lake format, like Delta, big data engine like Spark, or linear algebra library, like pandas, and don’t get stuck on old dependency versions. If there is a new version, take advantage of it.
Third, if you missed points one or two, don’t keep digging yourself deeper into the hole. The earlier you change direction, the easier it is for you as a developer, as well as for your users. In the long term, you’ll minimize maintenance costs and increase user satisfaction. Well, that’s all we have for today. Thank you so much for coming to our talk. If you have any questions, we’ll be answering them in the Q&A.

Karen Feng

Karen Feng is a software engineer at Databricks. She works on Spark SQL and genomics applications on Spark, including Project Glow. Before Databricks, she developed statistical algorithms for genomics...
Read more

Kiavash Kianfar

Kiavash Kianfar, Ph.D., is a Sr. Software Engineer at Databricks. He develops algorithms and software for the Delta and streaming project as well as genomics applications (including Project Glow). Kia...
Read more