Extending Machine Learning Algorithms with PySpark

May 26, 2021 04:25 PM (PT)

Download Slides

Machine learning practitioners are most comfortable using high-level programming languages such as Python. This is a barrier to parallelizing algorithms with big data frameworks such as Apache Spark, which are written in lower-level languages. Databricks partnered with the Regeneron Genetics Center to create the Glow library for population-scale genomics data storage and analytics. Glow V1.0.0 includes PySpark-based implementations for both existing and novel machine learning algorithms. We will discuss how leveraging tooling for Python users, especially Pandas UDFs, accelerated our development velocity and impacted our algorithms’ computational performance.

In this session watch:
Karen Feng, Developer, Databricks
Kiavash Kianfar, Developer, Databricks

 

Karen Feng

Karen Feng is a software engineer at Databricks. She works on Spark SQL and genomics applications on Spark, including Project Glow. Before Databricks, she developed statistical algorithms for genomics...
Read more

Kiavash Kianfar

Kiavash Kianfar, Ph.D., is a Sr. Software Engineer at Databricks. He develops algorithms and software for the Delta and streaming project as well as genomics applications (including Project Glow). Kia...
Read more