Spark SQL provides a convenient layer of abstraction for users to express their query’s intent while letting Spark handle the more difficult task of query optimization. Since spark 2.3, the addition of pandas UDFs allows the user to define arbitrary functions in python that can be executed in batches, allowing the user the flexibility required to write queries that suit very niche cases. At Quantcast, we have developed a model training pipeline that collects the training data for tens of thousands of models from petabytes of logs. Due to the scale of data that this pipeline deals with we spent considerable effort trying to optimize spark SQL to make our queries as efficient as possible. This resulted in several techniques that use pandas UDFs to run highly specialized batch processing jobs that speed up our data processing pipelines by over an order of magnitude. This talk will go over the learnings we gained from this process, focusing mainly on how we were able to leverage our custom UDFs to provide significant performance gains. The main takeaways of this talk are:
Michael Tong is a Machine Learning Engineer at Quantcast. His current projects at Quantcast focus on developing model training pipelines to process petabytes of data to train tens of thousands of models. In order to accomplish this, he has heavily utilized spark sql by writing queries that utilize specialized UDFs to achieve performance orders of magnitude over the naive spark sql solutions. Michael has a Master's degree from UC Berkeley in Electrical Engineering and Computer Science where he focused on physics simulators and machine learning.