Michael Tong

Machine Learning Engineer, Quantcast

Michael Tong is a Machine Learning Engineer at Quantcast. His current projects at Quantcast focus on developing model training pipelines to process petabytes of data to train tens of thousands of models. In order to accomplish this, he has heavily utilized spark sql by writing queries that utilize specialized UDFs to achieve performance orders of magnitude over the naive spark sql solutions. Michael has a Master’s degree from UC Berkeley in Electrical Engineering and Computer Science where he focused on physics simulators and machine learning.

UPCOMING SESSIONS

PAST SESSIONS

Accelerating Data Processing in Spark SQL with Pandas UDFsSummit 2020

Spark SQL provides a convenient layer of abstraction for users to express their query's intent while letting Spark handle the more difficult task of query optimization. Since spark 2.3, the addition of pandas UDFs allows the user to define arbitrary functions in python that can be executed in batches, allowing the user the flexibility required to write queries that suit very niche cases. At Quantcast, we have developed a model training pipeline that collects the training data for tens of thousands of models from petabytes of logs. Due to the scale of data that this pipeline deals with we spent considerable effort trying to optimize spark SQL to make our queries as efficient as possible. This resulted in several techniques that use pandas UDFs to run highly specialized batch processing jobs that speed up our data processing pipelines by over an order of magnitude. This talk will go over the learnings we gained from this process, focusing mainly on how we were able to leverage our custom UDFs to provide significant performance gains. The main takeaways of this talk are:

  1. Learning what spark SQL tends to do well and what it tends to do poorly.
  2. Some ideas that you can implement in UDFs that can potentially speed up queries by over an order of magnitude.
  3. Ways to profile your spark SQL jobs quickly to check if your ideas are working as intended.