PandasUDFs: One Weird Trick to Scaled Ensembles

When I was tasked with improving our predictions of when customers were likely to purchase in a category, I ran into a problem – we had one model that was trying to predict everything from milk and eggs to batteries and tea. I was able to improve our predictions by creating category-specific models, but how could I possibly handle every category we had?


Turns out, PandasUDFs were my One Weird Trick to solving this problem and many others. By using them, I was able to take already-written development code, add a function decorator, and scale my analysis to every category with minimal effort. 10 hour runtimes finished in 30 minutes. You too can use this One Weird Trick to scale from one model to whole ensembles of models.


Topics covered will include:


  • General outline of use and fitting in your workflows
  • Types of PandasUDFs
  • The Ser/De limit and how to work around it
  • Equivalents in R and Koalas

About Paul Anzel

Paul Anzel is a data engineer (and former data scientist) who has focused on taking one-off analyses and turning them into production data products. He has worked on statistical process control for data workflows, product substitutions, insurance claim fraud, assessing vehicle crashes, and price-demand estimation. Prior to working in data science, he was a grad student in Applied Physics working in acoustic non-destructive evaluation. He is an instructor for Software Carpentry. Outside of work, Paul likes repairing bikes (and sometimes riding them too), playing accordion and piano, and keeping up with his very energetic toddler.