Scaling MLOps to Retrain 50k Weekly Models in Parallel Using UDFs.
OVERVIEW
EXPERIENCE | In Person |
---|---|
TYPE | Breakout |
TRACK | Data Science and Machine Learning |
INDUSTRY | Enterprise Technology |
TECHNOLOGIES | AI/Machine Learning, Apache Spark, MLFlow |
SKILL LEVEL | Intermediate |
DURATION | 40 min |
DOWNLOAD SESSION SLIDES |
At data.ai, our machine learning team leverages the Databricks Platform to adopt MLOps best practices for high-frequency retraining. Our team uses Databricks and MLflow to track experiments, improve our code consistency, and to safeguard model retraining against data volatility. However, as a global data provider providing insights for the entire mobile marketplace, we face specific constraints when parallelizing model training across the tremendous combinatorics required: we train ~6 models each for >60 categories in >150 countries. Here, I will describe the framework our team has created to incorporate MLOps into weekly retraining for ~50k sklearn models in parallel. I will demonstrate how any arbitrary code can be applied in groups using Pandas UDFs and, therefore, how MLflow logging and model registration can be applied at scale to any grouped data. Finally, I will discuss the limitations of this approach and how this might be adapted for a more time-sensitive use case.
SESSION SPEAKERS
Kaleb Lowe
/Staff Machine Learning Engineer
Data.AI