To get good results from Machine Learning (ML) models, data scientists almost always tune hyperparameters—learning rate, regularization, etc. This tuning can be critical for performance and accuracy, but it is also routine and laborious to do manually. This talk discusses automation for tuning, scaling via Apache Spark, and best practices for tuning workflows and architecture. We will use a running demo of Hyperopt, one of the most popular open-source tools for tuning ML in Python. Our team contributed a Spark-powered backend for scaling out Hyperopt, and we will use this tool to discuss challenges and demonstrate best practices. After a quick introduction to hyperparameter tuning and Hyperopt, we will discuss workflows for tuning.
How should a data scientist begin, selecting what to tune and how? How should they track their work, evaluate progress, and iterate? We will demo using MLflow for tracking and visualization. We will then discuss architectural patterns for tuning. How can a data scientist tune single-machine ML workflows vs. distributed? How can data ingest be optimized with Spark, and how should the Spark cluster be configured? We will wrap up with mentions of other efforts around scaling out tuning in the Spark and AI ecosystem. Our team’s recent release of joblib-spark, a Joblib Apache Spark Backend, simplifies distributing scikit-learn tuning jobs across a Spark cluster. This talk will be generally accessible, though knowledge of ML and Spark will help.
Joseph Bradley is a Software Engineer and Apache Spark PMC member working on Machine Learning at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon in 2013.