Viswesh Periyasamy is a Software Engineer on the Machine Learning team at Databricks, primarily focused on building infrastructure for model training and tuning. Previously, he was a Software Engineer at Confluent and received his MS in Machine Learning and Bioinformatics from the University of Wisconsin-Madison.
May 27, 2021 03:50 PM PT
Hyperparameter tuning is a key step in achieving and maintaining optimal performance from Machine Learning (ML) models. Today, there are many open-source frameworks which help automate the process and employ statistical algorithms to efficiently search the parameter space. However, optimizing these parameters over a sufficiently large dataset or search space can be computationally infeasible on a single machine. Apache Spark is a natural candidate to accelerate such workloads, but naive parallelization can actually impede the overall search speed and accuracy.
In this talk, we’ll discuss how to efficiently leverage Spark to distribute our tuning workload and go over some common pitfalls. Specifically, we’ll provide a brief introduction to tuning and motivation for moving to a distributed workflow. Next, we’ll demonstrate best practices when utilizing Spark with Hyperopt - a popular, flexible, open-source tool for hyperparameter tuning. This will include topics such as how to distribute the training data and appropriately size the cluster for the problem at hand. We’ll also touch on the conflicting nature between parallel computation and Sequential Model-Based Optimization methods, such as the Tree-structured Parzen Estimators implemented in Hyperopt. Afterwards, we'll demonstrate these practices with Hyperopt using the SparkTrials API. Additionally, we’ll showcase joblib-spark, an extension our team recently developed, which uses Spark as a distributed backend for scikit-learn to accelerate tuning and training.
This talk will be generally accessible to those familiar with ML and particularly useful for those looking to scale up their training with Spark.