Joseph Bradley is a Software Engineer and Apache Spark PMC member working on Machine Learning at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon in 2013.
To get good results from Machine Learning (ML) models, data scientists almost always tune hyperparameters---learning rate, regularization, etc. This tuning can be critical for performance and accuracy, but it is also routine and laborious to do manually. This talk discusses automation for tuning, scaling via Apache Spark, and best practices for tuning workflows and architecture. We will use a running demo of Hyperopt, one of the most popular open-source tools for tuning ML in Python. Our team contributed a Spark-powered backend for scaling out Hyperopt, and we will use this tool to discuss challenges and demonstrate best practices. After a quick introduction to hyperparameter tuning and Hyperopt, we will discuss workflows for tuning.
How should a data scientist begin, selecting what to tune and how? How should they track their work, evaluate progress, and iterate? We will demo using MLflow for tracking and visualization. We will then discuss architectural patterns for tuning. How can a data scientist tune single-machine ML workflows vs. distributed? How can data ingest be optimized with Spark, and how should the Spark cluster be configured? We will wrap up with mentions of other efforts around scaling out tuning in the Spark and AI ecosystem. Our team's recent release of joblib-spark, a Joblib Apache Spark Backend, simplifies distributing scikit-learn tuning jobs across a Spark cluster. This talk will be generally accessible, though knowledge of ML and Spark will help.
Hyperparameter tuning and optimization is a powerful tool in the area of AutoML, for both traditional statistical learning models as well as for deep learning. There are many existing tools to help drive this process, including both blackbox and whitebox tuning. In this talk, we'll start with a brief survey of the most popular techniques for hyperparameter tuning (e.g., grid search, random search, Bayesian optimization, and parzen estimators) and then discuss the open source tools which implement each of these techniques. Finally, we will discuss how we can leverage MLflow with these tools and techniques to analyze how our search is performing and to productionize the best models.
This talk discusses developments within Apache Spark to allow deployment of MLlib models and pipelines within Structured Streaming jobs. MLlib has proven success and wide adoption for fitting Machine Learning (ML) models on big data. Scalability, expressive Pipeline APIs, and Spark DataFrame integration are key strengths. Separately, the development of Structured Streaming has provided Spark users with intuitive, performant tools for building Continuous Applications. The smooth integration of batch and streaming APIs and workflows greatly simplifies many production use cases. Given the adoption of MLlib and Structured Streaming in production systems, a natural next step is to combine them: deploy MLlib models and Pipelines for scoring (prediction) in Structured Streaming. However, before Apache Spark 2.3, many ML Pipelines could not be deployed in streaming. This talk discusses key improvements within MLlib to support streaming prediction. We will discuss currently supported functionality and opportunities for future improvements. With Spark 2.3, almost all MLlib workflows can be deployed for scoring in streaming, and we will demonstrate this live. The ability to deploy full ML Pipelines which include featurization greatly simplifies moving complex ML workflows from development to production. We will also include some discussion of technical challenges, such as featurization via Estimators vs. Transformers and DataFrame column metadata. Session hashtag: #ML2SAIS
Machine Learning workflows involve complex sequences of data transformations, learning algorithms, and parameter tuning. Spark ML Pipelines, introduced in Spark 1.2, have grown into a powerful framework for developing ML workflows. This talk will cover basic Pipeline concepts and then demonstrate their usage: (1) Building: Pipelines simplify the process of specifying a ML workflow. (2) Debugging: Pipelines and DataFrames permit users to inspect and debug the workflow. (3) Tuning: Built-in support for parameter tuning helps users optimize ML performance.
This talk discusses integrating common data science tools like Python pandas, scikit-learn, and R with MLlib, Spark's distributed Machine Learning (ML) library. Integration is simple; migration to distributed ML can be done lazily; and scaling to big data can significantly improve accuracy. We demonstrate integration with a simple data science workflow. Data scientists often encounter scaling bottlenecks with single-machine ML tools. Yet the overhead in migrating to a distributed workflow can seem daunting. In this talk, we demonstrate such a migration, taking advantage of Spark and MLlib's integration with common ML libraries. We begin with a small dataset which runs on a single machine. Increasing the size, we hit bottlenecks in various parts of the workflow: hyperparameter tuning, then ETL, and eventually the core learning algorithm. As we hit each bottleneck, we parallelize that part of the workflow using Spark and MLlib. As we increase the dataset and model size, we can see significant gains in accuracy. We end with results demonstrating the impressive scalability of MLlib algorithms. With accuracy comparable to traditional ML libraries, combined with state-of-the-art distributed scalability, MLlib is a valuable new tool for the modern data scientist.Additional Reading:
This talk highlights major improvements in Machine Learning (ML) targeted for Apache Spark 2.0. The MLlib 2.0 release focuses on ease of use for data science---both for casual and power users. We will discuss 3 key improvements: persisting models for production, customizing Pipelines, and improvements to models and APIs critical to data science. (1) MLlib simplifies moving ML models to production by adding full support for model and Pipeline persistence. Individual models---and entire Pipelines including feature transformations---can be built on one Spark deployment, saved, and loaded onto other Spark deployments for production and serving. (2) Users will find it much easier to implement custom feature transformers and models. Abstractions automatically handle input schema validation, as well as persistence for saving and loading models. (3) For statisticians and data scientists, MLlib has doubled down on Generalized Linear Models (GLMs), which are key algorithms for many use cases. MLlib now supports more GLM families and link functions, handles corner cases more gracefully, and provides more model statistics. Also, expanded language APIs allow data scientists using Python and R to call many more algorithms. Finally, we will demonstrate these improvements live and show how they facilitate getting started with ML on Spark, customizing implementations, and moving to production.