Joseph Bradley - Databricks

Joseph Bradley

Software Engineer, Databricks

Joseph Bradley is a Software Engineer and Apache Spark PMC member working on Machine Learning at Databricks. Previously, he was a postdoc at UC Berkeley after receiving his Ph.D. in Machine Learning from Carnegie Mellon in 2013.


Deploying MLlib for Scoring in Structured Streaming

This talk discusses developments within Apache Spark to allow deployment of MLlib models and pipelines within Structured Streaming jobs. MLlib has proven success and wide adoption for fitting Machine Learning (ML) models on big data. Scalability, expressive Pipeline APIs, and Spark DataFrame integration are key strengths. Separately, the development of Structured Streaming has provided Spark users with intuitive, performant tools for building Continuous Applications. The smooth integration of batch and streaming APIs and workflows greatly simplifies many production use cases. Given the adoption of MLlib and Structured Streaming in production systems, a natural next step is to combine them: deploy MLlib models and Pipelines for scoring (prediction) in Structured Streaming. However, before Apache Spark 2.3, many ML Pipelines could not be deployed in streaming. This talk discusses key improvements within MLlib to support streaming prediction. We will discuss currently supported functionality and opportunities for future improvements. With Spark 2.3, almost all MLlib workflows can be deployed for scoring in streaming, and we will demonstrate this live. The ability to deploy full ML Pipelines which include featurization greatly simplifies moving complex ML workflows from development to production. We will also include some discussion of technical challenges, such as featurization via Estimators vs. Transformers and DataFrame column metadata.

Building, Debugging, and Tuning Spark Machine Learning Pipelines

Machine Learning workflows involve complex sequences of data transformations, learning algorithms, and parameter tuning. Spark ML Pipelines, introduced in Spark 1.2, have grown into a powerful framework for developing ML workflows. This talk will cover basic Pipeline concepts and then demonstrate their usage: (1) Building: Pipelines simplify the process of specifying a ML workflow. (2) Debugging: Pipelines and DataFrames permit users to inspect and debug the workflow. (3) Tuning: Built-in support for parameter tuning helps users optimize ML performance.

Combining the Strengths of MLlib, scikit-learn, and R

This talk discusses integrating common data science tools like Python pandas, scikit-learn, and R with MLlib, Spark's distributed Machine Learning (ML) library. Integration is simple; migration to distributed ML can be done lazily; and scaling to big data can significantly improve accuracy. We demonstrate integration with a simple data science workflow. Data scientists often encounter scaling bottlenecks with single-machine ML tools. Yet the overhead in migrating to a distributed workflow can seem daunting. In this talk, we demonstrate such a migration, taking advantage of Spark and MLlib's integration with common ML libraries. We begin with a small dataset which runs on a single machine. Increasing the size, we hit bottlenecks in various parts of the workflow: hyperparameter tuning, then ETL, and eventually the core learning algorithm. As we hit each bottleneck, we parallelize that part of the workflow using Spark and MLlib. As we increase the dataset and model size, we can see significant gains in accuracy. We end with results demonstrating the impressive scalability of MLlib algorithms. With accuracy comparable to traditional ML libraries, combined with state-of-the-art distributed scalability, MLlib is a valuable new tool for the modern data scientist.

Additional Reading:
  • MLlib and Machine Learning

  • Apache Spark MLlib 2.0 Preview: Data Science and Production

    This talk highlights major improvements in Machine Learning (ML) targeted for Apache Spark 2.0. The MLlib 2.0 release focuses on ease of use for data science---both for casual and power users. We will discuss 3 key improvements: persisting models for production, customizing Pipelines, and improvements to models and APIs critical to data science. (1) MLlib simplifies moving ML models to production by adding full support for model and Pipeline persistence. Individual models---and entire Pipelines including feature transformations---can be built on one Spark deployment, saved, and loaded onto other Spark deployments for production and serving. (2) Users will find it much easier to implement custom feature transformers and models. Abstractions automatically handle input schema validation, as well as persistence for saving and loading models. (3) For statisticians and data scientists, MLlib has doubled down on Generalized Linear Models (GLMs), which are key algorithms for many use cases. MLlib now supports more GLM families and link functions, handles corner cases more gracefully, and provides more model statistics. Also, expanded language APIs allow data scientists using Python and R to call many more algorithms. Finally, we will demonstrate these improvements live and show how they facilitate getting started with ML on Spark, customizing implementations, and moving to production.