Joseph Bradley works as a Sr. Solutions Architect at Databricks, specializing in Machine Learning, and is an Apache Spark committer and PMC member. Previously, he was a Staff Software Engineer at Databricks and a postdoc at UC Berkeley, after receiving his Ph.D. in Machine Learning from Carnegie Mellon.
May 26, 2021 11:30 AM PT
We will share our experiences in building Data Science and Machine Learning (DS/ML) into organizations. As new DS/ML teams are created, many wrestle with questions such as: How can we most efficiently achieve short-term goals while planning for scale and production long-term? How should DS/ML be incorporated into a company?
We will bring unique perspectives: one as a previous Databricks customer leading a DS team, one as the second ML engineer at Databricks, and both as current Solutions Architects guiding customers through their DS/ML journeys.We will cover best practices through the crawl-walk-run journey of DS/ML: how to immediately become more productive with an initial team, how to scale and move towards production when needed, and how to integrate effectively with the broader organization.
This talk is meant for technical leaders who are building new DS/ML teams or helping to spread DS/ML practices across their organizations. Technology discussion will focus on Databricks, but the lessons apply to any tech platforms in this space.
May 27, 2021 03:15 PM PT
In this session, you will learn how to scale their exploratory data analysis and data science workflows with Databricks. You will learn how you can collaborate with team members writing code in different languages (Python, R, Scala, SQL) using Databricks Workspace, explore data with interactive visualizations, and discover new insights, securely share code with co-authoring, commenting, automatic versioning, Git integrations, and role-based access controls. You will learn best practices for managing experiments, projects, and models using MLflow. Attendees will build a pipeline to log and deploy machine learning models to production.
This session will be "follow along" - you are welcome to try running the notebooks yourself 'live', but it is not required. They can be re-run later as well. If you want to follow along, download the notebooks from https://files.training.databricks.com/classes/data-science-on-databricks/ . We recommend downloading the version with solutions.
For access to Databricks, sign up for free at https://community.cloud.databricks.com/ . Import the notebooks and provision a cluster using Databricks runtime 7.3 ML.
June 23, 2020 05:00 PM PT
To get good results from Machine Learning (ML) models, data scientists almost always tune hyperparameters---learning rate, regularization, etc. This tuning can be critical for performance and accuracy, but it is also routine and laborious to do manually. This talk discusses automation for tuning, scaling via Apache Spark, and best practices for tuning workflows and architecture. We will use a running demo of Hyperopt, one of the most popular open-source tools for tuning ML in Python. Our team contributed a Spark-powered backend for scaling out Hyperopt, and we will use this tool to discuss challenges and demonstrate best practices. After a quick introduction to hyperparameter tuning and Hyperopt, we will discuss workflows for tuning.
How should a data scientist begin, selecting what to tune and how? How should they track their work, evaluate progress, and iterate? We will demo using MLflow for tracking and visualization. We will then discuss architectural patterns for tuning. How can a data scientist tune single-machine ML workflows vs. distributed? How can data ingest be optimized with Spark, and how should the Spark cluster be configured? We will wrap up with mentions of other efforts around scaling out tuning in the Spark and AI ecosystem. Our team's recent release of joblib-spark, a Joblib Apache Spark Backend, simplifies distributing scikit-learn tuning jobs across a Spark cluster. This talk will be generally accessible, though knowledge of ML and Spark will help.
April 23, 2019 05:00 PM PT
Hyperparameter tuning and optimization is a powerful tool in the area of AutoML, for both traditional statistical learning models as well as for deep learning. There are many existing tools to help drive this process, including both blackbox and whitebox tuning. In this talk, we'll start with a brief survey of the most popular techniques for hyperparameter tuning (e.g., grid search, random search, Bayesian optimization, and parzen estimators) and then discuss the open source tools which implement each of these techniques. Finally, we will discuss how we can leverage MLflow with these tools and techniques to analyze how our search is performing and to productionize the best models.
June 4, 2018 05:00 PM PT
This talk discusses developments within Apache Spark to allow deployment of MLlib models and pipelines within Structured Streaming jobs. MLlib has proven success and wide adoption for fitting Machine Learning (ML) models on big data. Scalability, expressive Pipeline APIs, and Spark DataFrame integration are key strengths.
Separately, the development of Structured Streaming has provided Spark users with intuitive, performant tools for building Continuous Applications. The smooth integration of batch and streaming APIs and workflows greatly simplifies many production use cases. Given the adoption of MLlib and Structured Streaming in production systems, a natural next step is to combine them: deploy MLlib models and Pipelines for scoring (prediction) in Structured Streaming.
However, before Apache Spark 2.3, many ML Pipelines could not be deployed in streaming. This talk discusses key improvements within MLlib to support streaming prediction. We will discuss currently supported functionality and opportunities for future improvements. With Spark 2.3, almost all MLlib workflows can be deployed for scoring in streaming, and we will demonstrate this live. The ability to deploy full ML Pipelines which include featurization greatly simplifies moving complex ML workflows from development to production. We will also include some discussion of technical challenges, such as featurization via Estimators vs. Transformers and DataFrame column metadata.
Session hashtag: #ML2SAIS
March 17, 2015 05:00 PM PT
June 14, 2015 05:00 PM PT
Machine Learning workflows involve complex sequences of data transformations, learning algorithms, and parameter tuning. Spark ML Pipelines, introduced in Spark 1.2, have grown into a powerful framework for developing ML workflows. This talk will cover basic Pipeline concepts and then demonstrate their usage:
(1) Building: Pipelines simplify the process of specifying a ML workflow.
(2) Debugging: Pipelines and DataFrames permit users to inspect and debug the workflow.
(3) Tuning: Built-in support for parameter tuning helps users optimize ML performance.
October 28, 2015 05:00 PM PT
This talk discusses integrating common data science tools like Python pandas, scikit-learn, and R with MLlib, Spark's distributed Machine Learning (ML) library. Integration is simple; migration to distributed ML can be done lazily; and scaling to big data can significantly improve accuracy. We demonstrate integration with a simple data science workflow. Data scientists often encounter scaling bottlenecks with single-machine ML tools. Yet the overhead in migrating to a distributed workflow can seem daunting. In this talk, we demonstrate such a migration, taking advantage of Spark and MLlib's integration with common ML libraries. We begin with a small dataset which runs on a single machine. Increasing the size, we hit bottlenecks in various parts of the workflow: hyperparameter tuning, then ETL, and eventually the core learning algorithm. As we hit each bottleneck, we parallelize that part of the workflow using Spark and MLlib. As we increase the dataset and model size, we can see significant gains in accuracy. We end with results demonstrating the impressive scalability of MLlib algorithms. With accuracy comparable to traditional ML libraries, combined with state-of-the-art distributed scalability, MLlib is a valuable new tool for the modern data scientist.
June 7, 2016 05:00 PM PT
This talk highlights major improvements in Machine Learning (ML) targeted for Apache Spark 2.0. The MLlib 2.0 release focuses on ease of use for data science---both for casual and power users. We will discuss 3 key improvements: persisting models for production, customizing Pipelines, and improvements to models and APIs critical to data science.
(1) MLlib simplifies moving ML models to production by adding full support for model and Pipeline persistence. Individual models---and entire Pipelines including feature transformations---can be built on one Spark deployment, saved, and loaded onto other Spark deployments for production and serving.
(2) Users will find it much easier to implement custom feature transformers and models. Abstractions automatically handle input schema validation, as well as persistence for saving and loading models.
(3) For statisticians and data scientists, MLlib has doubled down on Generalized Linear Models (GLMs), which are key algorithms for many use cases. MLlib now supports more GLM families and link functions, handles corner cases more gracefully, and provides more model statistics. Also, expanded language APIs allow data scientists using Python and R to call many more algorithms.
Finally, we will demonstrate these improvements live and show how they facilitate getting started with ML on Spark, customizing implementations, and moving to production.