Automating and Productionizing Machine Learning Pipelines for Real-Time Scoring - Databricks

Automating and Productionizing Machine Learning Pipelines for Real-Time Scoring

Download Slides

You’ve fit your machine learning pipeline…now what? As a data scientist, taking a model to production is the single largest barrier to making an impact. As an engineer, how do you integrate machine learning into production applications? This talk will explore how we generalize the data science workflow through three stages: data collection, machine learning, and machine learning pipeline deployment.

First, we’ll talk through how we leverage Spark Structured Streaming to generate consistent and up-to-date data that is available at training and scoring time. Next, we’ll discuss how we built repeatable, scalable, data agnostic machine learning pipelines that consider a host of algorithms, objective functions, feature selection and extraction methods to scale the impact of our data scientists. Finally, we’ll show you how to utilize MLeap to serialize these fitted Spark ML pipelines so they can be evaluated real-time, in tens of milliseconds.

Session hashtag: #Ent2SAIS

About David Crespi

David Crespi is a Data Scientist at Red Ventures, where he focuses on optimizing a customer’s journey and experience in the digital marketing space. David is passionate about generalizing data science software infrastructure to model many problems quickly at scale. David graduated from Wake Forest University in 2014 with degrees in Mathematics and Computer Science.

About Jared Piedt

Jared Piedt is a Software Engineer on the data science team at Red Ventures, where he works on their automated machine learning and data management platforms. He graduated from the University of South Carolina in 2016 with a B.S. in Computer Information Systems.