MLeap: Productionize Data Science Workflows Using Spark

Download Slides

If you have worked on real world data science deployments, this should be all too familiar to you: Data Scientists use myriad tools, analyze datasets, clean them and build offline models and validate their performance ad hoc. The resulting scripts are thrown across the wall to Data Engineers and Architects whose job is to productionize this workflow. The Engineers are left with the unenviable job of not only reproducing the Data Scientists’ conclusions, but to scale the resulting pipeline both of which require a deep understanding of Data Science itself. As a result, most if not all Data Science deployments in the wild end up either too simplistic or take too long to productionize. There are a variety of challenges in productionizing data science workflows, some of which are solved by Spark itself. However, there is still a large gap that needs to be plugged: How do you take workflows that have been trained offline and produce models that have to be scored online? MLeap is an open source Spark package designed to serialize your Spark-trained pipelines and transformers, deploy them to a JVM-based API server and execute real-time, one-off requests. In this talk we motivate the need for such a library, outline programming time saved by using MLeap, show benchmarks of several online models and provide a demo as well as examples of how to use MLeap in practice.

Additional Reading:

  • Building Data Science Applications on Databricks

    « back
  • About Hollin Wilkins

    Hollin Wilkins is a founder of Combust, an ML/AI start-up in the Bay Area. He has been working on machine learning infrastructure since 2015, focusing on platforms for data scientists and engineers to rapidly iterate on ML algorithms and pipeline deployments. Previously he worked in the games industry at LindenLab on Blocksworld and Versu, helping to build everything from game UI, to servers, to custom logic languages that drive user experiences. He holds a degree in Biology from Cornell University and spends his free time hiking with his dog and snowboarding.

    About Mikhail Semeniuk

    Mikhail heads up pricing and data ops at Shift, where he focuses on algorithm-driven pricing systems. Prior to Shift, he was a Director of Data Products at TrueCar, and a Statistician at United Health Group. Mikhail studied Mathematics and Economics at the University of Minnesota, which inspired his mission to bridge the gap between data science and engineering. He grew up in Minneapolis and lived in Venice, CA for 6 years where he pursued skydiving and hopes of being a decent surfer, and now resides in the Bay Area.