Hollin Wilkins

Co-Founder, Combust, Inc.

Hollin Wilkins is a founder of Combust, an ML/AI start-up in the Bay Area. He has been working on machine learning infrastructure since 2015, focusing on platforms for data scientists and engineers to rapidly iterate on ML algorithms and pipeline deployments. Previously he worked in the games industry at LindenLab on Blocksworld and Versu, helping to build everything from game UI, to servers, to custom logic languages that drive user experiences. He holds a degree in Biology from Cornell University and spends his free time hiking with his dog and snowboarding.

SESSIONS

Fully-Reproducible ML Deployment with Spark, Pachyderm, and MLeap

After you train a machine learning pipeline, it can be challenging to deploy it to production, maintain versioned histories of both the model and data used in training, and make the entire process reproducible. This talk will show you how to use Apache Spark with two open source tools, Pachyderm and MLeap, to achieve all of those goals. Data provenance, provided by Pachyderm, gives a detailed audit of all data sources that go into your data pipeline at every step, as well as a rich, versioned history of your data. Spark ML provides a platform for training full machine learning pipelines on these versioned/tracked datasets, including feature generation/extraction and predictive models. Finally, MLeap provides the tools to instantly deploy, version, share and audit these machine learning pipelines in production. You are left with instant model deployments, full reproducibility of your entire pipeline from data import to production ML pipeline, and a full audit log ‒ from raw training data sources all the way to predictions being made by the model. Learn how to set up these tools to produce a data pipeline with complete data provenance and model auditing so that your company can develop ML pipelines quickly, reproducibly and safely, with a high-level of visibility into every step. Session hashtag: #SFds3

MLLeap, or How to Productionize Data Science Workflows Using Spark

If you have worked on real world data science deployments, this should be all too familiar to you: Data Scientists use myriad tools, analyze datasets, clean them and build offline models and validate their performance adhoc. The resulting scripts are thrown across the wall to Data Engineers and Architects whose job is to productionize this workflow. The Engineers are left with the unenviable job of not only reproducing the Data Scientists conclusions, but to scale the resulting pipeline both of which require a deep understanding of Data Science itself. As a result, most if not all Data Science deployments in the wild end up either too simplistic or take too long to productionize. There are a variety of challenges in productionizing data science workflows, some of which are solved by Spark itself. However, there is still a large gap that needs to be plugged: How do you take workflows that have been trained offline and produce models that have to be scored online? The current machine learning pipelines allow us to create, serialize and deserialize workflows but only into transformers and estimators that are written in Spark. While Spark is a great fit for offline model training, the requirements on online scoring are very different and often needs very low latency querying capabilities. The challenge then becomes, how do you serialize machine learning workflows in Spark so that they can be reconstructed on the scoring side? In doing so, do we have to duplicate the prediction code for each transformer and estimator or is there a better way to do this? MLLeap is an open source spark package designed to address these needs. In this talk we motivate the need for such a library , outline our work and provide examples of how to use MLLeap in practice.

MLeap: Productionize Data Science Workflows Using Spark

If you have worked on real world data science deployments, this should be all too familiar to you: Data Scientists use myriad tools, analyze datasets, clean them and build offline models and validate their performance ad hoc. The resulting scripts are thrown across the wall to Data Engineers and Architects whose job is to productionize this workflow. The Engineers are left with the unenviable job of not only reproducing the Data Scientists' conclusions, but to scale the resulting pipeline both of which require a deep understanding of Data Science itself. As a result, most if not all Data Science deployments in the wild end up either too simplistic or take too long to productionize. There are a variety of challenges in productionizing data science workflows, some of which are solved by Spark itself. However, there is still a large gap that needs to be plugged: How do you take workflows that have been trained offline and produce models that have to be scored online? MLeap is an open source Spark package designed to serialize your Spark-trained pipelines and transformers, deploy them to a JVM-based API server and execute real-time, one-off requests. In this talk we motivate the need for such a library, outline programming time saved by using MLeap, show benchmarks of several online models and provide a demo as well as examples of how to use MLeap in practice.

MLeap and Combust.ML: Deploying Machine Learning Models to Production

Data Scientists use myriad tools, analyze datasets, clean them and build offline models and validate their performance ad hoc. The resulting scripts are thrown across the wall to Data Engineers and Architects whose job it is to productionize these workflows. The Engineers are left with the unenviable job of not only reproducing the Data Scientists' conclusions, but to scale the resulting pipeline both of which require a deep understanding of Data Science itself. As a result, most if not all Data Science deployments in the wild end up either too simplistic or take too long to productionize. There are a variety of challenges in productionizing data science workflows, some of which are solved by Spark itself. However, there is still a large gap that needs to be plugged: How do you take workflows that have been trained offline and produce models that have to be scored online? MLeap is an open source Spark package designed to serialize your Spark-trained pipelines and transformers, deploy them to a JVM-based API server and execute real-time, one-off requests. In this talk we motivate the need for such a library, outline programming time saved by using MLeap, show benchmarks of several online models and provide a demo as well as examples of how to use MLeap in practice. In addition, we present a platform called Combust.ML that can be used to deploy Spark trained algorithms to highly scalable, scala-backed API servers.