Productionizing H2O Models with Apache Spark - Databricks

Productionizing H2O Models with Apache Spark

Spark pipelines represent a powerful concept to support productionizing machine learning workflows. Their API allows to combine data processing with machine learning algorithms and opens opportunities for integration with various machine learning libraries. However, to benefit from the power of pipelines, their users need to have a freedom to choose and experiment with any machine learning algorithm or library.

Therefore, we developed Sparkling Water that embeds H2O machine learning library of advanced algorithms into the Spark ecosystem and exposes them via pipeline API. Furthermore, the algorithms benefit from H2O MOJOs – Model Object Optimized – a powerful concept shared across entire H2O platform to store and exchange models. The MOJOs are designed for effective model deployment with focus on scoring speed, traceability, exchangeability, and backward compatibility. In this talk we will explain the architecture of Sparkling Water with focus on integration into the Spark pipelines and MOJOs.

We’ll demonstrate creation of pipelines integrating H2O machine learning models and their deployments using Scala or Python. Furthermore, we will show how to utilize pre-trained model MOJOs with Spark pipelines.

Session hashtag: #ML4SAIS

About Jakub Hava

Jakub (or "Kuba") finished his bachelors degree in computer science at Charles University in Prague, and is currently finishing his master's in software engineering as well. As a bachelors thesis, Kuba wrote a small platform for distributed computing of tasks of any type. On his current masters studies he's developing a cluster monitoring tool for JVM based languages which should make debugging and reasoning about performance of distributed systems easier using a concept called distributed stack traces. At H2O, Kuba mostly works on our Sparkling Water project.

About Michal Malohlava

Michal is a geek, developer, and Java, Linux, and programming languages enthusiast developing software for over 10 years. He obtained his PhD from Charles University in Prague in 2012 and pursued a post-doc at Purdue University. During his studies, he was interested in the construction of distributed, embedded, and real-time component-based systems using model-driven methods and domain-specific languages. He participated in the design and development of various systems, including SOFA and Fractal component systems or jPapabench control system.