Applying Multiple ML Pipelines to Heterogeneous Data Streams - Databricks

Applying Multiple ML Pipelines to Heterogeneous Data Streams

Spark ML Pipelines provide a comprehensive framework for predictive modeling, including feature engineering, batch model training, and real-time predictions based on streams of data. For example, a model predicting likelihood of cart abandonment may be trained periodically using features based on Web activity of customers and applied to a stream of Web events to make real-time predictions for live users. However, in multi-tenant environments where streams contain events from different sources, application of ML Pipelines becomes difficult. Even though the pipeline paradigm can be applied to model training using datasets that contain events separated by source, generating real-time prediction in Spark Streaming poses multiple challenges, since a single micro-batch contains events that require evaluation of different pipelines. In this talk we will show how Altocloud applies Spark Pipelines to train hundreds of predictive models and to enable real-time predictions on high-throughput heterogeneous data streams. In particular we will focus on: 1. Training multiple models for activity streams from different sources. 2. Application of these models in real-time to a heterogeneous stream of events containing behavioural data for millions of users. 3. Automated training, validation, selection, and deployment of multiple predictive models in a multi-tenant environment at scale.

Session hashtag: #EUds4

About Maciej Dabrowski

Maciej has been building large-scale data analytics and AI products in both research and industry for over 10 years. As Chief Data Scientist at Altocloud he is responsible for the Machine Learning platform that uses Apache Spark to train hundreds of predictive models and apply them in real-time to millions of events a day. He is the founder of the Galway Data Meetup with over 250 members and received a number of awards for his work. He was shortlisted as one of the four finalists of the DatSci 2016 competition in the Data Scientist of the Year category.

About Gevorg Soghomonyan

Gevorg is the Lead AI Research Engineer at Altocloud. He has received his BA from Yerevan State University Applied Math faculty and his MSc from American University of Armenia. With many years of experience in building Software Systems in various industries, he specializes in Machine Learning, Deep Learning, and functional programming. Gevorg is the winner of Best Master Student in Technology and Sciences award in 2013 awarded by the President of Republic of Armenia.