Maciej has been building large-scale data analytics and AI products in both research and industry for over 10 years. As Chief Data Scientist at Altocloud he is responsible for the Machine Learning platform that uses Apache Spark to train hundreds of predictive models and apply them in real-time to millions of events a day. He is the founder of the Galway Data Meetup with over 250 members and received a number of awards for his work. He was shortlisted as one of the four finalists of the DatSci 2016 competition in the Data Scientist of the Year category.
Spark ML Pipelines provide a comprehensive framework for predictive modeling, including feature engineering, batch model training, and real-time predictions based on streams of data. For example, a model predicting likelihood of cart abandonment may be trained periodically using features based on Web activity of customers and applied to a stream of Web events to make real-time predictions for live users. However, in multi-tenant environments where streams contain events from different sources, application of ML Pipelines becomes difficult. Even though the pipeline paradigm can be applied to model training using datasets that contain events separated by source, generating real-time prediction in Spark Streaming poses multiple challenges, since a single micro-batch contains events that require evaluation of different pipelines. In this talk we will show how Altocloud applies Spark Pipelines to train hundreds of predictive models and to enable real-time predictions on high-throughput heterogeneous data streams. In particular we will focus on: 1. Training multiple models for activity streams from different sources. 2. Application of these models in real-time to a heterogeneous stream of events containing behavioural data for millions of users. 3. Automated training, validation, selection, and deployment of multiple predictive models in a multi-tenant environment at scale. Session hashtag: #EUds4