Productionizing Behavioural Features for Machine Learning with Apache Spark Streaming

Download Slides

We are using Spark Streaming for building online Machine Learning(ML) features that are used in for real-time prediction of behaviour and preferences of our users, demand for hotels and improve processes in customer support. Our initial set of goals was to speedup experimentation with real-time features, make features reusable by Data Scientists (DS) within the company and reduce training/serving data skew problem. The tooling that we’ve built and integrated into company’s infrastructure simplifies development of new features to the level that online feature collection can be implemented and deployed into production by DS with very little or no help from developers. That makes this approach scalable and allows us to iterate fast. We use Kafka as a streaming source of real-time events from the website as well as other sources and with connectivity to Cassandra and Hive we were able to make data more consistent between training and serving phases of ML pipelines. Our key takeaways: – It is possible to design production pipelines in a way that allows DS to build and deploy them without help of a developer. – Constructing online features is a much more complex job than offline construction and business-wise it is not always a priority to invest into their construction even if they are proven to be beneficial to the model performance. We plan to invest further into development of pipelines with Spark Streaming and are happy to see that support for streaming operations in Spark evolves in right direction.
Session hashtag: #EUstr4

About Roman Studenikin

Roman gained most of his experience during years spent in Yandex where he was in charge of designing and building from scratch highly loaded online data processing systems, cloud storages, data transport systems, consensus protocols, as well as full products like automatic bidding and PPC fraud prevention systems. Of course at some point he wasn't able to hold all responsibilities himself and eventually he grew a team around the products he built which have grown into multiple teams. Currently he works for where his team's mission is to scale machine learning usage company wide.

About Ben Teeuwen

Ben Teeuwen spends most of his work time as a data scientist empowering data scientists through tooling and trainings to use ML to further personalize Ben guided widespread adoption of Spark throughout the company. Before starting to work for in 2014, he studied cognitive psychology and public sector innovation