Ben Teeuwen

Senior Data Scientist,

Ben Teeuwen spends most of his work time as a data scientist empowering data scientists through tooling and trainings to use ML to further personalize Ben guided widespread adoption of Spark throughout the company. Before starting to work for in 2014, he studied cognitive psychology and public sector innovation



Productionizing Behavioural Features for Machine Learning with Apache Spark StreamingSummit Europe 2017

We are using Spark Streaming for building online Machine Learning(ML) features that are used in for real-time prediction of behaviour and preferences of our users, demand for hotels and improve processes in customer support. Our initial set of goals was to speedup experimentation with real-time features, make features reusable by Data Scientists (DS) within the company and reduce training/serving data skew problem. The tooling that we've built and integrated into company's infrastructure simplifies development of new features to the level that online feature collection can be implemented and deployed into production by DS with very little or no help from developers. That makes this approach scalable and allows us to iterate fast. We use Kafka as a streaming source of real-time events from the website as well as other sources and with connectivity to Cassandra and Hive we were able to make data more consistent between training and serving phases of ML pipelines. Our key takeaways: - It is possible to design production pipelines in a way that allows DS to build and deploy them without help of a developer. - Constructing online features is a much more complex job than offline construction and business-wise it is not always a priority to invest into their construction even if they are proven to be beneficial to the model performance. We plan to invest further into development of pipelines with Spark Streaming and are happy to see that support for streaming operations in Spark evolves in right direction. Session hashtag: #EUstr4