Ben Teeuwen - Databricks

Ben Teeuwen

Senior Data Scientist, Booking.com

Ben Teeuwen spends most of his work time as a data scientist empowering data scientists through tooling and trainings to use ML to further personalize Booking.com. Ben guided widespread adoption of Spark throughout the company. Before starting to work for Booking.com in 2014, he studied cognitive psychology and public sector innovation

UPCOMING SESSIONS

Scaling Machine Learning at Booking.com with H2O Sparkling Water and FeatureStoreSummit 2018

At Booking.com we have a community of over 150 data scientists working on personalizing the experience of our customers, improving visibility of our partners on the platform and preventing fraud. Because of the company's growth and size, tasks like finding consistent data sources, building robust features and productionizing models can be challenging and time-consuming for ML practitioners. In this talk we will share our journey from the origins, where models were very much hand-crafted, till nowadays, where we have tooling for discovering and building reusable online and offline features and self-service tools to deploy models in production quickly. Along the way, we'll share how Spark and H2O's Sparkling Water play a key role for building scalable models with large training datasets while allowing fast-predictions on our website. Session hashtag: #ML8SAIS

PAST SESSIONS

Productionizing Behavioural Features for Machine Learning with Apache Spark StreamingSummit Europe 2017

We are using Spark Streaming for building online Machine Learning(ML) features that are used in Booking.com for real-time prediction of behaviour and preferences of our users, demand for hotels and improve processes in customer support. Our initial set of goals was to speedup experimentation with real-time features, make features reusable by Data Scientists (DS) within the company and reduce training/serving data skew problem. The tooling that we've built and integrated into company's infrastructure simplifies development of new features to the level that online feature collection can be implemented and deployed into production by DS with very little or no help from developers. That makes this approach scalable and allows us to iterate fast. We use Kafka as a streaming source of real-time events from the website as well as other sources and with connectivity to Cassandra and Hive we were able to make data more consistent between training and serving phases of ML pipelines. Our key takeaways: - It is possible to design production pipelines in a way that allows DS to build and deploy them without help of a developer. - Constructing online features is a much more complex job than offline construction and business-wise it is not always a priority to invest into their construction even if they are proven to be beneficial to the model performance. We plan to invest further into development of pipelines with Spark Streaming and are happy to see that support for streaming operations in Spark evolves in right direction. Session hashtag: #EUstr4