Evgeny Shapiro is a software engineer on the Data Infrastructure team at Airbnb, where he works on the next generation of data architecture in Airbnb. Previously he worked on the Trust team where he was implementing infrastructure to catch fraud in real-time. Many of the requirements in fraud were particularly challenging for existing infrastructure, because of latency, volume and correctness requirements. To address these challenges he joined the Zipline project where he worked on core data aggregation algorithms and optimizations required to run large feature backfills for production machine learning models as well as online feature serving infrastructure.
Zipline is Airbnb's data management platform specifically designed for ML use cases. Previously, ML practitioners at Airbnb spent roughly 60% of their time on collecting and writing transformations for machine learning tasks. Zipline reduces this task from months to days - by making the process declarative. It allows data scientists to easily define features in a simple configuration language. The framework then provides access to point-in-time correct features - for both - offline model training and online inference. In this talk we will describe the architecture of our system and the algorithm that makes the problem of efficient point-in-time correct feature generation, tractable. The attendee will learn 1. Importance of point-in-time correct features for achieving better ML model performance 2. Importance of using change data capture for generating feature views 3. An algorithm - to efficiently generate features over change data. We use interval trees to efficiently compress time series features. The algorithm allows generating feature aggregates over this compressed representation. 4. A lambda architecture - that enables using the above algorithm - for online feature generation. 5. A framework, based on category theory, to understand how feature aggregations be distributed, and independently composed. While the talk if fairly technical - we will introduce all the concepts from first principles with examples. Basic understanding of data-parallel distributed computation and machine learning might help, but are not required.