Varant Zanoyan is a software engineer on the Machine Learning Infrastructure team at Airbnb, where he works on tools for building and productionizing ML models. Previously, he worked closely with data scientists and engineers within Airbnb to build and deploy machine learning models. During this time he identified data management and feature engineering as the primary challenges faced by machine learning practitioners at Airbnb. Seeing these problems motivated him to work on solving them at the infrastructure level, and these efforts resulted in Zipline, the feature store and data management platform for machine learning. Zipline remains his primary focus currently. Prior to Airbnb, he solved data infrastructure problems at Palantir Technologies.
Zipline is Airbnb's data management platform specifically designed for ML use cases. Previously, ML practitioners at Airbnb spent roughly 60% of their time on collecting and writing transformations for machine learning tasks. Zipline reduces this task from months to days - by making the process declarative. It allows data scientists to easily define features in a simple configuration language. The framework then provides access to point-in-time correct features - for both - offline model training and online inference. In this talk we will describe the architecture of our system and the algorithm that makes the problem of efficient point-in-time correct feature generation, tractable. The attendee will learn 1. Importance of point-in-time correct features for achieving better ML model performance 2. Importance of using change data capture for generating feature views 3. An algorithm - to efficiently generate features over change data. We use interval trees to efficiently compress time series features. The algorithm allows generating feature aggregates over this compressed representation. 4. A lambda architecture - that enables using the above algorithm - for online feature generation. 5. A framework, based on category theory, to understand how feature aggregations be distributed, and independently composed. While the talk if fairly technical - we will introduce all the concepts from first principles with examples. Basic understanding of data-parallel distributed computation and machine learning might help, but are not required.
Zipline is Airbnb's data management platform specifically designed for ML use cases. Previously, ML practitioners at Airbnb spent roughly 60% of their time on collecting and writing transformations for machine learning tasks. Zipline reduces this task from months to days. It allows users to define features in an easy-to-use configuration language, then provides access to the following features: resource efficient and point-in-time correct training set backfills and scheduled updates, feature visualizations and automatic data quality monitoring, feature availability in online scoring environment: batch and streaming with batch correction (lambda architecture), collaboration and sharing of features, and data ownership and management. Spark powers many of Zipline's features, especially offline tasks for efficient training set backfills and feature computation. This talk covers Ziplines architecture and the main problems that Zipline solves. Despite being widespread, there is no open source software to address these problems. As a result, we intend to open source our work. Session hashtag: #ML3SAIS