Large Scale Feature Aggregation Using Apache Spark – Databricks

Large Scale Feature Aggregation Using Apache Spark

Download Slides

Aggregation based features account for a quarter of the several 1000s features used by the ML-based decisioning system by the Risk team at Uber. We observed several repetitive, cumbersome steps needed for onboarding a feature, every single time. Therefore, to accelerate developer velocity, and to enable Feature Engineering at scale, we decided to develop a generic spark based infrastructure to simplify the process to no more than a simple spec file, containing a parameterized query, along with some metadata on where the feature should be aggregated and stored.

In the presentation, we will describe the architecture of the final solution, highlighting some of the advanced capabilities like backfill support and self-healing for correctness. We will showcase how, using data stored in Hive and using Spark, we developed a highly scalable solution to carry out feature aggregation in an incremental way. By dividing data aggregation responsibility across the realtime access layer, and the batch computation components, we ensured that only entities for which there is actual value changes are dispersed to our real-time access store (Cassandra). We will share how we did data modeling in Cassandra using its native capabilities such as counters, and how we worked around some of the limitations of Cassandra. We will also cover the details about the access service how we do different types of feature stitching together. How, based on our data model we were able to ensure that all the feature for an entity with the same aggregation window, were queried via a single query. Finally, we will cover some of the details on how these incremental aggregated features have enabled shorter turnaround times for the models using such features.

Session hashtag: #Dev1SAIS

« back
About Pulkit Bhanot

Pulkit Bhanot is a tech lead at Uber. His team's focus is to develop libraries primarily focused on enabling complex feature engineering for the Risk and Safety programs. The team over the past few years have developed solutions to streamline the feature engineering process to mostly spec driven, alongside ensuring that these solutions are highly scalable and performant, to accommodate ever-increasing catalog of features. Prior to Uber, Pulkit has worked in several startups with the focus on scaling Data ecosystem. He holds a Bachelor's degree in Computer Engineering.

About Amit Nene

Amit Nene is a manager and founding member of Uber's Risk Data team. He has been responsible for driving the vision for Data infrastructure for Risk programs, and providing the architectural/technical supervision to the team. Before Uber, Amit has had several years of industry experience working at companies such as at Apple, VMware in Staff Engineer and Manager roles, leading several Platform and Datacenter infrastructure efforts.