Prasanna is currently an engineer on the Personalization Infrastructure team at Netflix. His primary focus is on building various big data infrastructure components using Spark that help our algorithmic engineers to innovate faster and improve personalization for our members. In the past, he has built distributed data systems that leverages both batch and stream processing.
Learning is an analytic process of exploring the past in order to predict the future. Hence, being able to travel back in time to create feature is critical for machine learning projects to be successful. At Netflix, we spend significant time and effort experimenting with new features and new ways of building models. This involves generating features for our members from different regions over multiple days. To enable this, we built a time machine using Apache Spark that computes features for any arbitrary time in the recent past. The first step of building this time machine is to snapshot the data from various micro services on a regular basis. We built a general purpose workflow orchestration and scheduling framework optimized for machine learning pipelines and used it to run the snapshot and model training workflows. Snapshot data is then consumed by feature encoders to compute various features for offline experimentation and model training. Crucially, the same feature encoders are used in both offline model building and online scoring for production or A/B tests. Building this time machine helped us try new ideas quickly without placing stress on production services and without having to wait for data accumulation of the newly-implemented features. Moreover, building it with Apache Spark empowered us to both scale up the data size by an order of magnitude and train and validate the models in less time. Finally, using Apache Zeppelin notebook, we are able to interactively prototype features and run experiments.
Prasanna Rajaperumal and Vinoth Chandar will explore a specific problem of ingesting petabytes of data in Uber and why they ended up building an analytical datastore from scratch using Spark. Prasanna will discuss design choices and implementation approaches in building Hoodie to provide near-real-time data ingestion and querying using Spark and HDFS. Session hashtag: #SFexp4