Learning is an analytic process of exploring the past in order to predict the future. Hence, being able to travel back in time to create feature is critical for machine learning projects to be successful. At Netflix, we spend significant time and effort experimenting with new features and new ways of building models. This involves generating features for our members from different regions over multiple days. To enable this, we built a time machine using Apache Spark that computes features for any arbitrary time in the recent past. The first step of building this time machine is to snapshot the data from various micro services on a regular basis. We built a general purpose workflow orchestration and scheduling framework optimized for machine learning pipelines and used it to run the snapshot and model training workflows. Snapshot data is then consumed by feature encoders to compute various features for offline experimentation and model training. Crucially, the same feature encoders are used in both offline model building and online scoring for production or A/B tests. Building this time machine helped us try new ideas quickly without placing stress on production services and without having to wait for data accumulation of the newly-implemented features. Moreover, building it with Apache Spark empowered us to both scale up the data size by an order of magnitude and train and validate the models in less time. Finally, using Apache Zeppelin notebook, we are able to interactively prototype features and run experiments.
DB Tsai is an Apache Spark PMC / Committer and an open source and big data engineer at Apple Siri. He implemented several algorithms including linear models with Elastici-Net (L1/L2) regularization using LBFGS/OWL-QN optimizers in Apache Spark. Prior to joining Apple, DB worked on Personalized Recommendation ML Algorithms at Netflix. DB was a Ph.D. candidate in Applied Physics at Stanford University. He holds a Master’s degree in Electrical Engineering from Stanford.
Prasanna is currently an engineer on the Personalization Infrastructure team at Netflix. His primary focus is on building various big data infrastructure components using Spark that help our algorithmic engineers to innovate faster and improve personalization for our members. In the past, he has built distributed data systems that leverages both batch and stream processing.