Shriya Arora - Databricks

Shriya Arora

Data Engineer, Netflix

I am a data engineer at Netflix in the Data Personalization team that is responsible for generating datasets that are used for machine learning pipelines that power the Netflix recommendations. We have been actively using Spark over Pig/Hive for our batch jobs and are now exploring Spark streaming.Before Netflix, I was at Walmart Labs, where I helped build and architect their new generation item-setup, moving from batch processing to stream .We used Storm-Kafka to enable a micro-services architecture that can allow for products to be updated near real-time as opposed to once-a-day update on the legacy framework.



Going Real-Time: Creating Frequently-Updating Datasets for PersonalizationSummit East 2017

Streaming applications have often been complex to design and maintain because of the significant upfront infrastructure investment required. However, with the advent of Spark an easy transition to stream processing is now available, enabling personalization applications and experiments to consume near real-time data without massive development cycles. Our decision to evaluate Spark as our stream processing engine was primarily led by the following considerations: 1) Ease of development for the team (already familiar with spark for batch), 2) the scope/requirements of our problem, 3) re-usability of code from spark batch jobs, and 4) Spark support from infrastructure teams within the company. In this session, we will present our experience using Spark for stream processing unbounded datasets in the personalization space. The datasets consisted of, but were not limited, to the stream of playback events that are used as feedback for all personalization algorithms. These plays are used to extract specific behaviors which are highly predictive of a customer’s enjoyment of our service. This dataset is massive and has to be further enriched by other online and offline Netflix data sources. These datasets, when consumed by our machine learning models, directly affect the customer’s personalized experience, which means that the impact is high and tolerance for failure is low. We’ll talk about the experiments we did to compare Spark with other streaming solutions like Apache Flink , the impact that we had on our customers, and most importantly, the challenges we faced. Take-aways for the audience: 1) A great example of stream processing large, personalization datasets at scale. 2) An increased awareness of the costs/requirements for making the transition from batch to streaming successfully. 3) Exposure to some of the technical challenges that should be expected along the way.