Senior software engineer on the Personalization Infrastructure team at Netflix that builds scalable, distributed computing systems for the algorithmic engineers that help improve member personalization.
As a data driven company, we use Machine Learning algos and A/B tests to drive all of the content recommendations for our members. To improve the quality of our personalized recommendations, we try an idea offline using historical data. Ideas that improve our offline metrics are then pushed as A/B tests which are measured through statistically significant improvements in core metrics such as member engagement, satisfaction, and retention.The heart of such offline analyses are historical facts data that are used to generate features required by the machine learning model. For example, viewing history of a member, videos in mylist etc. Building a fact store at an ever evolving Netflix scale is non trivial. Ensuring we capture enough fact data to cover all stratification needs of various experiments and guarantee that the data we serve is temporally accurate is an important requirement. In this talk, we will present the key requirements, evolution of our fact store design, its implementation, the scale and our learnings. We will also take a deep dive into fact vs feature logging, design tradeoffs, infrastructure performance, reliability and query API for the store. We use Spark and Scala extensively and variety of compression techniques to store/retrieve data efficiently.
Discuss how we leveraged the BDAS stack within Netflix improve the rate of innovation in the algorithmic engineering teams. Also talk about how we are using Spark Streaming to solve different use cases within Netflix and cover various resiliency and scalability testing that we performed.
Spark, GraphX and MLlib play an important role in several current generation machine learning platforms. This talk outlines how we integrated Spark, Python, R, Docker and other toolkits into a scalable orchestration framework that supports heterogeneous workloads. We will demonstrate how our users design and run machine learning pipelines that seamlessly exchange data and artifacts between Spark jobs and other components. This is deployed on a single multi-tenant cluster that allows engineers to choose between multiple Spark versions with built in capabilities to monitor long running workflows. We will explain our journey building this framework atop Mesos and our special treatment of Spark within it.