Nitin is a Senior Software Engineer on the Personalization Infrastructure team at Netflix. His primary focus is on building various ML infrastructure components using Apache Spark that helps Netflix research engineers to innovate faster and improve personalized recommendations. He is passionate about Large Scale Distributed Systems, Search Platforms and Performance Optimizations. He is an active open source contributor for Apache Solr and a few other apache projects.
As a data driven company, we use Machine Learning algos and A/B tests to drive all of the content recommendations for our members. To improve the quality of our personalized recommendations, we try an idea offline using historical data. Ideas that improve our offline metrics are then pushed as A/B tests which are measured through statistically significant improvements in core metrics such as member engagement, satisfaction, and retention.The heart of such offline analyses are historical facts data that are used to generate features required by the machine learning model. For example, viewing history of a member, videos in mylist etc. Building a fact store at an ever evolving Netflix scale is non trivial. Ensuring we capture enough fact data to cover all stratification needs of various experiments and guarantee that the data we serve is temporally accurate is an important requirement. In this talk, we will present the key requirements, evolution of our fact store design, its implementation, the scale and our learnings. We will also take a deep dive into fact vs feature logging, design tradeoffs, infrastructure performance, reliability and query API for the store. We use Spark and Scala extensively and variety of compression techniques to store/retrieve data efficiently. Session hashtag: #DevSAIS11
As a data driven company, we use Machine learning based algos and A/B tests to drive all of the content recommendations for our members. Traditionally, these recommendations are precomputed in a batch processing fashion, but such a model cannot react quickly based on member interactions, title interests, popularity etc. With an ever-growing Netflix catalog, finding the right content for our audience in near real-time would provide the best personalized experience. We'll take a deep dive into our realtime Spark Streaming ecosystem at Netflix. Both it's infrastructure and business use cases. On the infrastructure front, we will delve into scale challenges, state management, data persistence, resiliency considerations, metrics, operations and auto-remediation. We will talk about a few use cases that leverage real-time data for model training, such as providing the right personalized videos in a member's Billboard and choosing the right personalized image soon after the launch of the show. We will also reflect on the lessons learnt while building such high volume infrastructure. Session hashtag: #ML7SAIS