Preetam is currently a Senior Software Engineer at the Personalization Infrastructure team at Netflix. He builds systems that power machine learning models that operate at Petabyte scale. Prior to Netflix, he developed end-to-end machine learning models as part of the data science team at Thumbtack. He has also worked on the content recommendation and mobile search systems at Yahoo. He obtained his Masters from the College of Computing at Georgia Tech.
Personalization is one of the key pillars of Netflix as it enables each member to experience the vast collection of content tailored to their interests. Our personalization system is powered by several machine learning models. These models are only as good as the data that is fed to them. They are trained using hundreds of terabytes of data everyday, that make it a non-trivial challenge to track and maintain data quality. To ensure high data quality, we require three things: automated monitoring of data; visualization to observe changes in the metrics over time; and mechanisms to control data related regressions, wherein a data regression is defined as data loss or distributional shifts over a given period of time.
In this talk, we will describe infrastructure and methods that we used to achieve the above: - 'Swimlanes' that help us define data boundaries for different environments that are used to develop, evaluate and deploy ML models, - Pipelines that aggregate data metrics from various sources within each swimlane - Time series and dashboard visualization tools across an atypically larger period of time - Automated audits that periodically monitor these metrics to detect data regressions. We will explain how we run aggregation jobs to optimize metric computations, SQL queries to quickly define/test individual metrics and other ETL jobs to power the visualization/audits tools using Spark.'