I work as a senior software engineer in the Personalization Infrastructure team at Netflix. I work on distributed systems and big data, currently focusing on storing and querying petabytes of data. Previously, I have worked at Apple, Sumologic and Amazon in similar roles.
Netflix personalizes the experience for each member and this is achieved by several machine learning models. Our team builds infrastructure that powers these machine learning pipelines; primarily using Spark for feature generation and training. In this talk, we discuss two different Spark related problems and their solutions. - Memory optimization: our spark jobs can consume upwards of ten terabytes of memory. To optimize this memory footprint, we implemented an in memory cache that shares resources between different tasks in the same executor. We achieve this by using a combination of singletons, broadcast variables and task completion listeners. This implementation reduced the memory footprint by more than 20 percent. - Reliable metrics: we make use of several metrics that help debug different stages and sub-components of the pipelines. Spark accumulators can be cumbersome to use for metric computations, especially in scenarios where the spark process fails mid way during execution. We will describe solutions using spark listeners to solve this problem. This metric granularity helps reduce the time to debug complex jobs.
Personalization is one of the key pillars of Netflix as it enables each member to experience the vast collection of content tailored to their interests. Our personalization system is powered by several machine learning models. These models are only as good as the data that is fed to them. They are trained using hundreds of terabytes of data everyday, that make it a non-trivial challenge to track and maintain data quality. To ensure high data quality, we require three things: automated monitoring of data; visualization to observe changes in the metrics over time; and mechanisms to control data related regressions, wherein a data regression is defined as data loss or distributional shifts over a given period of time.
In this talk, we will describe infrastructure and methods that we used to achieve the above: - 'Swimlanes' that help us define data boundaries for different environments that are used to develop, evaluate and deploy ML models, - Pipelines that aggregate data metrics from various sources within each swimlane - Time series and dashboard visualization tools across an atypically larger period of time - Automated audits that periodically monitor these metrics to detect data regressions. We will explain how we run aggregation jobs to optimize metric computations, SQL queries to quickly define/test individual metrics and other ETL jobs to power the visualization/audits tools using Spark.'