Building flexible machine learning libraries adapted for Netflix’s use cases is paramount in our continued efforts to better model our users’ behaviors and provide them great personalized video recommendations.
This talk introduces one such spark-based stratification library developed at Netflix to aid “Training Set Stratification” in offline machine learning workflows. Originally created to implement user selection algorithms in our data snapshotting infrastructure, the library has evolved to cater to general-purpose stratification use cases in ML pipelines. We will talk about how using the stratification library’s DSL (domain specific language) and its underlying Spark based implementation, one can easily express complex sampling rules and dynamically carve out matching portions of a Spark dataframe.
For example, arbitrary rules governing the distributions of user attributes (and combinations there of) such as origin country, video play frequency, tenure etc can be easily enforced when constructing a ML training data set. The demo section of the talk will showcase example usages of the stratification library in a Jupyter notebook.
Session hashtag: #DSSAIS11
Shiva Chaitanya is a senior software engineer in the Personalization Infrastructure team at Netflix. His primary focus is building and improving Machine Learning libraries - mostly within spark/scala ecosystem - consumed by several internal teams working on Netflix recommendations' algorithms. Shiva holds a Ph.D degree in Computer Science & Engineering from Pennsylvania State University.