Netflix is the world’s largest streaming service, with over 80 million members worldwide. Machine learning algorithms are used to recommend relevant titles to users based on their tastes.At Netflix, we use Apache Spark to power our recommendation pipeline. Stages in the pipeline, such as label generation, data retrieval, feature generation, training, validation, are based on Spark ML PipleStage framework. While this provides developers the flexibility to develop individual components as encapsulated pipeline stages, we find that coordination across stages can potentially provide significant performance gains.
In this talk, we discuss how our machine learning pipeline based on Spark has been improved over the years. Techniques such as predicate pushdown, wide transformation minimization, have lead to significant run time improvement and resource savings.
Session hashtag: #SFexp9
DB Tsai is an Apache Spark PMC and committer and a Senior Research Engineer working on Personalized Recommendation Algorithms at Netflix. He implemented several algorithms including linear models with Elastici-Net (L1/L2) regularization using LBFGS/OWL-QN optimizers in Apache Spark. Prior to joining Netflix, DB was a Lead Machine Learning Engineer at Alpine Data Labs, where he led a team to develop innovative large-scale distributed learning algorithms, and then contributed back to open source Apache Spark project. DB was a Ph.D. candidate in Applied Physics at Stanford University. He holds a Master's degree in Electrical Engineering from Stanford.