DB Tsai - Databricks

DB Tsai

Senior Research Engineer, Netflix

DB Tsai is an Apache Spark PMC and committer and a Senior Research Engineer working on Personalized Recommendation Algorithms at Netflix. He implemented several algorithms including linear models with Elastici-Net (L1/L2) regularization using LBFGS/OWL-QN optimizers in Apache Spark. Prior to joining Netflix, DB was a Lead Machine Learning Engineer at Alpine Data Labs, where he led a team to develop innovative large-scale distributed learning algorithms, and then contributed back to open source Apache Spark project. DB was a Ph.D. candidate in Applied Physics at Stanford University. He holds a Master’s degree in Electrical Engineering from Stanford.



BoF-How to bring the pipeline built in Notebook / Apache Spark to production, and machine learning deployment cyclesSummit 2017

Notebook is a widely used tools for data scientists to analyze the data to find insight and build learning models. However, there is a gap bringing the notebook into production pipeline. How do we streamline the process to deploy the notebook into production? If we find something needed to be improved in production, how do we shorten the cycles? How do we make sure there is no discrepancy between the online feature generation which will be used for online service and the features generated offline for model training?

Large-Scale Lasso and Elastic-Net Regularized Generalized Linear ModelsSummit 2015

Nonlinear methods are widely used to produce higher performance compared with linear methods; however, nonlinear methods are generally more expensive in model size, training time, and scoring phase. With proper feature engineering techniques like polynomial expansion, the linear methods can be as competitive as those nonlinear methods. In the process of mapping the data to higher dimensional space, the linear methods will be subject to overfitting and instability of coefficients which can be addressed by penalization methods including Lasso and Elastic-Net. Finally, we’ll show how to train linear models with Elastic-Net regularization using MLlib.

Distributed Time Travel for Feature GenerationSummit East 2016

Learning is an analytic process of exploring the past in order to predict the future. Hence, being able to travel back in time to create feature is critical for machine learning projects to be successful. At Netflix, we spend significant time and effort experimenting with new features and new ways of building models. This involves generating features for our members from different regions over multiple days. To enable this, we built a time machine using Apache Spark that computes features for any arbitrary time in the recent past. The first step of building this time machine is to snapshot the data from various micro services on a regular basis. We built a general purpose workflow orchestration and scheduling framework optimized for machine learning pipelines and used it to run the snapshot and model training workflows. Snapshot data is then consumed by feature encoders to compute various features for offline experimentation and model training. Crucially, the same feature encoders are used in both offline model building and online scoring for production or A/B tests. Building this time machine helped us try new ideas quickly without placing stress on production services and without having to wait for data accumulation of the newly-implemented features. Moreover, building it with Apache Spark empowered us to both scale up the data size by an order of magnitude and train and validate the models in less time. Finally, using Apache Zeppelin notebook, we are able to interactively prototype features and run experiments.

Netflix’s Recommendation ML Pipeline Using Apache SparkSummit East 2017

Netflix is the world’s largest streaming service, with 80 million members in over 250 countries. Netflix uses machine learning to inform nearly every aspect of the product, from the recommendations you get, to the boxart you see, to the decisions made about which TV shows and movies are created.Given this scale, we utilized Apache Spark to be the engine of our recommendation pipeline. Apache Spark enables Netflix to use a single, unified framework/API – for ETL, feature generation, model training, and validation. With pipeline framework in Spark ML, each step within the Netflix recommendation pipeline (e.g. label generation, feature encoding, model training, model evaluation) is encapsulated as Transformers, Estimators and Evaluators – enabling modularity, composability and testability. Thus, Netflix engineers can build our own feature engineering logics as Transformers, learning algorithms as Estimators, and customized metrics as Evaluators, and with these building blocks, we can more easily experiment with new pipelines and rapidly deploy them to production. In this talk, we will discuss how Apache Spark is used as a distributed framework we build our own algorithms on top of to generate personalized recommendations for each of our 80+ million subscribers, specific techniques we use at Netflix to scale, and the various pitfalls we’ve found along the way.

Learn more:
  • ML Pipelines: A New High-Level API for MLlib
  • Audience Modeling With Apache Spark ML Pipelines

  • Performance Optimization of Recommendation Training Pipeline at NetflixSummit 2017

    Netflix is the world’s largest streaming service, with over 80 million members worldwide. Machine learning algorithms are used to recommend relevant titles to users based on their tastes.At Netflix, we use Apache Spark to power our recommendation pipeline. Stages in the pipeline, such as label generation, data retrieval, feature generation, training, validation, are based on Spark ML PipleStage framework. While this provides developers the flexibility to develop individual components as encapsulated pipeline stages, we find that coordination across stages can potentially provide significant performance gains. In this talk, we discuss how our machine learning pipeline based on Spark has been improved over the years. Techniques such as predicate pushdown, wide transformation minimization, have lead to significant run time improvement and resource savings. Session hashtag: #SFexp9

    VEGAS: The Missing Matplotlib for Scala/Apache SparkSummit Europe 2017

    In this talk, we'll present techniques for visualizing large scale machine learning systems in Spark. These are techniques that are employed by Netflix to understand and refine the machine learning models behind Netflix's famous recommender systems that are used to personalize the Netflix experience for their 99 millions members around the world. Essential to these techniques is Vegas, a new OSS Scala library that aims to be the "missing MatPlotLib" for Spark/Scala. We'll talk about the design of Vegas and its usage in Scala notebooks to visualize Machine Learning Models. Session hashtag: #EUds0