Apache Spark enables applications to efficiently process massive datasets. Its extensive machine learning libraries, such as MLlib, have showcased the power of the Spark system architecture and the elegance of its API. The existing Spark libraries assume that the number of parameters in machine learning models are small enough to fit in the memory of a single machine. This assumption, however, is not compatible with demanding use cases at Yahoo. In order to fit our needs, we developed a set of Spark ML libraries that can handle large models with billions of parameters. To enable models with billions of parameters, we have explored a system architecture that augments Spark driver/executors with Parameter Servers (PS). This provides distributed in-memory model storage and computation, and parameter servers enable Spark-executor-based learners to jointly learn large models efficiently. In this talk, we will illustrate the power of Spark+PS architecture via two algorithms: logistic regression and word2vec. We will elaborate on how Spark+PS has enabled us to achieve significant model size scale-up and speed-up of machine learning. We will also discuss how this approach could be applied to other ML algorithms.
Badri Narayan Bhaskar is a Machine Learning Scientist specializing in distributed inference and optimization. He has developed scalable prediction algorithms for content ranking and recommendation, advertising click prediction and demographic inference at Yahoo and Networked Insights Inc on Hadoop, Spark and Storm clusters. He obtained his M.S and Ph.D in Electrical Engineering from the University of Wisconsin.