Xiangrui Meng is an Apache Spark PMC member and a software engineer at Databricks. His main interests center around developing and implementing scalable algorithms for scientific applications. He has been actively involved in the development and maintenance of Spark MLlib since he joined Databricks. Before Databricks, he worked as an applied research engineer at LinkedIn, where he was the main developer of an offline machine learning framework in Hadoop MapReduce. His Ph.D. work at Stanford is on randomized algorithms for large-scale linear regression problems.
Prior to v1.0, MLlib only supports dense data in regression, classification, and clustering, while sparse data dominates in practice. In this talk, we will show the design choices we’ve made to support sparse data in MLlib and the optimizations we used to take advantage of sparsity in k-means, gradient descent, column summary statistics, tall-and-skinny SVD and PCA, etc.
Recommendation systems are among the most popular applications of machine learning. MLlib implements alternating least squares (ALS) for collaborative filtering, a very popular algorithm for making recommendations. We utilize Spark’s in-memory caching and a special partitioning strategy to make ALS efficient and scalable. MLlib’s ALS runs 10x faster than Apache Mahout’s implementation and it scales up to billions of ratings. In this talk, we present a more scalable implementation of ALS with scalability results on 100 billion ratings. It is based on the issues we experienced with the old implementation. We will review the ALS algorithm, and describe the internal data storage we used in the new implementation as well as techniques used to accelerate the computation and to improve JVM efficiency. We will also discuss the next steps for recommendation algorithms in MLlib.
Generalized linear models (GLMs) unify various statistical models such as linear regression and logistic regression through the specification of a model family and link function. They are widely used in modeling, inference, and prediction with applications in numerous fields. In this talk, we will summarize recent community efforts in supporting GLMs in Spark MLlib and SparkR. We will review supported model families, link functions, and regularization types, as well as their use cases, e.g., logistic regression for classification and log-linear model for survival analysis. Then we discuss the choices of solvers and their pros and cons given training datasets of different sizes, and implementation details in order to match R's model output and summary statistics. We will also demonstrate the APIs in MLlib and SparkR, including R model formula support, which make building linear models a simple task in Spark. This is a joint work with Eric Liang, Yanbo Liang, and some other Spark contributors.
Since its introduction in Spark 1.4, SparkR has received contributions from both the Spark community and the R community. In this talk, we will summarize recent community efforts on extending SparkR for scalable advanced analytics. We start with the computation of summary statistics on distributed datasets, including single-pass approximate algorithms. Then we demonstrate MLlib machine learning algorithms that have been ported to SparkR and compare them with existing solutions on R, e.g., generalized linear models, classification and clustering algorithms. We also show how to integrate existing R packages with SparkR to accelerate existing R workflows.
Graph analytics has a wide range of applications, from information propagation and network flow optimization to fraud and anomaly detection. The rise of social networks and the Internet of Things has given us complex web-scale graphs with billions of vertices and edges. However, in order to extract the hidden gems within those graphs, you need tools to analyze the graphs easily and efficiently. At Spark Summit 2016, Databricks introduced GraphFrames, which implemented graph queries and pattern matching on top of Spark SQL to simplify graph analytics. In this talk, you'll learn about work that has made graph algorithms in GraphFrames faster and more scalable. For example, new implementations like connected components have received algorithm improvements based on recent research, as well as performance improvements from Spark DataFrames. Discover lessons learned from scaling the implementation from millions to billions of nodes; compare its performance with other popular graph libraries; and hear about real-world applications. Session hashtag: #SFml1