Scalable Distributed Decision Trees in Spark MLLib - Databricks

Scalable Distributed Decision Trees in Spark MLLib

Download Slides

Decision trees and their ensembles are popular methods for the machine learning tasks of classification and regression. Decision trees are widely used since they are easy to interpret, handle categorical variables, extend to the multi-class classification setting, do not require feature scaling and are able to capture non-linearities and feature interactions. Spark RDDs are naturally suited for efficient computations on narrow dependencies and multiple passes over the data especially with in-memory caching, thus making Spark an ideal choice for decision tree implementation.

In this talk, we will describe the implementation of tree algorithms in MLlib that is able to handle massive datasets. We will describe the challenges for implementing a distributed decision tree, including repeated histogram computations on various subsets of the data defined by the branching filters. We will also demonstrate the performance of our implementation via weak and strong scaling results for multiple datasets and cluster sizes. Finally, we will demonstrate how the decision tree implementation can be used as a building block for ensemble methods like boosting and random forests, which are considered top performers for both classification and regression tasks.

About Evan Sparks

Evan R. Sparks is a Ph.D. student in Computer Science at UC Berkeley. He holds a Masters of Science in Computer Science from UC Berkeley and a Bachelor of Arts in Computer Science with High Honors from Dartmouth College. Prior to joining UC Berkeley, Evan worked in Quantitative Asset Management at MDT Advisers and as an Engineer at the Web Intelligence firm Recorded Future.

About Ameet Talkwalkar

Ameet Talwalkar is an NSF post-doctoral fellow at UC Berkeley and a consultant at Databricks. His research addresses scalability and ease-of-use issues in the field of statistical machine learning, with applications related to large-scale genomic sequencing. He started the MLlib project in Apache Spark and is also a co-author of the graduate-level textbook entitled “Foundations of Machine Learning” (2012, MIT Press). Next year he will join UCLA’s Computer Science Department as an Assistant Professor.