Evan R. Sparks is a Ph.D. student in Computer Science at UC Berkeley. He holds a Masters of Science in Computer Science from UC Berkeley and a Bachelor of Arts in Computer Science with High Honors from Dartmouth College. Prior to joining UC Berkeley, Evan worked in Quantitative Asset Management at MDT Advisers and as an Engineer at the Web Intelligence firm Recorded Future.
Decision trees and their ensembles are popular methods for the machine learning tasks of classification and regression. Decision trees are widely used since they are easy to interpret, handle categorical variables, extend to the multi-class classification setting, do not require feature scaling and are able to capture non-linearities and feature interactions. Spark RDDs are naturally suited for efficient computations on narrow dependencies and multiple passes over the data especially with in-memory caching, thus making Spark an ideal choice for decision tree implementation. In this talk, we will describe the implementation of tree algorithms in MLlib that is able to handle massive datasets. We will describe the challenges for implementing a distributed decision tree, including repeated histogram computations on various subsets of the data defined by the branching filters. We will also demonstrate the performance of our implementation via weak and strong scaling results for multiple datasets and cluster sizes. Finally, we will demonstrate how the decision tree implementation can be used as a building block for ensemble methods like boosting and random forests, which are considered top performers for both classification and regression tasks.