Decision Trees, Gradient-Boosted Trees, and Random Forests are among the most commonly used learning methods in Spark MLlib. As datasets grow, there is a pressing need to model high-dimensional data and use highly expressive (i.e., deep) trees. However, most of the Decision Tree code in MLlib uses optimizations borrowed from Google’s PLANET framework, which scales poorly as data dimensionality and tree depths grow. Yggdrasil is a new distributed tree learning algorithm implemented in Spark that scales well to high-dimensional data and deep trees. Unlike PLANET, Yggdrasil partitions the training data vertically (by column) rather than horizontally (by row), leading to substantially lower communication costs. In our evaluation, we found that, for a single tree, Yggdrasil outperforms Spark MLlib’s standard Decision Tree algorithm by 13x on a large dataset (2 million rows, 3500 features) from a leading Web company. Yggdrasil is open-source, and we plan to publish it as a Spark package to let users take advantage of this improved performance.
Firas Abuzaid is a 3rd-year PhD student in the Stanford InfoLab, advised Profs. Peter Bailis and Matei Zaharia. Firas works on problems at the intersection of machine learning and systems; he enjoys building new systems and abstractions that make machine learning faster, more scalable, and easier to use. His work has applications across a broad variety of domains, such as video classification, recommendation serving, and data analytics; Firas has presented his research at multiple venues, including NIPS, VLDB, and HPTS. In his spare time, when he’s not running experiments, you can find Firas running the Dish behind Stanford, or some other scenic spot around Palo Alto or Menlo Park.