Firas Abuzaid is a 3rd-year PhD student in the Stanford InfoLab, advised Profs. Peter Bailis and Matei Zaharia. Firas works on problems at the intersection of machine learning and systems; he enjoys building new systems and abstractions that make machine learning faster, more scalable, and easier to use. His work has applications across a broad variety of domains, such as video classification, recommendation serving, and data analytics; Firas has presented his research at multiple venues, including NIPS, VLDB, and HPTS. In his spare time, when he’s not running experiments, you can find Firas running the Dish behind Stanford, or some other scenic spot around Palo Alto or Menlo Park.
Many queries in Spark workloads execute over unstructured or text-based data formats, such as JSON or CSV files. Unfortunately, parsing these formats into queryable DataFrames or DataSets is often the slowest stage of these workloads, especially for interactive, ad-hoc analytics. In many instances, this bottleneck can be eliminated by taking filters expressed in the high-level query (e.g., a SQL query in Spark SQL) and pushing the filters into the parsing stage, thus reducing the total number of records that need to be parsed. In this talk, we present Sparser, a new parsing library in Spark for JSON, CSV, and Avro files. By aggressively filtering records before parsing them, Sparser achieves up to 9x end-to-end runtime improvement on several real-world Spark SQL workloads. Using Spark's Data Source API, Sparser extracts the filtering expressions specified by a Spark SQL query; these expressions are then compiled into fast, SIMD-accelerated "pre-filters" which can discard data at an order of magnitude faster than the JSON and CSV parsers currently available in Spark. These pre-filters are approximate and may produce false positives; thus, Sparser intelligently selects the best set of pre-filters that minimizes the overall parsing runtime for any given query. We show that, for Spark SQL queries with low selectivity (i.e., very selective filters), Sparser routinely outperforms the standard parsers in Spark by at least 3x. Sparser can be used as a drop-in replacement for any Spark SQL query; our code is open-source, and our Spark package will be made public soon.
Decision Trees, Gradient-Boosted Trees, and Random Forests are among the most commonly used learning methods in Spark MLlib. As datasets grow, there is a pressing need to model high-dimensional data and use highly expressive (i.e., deep) trees. However, most of the Decision Tree code in MLlib uses optimizations borrowed from Google's PLANET framework, which scales poorly as data dimensionality and tree depths grow. Yggdrasil is a new distributed tree learning algorithm implemented in Spark that scales well to high-dimensional data and deep trees. Unlike PLANET, Yggdrasil partitions the training data vertically (by column) rather than horizontally (by row), leading to substantially lower communication costs. In our evaluation, we found that, for a single tree, Yggdrasil outperforms Spark MLlib's standard Decision Tree algorithm by 13x on a large dataset (2 million rows, 3500 features) from a leading Web company. Yggdrasil is open-source, and we plan to publish it as a Spark package to let users take advantage of this improved performance.