Experiences with Spark’s RDD APIs for Complex, Custom Applications

Download Slides

In this talk, we will discuss several advantages of the Spark RDD API for developing custom applications when compared to pure SQL-like interfaces such as Hive. In particular, we will describe how to control data distribution, avoid data skew, and implement application specific optimizations in order to build performant and reliable data pipelines. In order to illustrate these ideas, we will share our experiences redesigning a large-scale, complex (100+ stage) language model training pipeline for Spark that was originally built in Hive. The final Spark based pipeline is modular, readable, and more maintainable when compared to previous set of HQL queries. In addition to the qualitative improvements, we also observed a significant reduction in both resource usage and data landing time. Finally, we will also describe Spark optimizations that we implemented for this workload that can be applied toward batch workloads in general.