Apache Spark at Scale: A 60 TB+ Production Use Case

Download Slides

In this talk we will describe our experiences and and lessons learned while scaling Spark to replace one of our production Hive workloads that processes around 60TB+ compressed input data and shuffles 90TB of compressed intermediate data. As far as we know, we are now running the largest real-world Spark job attempted in terms of shuffle data size (Databrick’s Petabyte sort was on synthetic data). We will discuss a multitude of reliability fixes and performance improvements that we contributed back to the Apache Spark project in order to support our workload and also benefit others in the community. We will present performance benchmarks that show our Spark pipeline is 4-6x faster than its deprecated Hive counterpart. Finally, we will conclude the presentation by talking about future work and our initial experiences with Spark 2.0.