Making Sense of Spark Performance - Databricks

Making Sense of Spark Performance

Download Slides

In this talk, I’ll take a deep dive into Spark’s performance on two benchmarks (TPC-DS and the Big Data Benchmark from UC Berkeley) and one production workload and demonstrate that many commonly-held beliefs about performance bottlenecks do not hold. In particular, I’ll demonstrate that CPU (and not I/O) is often the bottleneck, that network performance can improve job completion time by a median of at most 4%, and that the causes of most stragglers can be identified and fixed. After describing the takeaways from the workloads I studied, I’ll give a brief demo of how the (open-source) tools that I developed can be used by others to understand why Spark jobs are taking longer than expected. I’ll conclude by proposing changes to Spark core that, based on my performance study, could significantly improve performance. This talk is based on a research talk that I’ll be giving at NSDI 2015.

« back
About Kay Ousterhout

Kay Ousterhout is a committer and PMC member for Apache Spark and a software engineer at LightStep. Kay received her PhD from UC Berkeley in 2017. Her thesis focused on building high-performance data analytics frameworks that allow users to reason about - and optimize for - performance. She also co-authored two high-throughput schedulers for Spark, Sparrow and Drizzle, that provide high performance scheduling for low-latency workloads. Currently Kay is an engineer at LightStep, where she focuses on enabling users to understand the performance of complex distributed systems.