This talk will describe Monotasks, a new architecture for the core of Spark that makes performance easier to reason about. In Spark today, pervasive parallelism and pipelining make it difficult to answer even simple performance questions like “what is the bottleneck for this workload?”. As a result, it’s difficult for developers to know what to optimize, and it’s even more difficult for users to understand what hardware to use and what configuration parameters to set to get the best performance. Monotasks is a radically different execution model, designed to make performance easy to reason about. With this design, jobs are broken into units of work called monotasks that each use a single resource (e.g., CPU, disk, or network). Each resource is managed by a per-resource scheduler that assigns dedicated resources to each monotask; e.g., the CPU scheduler assigns each CPU monotask a dedicated CPU core. This design makes it simpler to reason about performance, because each resource is managed by a single scheduler that has complete visibility over how that resource is used. Per-resource schedulers can easily provide high utilization by scheduling concurrent tasks to match the concurrency of the underlying resource (e.g., 8 CPU monotasks to fully utilize 8 CPU cores), and allow the framework to auto-tune based on which resources are currently contended. For example, using monotasks, Spark can automatically determine whether to compress a particular dataset based on the length of the CPU and disk queues. We have implemented monotasks as an API-compatible replacement for Spark’s execution layer. This talk will first present results illustrating that Monotasks makes it simple to reason about performance. Next, we’ll present results demonstrating new performance optimizations enabled by Monotasks (e.g., automatically determining whether to compress data) that allow Monotasks to provide much better performance than today’s Spark.
Kay Ousterhout is a Spark committer and PMC member and a PhD student at UC Berkeley. In the Spark project, Kay is a committer of the scheduler, and her work on Spark has focused on improving scheduler performance. At UC Berkeley, Kay's research focuses on understanding and improving performance of large-scale analytics frameworks.