Mayuresh Kunjir is a PhD candidate in the Computer Science Department at Duke University. His research focus is on resource management and query optimization in data analytics systems. Prior to joining Duke, Mayuresh got his MS from Indian Institute of Science, Bangalore, working on improving power efficiency of commercial database engines.
June 7, 2016 05:00 PM PT
Allocation and usage of memory in Spark is based on an interplay of algorithms at multiple levels: (i) at the resource-management level across various containers allocated by Mesos or YARN, (ii) at the container level among the OS and multiple processes such as the JVM and Python, (iii) at the Spark application level for caching, aggregation, data shuffles, and program data structures, and (iv) at the JVM level across various pools such as the Young and Old Generation as well as the heap versus off-heap. The goal of this talk is to provide application developers and operational staff easy ways to understand the multitude of choices involved in Spark's memory management. This talk is based on an extensive experimental study of Spark on Yarn that was done using a representative suite of applications.
Takeaways from this talk:
- We identify the memory pools used at different levels along with the key configuration parameters (i.e., tuning knobs) that control memory management at each level.
- We show how to collect resource usage and performance metrics for various memory pools, and how to analyze these metrics to identify contention versus underutilization of the pools.
- We show the impact of key memory-pool configuration parameters at the levels of the application, containers, and the JVM. We also highlight tradeoffs in memory usage and running time which are important indicators of resource utilization and application performance.
- We demonstrate how application characteristics, such as shuffle selectivity and input data size, dictate the impact of memory pool settings on application response time, efficiency of resource usage, chances of failure, and performance predictability.
- We summarize our findings as key troubleshooting and tuning guidelines at each level for improving application performance while achieving the highest resource utilization possible in multi-tenant clusters.