Asif works in the Spark team of Workday’s Prism Analytics division. He enjoys the nitty-gritty of Spark internals, and has over 20 years of experience in the field of distributed caching, SQL and object querying engine development. As a part of Workday, he has been making various enhancements to Spark’s Catalyst optimizer for complex plans. In his previous company SnappyData (acquired by Tibco), he had implemented an Approximate Querying Engine on top of Spark, and optimized Spark’s HashAggregateExec operator for their workloads. Asif holds a B. Tech in Chemical Engineering from IIT Bombay.
May 28, 2021 11:40 AM PT
For more than 6 years, Workday has been building various analytics products powered by Apache Spark. At the core of each product offering, customers use our UI to create data prep pipelines, which are then compiled to DataFrames and executed by Spark under the hood. As we built out our products, however, we started to notice places where vanilla Spark is not suitable for our workloads. For example, because our Spark plans are programmatically generated, they tend to be very complex, and often result in tens of thousands of operators. Another common issue is having case statements with thousands of branches, or worse, nested expressions containing such case statements.
With the right combination of these traits, the final DataFrame can easily take Catalyst hours to compile and optimize - that is, if it doesn’t first cause the driver JVM to run out of memory.
In this talk, we discuss how we addressed some of our pain points regarding complex pipelines. Topics covered include memory-efficient plan logging, using common subexpression elimination to remove redundant subplans, rewriting Spark’s constraint propagation mechanism to avoid exponential growth of filter constraints, as well as other performance enhancements made to Catalyst rules.
We then apply these changes to several production pipelines, showcasing the reduction of time spent in Catalyst, and list out ideas for further improvements. Finally, we share tips on how you too can better handle complex Spark plans.