Code generation is integral to Spark’s physical execution engine. When implemented, the Spark engine creates optimized bytecode at runtime improving performance when compared to interpreted execution. Spark has taken the next step with whole-stage codegen which collapses an entire query into a single function. However, as the generated function sizes increase, new problems arise. Complex queries can lead to code generated functions ranging from thousands to hundreds of thousands of lines of code. This can lead to many problems such as OOM errors due to compilation costs, exceptions from exceeding the 64KB method limit in Java, and performance regressions when JIT compilation is turned off for a function whose bytecode exceeds 8KB. With whole-stage codegen turned off, Spark is able to split these functions into smaller functions to avoid these problems, but then the improvements of whole-stage codegen are lost. This talk will go over the improvements that Workday has made to code generation to handle whole-stage codegen for various queries. We will begin with the differences between expression codegen and whole-stage codegen. Then we will discuss how to split the collapsed function from whole-stage codegen. We will also present the performance improvements that we have seen from these improvements in our production workloads.
Michael Chen is a Software Development Engineer at Workday, working on the Prism Analytics product. Michael received his Bachelor's in computer science from the University of Michigan