Spark SQL Catalyst optimizer, post query plan optimization, compiles the SQL query to Java code. Without code generation, such query expressions would have to be interpreted for each row of data, by walking down a tree of nodes. This introduces large amounts of branches and virtual function calls that slow down execution. With code generation, a query is collapsed into a single optimized function that eliminates multiple function calls and leverages CPU registers for intermediate data.
This code is then compiled in runtime to Java bytecode using Janino compiler. This presentation focuses on further catalyst code generation optimizations possible using function outlining. Automatic code generation tools generally tend to generate huge optimized functions. Large functions that are frequently executed might degrade runtime performance by preventing JVM optimizations such as function inlining. To avoid this, code generation tools should try to contain independent logic into separate functions.
This presentation will take the audience through the Spark Catalyst Code generation, how automatic split of large functions into smaller functions was achieved and the performance benefits associated with it.
Session hashtag: #SAISDD15
Madhusudanan Kandasamy is a Senior Technical Staff Member at IBM Systems - Bangalore.He has more than decade of experience in UNIX Operating System development and is an expert in Virtual Memory Management, Scheduling, malloc subsystem and system level performance tuning of applications. Recently he is focusing on improving the performance of Apache Spark by exploiting hardware features like GPU. He is also a Master Inventor at IBM have 19 patents and 10 research disclosures under his name.
Kavana N Bhat is a Senior Developer with IBM Power Systems Development. She has 16 years of experience and has worked more than a decade in AIX OS development majorly architecting ProbeVue, the Dynamic Tracing Tool and components like WPARS, Debuggers, Pthreads, RAS features etc. She has experience performance tuning Big Data workloads like Apache Spark and Alluxio on POWER. In her recent role, she was involved in the development of IBM's Distributed Deep Learning Library. She holds a BE in CS from NIE, Mysore. She has around 5 Patents, 5 Research Disclosures and has presented at various technical conferences.