Spark SQL works very well with structured row-based data. Vectorized reader and writer for parquet/orc can make I/O much faster. It also used WholeStageCodeGen to improve the performance by Java JIT code. However Java JIT is usually not working very well on utilizing latest SIMD instructions under complicated queries. Apache Arrow provides columnar in-memory layout and SIMD optimized kernels as well as a LLVM based SQL engine Gandiva. These native based libraries can accelerate Spark SQL by reduce the CPU usage for both I/O and execution.
In this session, we would like to take a deep dive on we build Native SQL engine for Spark by leveraging Arrow Gandiva and its compute kernels. We will introduce the general design of commonly used operators like aggregation, sorting and joining, and discuss how can we optimize these operators with SIMD based instructions. We will also introduce how to implement WholeStageCodeGen with Native libraries. Finally we will use micro-benchmarks and TPCH workloads to explain how vectorized execution can benefit these workloads.
Speakers: Yuan Zhou and Chendi Xue
Yuan Zhou is a senior software development engineer Intel, where he primarily focused on big data storage software. He's been working in databases, virtualization, and cloud computing for most of his 10+ year career at Intel.
Chendi Xue is a software engineer from Intel data analytics team. She has six years’ experience in bigdata and cloud system optimization, focusing on computation, storage, network software stack performance analysis and optimization. She participated in the development works including Spark-Sql, Spark-Shuffle optimization, cache implementation, etc. Before that, she worked on Linux Device-Mapper optimization and iscsi optimization during her master degree study.a