This talk summarizes the results of using the TPC-DS workload to characterize the SQL capability, performance and scalability of Apache Spark SQL 2.0 at the multi-Terabyte scale in both single user dedicated and multi-user concurrent execution modes. We track the evolution of Spark SQL across versions 1.5, 1.6 and 2.0 to underscore the pace of improvement in Spark SQL capability and performance. We also provide best practices and configuration tuning parameters to support the concurrent execution of the 99 TPC-DS queries at scale. The key takeaways include 1) See the substantial progress made by Spark SQL 2.0 2) Understand what TPC-DS is and why it has become the preferred workload of SQL on Hadoop systems. 3) Experimental results supporting the optimized execution of multi-user, multi-terabyte TPC-DS-based workloads 4) Tuning and configuration changes used to attain excellent performance of Spark SQL.
Berni Schiefer is an IBM Fellow in the IBM Analytics Group. He is based in San Francisco at the IBM Spark Technology Centre and is responsible for a global team that focusses on the performance and scalability of products and solutions in the Analytics Group, specifically for Big Data technologies including Spark, BigInsights, Big SQL, dashDB, DB2 pureScale and DB2 with BLU acceleration. His passion is in bringing advanced technology to market with a particular emphasis on exploiting processor, memory, networking, storage technology and other hardware and software acceleration technologies.