Berni Schiefer is an IBM Fellow in the IBM Analytics Group. He is based in San Francisco at the IBM Spark Technology Centre and is responsible for a global team that focusses on the performance and scalability of products and solutions in the Analytics Group, specifically for Big Data technologies including Spark, BigInsights, Big SQL, dashDB, DB2 pureScale and DB2 with BLU acceleration. His passion is in bringing advanced technology to market with a particular emphasis on exploiting processor, memory, networking, storage technology and other hardware and software acceleration technologies.
This talk summarizes the results of using the TPC-DS workload to characterize the SQL capability, performance and scalability of Apache Spark SQL 2.0 at the multi-Terabyte scale in both single user dedicated and multi-user concurrent execution modes. We track the evolution of Spark SQL across versions 1.5, 1.6 and 2.0 to underscore the pace of improvement in Spark SQL capability and performance. We also provide best practices and configuration tuning parameters to support the concurrent execution of the 99 TPC-DS queries at scale. The key takeaways include 1) See the substantial progress made by Spark SQL 2.0 2) Understand what TPC-DS is and why it has become the preferred workload of SQL on Hadoop systems. 3) Experimental results supporting the optimized execution of multi-user, multi-terabyte TPC-DS-based workloads 4) Tuning and configuration changes used to attain excellent performance of Spark SQL.
Spark 2.0 provided strong performance enhancements to the Spark core while advancing Spark ML usability to use data frames. But what happens when you run Spark 2.0 machine learning algorithms on a large cluster with a very large data set? Do you even get any benefit from using a very large data set? It depends. :-) How do new hardware advances affect the topology of high performance Spark clusters. In this talk we will explore Spark 2.0 Machine Learning at scale and share our findings with the community. As our test platform we will be using a new cluster design, different from typical Hadoop clusters, with more cores, more RAM and latest generation NVMe SSD's and a 100GbE network with a goal of more performance, in a more space and energy efficient footprint.