Spark 2.0 provided strong performance enhancements to the Spark core while advancing Spark ML usability to use data frames. But what happens when you run Spark 2.0 machine learning algorithms on a large cluster with a very large data set? Do you even get any benefit from using a very large data set? It depends. 🙂 How do new hardware advances affect the topology of high performance Spark clusters. In this talk we will explore Spark 2.0 Machine Learning at scale and share our findings with the community.
As our test platform we will be using a new cluster design, different from typical Hadoop clusters, with more cores, more RAM and latest generation NVMe SSD’s and a 100GbE network with a goal of more performance, in a more space and energy efficient footprint.
Berni Schiefer is an IBM Fellow in the IBM Analytics Group. He is based in San Francisco at the IBM Spark Technology Centre and is responsible for a global team that focusses on the performance and scalability of products and solutions in the Analytics Group, specifically for Big Data technologies including Spark, BigInsights, Big SQL, dashDB, DB2 pureScale and DB2 with BLU acceleration. His passion is in bringing advanced technology to market with a particular emphasis on exploiting processor, memory, networking, storage technology and other hardware and software acceleration technologies.