Qifan Pu is a Ph.D. student in Computer Science. He works at the AMPLab with Prof. Ion Stoica on distributed systems. Previously, he received a bachelor degree from USTC in 2013. He has also worked/interned at Databricks (2016), Alluxio (2015), MSR (2014), UW (2012-2013) and MSR-Asia (2011-2012) in the past.
Spark started as a research project that has primarily focused on efficient data processing on a cluster of machines. Recently, advances in computer hardware and growing industry use cases have led to increasingly more powerful single-node machines. For example, Amazon EC2 has recently announced X1 instances that are equipped with 128 virtual cores and 2TB of memory. The capability of such machine already matches a small cluster, with the advantage of high network bandwidth between the cores. Efficiently leveraging such types of machines, even on a single node, however, requires fundamentally different software architecture. Naively running Spark on top of such machines results in sub-optimal performance because Spark was originally designed to operate across multiple nodes that communicate through expensive data transfers, which, in the case of intra-machine communication, is much unnecessary. In this talk, we will discuss our recent work in adapting Apache Spark to work on many-core machines. We will discuss reasons why Spark is a great fit for many-core systems, the performance gaps we have observed in the current architecture, as well as our new design and algorithmic improvements on Spark to bridge those gaps. As part of the efforts, we will also introduce a new in-memory shuffle mechanism that can greatly improve Spark shuffle performance on single node. This is a joint work between UC Berkeley AMPLab and Databricks.