The sheer increase in volume of data over the last decade has triggered research in cluster computing frameworks that enable web enterprises to extract big insights from big data. While Apache Spark defines the state of the art in big data analytics platforms for (i) exploiting data-flow and in-memory computing and (ii) for exhibiting superior scale-out performance on the commodity machines, little effort has been devoted at understanding the performance of in-memory data analytics with Spark on modern scale-up servers. This thesis characterizes the performance of in-memory data analytics with Spark on scale-up servers. Through empirical evaluation of representative benchmark workloads on a dual socket server, we have found that in-memory data analytics with Spark exhibit poor multi-core scalability beyond 12 cores due to thread level load imbalance and work-time inflation. We have also found that workloads are bound by the latency of frequent data accesses to DRAM. By enlarging input data size, application performance degrades significantly due to substantial increase in wait time during I/O operations and garbage collection, despite 10% better instruction retirement rate (due to lower L1 cache misses and higher core utilization). For data accesses we have found that simultaneous multi-threading is effective in hiding the data latencies. We have also observed that (i) data locality on NUMA nodes can improve the performance by 10% on average, (ii) disabling next-line L1-D prefetchers can reduce the execution time by up-to 14%. For GC impact, we match memory behaviour with the garbage collector to improve performance of applications between 1.6x to 3x. and recommend to use multiple small executors that can provide up-to 36% speedup over single large executor.
Ahsan Javed Awan is an Erasmus Mundus Joint Doctoral Fellow at KTH, Sweden and UPC, Spain. He has been working on "Architecture Support for Apache Spark based Big Data Analytics" for the last 4 years. He has previously interned at IBM Research Tokyo, Japan and Recore Systems, Netherlands. He was a visiting researcher at Barcelona Super Computing Center, Spain and also worked as a Lecturer at National University of Sciences and Technology (NUST), Pakistan. He holds an Erasmus Mundus Joint Masters Degree in Embedded Computing Systems from TU Kaiserslautern, Germany, University of Southampton, UK and NTNU, Norway and a B.E degree in Mechatronics Engineering from NUST, Pakistan.