Spark Deployment and Performance Evaluation on a Petascale HPC Setup - Databricks

Spark Deployment and Performance Evaluation on a Petascale HPC Setup

Download Slides

Traditional HPC systems are designed according to the compute-centric paradigm, with focus on computing power, and the goal to process as many floating-point operations per second as possible. However, the growing importance of data-intensive applications is currently pushing the transition of many computing facilities into a data-centric paradigm, for which the variable to maximize is the amount of data, measured in records or bytes, processed per second to perform data analysis. The emergent focus on big data and the potential paradigm shift poses a dilemma to the managers of traditional HPC facilities, who have to choose between deploying dedicated systems for data analytics or to evolve their existing infrastructure to meet the new demands. We have studied the second option, adapting an existing HPC setup to host a massively parallel dataflow platform able to execute big data workloads. Among the different massively parallel dataflow frameworks, we have chosen Apache Spark. We have deployed Apache Spark 1.4.0 on a real-world, petascale, HPC setup, the MareNostrum supercomputer, built on top of commodity hardware. We have designed and developed a framework (Spark4MN) to efficiently run a Spark cluster over a Load Sharing Facility (LSF)-based environment and account for the hardware particularities of MareNostrum, such as GPFS storage, InfiniBand network, and multicore nodes. We have evaluated the behavior of two representative data-intensive applications (sorting and k-means). Especially for k-means, we show that MareNostrum’s performance is scalable and similar, if not better, than the top