Alchemist: An Apache Spark MPI Interface - Databricks

Alchemist: An Apache Spark <=> MPI Interface

The need for efficient and scalable numerical linear algebra and machine-learning implementations continues to grow with the increasing importance of big data analytics. Since its introduction, Apache Spark has become an integral tool in this field, with attractive features such as ease of use, interoperability with the Hadoop ecosystem, and fault tolerance. However, it has been shown that numerical linear algebra routines implemented using MPI, a tool for parallel programming commonly used in high-performance computing, can outperform the equivalent Spark routines by an order of magnitude or more.

We describe Alchemist, a system for interfacing between Spark and existing MPI libraries that is designed to address this performance gap. The libraries can be called from a Spark application with little effort, and we illustrate how the resulting system leads to efficient and scalable performance on large datasets.

Session hashtag: #Res1SAIS

About Michael Mahoney

Michael Mahoney is at UC Berkeley. He works on algorithmic and statistical aspects of modern large-scale data analysis. His recent research has focused on large-scale machine learning, including randomized matrix algorithms and randomized numerical linear algebra, geometric network analysis tools for structure extraction in large informatics graphs, scalable implicit regularization methods, and applications in genetics, astronomy, medical imaging, social network analysis, and internet data analysis. He received his PhD from Yale with a dissertation in computational statistical mechanics, and he has worked and taught at Yale in the mathematics department, at Yahoo Research, and at Stanford in the mathematics department.

About Kai Rothauge

Kai Rothauge is currently a postdoc in the statistics department at UC Berkeley, working with Michael Mahoney on machine learning and numerical linear algebra on distributed systems. He earned his PhD in Applied Mathematics from the University of British Columbia and his MMath from the University of Bath. Prior to his doctoral studies he also completed scientific internships at the Max Planck Institute, the Fraunhofer Institute, and CSIRO.