Michael Mahoney is at UC Berkeley. He works on algorithmic and statistical aspects of modern large-scale data analysis. His recent research has focused on large-scale machine learning, including randomized matrix algorithms and randomized numerical linear algebra, geometric network analysis tools for structure extraction in large informatics graphs, scalable implicit regularization methods, and applications in genetics, astronomy, medical imaging, social network analysis, and internet data analysis. He received his PhD from Yale with a dissertation in computational statistical mechanics, and he has worked and taught at Yale in the mathematics department, at Yahoo Research, and at Stanford in the mathematics department.
The need for efficient and scalable numerical linear algebra and machine-learning implementations continues to grow with the increasing importance of big data analytics. Since its introduction, Apache Spark has become an integral tool in this field, with attractive features such as ease of use, interoperability with the Hadoop ecosystem, and fault tolerance. However, it has been shown that numerical linear algebra routines implemented using MPI, a tool for parallel programming commonly used in high-performance computing, can outperform the equivalent Spark routines by an order of magnitude or more. We describe Alchemist, a system for interfacing between Spark and existing MPI libraries that is designed to address this performance gap. The libraries can be called from a Spark application with little effort, and we illustrate how the resulting system leads to efficient and scalable performance on large datasets. Session hashtag: #Res1SAIS
Explore the trade-offs of performing linear algebra for data analysis and machine learning using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Apache Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks. This session will examine three widely-used and important matrix factorizations: NMF (for physical plausibility), PCA (for its ubiquity) and CX (for data interpretability). Learn how these methods are applied to terabyte-sized problems in particle physics, climate modeling and bioimaging, as use cases where interpretable analytics is of interest. The data matrices are tall-and-skinny, which enable the algorithms to map conveniently into Spark’s data-parallel model. We perform scaling experiments on up to 1600 Cray XC40 nodes, describe the sources of slowdowns and provide tuning guidance to obtain high performance. Based on joint work with Alex Gittens and many others. Session hashtag: #SFr2