Machine learning in the enterprise is an iterative process. A data scientist will tweak or replace her learning algorithm until she finds an approach that works for the business problem and the available data. Apache SystemML is a new system that accelerates this kind of exploratory algorithm development for large-scale machine learning problems. SystemML provides a high-level language to quickly implement and run machine learning algorithms on Spark. SystemML’s cost-based optimizer takes care of low-level decisions about how to use Spark’s parallelism, allowing users to focus on the algorithm and the real-world problem that the algorithm is trying to solve. This talk will explain how SystemML automates the design decisions involved in translating a high-level algorithm into Spark API calls. The explanation will center around a three-line snippet of R code. We’ll start by explaining several different ways that one could implement this code snippet on Spark. We’ll show how, depending on the characteristics of the data and the Spark cluster, each of these approaches might work very well or not work at all. Then we’ll explain how SystemML’s optimizer enumerates these different execution strategies and chooses one that works. By the end of this process, we will have walked through how the code changes as it passes through each stage of SystemML’s compilation chain, finally reaching the SystemML runtime for Spark. The talk will conclude with pointers to how the audience can try out Apache SystemML or learn more about the parts of SystemML’s optimizer that weren’t covered in the talk.
Fred Reiss is the Chief Architect at IBM's Center for Open-Source Data and AI Technologies in San Francisco. Fred received his Ph.D. from UC Berkeley in 2006, then worked for IBM Research Almaden for the next nine years. At Almaden, Fred worked on the SystemML and SystemT projects, as well as on the research prototype of DB2 with BLU Acceleration. Fred has over 25 peer-reviewed publications and six patents.