Apache Spark continues to grow quickly in both community size and technical capabilities. Since the last Spark Summit, in December 2013, Spark’s contributor base has grown from 100 contributors to more than 200, and Spark has become the most active open source project in big data. We’ve also seen significant new components added, such as the Spark SQL runtime, a larger machine learning library, and rich integration with other data processing systems. Given all this activity, where is Spark heading? I’ll share our goal of Spark as a unifying platform between the diverse applications (e.g. stream processing, machine learning and SQL) and diverse storage and runtime systems in big data.
Matei Zaharia is an Assistant Professor of Computer Science at Stanford University and Chief Technologist at Databricks. He started the Apache Spark project during his PhD at UC Berkeley in 2009, and has worked broadly in datacenter systems, co-starting the Apache Mesos project and contributing as a committer on Apache Hadoop. Today, Matei tech-leads the MLflow development effort at Databricks in addition to other aspects of the platform. Matei’s research work was recognized through the 2014 ACM Doctoral Dissertation Award for the best PhD dissertation in computer science, an NSF CAREER Award, and the US Presidential Early Career Award for Scientists and Engineers (PECASE).