Matei Zaharia is an assistant professor of computer science at Stanford University and Chief Technologist at Databricks. He started the Spark project during his PhD at UC Berkeley in 2009. Before that, Matei worked broadly in datacenter systems, co-starting the Apache Mesos project and contributing as a committer on Apache Hadoop. Matei’s research was recognized through the 2014 ACM Doctoral Dissertation Award for the best PhD dissertation in computer science.
Data is the key ingredient to building high-quality, production AI applications. It comes in during the training phase, where more and higher-quality training data enables better models, as well as during the production phase, were understanding the model's behavior in production and detecting changes in the predictions and input data is critical to maintaining a production application. However, so far most data management and machine learning tools have been largely separate. In this presentation, I'll talk about several efforts from Databricks, in Apache Spark as well as other open source projects, to unify data and AI in order to make it significantly simpler to build production AI applications.
Over the past three years, Spark has quickly grown from a research project to one of the most active open source projects in parallel computing. I’ll go through a summary of recent growth, highlighting key contributions from across the community. At the same time, much remains to be done to make big data analysis truly accessible and fast. I’ll sketch how we at Databricks are approaching this problem through our continuing work on Apache Spark, and the aspects of the system that we believe make Spark truly unique for big data.
Apache Spark continues to grow quickly in both community size and technical capabilities. Since the last Spark Summit, in December 2013, Spark’s contributor base has grown from 100 contributors to more than 200, and Spark has become the most active open source project in big data. We’ve also seen significant new components added, such as the Spark SQL runtime, a larger machine learning library, and rich integration with other data processing systems. Given all this activity, where is Spark heading? I’ll share our goal of Spark as a unifying platform between the diverse applications (e.g. stream processing, machine learning and SQL) and diverse storage and runtime systems in big data.
As the Apache Spark userbase grows, the developer community is working to adapt it for ever-wider use cases. 2014 saw fast adoption of Spark in the enterprise and major improvements in its performance, scalability and standard libraries. In 2015, we also want to make Spark accessible to a wider set of users, through new high-level APIs targeted at data science: machine learning pipelines, data frames, and R language bindings. In addition, we are defining extension points to let Spark grow as a platform, making it easy to plug in data sources, algorithms, and third-party packages. Like all work on Spark, these APIs are designed to plug seamlessly into existing Spark applications, giving users a unified platform for streaming, batch and interactive data processing.
2015 was a year of continued growth for Spark, with numerous additions to the core project and very fast growth of use cases across the industry. In this talk, I'll look back at how the Spark community is has grown and changed in 2015, based on a large Apache Spark user survey conducted by Databricks. We see some interesting trends in the diversity of runtime environments (which are increasingly not just Hadoop); the types of applications run on Spark; and the types of users, now that features like R support and DataFrames are available in Spark. I'll also cover the ongoing work in the upcoming releases of Spark to support new use cases.
The next release of Spark will be 2.0, marking a big milestone for the project. In this talk, I'll cover some of the large upcoming features that made us increase the version number to 2.0, as well as some of the roadmap for Spark in 2016.
The next release of Apache Spark will be 2.0, marking a big milestone for the project. In this talk, I'll cover how the community has grown to reach this point, and some of the major features in 2.0. The largest additions are performance improvements for Datasets, DataFrames and SQL through Project Tungsten, as well as a new Structured Streaming API that provides simpler and more powerful stream processing. I'll also discuss a bit of what's in the works for future versions.
Apache Spark 2.0 was released this summer and is already being widely adopted. I'll talk about how changes in the API have made it easier to write batch, streaming and realtime applications. The Dataset API, which is now integrated with DataFrames, makes it possible to benefit from powerful optimizations such as pushing queries into data sources, while the Structured Streaming extension to this API makes it possible to run many of the same computations in a streaming fashion automatically.
Big data remains a rapidly evolving field with new applications and infrastructure appearing every year. In this talk, I'll cover new trends in 2016 / 2017 and how Apache Spark is moving to meet them. In particular, I'll talk about work Databricks is doing to make Apache Spark interact better with native code (e.g. deep learning libraries), support heterogeneous hardware, and simplify production data pipelines in both streaming and batch settings through Structured Streaming.
2017 continues to be an exciting year for big data and Apache Spark. I will talk about two major initiatives that Databricks has been building: Structured Streaming, the new high-level API for stream processing, and new libraries that we are developing for machine learning. These initiatives can provide order of magnitude performance improvements over current open source systems while making stream processing and machine learning more accessible than ever before.
2017 continues to be an exciting year for Apache Spark. I will talk about new updates in two major areas in the Spark community this year: stream processing with Structured Streaming, and deep learning with high-level libraries such as Deep Learning Pipelines and TensorFlowOnSpark. In both areas, the community is making powerful new functionality available in the same high-level APIs used in the rest of the Spark ecosystem (e.g., DataFrames and ML Pipelines), and improving both the scalability and ease of use of stream processing and machine learning.