Matei Zaharia is an assistant professor of computer science at Stanford University and Chief Technologist at Databricks. He started the Spark project during his PhD at UC Berkeley in 2009. Before that, Matei worked broadly in datacenter systems, co-starting the Apache Mesos project and contributing as a committer on Apache Hadoop. Matei’s research was recognized through the 2014 ACM Doctoral Dissertation Award for the best PhD dissertation in computer science.
Successfully building and deploying a machine learning model can be difficult to do once. Enabling other data scientists (or yourself, one month later) to reproduce your pipeline, to compare the results of different versions, to track what's running where, and to redeploy and rollback updated models is much harder. In this talk, I'll introduce MLflow, a new open source project from Databricks that simplifies the machine learning lifecycle. MLflow provides APIs for tracking experiment runs between multiple users within a reproducible environment, and for managing the deployment of models to production. MLflow is designed to be an open, modular platform, in the sense that you can use it with any existing ML library and development process. MLflow was launched in June 2018 and has already seen significant community contributions, with 45 contributors and new features new multiple language APIs, integrations with popular ML libraries, and storage backends. I’ll go through some of the newly released features and explain how to get started with MLflow.
ML development brings many new complexities beyond the traditional software development lifecycle. Unlike in traditional software development, ML developers want to try multiple algorithms, tools and parameters to get the best results, and they need to track this information to reproduce work. In addition, developers need to use many distinct systems to productionize models. To address these problems, many companies are building custom "ML platforms" that automate this lifecycle, but even these platforms are limited to a few supported algorithms and to each company's internal infrastructure. In this talk, I present MLflow, a new open source project from Databricks that aims to design an open ML platform where organizations can use any ML library and development tool of their choice to reliably build and share ML applications. MLflow introduces simple abstractions to package reproducible projects, track results, and encapsulate models that can be used with many existing tools, accelerating the ML lifecycle for organizations of any size.
Over the past three years, Spark has quickly grown from a research project to one of the most active open source projects in parallel computing. I’ll go through a summary of recent growth, highlighting key contributions from across the community. At the same time, much remains to be done to make big data analysis truly accessible and fast. I’ll sketch how we at Databricks are approaching this problem through our continuing work on Apache Spark, and the aspects of the system that we believe make Spark truly unique for big data.
Apache Spark continues to grow quickly in both community size and technical capabilities. Since the last Spark Summit, in December 2013, Spark’s contributor base has grown from 100 contributors to more than 200, and Spark has become the most active open source project in big data. We’ve also seen significant new components added, such as the Spark SQL runtime, a larger machine learning library, and rich integration with other data processing systems. Given all this activity, where is Spark heading? I’ll share our goal of Spark as a unifying platform between the diverse applications (e.g. stream processing, machine learning and SQL) and diverse storage and runtime systems in big data.
As the Apache Spark userbase grows, the developer community is working to adapt it for ever-wider use cases. 2014 saw fast adoption of Spark in the enterprise and major improvements in its performance, scalability and standard libraries. In 2015, we also want to make Spark accessible to a wider set of users, through new high-level APIs targeted at data science: machine learning pipelines, data frames, and R language bindings. In addition, we are defining extension points to let Spark grow as a platform, making it easy to plug in data sources, algorithms, and third-party packages. Like all work on Spark, these APIs are designed to plug seamlessly into existing Spark applications, giving users a unified platform for streaming, batch and interactive data processing.
2015 was a year of continued growth for Spark, with numerous additions to the core project and very fast growth of use cases across the industry. In this talk, I'll look back at how the Spark community is has grown and changed in 2015, based on a large Apache Spark user survey conducted by Databricks. We see some interesting trends in the diversity of runtime environments (which are increasingly not just Hadoop); the types of applications run on Spark; and the types of users, now that features like R support and DataFrames are available in Spark. I'll also cover the ongoing work in the upcoming releases of Spark to support new use cases.
The next release of Spark will be 2.0, marking a big milestone for the project. In this talk, I'll cover some of the large upcoming features that made us increase the version number to 2.0, as well as some of the roadmap for Spark in 2016.
The next release of Apache Spark will be 2.0, marking a big milestone for the project. In this talk, I'll cover how the community has grown to reach this point, and some of the major features in 2.0. The largest additions are performance improvements for Datasets, DataFrames and SQL through Project Tungsten, as well as a new Structured Streaming API that provides simpler and more powerful stream processing. I'll also discuss a bit of what's in the works for future versions.
Apache Spark 2.0 was released this summer and is already being widely adopted. I'll talk about how changes in the API have made it easier to write batch, streaming and realtime applications. The Dataset API, which is now integrated with DataFrames, makes it possible to benefit from powerful optimizations such as pushing queries into data sources, while the Structured Streaming extension to this API makes it possible to run many of the same computations in a streaming fashion automatically.
Big data remains a rapidly evolving field with new applications and infrastructure appearing every year. In this talk, I'll cover new trends in 2016 / 2017 and how Apache Spark is moving to meet them. In particular, I'll talk about work Databricks is doing to make Apache Spark interact better with native code (e.g. deep learning libraries), support heterogeneous hardware, and simplify production data pipelines in both streaming and batch settings through Structured Streaming.
2017 continues to be an exciting year for big data and Apache Spark. I will talk about two major initiatives that Databricks has been building: Structured Streaming, the new high-level API for stream processing, and new libraries that we are developing for machine learning. These initiatives can provide order of magnitude performance improvements over current open source systems while making stream processing and machine learning more accessible than ever before.
2017 continues to be an exciting year for Apache Spark. I will talk about new updates in two major areas in the Spark community this year: stream processing with Structured Streaming, and deep learning with high-level libraries such as Deep Learning Pipelines and TensorFlowOnSpark. In both areas, the community is making powerful new functionality available in the same high-level APIs used in the rest of the Spark ecosystem (e.g., DataFrames and ML Pipelines), and improving both the scalability and ease of use of stream processing and machine learning.