Spark + AI Summit 2019, the world’s largest data and machine learning conference for the Apache Spark™ Community, brought nearly 5000 registered data scientists, engineers, and business leaders to San Francisco’s Moscone Center to find out what’s coming next. Watch the keynote recordings today and learn more about the latest product announcements for Apache Spark, MLflow, and our newest open source addition, Delta Lake!
The Spark + AI Summit keynotes included several major product announcements. Reynold Xin, Apache Spark PMC member and number-one code committer to Spark, opened the summit by presenting the upcoming work planned for Spark 3.0 later this year, with more than 1000 improvements, features, and bug fixes, ranging from Hydrogen-accelerator aware scheduling, Spark Graph, Spark on Kubernetes, ANSI-SQL Parser, and many more.
Reynold also announced a new release bringing the Pandas DataFrame API to Spark, under a new open-source project called Koalas. Pandas has long been the Python standard to manipulate and analyze data, particularly for small and medium-sized datasets, and the project opens up a more frictionless progression to large data sets on Spark. With compatible API syntax, data scientists trained on Pandas can now use Koalas to transition easily to working on larger, distributed data sets geared for production environments.
Ali Ghodsi, CEO and Co-founder of Databricks, announced the open-source release of Delta Lake, a storage layer that brings increased reliability and quality to data lakes. Previously, data lakes frequently faced garbage-in-garbage-out issues that made data quality too low for data science and machine learning, resulting in large, expensive “data swamps.” This project brings a suite of new features to data lakes, including ACID transactions, schema enforcement, and even data time travel, which help ensure data integrity for downstream analytics and projects. Customers who previously used Databricks Delta gave extremely positive feedback for the core problems that it solved for them, and we’re excited to open-source the project for the larger community. The ecosystem of Apache Spark, MLflow, and now Delta Lake, continues to expand to solve end-to-end data and ML challenges.
Rohan Kumar, CVP Azure Data from Microsoft, made a number of announcements:
- Azure Machine Learning support for MLflow, with Microsoft joining as an open-source supporter of MLflow
- .NET support for Apache Spark, to bring more developers into the Spark ecosystem
Matei Zaharia, Databricks Chief Technologist, announced the next new set of components for the open-source MLflow project with MLflow Workflows and MLflow Model Registry. These modules further extend management of the end-to-end machine learning lifecycle with multistep pipelines and model management. Matei also announced the upcoming work around MLflow 1.0, with a stabilized API for long-term usage and additional feature releases. Managed MLflow is also now Generally Available on AWS and Azure.
In addition to the product announcements was an extensive lineup of keynotes from industry luminaries, across a variety of topics. Turing Award winner David Patterson shared insights on the golden age of computer architectures and his vision for more open designs. Netflix VP of Data Science and Analytics Caitlin Smallwood presented on how Netflix uses data throughout the company, from predicting what content Netflix viewers will want to watch to investing in content production itself. Michael I. Jordan shared principles around human-centered AI and how a marketplace strategy to applying algorithms and data could create greater economic value. Timnit Gebru of Google Brain and Black in AI, spoke about biases and error rates found for different gender and racial groups in data and AI systems and the need for improved standards around such systems. Google’s Anitha Vijayakumar shared some of the latest features and developments with TensorFlow 2.0. Jitendra Malik of Facebook AI Research gavean overview into the rapid advances in visual understanding and computer vision with deep-learning techniques.
Jan Neumann and Jim Forsythe from Comcast shared how they architected a scalable data and machine learning platform with Delta Lake, MLflow, and Databricks to support their AI-powered voice remotes, including enriching petabytes of user session data and feeding it into an agile environment for model development and deployment. Check out the keynote recordings to hear perspectives from all the speakers.
Sessions and Trainings
Deeper technical content and tutorials continued outside the keynotes, with more than 170 sessions across over a dozen tracks, with speakers from Yelp, AirBNB, Lyft, Nike, Starbucks, Optum, FINRA, IBM, Verizon, Tencent, and many others. Topics varied widely, including stream processing, quality monitoring, testing, model serving, TensorFlow, SciKit-Learn, Keras, PyTorch, data pipelines, migrations, NLP, data governance, performance optimization, predictive modeling, graph algorithms, and many more. Sold out hands-on trainings dived deep into data engineering, data science, Spark programming, certification, deep learning, and machine learning.
Women in Unified Analytics
Female leaders shared their perspectives across a series of events sponsored by the Databricks Diversity Committee. With speakers from Netflix, Stanford, LinkedIn, Workday, Apple, Google, and more, women brought thought leadership and perspectives to attendees at the Bay Area Spark Meetup, Women’s Breakfast, and Lunch Tech Talk + Panel.
Spark+AI Summit 2019 keynote videos are now available to see the newest product announcements and thought leadership. Follow @Databricks on Twitter or subscribe to our newsletter if you want to be notified whenever new content becomes available. You can also learn Apache Spark today on our free Databricks Community Edition or build a production Spark application with a free Databricks trial. As always, thanks for your support, and we look forward to seeing you in Spark + AI Summit Europe in Amsterdam in October and next year as well!