Ali Ghodsi

Co-founder & CEO Original Creator of Apache Spark, Databricks

Ali Ghodsi is the CEO and co-founder of Databricks, responsible for the growth and international expansion of the company. He previously served as the VP of Engineering and Product Management before taking the role of CEO in January 2016. In addition to his work at Databricks, Ali serves as an adjunct professor at UC Berkeley and is on the board at UC Berkeley’s RiseLab. Ali was one of the original creators of open source project, Apache Spark, and ideas from his academic research in the areas of resource management and scheduling and data caching have been applied to Apache Mesos and Apache Hadoop. Ali received his MBA from Mid-Sweden University in 2003 and PhD from KTH/Royal Institute of Technology in Sweden in 2006 in the area of Distributed Computing.

 

Watch this speaker at Data + AI Summit 2021

Past sessions

Summit 2021 Keynotes: Data Science and Machine Learning

May 27, 2021 08:30 AM PT

The pursuit of AI is one of the biggest priorities in data today. The Thursday morning keynote will be led by Databricks Cofounder and CEO Ali Ghodsi and cover advances in data science, machine learning, MLOps and more in both open source and the Databricks Lakehouse Platform.

We’ll also be joined by data leaders from McDonalds and Microsoft, as well as the legendary Bill Nye, a scientist, engineer, comedian and author.

Join the Wednesday morning keynote to hear from Databricks co-founders and original creators of popular projects Apache Spark, Delta Lake, and MLflow on how the open source community is tackling the biggest challenges in data.

Stay tuned for them to reveal some of the latest innovations in data engineering and data analytics to simplify and scale your work.

Thursday Morning Keynote

November 18, 2020 04:00 PM PT

Welcome from Ali Ghodsi, Databricks


Taking Machine Learning to Production with New Features in MLflow

Matei Zaharia
Assistant Professor of Computer Science Original Creator of Apache Spark & MLflow, Databricks

Deploying and operating machine learning applications is challenging because they are highly dependent on input data and can fail in complex ways. Problems such as training/inference differences in data format, data skew, and misconfigured software environments can easily sneak into a production application and impact its quality. To address these types of problems, organizations are adopting ML Platform software and MLOps practices specifically for managing machine learning applications.

In this talk, I’ll present some of the latest functionality added for productionizing machine learning in MLflow, the popular open source machine learning platform started by Databricks in 2018. These include built-in support for model management and review using the Model Registry, APIs for automatic Continuous Integration and Delivery (CI/CD), model schemas to catch differences in a model’s expected data format, and integration with model explainability tools. I’ll also talk about other work happening in the open source MLflow community, including deep integration with PyTorch and its growing ecosystem of model productionization tools.


Demo: CI/CD and MLOps with MLflow

Kasey Uhlenhuth
Sr Product Manager, Machine Learning, Databricks


PyTorch and MLflow, from Research to Production

Lin Qiao
Engineering Director, PyTorch, Facebook

Lin Qiao, engineering director on the Facebook AI team, talks about bringing machine learning to production at scale, including the PyTorch integration with MLflow. She talks about the guiding principles for PyTorch and the goals set back in 2016 during initial development through the present day, with a focus on ecosystem compatibility.

Lin reviews the PyTorch production ecosystem and discusses how MLflow and PyTorch are integrated for tracking, models and model serving.


Introducing the Next Generation Data Science Workspace

Clemens Mewald
Director of Product Management, Data Science and Machine Learning, Databricks

It is no longer a secret that data driven insights and decision making are essential in any company’s strategy to keep up with today’s rapid pace of change and remain relevant. Although we take this realization for granted, we are still in the very early stage of enabling data teams to deliver on their promise. One of the reasons is that we haven’t equipped this profession with the modern toolkit they deserve.

Existing solutions leave data teams with impossible trade-offs. Giving Data Scientists the freedom to use any open source tools on their laptops doesn’t provide a clear path to production and governance. Simply hosting those same tools in the Cloud may solve some of the data privacy and security issues, but doesn’t improve productivity nor collaboration. On the other hand, most robust and scalable production environments hinder innovation and experimentation by slowing Data Scientists down.

In this talk we will give an update on the next generation Data Science Workspace on Databricks, originally unveiled at Spark + AI Summit 2020. Specifically, we will cover new capabilities added to Databricks Notebooks as well as Git-based Databricks Projects. Until now, the industry has assumed that collaborative notebooks are for experimentation only, and not for production. Our approach solved for these challenges and, for the first time, provides a single platform for data teams to rapidly and confidently move from experimentation to production.

In this talk, we will unveil the next generation of the Databricks Data Science Workspace: An open and unified experience for modern data teams specifically designed to address these hard tradeoffs. We will introduce new features that leverage the open source tools you are familiar with to give you a laptop-like experience that provides the flexibility to experiment and the robustness to create reliable and reproducible production solutions.


Discussion with Daimler

Stephan Schwarz
Production Planning: Manager Smart Data Processing (Mercedes Operations), Daimler

Sebastian Findeisen
Data Scientist, Daimler

When we think about luxury cars, what first comes to mind is often the end product-- the sleek design, how fast it goes, and so on. But we often overlook the enormous amount of effort it takes before that car rolls off the assembly line. In this talk, Daimler will give us a peek into how data and ML is playing a critical role to drive car production automation, with MLOps and tools like MLflow being leveraged to automate a number of complex processes, and provide insights that create production efficiencies.


Responsible ML – Bringing Accountability to Data Science Keynote

Rohan Kumar
Corporate Vice President, Azure Data, Microsoft

Responsible ML is the most talked about field in AI at the moment. With the growing importance of ML, it is even more important for us to exercise ethical AI practices and ensure that the models we create live up to the highest standards of inclusiveness and transparency. Join Rohan Kumar, as he talks about how Microsoft brings cutting-edge research into the hands of customers to make them more accountable for their models and responsible in their use of AI. For the AI community, this is an open invitation to collaborate and contribute to shape the future of Responsible ML. This keynote is brought to you as an encore presentation from the global Summit.


Demo: Azure Tools for Responsible AI

Sarah Bird
Principal Program Manager, Microsoft Azure AI


Pursuing the Extraordinary: A Data Revolution

Keynote from Mae Jemison
First woman of color in the world to go into space, former NASA astronaut

Exploration of the opportunities and obstacles encountered and clarity of purpose needed to achieve an extraordinary future -- such as human interstellar travel or a sustainable human existence on planet Earth -- and what roles can big data and advancing IT play.

Wednesday Morning Keynote

November 17, 2020 04:00 PM PT

Welcome from Ali Ghodsi, Databricks


Project Zen: Making Spark Pythonic

Reynold Xin
Co-founder & Chief Architect, Databricks

In this keynote from Reynold Xin, the top contributor to Apache Spark and PMC member, we will review the state of the project and highlight major community developments in the 10th anniversary release and beyond. Reynold will review how the recent Spark 3.0 release focused on making it easier to use, faster, and more ANSI standard compliant. With Python representing nearly 70% of notebook commands, he’ll focus on the development of Project Zen - the community effort to make Spark more Pythonic. This includes improvements in development tooling, API design, error handling and more, to make data scientists and engineers more productive with data.


Demo:Pythonic Spark with Real Koalas

Caryl Yuhas
Sr. Manager, Field Engineering, Databricks


The Rise of the Lakehouse

Ali Ghodsi
Co-founder & CEO

Original Creator of Apache Spark, Databricks

Data warehouses have a long history in decision support and business intelligence applications. But, data warehouses were not well suited to dealing with the unstructured, semi-structured, and streaming data common in modern enterprises. This led to organizations building data lakes of raw data about a decade ago. But, they also lacked important capabilities. The need for a better solution has given rise to lakehouse architecture, which implements similar data structures and data management features to those in a data warehouse, directly on the kind of low cost storage used for data lakes.

This keynote by Databricks CEO, Ali Ghodsi, explains how the open source Delta Lake project allows the industry to realize the full potential of lakehouse architecture. Additionally, Ali will discuss the newly announced SQL Analytics service that allows users to run traditional analytics on their data lake, instead of moving data out to data warehouses, without sacrificing performance, security, or quality. This service completes the vision of lakehouse architecture to allow the data lake to be a single source of truth of all data workloads.


Discussion with Tableau Software

Francois Ajenstat
Chief Product Officer, Tableau Software


Demo: SQL Analytics and the Lakehouse Architecture

Brooke Wenig,
Machine Learning Practice Lead, Databricks


How SQL Analytics Makes Lakehouse Fast

Reynold Xin
Co-founder & Chief Architect, Databricks

In this keynote, Reynold Xin, Co-founder and Chief Architect at Databricks, will explore how SQL Analytics brings a new level of performance to data lakes for analytics workloads. Traditionally, data lakes have struggled with analytics, because they struggle to deliver the fast query performance wiht low latency at high user concurrency. Reynold will provide a techical deep dive of how Databricks has addresssed these challenges. First, Delta Engine, Databricks' polymorphic vectorized execution engine, delivers extremely fast single query throughput. Second, the new auto-scaling SQL-optimized clusters in SQL Analytics make it easy to match compute capacity to user load. And third, optimizations in the new SQL Analytics Endpoints reduce the time required to get query results by up to 6x. Altogether, SQL Analytics is able to provide users with data warehousing performance at data lake economics for their analytics workloads.


Discussion with Peter Boncz

Professor, CWI & Vrije Universiteit Amsterdam


Discussion with Unilever

Phinean Woodward
Head of Architecture, Information and Analytics, Unilever

In this talk, we’ll discuss how the Lakehouse architecture has become a critical part of Unilever’s information management infrastructure to limit traditional enterprise data silos, and enable agile access to data both up and downstream that’s needed for faster decision making. As a result, IT is helping Unilever to deliver higher quality predictions in many areas of the business, thereby building trust in AI throughout the company.


Why Data Should Drive the Next Pandemic Response

Malcolm Gladwell
Best-selling author, journalist, and podcast host

Imagine what a data-driven response to the Covid-19 pandemic would have looked like — if we could set aside politics and ego. Award-winning author and journalist Malcolm Gladwell discusses the lessons we can learn from the current crisis, and how data and data teams will be critical in solving the world’s toughest problems – including future pandemic outbreaks. He also reveals the essential role that data teams play in his own work every day.


Close

Ali Ghodsi

Summit 2020 Spark + AI Summit 2020: Thursday Morning Keynotes

June 24, 2020 05:00 PM PT

Clemens Mewal - Next Generation Data Science Workspace (Databricks) - 9:06
Lauren Richie - DEMO: Next Generation Data Science Workspace (Databricks) - 17:55
Matei Zaharia - MLflow Community and Product Updates (Databricks) - 27:40
Sue Ann Hong - DEMO: MLflow (Databricks) - 42:57
Rohan Kumar - Responsible ML (Microsoft) - 51:52
Sarah Bird - DEMO: Responsible ML (Microsoft) - 1:00:21
Anurag Sehgal - Data and AI (Credit Suisse) - 1:12:58


Introducing the Next Generation Data Science Workspace
Ali Ghodsi, Clemens Mewald and Lauren Richie

It is no longer a secret that data driven insights and decision making are essential in any company’s strategy to keep up with today’s rapid pace of change and remain relevant. Although we take this realization for granted, we are still in the very early stage of enabling data teams to deliver on their promise. One of the reasons is that we haven’t equipped this profession with the modern toolkit they deserve.

Existing solutions leave data teams with impossible trade-offs. Giving Data Scientists the freedom to use any open source tools on their laptops doesn’t provide a clear path to production and governance. Simply hosting those same tools in the Cloud may solve some of the data privacy and security issues, but doesn’t improve productivity nor collaboration. On the other hand, most robust and scalable production environments hinder innovation and experimentation by slowing Data Scientists down.

In this talk, we will unveil the next generation of the Databricks Data Science Workspace: An open and unified experience for modern data teams specifically designed to address these hard tradeoffs. We will introduce new features that leverage the open source tools you are familiar with to give you a laptop-like experience that provides the flexibility to experiment and the robustness to create reliable and reproducible production solutions.


Simplifying Model Development and Management with MLflow
Matei Zaharia and Sue Ann Hong

As organizations continue to develop their machine learning (ML) practice, the need for robust and reliable platforms capable of handling the entire ML lifecycle is becoming crucial for successful outcomes. Building models is difficult enough to do once, but deploying them into production in a reproducible, agile, and predictable way is exponentially harder due to the dependencies on parameters, environments, and the ever changing nature of data and business needs.

Introduced by Databricks in 2018, MLflow is the most widely used open source platform for managing the full ML lifecycle. With over 2 million PyPI downloads a month and over 200 contributors, the growing support from the developer community demonstrates the need for an open source approach to standardize tools, processes, and frameworks involved throughout the ML lifecycle. MLflow significantly simplifies the complex process of standardizing MLOps and productionizing ML models. In this talk, we’ll cover what’s new in MLflow, including simplified experiment tracking, new innovations to the model format to improve portability, new features to manage and compare model schemas, and new capabilities for deploying models faster.


Responsible ML - Bringing Accountability to Data Science
Rohan Kumar and Sarah Bird

Responsible ML is the most talked about field in AI at the moment. With the growing importance of ML, it is even more important for us to exercise ethical AI practices and ensure that the models we create live up to the highest standards of inclusiveness and transparency. Join Rohan Kumar, as he talks about how Microsoft brings cutting-edge research into the hands of customers to make them more accountable for their models and responsible in their use of AI. For the AI community, this is an open invitation to collaborate and contribute to shape the future of Responsible ML.


How Credit Suisse Is Leveraging Open Source Data and AI Platforms to Drive Digital Transformation, Innovation and Growth
Anurag Sehgal

Despite the increasing embrace of big data and AI, most financial services companies still experience significant challenges around data types, privacy, and scale. Credit Suisse is overcoming these obstacles by standardizing on open, cloud-based platforms, including Azure Databricks, to increase the speed and scale of operations, and the democratization of ML across the organization. Now, Credit Suisse is leading the way by successfully employing data and analytics to drive digital transformation, delivering new products to market faster, and driving business growth and operational efficiency.

Summit 2020 Spark + AI Summit 2020: Wednesday Morning Keynotes

June 23, 2020 05:00 PM PT

Ali Ghodsi - Intro to Lakehouse, Delta Lake (Databricks) - 46:40
Matei Zaharia - Spark 3.0, Koalas 1.0 (Databricks) - 17:03
Brooke Wenig - DEMO: Koalas 1.0, Spark 3.0 (Databricks) - 35:46
Reynold Xin - Introducing Delta Engine (Databricks) - 1:01:50
Arik Fraimovich - Redash Overview & DEMO (Databricks) - 1:27:25
Vish Subramanian - Brewing Data at Scale (Starbucks) - 1:39:50


Realizing the Vision of the Data Lakehouse
Ali Ghodsi

Data warehouses have a long history in decision support and business intelligence applications. But, data warehouses were not well suited to dealing with the unstructured, semi-structured, and streaming data common in modern enterprises. This led to organizations building data lakes of raw data about a decade ago. But, they also lacked important capabilities. The need for a better solution has given rise to the data lakehouse, which implements similar data structures and data management features to those in a data warehouse, directly on the kind of low cost storage used for data lakes.

This keynote by Databricks CEO, Ali Ghodsi, explains why the open source Delta Lake project takes the industry closer to realizing the full potential of the data lakehouse, including new capabilities within the Databricks Unified Data Analytics platform to significantly accelerate performance. In addition, Ali will announce new open source capabilities to collaboratively run SQL queries against your data lake, build live dashboards, and alert on important changes to make it easier for all data teams to analyze and understand their data.


Introducing Apache Spark 3.0:
A retrospective of the Last 10 Years, and a Look Forward to the Next 10 Years to Come.
Matei Zaharia and Brooke Wenig

In this keynote from Matei Zaharia, the original creator of Apache Spark, we will highlight major community developments with the release of Apache Spark 3.0 to make Spark easier to use, faster, and compatible with more data sources and runtime environments. Apache Spark 3.0 continues the project’s original goal to make data processing more accessible through major improvements to the SQL and Python APIs and automatic tuning and optimization features to minimize manual configuration. This year is also the 10-year anniversary of Spark’s initial open source release, and we’ll reflect on how the project and its user base has grown, as well as how the ecosystem around Spark (e.g. Koalas, Delta Lake and visualization tools) is evolving to make large-scale data processing simpler and more powerful.


Delta Engine: High Performance Query Engine for Delta Lake
Reynold Xin


How Starbucks is Achieving its 'Enterprise Data Mission' to Enable Data and ML at Scale and Provide World-Class Customer Experiences
Vish Subramanian

Starbucks makes sure that everything we do is through the lens of humanity – from our commitment to the highest quality coffee in the world, to the way we engage with our customers and communities to do business responsibly. A key aspect to ensuring those world-class customer experiences is data. This talk highlights the Enterprise Data Analytics mission at Starbucks that helps making decisions powered by data at tremendous scale. This includes everything ranging from processing data at petabyte scale with governed processes, deploying platforms at the speed-of-business and enabling ML across the enterprise. This session will detail how Starbucks has built world-class Enterprise data platforms to drive world-class customer experiences.

In this talk, we will highlight the opportunity data presents to tackle world’s toughest problems. In spite of the promise that data presents, most data teams are challenged with data, technology and organizational silos. Unified Data Analytics presents a radically different approach to unlock the data potential by unifying all your data with your analytics - from Business Intelligence to Machine Learning.

Summit 2019 Ali Ghodsi, Michael Armbrust | Delta Lake

April 23, 2019 05:00 PM PT

Ali Ghodsi (Databricks), Michael Armbrust (Databricks) - Keynote from Spark + AI Summit 2019

Summit Europe 2018 The Power of Unified Analytics – EU Keynote

October 15, 2021 01:12 PM PT

Summit 2018 Fireside Chat with Marc Andreessen and Ali Ghodsi

June 5, 2018 05:00 PM PT

Summit 2018 The Power of Unified Analytics – NA Keynote

June 5, 2018 05:00 PM PT

Ali is the CEO and co-founder of Databricks, responsible for the growth and international expansion of the company. Ali was one of the original creators of open source project, Apache Spark, and ideas from his academic research in the areas of resource management and scheduling and data caching have been applied to Apache Mesos and Apache Hadoop. Ali received his MBA from Mid-Sweden University in 2003 and PhD from KTH/Royal Institute of Technology in Sweden in 2006 in the area of Distributed Computing

Summit 2017 Databricks Keynote

June 6, 2017 05:00 PM PT

Summit East 2016 Democratizing Access to Data

February 16, 2016 04:00 PM PT

Databricks' vision is to make big data simple for the enterprise. In this keynote, Databricks co-founder and CEO - Ali Ghodsi - will announce the beta release of Databricks Community Edition, a free version of our cloud-based Spark platform with the goal of making Spark easy to learn and accessible to the masses.

Summit 2016 Disrupting Big Data with Apache Spark in the Cloud

June 7, 2016 05:00 PM PT

Summit Europe 2016 Democratizing AI with Apache Spark

October 26, 2016 05:00 PM PT

Summit Europe 2017 Announcing Databricks Delta

October 24, 2017 05:00 PM PT

Databricks CEO Ali Ghodsi introduces Databricks Delta, a new data management system that combines the scale and cost-efficiency of a data lake, the performance and reliability of a data warehouse, and the low latency of streaming.

Learn more:

  • Databricks Delta Guide
  • Databricks Delta: A Unified Data Management System for Real-time Big Data