by Erika Ehrli, Prem Prakash and Sheldon D'Paiva
Data teams have never been more important to the world. Over the past few years we've seen many of our customers building a new generation of data and AI applications that are reshaping and transforming every industry with the lakehouse.
The data lakehouse paradigm introduced by Databricks is the future for modern data teams seeking to build solutions that unify analytics, data engineering, machine learning, and streaming workloads across clouds on one simple, open data platform.
Many of our customers, from enterprises to startups across the globe, love and trust Databricks. In fact, half of the Fortune 500 are seeing the lakehouse drive impact. Organizations like John Deere, Amgen, AT&T, Northwestern Mutual and Walgreens, are making the move to the lakehouse because of its ability to deliver analytics and machine learning on both structured and unstructured data.
Last month we unveiled innovation across the Databricks Lakehouse Platform to a sold-out crowd at the annual Data + AI Summit. Throughout the conference, we announced several contributions to popular data and AI open source projects as well as new capabilities across workloads.
Delta Lake is the fastest and most advanced multi-engine storage format. We've seen incredible success and adoption thanks to the reliability and fastest performance it provides. Today, Delta Lake is the most widely used storage layer in the world, with over 7 million monthly downloads; growing by 10x in monthly downloads in just one year.
We announced that Databricks will contribute all features and enhancements it has made to Delta Lake to the Linux Foundation and open source all Delta Lake APIs as part of the Delta Lake 2.0 release.
Delta Lake 2.0 will bring unmatched query performance to all Delta Lake users and enable everyone to build a highly performant data lakehouse on open standards. With this contribution, Databricks customers and the open source community will benefit from the full functionality and enhanced performance of Delta Lake 2.0. The Delta Lake 2.0 Release Candidate is now available and is expected to be fully released later this year. The breadth of the Delta Lake ecosystem makes it flexible and powerful in a wide range of use cases.
As the leading unified engine for large-scale data analytics, Spark scales seamlessly to handle data sets of all sizes. However, the lack of remote connectivity and the burden of applications developed and run on the driver node, hinder the requirements of modern data applications. To tackle this, Databricks introduced Spark Connect, a client and server interface for Apache Spark™ based on the DataFrame API that will decouple the client and server for better stability, and allow for built-in remote connectivity. With Spark Connect, users can access Spark from any device.
Data streaming on the lakehouse is one of the fastest-growing workloads within the Databricks Lakehouse Platform and is the future of all data processing. In collaboration with the Spark community, Databricks also announced Project Lightspeed, the next generation of Spark Structured Streaming engine for data streaming on the lakehouse.
For organizations, governance, security, and compliance are critical because they help guarantee that all data assets are maintained and managed securely across the enterprise and that the company is in compliance with regulatory frameworks. Databricks announced several new capabilities that further expand data governance, security, and compliance capabilities.
Data sharing has become important in the digital economy as enterprises wish to easily and securely exchange data with their customers, partners, suppliers and internal line of business to better collaborate and unlock value from that data. To address the limitations of existing data sharing solutions, Databricks developed Delta Sharing, with various contributions from the OSS community, and donated it to the Linux Foundation. We announced Delta Sharing will be generally available in the coming weeks.
Databricks is helping customers share and collaborate with data across organizational boundaries and we also unveiled enhancements to data sharing enabled by Databricks Marketplace and Data Cleanrooms.
Data warehousing is one of the most business-critical workloads for data teams. Databricks SQL (DBSQL) is a serverless data warehouse on the Databricks Lakehouse Platform that lets you run all your SQL and BI applications at scale with up to 12x better price/performance, a unified governance model, open formats and APIs, and your tools of choice – no lock-in. Databricks unveiled new data warehousing capabilities in its platform to enhance analytics workloads further:
Tens of millions of production workloads run daily on Databricks. With the Databricks Lakehouse Platform, data engineers have access to an end-to-end data engineering solution for ingesting and transforming batch and streaming data, orchestrating reliable production workflows at scale, and increasing the productivity of data teams with built-in data quality testing and support for software development best practices.
We recently announced general availability on all three clouds of Delta Live Tables (DLT), the first ETL framework to use a simple, declarative approach to building reliable data pipelines. Since its launch earlier this year, Databricks continues to expand DLT with new capabilities. We are excited to announce we are developing Enzyme, a performance optimization purpose-built for ETL workloads. Enzyme efficiently keeps up-to-date a materialization of the results of a given query stored in a Delta table. It uses a cost model to choose between various techniques, including techniques used in traditional materialized views, delta-to-delta streaming, and manual ETL patterns commonly used by our customers. Additionally, DLT offers new enhanced autoscaling, purpose-built to intelligently scale resources with the fluctuations of streaming workloads, and CDC Slowly Changing Dimensions—Type 2, easily tracks every change in source data for both compliance and machine learning experimentation purposes .When dealing with changing data (CDC), you often need to update records to keep track of the most recent data. SCD Type 2 is a way to apply updates to a target so that the original data is preserved.
We also recently announced general availability on all three clouds of Databricks Workflows, the fully managed lakehouse orchestration service for all your teams to build reliable data, analytics and AI workflows on any cloud. Since its launch earlier this year, Databricks continues to expand Databricks Workflows with new capabilities including Git support for Workflows now available in Public Preview, running dbt projects in production, new SQL task type in Jobs, new "Repair and Rerun" capability in Jobs, and context sharing between tasks.
Databricks Machine Learning on the lakehouse provides end-to-end machine learning capabilities from data ingestion and training to deployment and monitoring, all in one unified experience, creating a consistent view across the ML lifecycle and enabling stronger team collaboration. We continue to innovation across the ML lifecycle to enable you to put models faster into production –
Modern data teams need innovative data architectures to meet the requirements of the next generation of Data and AI applications. The lakehouse paradigm provides a simple, multicloud, and open platform and it remains our mission to continue supporting all our customers who want to be able to do business intelligence, AI, and machine learning in one platform. You can watch all our Data and AI Summit keynotes and breakout sessions on demand to learn more about these announcements. You can also download the Data Team's Guide to the Databricks Lakehouse Platform for a deeper dive to the Databricks Lakehouse Platform.