by Tathagata Das, Ryan Boyd, Denny Lee and Vini Jaiswal
At the Data + AI Summit, we were thrilled to announce the early release of Delta Lake: The Definitive Guide, published by O’Reilly. The guide teaches how to build a modern lakehouse architecture that combines the performance, reliability and data integrity of a warehouse with the flexibility, scale and support for unstructured data available in a data lake. It also shows how to use Delta Lake as a key enabler of the lakehouse, providing ACID transactions, time travel, schema constraints and more on top of the open Parquet format. Delta Lake enhances Apache Spark and makes it easy to store and manage massive amounts of complex data by supporting data integrity, data quality, and performance.
Get an early preview of O'Reilly's new ebook for the step-by-step guidance you need to start using Delta Lake.
What can you expect from reading this guide? Learn about all the buzz around bringing transactionality and reliability to data lakes using the Delta Lake. You will gain an understanding about the evolution of the big data technology landscape -- from data warehousing to the data lakehouse.
There is no shortage of challenges associated with building data pipelines, and this guide walks through how to tackle them and make data pipelines robust and reliable so that downstream users both realize significant value and rely on their data to make critical data driven decisions.
While many organizations have standardized on Apache Spark™ as the big data processing engine, we need to add transactionality to our data lakes to ensure a high quality end-to-end data pipeline. This is where Delta Lake comes in. Delta Lake enhances Apache Spark and makes it easy to store and manage massive amounts of complex data by supporting data integrity, data quality and performance. And with the recent announcements by Michael Armbrust and Matei Zaharia, Databricks recently released Delta Lake 1.0 on Apache Spark 3.1, with added experimental support for Google Cloud Storage, Oracle Cloud Storage and IBM Cloud Object Storage. In relation to this release, we also introduced Delta Sharing, an open protocol for secure real-time exchange of large datasets, which enables organizations to share data in real-time regardless of which computing platforms they use. We will cover the step-by-step guide on all these releases in a future release of the book.
This guide is designed to walk data engineers, data scientists and data practitioners through how to build reliable data lakes and data pipelines at scale using Delta Lake. Additionally, you will:
This guide doesn’t require any prior knowledge of the modern lakehouse architecture, however, some knowledge of big data, data formats, cloud architectures and Apache Spark is helpful. While we invite anyone with an interest in data architectures and machine learning to check our guide, it’s especially useful for:
The early release of the digital book is available now from Databricks and O’Reilly. You get to read the ebook in its earliest form—the authors’ raw and unedited content as they write—so you can take advantage of these technologies long before the official release of these titles. The final digital copy is expected to be released at the end of 2021, and the printed copies will be available in April 2022. Thanks to Gary O’Brien, Jess Haberman and Chris Faucher from O’reilly who have been helping us with the book publication.
To provide you a sneak peek, here is an excerpt from Chapter 2 describing what Delta Lake is.
As previously noted, over time, there have been different storage solutions built to solve this problem of data quality - from databases to data lakes. The transition from databases to data lakes allows for the decoupling of business logic from storage as well as the ability to independently scale compute and storage. But lost in this transition was ensuring data reliability. Providing data reliability to data lakes led to the development of Delta Lake.
Built by the original creators of Apache Spark, Delta Lake was designed to combine the best of both worlds for online analytical workloads (i.e., OLAP style): the transactional reliability of databases with the horizontal scalability of data lakes.
Delta Lake is a file-based, open-source storage format that provides ACID transactions, scalable metadata handling, and unifies streaming and batch data processing. It runs on top of your existing data lakes and is compatible with Apache Spark and other processing engines. Specifically, it provides the following features:
Above figure (referenced from the VLDB20 paper) shows a data pipeline implemented using three storage systems (a message queue, object store and data warehouse), or using Delta Lake for both stream and table storage. The Delta Lake version removes the need to manage multiple copies of the data and uses only low-cost object storage. For more information, refer to the VLDB20 paper: Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores.
Additionally, we are planning to cover the following topics in the final release of the book.
Please be sure to check out some related content from the Data+AI Summit 2021 platform - keynotes from highly technical content including Bill Inmon: the father of Data Warehouses, Malala Yousafzai: Nobel Peace Prize winner and education advocate, Dr. Moogega Cooper and Adam Steltzner: Trailblazing Engineers of the famed Mars Rover ‘Perseverance’ Mission at NASA-JPL, Sol Rashidi: CAO at Estee Lauder, DJ Patil who coined the title "Data scientists" at Linkedin, Michael Armbrust, distinguished software engineer at Databricks, Matei Zaharia: Databricks Co-founder & Chief Technologist, and original creator of Apache Spark and MLflow and Ali Ghodsi: Databricks CEO and co-founder among other features speakers. Level up your knowledge with highly technical content, presented by leading experts.