Delta Lake on Databricks - Databricks

Delta Lake

Brings data reliability and performance to your data lakes

Delta Lake brings reliability, performance, and lifecycle management to data lakes. No more malformed data ingestion, difficulty deleting data for compliance, or issues modifying data for change data capture. Accelerate the velocity that high quality data can get into your data lake, and the rate that teams can leverage that data, with a secure and scalable cloud service.

Benefits

 

 

OPEN & EXTENSIBLE

Delta Lake is an open source project with the Linux Foundation. Data is stored in the open Apache Parquet format, allowing data to be read by any compatible reader. APIs are open and compatible with Apache Spark™.
 

DATA RELIABILITY

Data lakes often have data quality issues, due to a lack of control over ingested data. Delta Lake adds a storage layer to data lakes to manage data quality, ensuring data lakes contain only high quality data for consumers.
 

MANAGE DATA LIFECYCLE

Handle changing records and evolving schemas as business requirements change. And go beyond Lambda architecture with truly unified streaming and batch using the same engine, APIs, and code.

Features

 

ACID Transactions: Multiple data pipelines can read and write data concurrently to a data lake. ACID Transactions ensure data integrity with serializability, the strongest level of isolation.
 
Updates and Deletes: Delta Lake provides DML APIs to merge, update and delete datasets. This allows you to easily comply with GDPR/CCPA and simplify change data capture.
 
Schema Enforcement: Specify your data lake schema and enforce it, ensuring that the data types are correct and required columns are present, and preventing bad data from causing data corruption.
 
Time Travel (Data Versioning): Data snapshots enable developers to access and revert to earlier versions of data to audit data changes, rollback bad updates or reproduce experiments.
 
Scalable Metadata Handling: Delta Lake treats metadata just like data, leveraging Spark’s distributed processing power. This allows for petabyte-scale tables with billions of partitions and files.
Open Format: All data in Delta Lake is stored in Apache Parquet format enabling Delta Lake to leverage the efficient compression and encoding schemes that are native to Parquet.
 
Unified Batch and Streaming Source and Sink: A table in Delta Lake is both a batch table, as well as a streaming source and sink. Streaming data ingest, batch historic backfill, and interactive queries all just work out of the box.
 
Schema Evolution: Big data is continuously changing. Delta Lake enables you to make changes to a table schema that can be applied automatically, without the need for cumbersome DDL.
 
Audit History: The Delta Lake transaction log records details about every change made to data, providing a full history of changes, for compliance, audit, and reproduction.
 
100% Compatible with Apache Spark API: Developers can use Delta Lake with their existing data pipelines with minimal change as it is fully compatible with Spark, the commonly used big data processing engine.

Instead of parquet

dataframe 
    .write
    .format("parquet") 
    .save("/data")

…simply say delta

dataframe 
    .write
    .format("delta")
    .save("/data")

How It Works

Delta Lake Under the Hood

From Michael Armbrust, Creator of Delta Lake

Delta Lake is an open source storage layer that sits on top of your existing data lake file storage, such AWS S3, Azure Data Lake Storage, or HDFS. It uses versioned Apache Parquet™ files to store your data. Delta Lake also stores a transaction log to keep track of all the commits made to provide expanded capabilities like ACID transactions, data versioning, and audit history. To access the data, you can use the open Spark APIs, any of the different connectors, or a Parquet reader to read the files directly.

Ready to get started?

TRY DATABRICKS FOR FREE


Follow the Quick Start Guide

Documentation

Resources