Delta Lake is an open source project with the Linux Foundation. Data is stored in the open Apache Parquet format, allowing data to be read by any compatible reader. APIs are open and compatible with Apache Spark™.
Data lakes often have data quality issues, due to a lack of control over ingested data. Delta Lake adds a storage layer to data lakes to manage data quality, ensuring data lakes contain only high quality data for consumers.
Handle changing records and evolving schemas as business requirements change. And go beyond Lambda architecture with truly unified streaming and batch using the same engine, APIs, and code.
Instead of parquet…
dataframe .write .format("parquet") .save("/data")
…simply say delta
dataframe .write .format("delta") .save("/data")
Delta Lake is an open source storage layer that sits on top of your existing data lake file storage, such AWS S3, Azure Data Lake Storage, or HDFS. It uses versioned Apache Parquet™ files to store your data. Delta Lake also stores a transaction log to keep track of all the commits made to provide expanded capabilities like ACID transactions, data versioning, and audit history. To access the data, you can use the open Spark APIs, any of the different connectors, or a Parquet reader to read the files directly.