Automatically Evolve Your Nested Column Schema, Stream From a Delta Table Version, and Check Your Constraints
We recently announced the release of Delta Lake 0.8.0, which introduces schema evolution and performance improvements in merge and operational metrics in table history. The key features in this release are: Unlimited MATCHED and NOT MATCHED clauses for merge operations in Scala, Java, and Python. Merge operations now support any number of whenMatched and whenNotMatched...
Natively Query Your Delta Lake With Scala, Java, and Python
Today, we’re happy to announce that you can natively query your Delta Lake with Scala and Java (via the Delta Standalone Reader) and Python (via the Delta Rust API). Delta Lake is an open-source storage layer that brings reliability to data lakes. Delta Lake provides ACID transactions, scalable metadata handling, and unifies streaming and batch...
Diving Into Delta Lake: DML Internals (Update, Delete, Merge)
In previous blogs Diving Into Delta Lake: Unpacking The Transaction Log and Diving Into Delta Lake: Schema Enforcement & Evolution, we described how the Delta Lake transaction log works and the internals of schema enforcement and evolution. Delta Lake supports DML (data manipulation language) commands including DELETE, UPDATE, and MERGE. These commands simplify change data...
An Update on Project Zen: Improving Apache Spark for Python Users
Apache Spark™ has reached its 10th anniversary with Apache Spark 3.0 which has many significant improvements and new features including but not limited to type hint support in pandas UDF, better error handling in UDFs, and Spark SQL adaptive query execution. It has grown to be one of the most successful open-source projects as the...
Enabling Spark SQL DDL and DML in Delta Lake on Apache Spark 3.0
Last week, we had a fun Delta Lake 0.7.0 + Apache Spark 3.0 AMA where Burak Yavuz, Tathagata Das, and Denny Lee provided a recap of Delta Lake 0.7.0 and answered your Delta Lake questions. The theme for this AMA was the release of Delta Lake 0.7.0 coincided with the release of Apache Spark 3.0...
Time Traveling with Delta Lake: A Retrospective of the Last Year
Try out Delta Lake 0.7.0 with Spark 3.0 today! It has been a little more than a year since Delta Lake became an open-source project as a Linux Foundation project. While a lot has changed over the last year, the challenges for most data lakes remain stubbornly the same - the inherent unreliability of data...
Schema Evolution in Merge Operations and Operational Metrics in Delta Lake
Try this notebook to reproduce the steps outlined below We recently announced the release of Delta Lake 0.6.0, which introduces schema evolution and performance improvements in merge and operational metrics in table history. The key features in this release are: Support for schema evolution in merge operations (#170) - You can now automatically evolve the...
Using AutoML Toolkit’s FamilyRunner Pipeline APIs to Simplify and Automate Loan Default Predictions
Introduction In the post Using AutoML Toolkit to Automate Loan Default Predictions, we had shown how the Databricks Labs’ AutoML Toolkit simplified Machine Learning model feature engineering and model building optimization (MBO). It also had improved the area-under-the-curve (AUC) from 0.6732 (handmade XGBoost model) to 0.723 (AutoML XGBoost model). With AutoML Toolkit’s Release 0.6.1, we...
Simple, Reliable Upserts and Deletes on Delta Lake Tables using Python APIs
We are excited to announce the release of Delta Lake 0.4.0 which introduces Python APIs for manipulating and managing data in Delta tables. The key features in this release are: Python APIs for DML and utility operations (#89) - You can now use Python APIs to update/delete/merge data in Delta Lake tables and to run...
Diving Into Delta Lake: Schema Enforcement & Evolution
Data, like our experiences, is always evolving and accumulating. To keep up, our mental models of the world must adapt to new data, some of which contains new dimensions - new ways of seeing things we had no conception of before. These mental models are not unlike a table's schema, defining how we categorize and...