Databricks Delta Lake, the next-generation engine built on top of Apache Spark™, now supports the MERGE command, which allows you to efficiently upsert and delete records in your data lakes. MERGE dramatically simplifies how a number of common data pipelines can be built; all the complicated multi-hop processes that inefficiently rewrote entire partitions can now be replaced by simple MERGE queries. This finer-grained update capability simplifies how you build your big data pipelines for various use cases ranging from change data capture to GDPR.
There are a number of common use cases where existing data in a data lake needs to be updated or deleted:
Since data lakes are fundamentally based on files, they have always been optimized for appending data than for changing existing data. Hence, building the above use case has always been challenging. Users typically read the entire table (or a subset of partitions) and then overwrite them. Therefore, every organization tries to reinvent the wheel for their requirement by hand-writing complicated queries in SQL, Spark, etc. This approach is:
With Databricks Delta, you can easily address the use cases above without any of the aforementioned problems using the following MERGE command:
See our docs (Azure | AWS) for a more detailed explanation of the syntax.
Let’s understand how to use MERGE with a simple example. Suppose you have a slowly changing dimension table that maintains user information like addresses. Furthermore, you have a table of new addresses for both existing and new users. To merge all the new addresses to the main user table, you can run the following:
This will perform exactly what the syntax says - for existing users (i.e. MATCHED clause), it will update the address column, and for new users (i.e. NOT MATCHED clause) it will insert all the columns. For large tables with TBs of data, this Databricks Delta MERGE operation can be orders of magnitude faster than overwriting entire partitions or tables since Delta reads only relevant files and updates them. Specifically, Delta’s MERGE has the following advantages:
Here is a visual explanation of how MERGE compares with hand-written pipelines.
Complying with the “right to be forgotten” clause of GDPR for data in data lakes cannot get easier. You can set up a simple scheduled job with an example code like below to delete all the users who have opted out of your service.
You can easily apply all data changes - updates, deletes, inserts - generated from an external database into a Databricks Delta table with the MERGE syntax as follows:
If you have streaming event data flowing in and if you want to sessionize the streaming event data and incrementally update and store sessions in a Databricks Delta table, you can accomplish using the foreachBatch in Structured Streaming and MERGE. For example, suppose you have a Structured Streaming DataFrame that computes updated session information for each user. You can start a streaming query that applies all the sessions update to a Delta table as follows (Scala).
For a complete working example of foreachBatch and MERGE, see this notebook (Azure | AWS).
The fine-grained update capability in Databricks Delta simplifies how you build your big data pipelines. You no longer need to write complicated logic to overwrite tables and overcome a lack of snapshot isolation. With fine-grained updates, your pipelines will also be more efficient since you don’t need to read and overwrite entire tables. With changing data, another critical capability required is the ability to roll back in case of bad writes. Databricks Delta also offers rollback capabilities with the time travel feature, so that if you do a bad merge, you can easily roll back.
Read more (Azure | AWS) about the fine-grained updates feature. To see the feature in action, sign up for a free trial of Databricks and try it out.