As organizations adopt the data lakehouse architecture, data engineers are looking for efficient ways to capture continually arriving data. Even with the right tools, implementing this common use case can be challenging to execute – especially when replicating operational databases into their lakehouse or reprocessing data for each update. Using a reliable ETL framework to develop, monitor, manage and operationalize data pipelines at scale, we have made it easy to implement change data capture (CDC) into the Delta Lake with Delta Live Tables (DLT) giving users:
- Simplicity and convenience: Easy-to-use APIs for identifying changes, making your code simple, convenient and easy to understand.
- Efficiency: The ability to only insert or update rows that have changed, with efficient merge, update and delete operations.
- Scalability: The ability to capture and apply data changes across tens of thousands of tables with low-latency support.
Delta Live Tables enables data engineers to simplify data pipeline development and maintenance, enable data teams to self serve and innovate rapidly, provides built-in quality controls and monitoring to ensure accurate and useful BI, Data Science and ML and lets you scale with reliability through deep visibility into pipeline operations, automatic error handling, and auto-scaling capabilities.
With DLT, data engineers can easily implement CDC with a new declarative APPLY CHANGES INTO API, in either SQL or Python. This new capability lets ETL pipelines easily detect source data changes and apply them to data sets throughout the lakehouse. DLT processes data changes into the Delta Lake incrementally, flagging records to be inserted, updated or deleted when handling CDC events. The example below shows how easy it is to identify and delete records from a customer table using the new API:
CREATE STREAMING LIVE TABLE customer_silver; APPLY CHANGES INTO live.customer_silver FROM stream(live.customer_bronze) KEYS (id) APPLY AS DELETE WHEN active = 0 SEQUENCE BY update_dt ;
The default behavior is to upsert the CDC events from the source by automatically updating any row in the target table that matches the specified key(s) and insert a new row if there’s no preexisting match in the target table. DELETE events may also be handled by specifying the APPLY AS DELETE WHEN condition. APPLY CHANGES INTO is available in all regions. For more information, refer to the documentation (Azure, AWS, GCP) or check out an example notebook.