Simplify CDC Pipeline with Spark Streaming SQL and Delta Lake

Download Slides

Change Data Capture CDC is a typical use case in Real-Time Data Warehousing. It tracks the data change log -binlog- of a relational database [OLTP], and replay these change log timely to an external storage to do Real-Time OLAP, such as delta/kudu. To implement a robust CDC streaming pipeline, lots of factors should be concerned, such as how to ensure data accuracy , how to process OLTP source schema changed, whether it is easy to build for variety databases with less code. This talk will share the practice for simplify CDC pipeline with SparkStreaming SQL and Delta Lake. Users juest need to write a simple Merge Into Streaming SQL to build a CDC pipeline, which is from relational database to delta lake. Behind this simple Streaming SQL, we cover the data accuracy/auto data schema change detected, also with lots of delta lake improvement, data skipping to improve merge perfermance, streaming job transaction commit conflict with compaction.

Try Databricks
« back
About Jun Song


Jun Song, a senior engineer and big data expert @Alibaba, focusing on Spark area, especially Spark Core and Spark SQL. He is also an Apache Spark contributor and a winner of CloudSort Benchmark Competion 2016 using Spark as a compute engine. Additional he also submit benchmark report to TPC-DS website(, top one ranking, which is accomplished by lots of optimize for SparkSQL.