Modern ETL Pipelines with Change Data Capture - Databricks

Modern ETL Pipelines with Change Data Capture

Download Slides

In this talk we’ll present how at GetYourGuide we’ve built from scratch a completely new ETL pipeline using Debezium, Kafka, Spark and Airflow, which can automatically handle schema changes. Our starting point was an error prone legacy system that ran daily, and was vulnerable to breaking schema changes, which caused many sleepless on-call nights. As most companies, we also have traditional SQL databases that we need to connect to in order to extract relevant data.

This is done usually through either full or partial copies of the data with tools such as sqoop. However another approach that has become quite popular lately is to use Debezium as the Change Data Capture layer which reads databases binlogs, and stream these changes directly to Kafka. As having data once a day is not enough anymore for our bussiness, and we wanted our pipelines to be resilent to upstream schema changes, we’ve decided to rebuild our ETL using Debezium.

We’ll walk the audience through the steps we followed to architect and develop such solution using Databricks to reduce operation time. By building this new pipeline we are now able to refresh our data lake multiple times a day, giving our users fresh data, and protecting our nights of sleep.

 

Try Databricks
See More Spark + AI Summit Europe 2019 Videos

« back
About Thiago Rigo

GetYourGuide

Thiago has been working with software engineering for the past 7 years, and last 3 years focused on data engineering. As a data engineer, he has worked on a variety of projects related to data warehousing, data quality, and event processing. At GetYourGuide he's part of the Data Platform team, where he's responsible for architecting, building and monitoring data pipelines to service internal and external users.

About David Mariassy

GetYourGuide

David is working as a Data Engineer on Get Your Guide's Data Platform Team where he focuses on serving internal customers with high-quality, low-latency data products. He has over 5 years of experience in Business Intelligence and Data Engineering roles from the Berlin e-commerce scene. He enjoys developing data pipelines that are easy to maintain, test and evolve, and has a keen interest in functional programming.