In this talk we'll present how at GetYourGuide we've built from scratch a completely new ETL pipeline using Debezium, Kafka, Spark and Airflow, which can automatically handle schema changes. Our starting point was an error prone legacy system that ran daily, and was vulnerable to breaking schema changes, which caused many sleepless on-call nights. As most companies, we also have traditional SQL databases that we need to connect to in order to extract relevant data.
This is done usually through either full or partial copies of the data with tools such as sqoop. However another approach that has become quite popular lately is to use Debezium as the Change Data Capture layer which reads databases binlogs, and stream these changes directly to Kafka. As having data once a day is not enough anymore for our business, and we wanted our pipelines to be resilient to upstream schema changes, we've decided to rebuild our ETL using Debezium.
We'll walk the audience through the steps we followed to architect and develop such solution using Databricks to reduce operation time. By building this new pipeline we are now able to refresh our data lake multiple times a day, giving our users fresh data, and protecting our nights of sleep.
Attribution tracking is the process of recording which touch points result in a customer visiting a website or mobile apps. Organizations track every interaction that brings a visitor to their website to properly attribute every step leading to a visit. With proper attribution, businesses can determine which marketing channels work better than others, and therefore allocate more of their marketing budget to the most effective channels. In this context, GetYourGuide developed a solution that cleans and structures logs from different data sources, applies rules to deal with channel assignment, and finally properly weights each channel's contribution to total revenue generated. Thiago will go through the business and technical challenges solved and how the solution was implemented at GetYourGuide using Spark and Databricks. Session hashtag: #SAISEnt13