Data engineering is the backbone of modern data teams. Without high quality data, downstream projects for data science, machine learning, and analytics quickly run into bottlenecks.
Find out how to keep your data pipelines stable and your data lakes reliable, in our data engineering track at Spark+AI Europe, where community presenters will talk about their experiences and best practices with Apache SparkTM and Delta Lake. You’ll learn how to tackle your toughest data challenges. Here are a few sessions to check out:
Time travel is now possible with Delta Lake! We’ll show you how to “go back in time” with Delta Lake and why it’s such a powerful feature. Through presentation, notebooks, and code, you’ll learn about several common applications and how they can improve your data engineering pipelines. In this presentation, you’ll learn what challenges Delta Lake addresses, how Delta Lake works, and what you can do with Delta’s time travel capability.
Building Data Intensive Analytic Application on Top of Delta Lakes
All types of enterprises are building data lakes. However, data lakes are still plagued by low user adoption rates and poor data quality, resulting in lower ROI. BI tools may not be enough for your use case. We’ll explore various options for building an analytics application, using various backend technologies, architectures, and frameworks. The session includes a demo analytics application built on Play Framework (for back-end), React (for front-end), Structured Streaming for ingesting data from Delta table, and live query analytics on real-time data ML predictions based on analytics data
Modern ETL Pipelines with Change Data Capture
In this talk, you’ll find out how GetYourGuide built a completely new ETL pipeline from scratch, using Debezium, Kafka, Spark, and Airflow. The previous legacy system was error-prone, vulnerable to breaking schema changes, and caused many sleepless on-call nights. In this session, we’ll review the steps we followed to architect and develop our ETL pipeline using Databricks to reduce operation time. Since building these new pipelines, we can now refresh our data lake multiple times daily to provide our users with fresher data than before.
Data Warehousing with Spark Streaming at Zalando
Zalando’s AI-driven products and distributed landscape of analytical data marts cannot wait for long-running, hard-to-recover, monolithic batch jobs that take all night to calculate data that’s already outdated. This talk will include a discussion of challenges in our data platform and an architectural deep dive about separation of integration from enrichment, providing streams and snapshots, and feeding the data to distributed data marts. We will also discuss lessons learned and best practices for Delta’s MERGE command, Scala API vs Spark SQL, and schema evolution, and provide additional insights and guidance for similar use cases.
Simplify and Scale Data Engineering Pipelines with Delta Lake
This talk will review data engineering pipeline processes for transforming your data through different quality levels. Pipelines commonly use tables that correspond to different quality levels, progressively adding structure to the data, from data ingestion (“Bronze” tables) to transformation/feature engineering (“Silver” tables) to machine learning training or prediction (“Gold” tables). This“multi-hop” architecture allows data engineers to build a pipeline that begins with raw data as a “single source of truth” from which everything flows. In this session, we’ll demonstrate how to build a scalable data engineering data pipeline using Delta Lake.
What’s Next
Check out the full list of sessions at Spark+AI Summit Europe 2019, including such tracks as Architecture, Developer, Data & ML Use Cases, and more.