Session

Scaling Data Engineering Pipelines: Preparing Credit Card Transactions Data for Machine Learning

Overview

Experience	In Person
Type	Breakout
Track	Data Engineering and Streaming
Industry	Enterprise Technology, Financial Services
Technologies	Apache Spark, Delta Lake, Databricks Workflows
Skill Level	Intermediate
Duration	40 min

We discuss two real-world use cases in big data engineering, focusing on constructing stable pipelines and managing storage at a petabyte scale. The first use case highlights the implementation of Delta Lake to optimize data pipelines, resulting in an 80% reduction in query time and a 70% reduction in storage space. The second use case demonstrates the effectiveness of the Workflows ‘ForEach’ operator in executing compute-intensive pipelines across multiple clusters, significantly reducing processing time from months to days. This approach involves a reusable design pattern that isolates notebooks into units of work, enabling data scientists to independently test and develop.

Scaling Data Engineering Pipelines: Preparing Credit Card Transactions Data for Machine Learning

Overview

Session Speakers

Brandon DeShon

Luke Garzia