Session

Scaling Data Engineering Pipelines: Preparing Credit Card Transactions Data for Machine Learning

Overview

ExperienceIn Person
TypeBreakout
TrackData Engineering and Streaming
IndustryEnterprise Technology, Financial Services
TechnologiesApache Spark, Delta Lake, Databricks Workflows
Skill LevelIntermediate
Duration40 min

We discuss two real-world use cases in big data engineering, focusing on constructing stable pipelines and managing storage at a petabyte scale. The first use case highlights the implementation of Delta Lake to optimize data pipelines, resulting in an 80% reduction in query time and a 70% reduction in storage space. The second use case demonstrates the effectiveness of the Workflows ‘ForEach’ operator in executing compute-intensive pipelines across multiple clusters, significantly reducing processing time from months to days. This approach involves a reusable design pattern that isolates notebooks into units of work, enabling data scientists to independently test and develop.

Session Speakers

Brandon DeShon

/Director, Data Scientist
Mastercard

Luke Garzia

/Lead Data Engineer
Mastercard