SESSION

Streaming Data Pipelines and Optimization

OVERVIEW

EXPERIENCEIn Person
TYPEBreakout
TRACKData Engineering and Streaming
INDUSTRYRetail and CPG - Food
TECHNOLOGIESApache Spark, Delta Lake, ETL
SKILL LEVELIntermediate
DURATION40 min

In this session, we delve into Streaming data pipelines and optimization and improvements on the foundational large datasets. We enhanced the processing speed and efficiency through collaborative efforts with Databricks. Key achievements include a significant increase in data processing speed and an exponential improvement in streaming capability while handling approximately large volume data flat daily events. We reduced data backlog and latency. We discuss various storage optimization techniques, including file size tuning, deletion vectors, table reorganization, auto-compaction, and other compute optimization strategies. We addressed the small file problem, which hindered query reads and data processing. By optimizing file sizes and implementing data clean-up processes, we reduced storage needs and improved performance. We provide insights into these technical improvements, the challenges, and the strategies to overcome them.

SESSION SPEAKERS

Amogh Antarkar

/Senior Software Engineer
H-E-B