Streaming Data Pipelines and Optimization
OVERVIEW
EXPERIENCE | In Person |
---|---|
TYPE | Breakout |
TRACK | Data Engineering and Streaming |
INDUSTRY | Retail and CPG - Food |
TECHNOLOGIES | Apache Spark, Delta Lake, ETL |
SKILL LEVEL | Intermediate |
DURATION | 40 min |
In this session, we delve into Streaming data pipelines and optimization and improvements on the foundational large datasets. We enhanced the processing speed and efficiency through collaborative efforts with Databricks. Key achievements include a significant increase in data processing speed and an exponential improvement in streaming capability while handling approximately large volume data flat daily events. We reduced data backlog and latency. We discuss various storage optimization techniques, including file size tuning, deletion vectors, table reorganization, auto-compaction, and other compute optimization strategies. We addressed the small file problem, which hindered query reads and data processing. By optimizing file sizes and implementing data clean-up processes, we reduced storage needs and improved performance. We provide insights into these technical improvements, the challenges, and the strategies to overcome them.
SESSION SPEAKERS
Amogh Antarkar
/Senior Software Engineer
H-E-B