Fast Copy-On-Write in Apache Parquet for Data Lakehouse Upserts
OVERVIEW
EXPERIENCE | In Person |
---|---|
TYPE | Breakout |
TRACK | Data Lakehouse Architecture |
INDUSTRY | Retail and CPG - Food, Travel and Hospitality |
TECHNOLOGIES | Apache Spark, Delta Lake, SQL Analytics / BI / Visualizations |
SKILL LEVEL | Intermediate |
DURATION | 40 min |
DOWNLOAD SESSION SLIDES |
Efficient table ACID upsert is essential for today’s Lakehouse. Important use cases, such as GDPR Right to be Forgotten and Change Data Capture, rely heavily on it. While Apache Delta Lake, Iceberg, and Hudi are widely adopted, the slowness of upserts is seen when the data volume scales up, particularly for copy-on-write mode. Sometimes, the slow upserts become a blocker to finishing compliance requirements on time. We introduced partial copy-on-write within Parquet with row-level index to skip unnecessary column chunks efficiently. The term partial here means only performing copy-on-write for the needed chunks but skipping unrelated ones. Generally, only a small portion of the file needs to be updated, and most of the data chunks can be skipped. We have observed an increased speed of up to 20x compared to existing upserts.
SESSION SPEAKERS
Xinli Shang
/Engineering Manager
Uber
Mingmin Chen
/Director of Engineering
Uber Technologies, Inc