A Unified Solution for Data Management and Model Training With Apache Iceberg and Mosaic Streaming
Overview
Experience | In Person |
---|---|
Type | Breakout |
Track | Data Lakehouse Architecture and Implementation |
Industry | Media and Entertainment, Public Sector |
Technologies | Apache Iceberg, Mosaic AI |
Skill Level | Advanced |
Duration | 40 min |
This session introduces ByteDance’s challenges in data management and model training, and addresses them by Magnus (enhanced Apache Iceberg) and Byted Streaming (customized Mosaic Streaming).
Magnus uses Iceberg’s branch/tag to manage massive datasets/checkpoints efficiently. With enhanced metadata and a custom C++ data reader, Magnus achieves optimal sharding, shuffling and data loading. Flexible table migration, detailed metrics and built-in full-text indexes on Iceberg tables further ensure training reliability.
When training with ultra-large datasets, ByteDance faced scalability and performance issues. Given Streaming's scalability in distributed training and good code structure, the team chose and customized it to resolve challenges like slow startup, high resource consumption, and limited data source compatibility.
In this session, we will explore Magnus and Byted Streaming, discuss their enhancements and demonstrate how they enable efficient and robust distributed training.
Session Speakers
IMAGE COMING SOON
Zilong Zhou
/ByteDance
IMAGE COMING SOON
Jia Wei
/ByteDance