Session

A Unified Solution for Data Management and Model Training With Apache Iceberg and Mosaic Streaming

Overview

ExperienceIn Person
TypeBreakout
TrackData Lakehouse Architecture and Implementation
IndustryMedia and Entertainment, Public Sector
TechnologiesApache Iceberg, Mosaic AI
Skill LevelAdvanced
Duration40 min

This session introduces ByteDance’s challenges in data management and model training, and addresses them by Magnus (enhanced Apache Iceberg) and Byted Streaming (customized Mosaic Streaming).

 

Magnus uses Iceberg’s branch/tag to manage massive datasets/checkpoints efficiently. With enhanced metadata and a custom C++ data reader, Magnus achieves optimal sharding, shuffling and data loading. Flexible table migration, detailed metrics and built-in full-text indexes on Iceberg tables further ensure training reliability.

 

When training with ultra-large datasets, ByteDance faced scalability and performance issues. Given Streaming's scalability in distributed training and good code structure, the team chose and customized it to resolve challenges like slow startup, high resource consumption, and limited data source compatibility.

 

In this session, we will explore Magnus and Byted Streaming, discuss their enhancements and demonstrate how they enable efficient and robust distributed training.

Session Speakers

IMAGE COMING SOON

Zilong Zhou

/ByteDance

IMAGE COMING SOON

Jia Wei

/ByteDance