SESSION

Supercharging AI Training with Mosaic Streaming and Lance Columnar Format

Accept Cookies to Play Video

OVERVIEW

EXPERIENCEIn Person
TYPEBreakout
TRACKGenerative AI
TECHNOLOGIESDatabricks Experience (DBX), AI/Machine Learning, GenAI/LLMs
SKILL LEVELIntermediate
DURATION40 min

AI models are becoming larger and increasingly multi-modal. Under the hood, large scale training datasets are becoming increasingly more costly to store and more complicated to manage. To solve this problem, new purpose-built data infrastructure is necessary. Mosaic StreamingDataset is designed to make multi-node, distributed training of large models easy. With a drop-in replacement for your PyTorch dataset, StreamingDataset seamlessly integrates into your existing training workflows. Lance columnar is an alternative to parquet for ML workloads that delivers 100x better random access performance, critical for shuffling, sampling, vector search, debugging, and more.. By integrating StreamingDataset with Lance, developers can stream data directly from object storage with much higher performance and lower cost. This combination delivers higher quality training datasets and much faster training loops. In this talk we’ll take a peek under the hood of Mosaic StreamingDataset and Lance, discuss how it’s different from parquet, and show you how to build a simple training pipeline.

SESSION SPEAKERS

Chang She

/CEO / Co-founder
LanceDB

Karan Jariwala

/Engineering Manager
Databricks