Supercharging AI Training with Mosaic Streaming and Lance Columnar Format
OVERVIEW
EXPERIENCE | In Person |
---|---|
TYPE | Breakout |
TRACK | Generative AI |
TECHNOLOGIES | Databricks Experience (DBX), AI/Machine Learning, GenAI/LLMs |
SKILL LEVEL | Intermediate |
DURATION | 40 min |
AI models are becoming larger and increasingly multi-modal. Under the hood, large scale training datasets are becoming increasingly more costly to store and more complicated to manage. To solve this problem, new purpose-built data infrastructure is necessary. Mosaic StreamingDataset is designed to make multi-node, distributed training of large models easy. With a drop-in replacement for your PyTorch dataset, StreamingDataset seamlessly integrates into your existing training workflows. Lance columnar is an alternative to parquet for ML workloads that delivers 100x better random access performance, critical for shuffling, sampling, vector search, debugging, and more.. By integrating StreamingDataset with Lance, developers can stream data directly from object storage with much higher performance and lower cost. This combination delivers higher quality training datasets and much faster training loops. In this talk we’ll take a peek under the hood of Mosaic StreamingDataset and Lance, discuss how it’s different from parquet, and show you how to build a simple training pipeline.
SESSION SPEAKERS
Chang She
/CEO / Co-founder
LanceDB
Karan Jariwala
/Engineering Manager
Databricks