Skip to main content
Page 1

Training MoEs at Scale with PyTorch and Databricks

Mixture-of-Experts (MoE) has emerged as a promising LLM architecture for efficient training and inference. MoE models like DBRX , which use multiple expert...

Building DBRX-class Custom LLMs with Mosaic AI Training

We recently introduced DBRX : an open, state-of-the-art, general-purpose LLM. DBRX was trained, fine-tuned, and evaluated using Mosaic AI Training, scaling training to...

Bringing MegaBlocks to Databricks

At Databricks, we’re committed to building the most efficient and performant training tools for large-scale AI models. With the recent release of DBRX...

Turbocharged Training: Optimizing the Databricks Mosaic AI Stack With FP8

At Databricks, we believe that the best companies in the world, in every sector, will have AI-powered systems that are trained and customized...

How We Trained Stable Diffusion for Less than $50k (Part 3)

In our previous blog post, we showed how we used the MosaicML platform, Streaming datasets, and the Composer library to train a Stable...

Training Stable Diffusion from Scratch for <$50k with MosaicML (Part 2)

We've replicated Stable Diffusion 2 for less than $50k, and we've open-sourced the training code so you can too! This is a 3x...

Farewell, CUDA OOM: Automatic Gradient Accumulation

June 23, 2022 by Mihir Patel and Erica Ji Yuen in
With automatic gradient accumulation, Composer lets users seamlessly change GPU types and number of GPUs without having to worry about batch size. CUDA...