Introducing Simple, Fast, and Scalable Batch LLM Inference on Mosaic AI Model Serving

Introducing Batch Inference in Mosaic AI Model Serving: Scalable, Seamless AI for Large Datasets

Published: October 22, 2024

by Ahmed Bilal, Youngbin Kim and Ankit Mathur

Over the years, organizations have amassed a vast amount of unstructured text data—documents, reports, and emails—but extracting meaningful insights has remained a challenge. Large Language Models (LLMs) now offer a scalable way to analyze this data, with batch inference as the most efficient solution. However, many tools still focus on online inference, leaving a gap for better batch processing capabilities.

Today, we’re excited to announce a simpler, faster, and more scalable way to apply LLMs to large documents. No more exporting data as CSV files to unmanaged locations—now you can run batch inference directly within your workflows, with full governance through Unity Catalog. Simply write the SQL query below and execute it in a notebook or workflow.

Using ai_query, you can now run at high scale with unmatched speed, ensuring fast processing of even the largest datasets. The interface supports all AI models, allowing you to securely apply LLMs, traditional AI models, or compound AI systems to analyze your data at scale.

Figure 1: A batch inference job of any scale - millions or billions of tokens - is defined using the same, familiar SQL interface

What are people doing with Batch LLM Inference?

Batch inference enables businesses to apply LLMs to large datasets all at once, rather than one at a time, as with real-time inference. Processing data in bulk provides cost efficiency, faster processing, and scalability. Some common ways businesses are using batch inference include:

Information Extraction: Extract key insights or classify topics from large text corpora, supporting data-driven decisions from documents like reviews or support tickets.
Data Transformation: Translate, summarize, or convert unstructured text into structured formats, improving data quality and preparation for downstream tasks.
Bulk Content Generation: Automatically create text for product descriptions, marketing copy, or social media posts, enabling businesses to scale content production effortlessly.

Current Batch Inference Challenges

Existing batch inference approaches present several challenges, such as:

Complex Data Handling: Existing solutions often require manual data export and upload, leading to higher operational costs and compliance risks.
Fragmented Workflows: Most production batch workflows involve multiple steps, like preprocessing, multi-model inference, and post-processing. This often requires stitching together various tools, slowing execution and increasing the risk of errors.
Performance and Cost Bottlenecks: Large-scale inference requires specialized infrastructure and teams for configuration and optimization, limiting analysts' and data scientists' ability to self-serve and scale insights.

Batch LLM Inference on Mosaic AI Model Serving

Effortless AI on Governed Data

Mosaic AI allows you to perform batch LLM inference directly where your governed data resides with no data movement or preparation needed. Applying batch LLM inference is as simple as creating an endpoint with any AI model and running an SQL query (as shown in the figure). You can deploy any AI models—base, fine-tuned, or traditional—and execute SQL functions from any development environment on Databricks, whether interactively in the SQL editor or notebook or scheduled through Workflows and Delta Live Tables (DLT).

Effortless AI on Governed Data

Run Fast Inference on Millions of Rows

This release introduces multiple infrastructure improvements, enabling you to process millions of rows quickly and cost-effectively. The infrastructure scales automatically, adjusting resources to handle even the largest workloads efficiently. Additionally, built-in fault tolerance with automatic retries allows you to run large workflows confidently, seamlessly handling any errors along the way.

Real-world use cases require preprocessing and post-processing, with LLM inference often just one part of a broader workflow. Instead of piecing together multiple tools and APIs, Databricks enables you to execute the entire workflow on a single platform, reducing complexity and saving valuable time. Below is an example of how to run an end-to-end workflow with the new solution.

Run an end-to-end batch workflow with the new solution.

Or, if you’d prefer, you can leverage SQL’s advanced nesting features to directly blend these into a single query.

Getting Started with Batch LLM Inference

Explore our getting started guide for step-by-step instructions on batch LLM inference.
Watch the demo.
Discover other built-in SQL AI functions that allow you to apply AI directly to your data.

What's next?

June 12, 2024/8 min read

Mosaic AI: Build and Deploy Production-quality AI Agent Systems

January 7, 2025/6 min read

Figure 1: A batch inference job of any scale - millions or billions of tokens - is defined using the same, familiar SQL interface

What are people doing with Batch LLM Inference?

Current Batch Inference Challenges

Batch LLM Inference on Mosaic AI Model Serving

Effortless AI on Governed Data

Run Fast Inference on Millions of Rows

Getting Started with Batch LLM Inference

Never miss a Databricks post

Sign up

What's next?

Mosaic AI: Build and Deploy Production-quality AI Agent Systems

Databricks on Databricks - Transforming the Sales Experience using GenAI Agents