Skip to main content

Model Serving

Unified deployment and governance for all AI models

illustration-nodes-1-gray
video thumb

Introduction

Mosaic AI Model Serving is a unified service for deploying, governing, querying and monitoring models fine-tuned or pre-deployed by Databricks like Meta Llama 3, DBRX or BGE, or from any other model provider like Azure OpenAI, AWS Bedrock, AWS SageMaker or Anthropic. Our unified approach makes it easy to experiment with and productionize models from any cloud or provider to find the best candidate for your real-time application. With batch inference support, you can efficiently run AI inference on large datasets, complementing real-time serving for comprehensive model evaluation. You can do A/B testing of different models and monitor model quality on live production data once they are deployed. Model Serving also has pre-deployed models such as Llama 3 70B, allowing you to jump-start developing AI applications like retrieval augmented generation (RAG) and provide pay-per-token access or pay-for-provisioned compute for throughput guarantees.

Customer Quotes

Simplified deployment

Simplified deployment for all AI models

Deploy any model type, from pretrained open source models to custom models built on your own data — on both CPUs and GPUs. Automated container build and infrastructure management reduce maintenance costs and speed up deployment so you can focus on building your AI projects and delivering value faster for your business.

Serving endpoints graphic image

Unified management for all models

Manage all models, including custom ML models like PyFunc, scikit-learn and LangChain, foundation models (FMs) on Databricks like Llama 3, MPT and BGE, and foundation models hosted elsewhere like ChatGPT, Claude 3, Cohere and Stable Diffusion. Model Serving makes all models accessible in a unified user interface and API, including models hosted by Databricks, or from another model provider on Azure or AWS.

Effortless batch inference

Effortless batch inference

Model Serving enables efficient AI inference on large datasets, supporting all data types and models in a serverless environment. You can seamlessly integrate with Databricks SQL, Notebooks and Workflows to apply AI models across vast amounts of data in one streamlined operation. Enhance data processing, generate embeddings and evaluate models — all without complex rework.

Create serving endpoint graphic image

Governance built-in

Integrate with Mosaic AI Gateway to meet stringent security and advanced governance requirements. You can enforce proper permissions, monitor model quality, set rate limits, and track lineage across all models whether they are hosted by Databricks or on any other model provider.

Unified with Lakehouse data

Data-centric models

Accelerate deployments and reduce errors through deep integration with the Data Intelligence Platform. You can easily host various generative AI models, augmented (RAG) or fine-tuned with their enterprise data. Model Serving offers automated lookups, monitoring and governance across the entire AI lifecycle.

real-time

Cost-effective

Serve models as a low-latency API on a highly available serverless service with both CPU and GPU support. Effortlessly scale from zero to meet your most critical needs — and back down as requirements change. You can get started quickly with one or more pre-deployed models and pay-per-token (on demand with no commitments) or pay-for-provisioned compute workloads for guaranteed throughput. Databricks will take care of infrastructure management and maintenance costs, so you can focus on delivering business value.

Ready to get started?