Foundation Model Serving

Serve state-of-the-art foundation models for both real-time and batch inference workload needs. This enables you to quickly and easily build applications that leverage high-quality generative AI models without the need to maintain your own model deployment.

* Displayed pricing does not guarantee product availability in that region. For product availability see here: AWS, Azure, GCP, SAP
1. Azure Databricks, as a first-party service on Microsoft Azure, offers unified billing and support by Microsoft
The Premium tier on Azure Databricks corresponds to the Enterprise tier on AWS and GCP
2. Hourly pricing is charged on a per-minute increment
3. Throughput in a single unit of PT capacity varies by model and query shape (input vs. output tokens). Please use the GenAI Calculator to estimate workload-specific throughput and total cost

Foundation Model Serving DBU rates

Model	Pay-Per-Token		Provisioned Throughput
Model	DBU / M input tokens	DBU / M output tokens	DBU / hour (entry capacity)	DBU / hour (scaling capacity)
Llama 4 Maverick	7.143	21.429	85.714	85.714
Llama 3.3 70B	7.143	21.429	85.714	342.857
Qwen 3 Next 80B	2.143	17.143	78.571	78.571
Qwen 3.5 122B	3.143	31.429	85.714	85.714
GPT OSS 120B	2.143	8.571	71.429	71.429
Gemma 3 12B	2.143	7.143	71.429	71.429
Llama 3.1 8B	2.143	6.429	53.571	106.000
GPT OSS 20B	1.000	4.286	53.571	53.571
Llama 3.2 3B	n/a	n/a	46.429	92.857
Llama 3.2 1B	n/a	n/a	42.857	85.714
Qwen 3 0.6B Embedding	0.286	n/a	25.000	25.000
GTE	1.857	n/a	20.000	20.000
BGE Large	1.429	n/a	24.000	24.000

¹: Entry capacity is the small, lower-cost PT capacity unit designed to provide a more affordable starting point for customers. These provide proportionally reduced throughput compared to the scaling capacity. These are only available in Azure and AWS for US, Canada and Brazil regions, and only for base (not fine-tuned) models.

²: Scaling capacity is the standard PT capacity increment that can be provisioned for a model. Beyond entry capacity (available in select clouds and regions), Provisioned Throughput capacity scales up and down in increments of these scaling capacity units. In clouds / regions where entry capacity is not available, the minimum PT purchase increment is the full scaling capacity unit.

Pay as you go with a 14-day free trial or contact us for committed-use discounts or custom requirements.

Start free trial Contact us

Foundation Model Serving

Foundation Model Serving DBU rates

Foundation Model Serving FAQ