Skip to main content

Mosaic AI Foundation Model Serving

Serve state-of-the-art open foundation models for both real-time and batch inference workload needs. This enables you to quickly and easily build applications that leverage high-quality generative AI models without the need to maintain your own model deployment.

Loading...

* For regional availability: AWS, Azure
1. Showing the lowest regional price

2. Maximum provisioned throughput per band for Batch Inference workloads is ~50% higher than for real-time workloads shown in the table

3. Hourly pricing is charged on a per-minute increment

Foundation Model Serving DBU rates and Throughput

Model Pay-Per-Token Provisioned Throughput
for Scaling bands1
Provisioned Throughput for entry band
(available only for base models in US, Canada, and Brazil)3
DBU / 1M INPUT tokens
(Global)
DBU / 1M OUTPUT tokens
(Global)
DBU / hour
(Global)
Throughput Band2
(max tokens / sec)
DBU / hour (Global) Max tokens / second
Current Models
Llama 3.1 405B 35.714 142.857 600.000 3,400 150.000 850
Llama 3.3 70B 7.143 21.429 342.857 9,500 85.714 2,400
Llama 3.1 70B n/a n/a 342.857 9,500 85.714 2,400
Llama 3.1 8B n/a n/a 106.000 19,000 50.000 9,500
Llama 3.2 3B n/a n/a 92.857 22,000 46.429 10,900
Llama 3.2 1B n/a n/a 85.714 35,000 42.857 15,800
GTE 1.857 n/a 20.000 9,450 20.000 9,450
BGE Large 1.429 n/a 24.000 11,800 24.000 11,800
Legacy Models
DBRX 10.714 32.143 171.429 650 171.429 650
Llama 3 70B n/a n/a 212.143 1,000 212.143 1,000
Llama 3 8B n/a n/a 106.000 3,000 106.000 3,000
Llama 2 70B n/a n/a 290.800 1,200 290.800 1,200
Llama 2 13B n/a n/a 112.000 980 112.000 980
Mixtral 8x7B 7.143 14.286 290.857 5,000 290.857 5,000
MPT 30B n/a n/a 112.000 450 112.000 450
MPT 7B n/a n/a 20.000 2,450 20.000 2,450

1: Throughput band is a model-specific maximum throughput (tokens per second) provided at the above per-hour price.  With Provisioned Throughput Serving, model throughput is provided in increments of its specific "throughput band"; higher model throughput will require the customer to set an appropriate multiple of the throughput band which is then charged at the multiple of the per-hour price above.

2: Throughput shown is an example based on a typical real-time use case with input / output of 3500 / 300 tokens. Actual throughput will vary, depending on the use case, query shape and other factors. Input/output ratios do not apply to embedding models.

3: Entry band is not available outside US, Canada, Brazil.  Entry band is also not available for fine-tuned versions of the base models.

Pay-Per-Token Serving Pricing Examples

Model Input tokens Output tokens Region Unit price
$ / DBU
Total Price
Llama 3.1 405B 4,000,000 1,000,000 US East $0.070 $20.00
Llama 3.3 70B 4,000,000 1,000,000 US East $0.070 $3.50

Provisioned Throughput Serving Pricing Examples

Model Throughput bands Hours / month Region Unit price
$ / DBU
Total Price
Llama 3.1 405B 1 720 US East $0.070 $7,560
Llama 3.3 70B 1 720 US East $0.070 $4,320
Llama 3.1 8B 2 720 US East $0.070 $5,040

Pay as you go with a 14-day free trial or contact us for committed-use discounts or custom requirements.

Mosaic AI Foundation Model Serving FAQ