by Ahmed Bilal, Kasey Uhlenhuth, Hanlin Tang and Ana Nieto
As enterprises build agent systems to deliver high quality AI apps, we continue to deliver optimizations to deliver best overall cost-efficiency for our customers. We’re excited to announce the availability of the Meta Llama 3.3 model on the Databricks Data Intelligence Platform, and significant updates to Mosaic AI’s Model Serving pricing and efficiency. These updates together will reduce your inference costs by up to 80%, making it significantly more cost effective than before for enterprises building AI agents or doing batch LLM processing.
We’re proud to partner with Meta to bring Llama 3.3 70B to Databricks. This model rivals the larger Llama 3.1 405B in instruction-following, math, multilingual, and coding tasks while offering a cost-efficient solution for domain-specific chatbots, intelligent agents, and large-scale document processing.
While Llama 3.3 sets a new benchmark for open foundation models, building production-ready AI agents requires more than just a powerful model. Databricks Mosaic AI is the most comprehensive platform for deploying and managing Llama models, with a robust suite of tools to build secure, scalable, and reliable AI agent systems that can reason over your enterprise data.
We’re rolling out proprietary efficiency improvements across our inference stack, enabling us to reduce prices and make GenAI even more accessible to everyone. Here’s a closer look at the new pricing changes:
Pay-per-Token Serving Price Cuts:
Provisioned Throughput Price Cuts:
With the more efficient and high-quality Llama 3.3 70B model, combined with the pricing reductions, you can now achieve up to an 80% reduction in your total TCO.
Let’s look at a concrete example. Suppose you’re building a customer service chatbot agent designed to handle 120 requests per minute (RPM). This chatbot processes an average of 3,500 input tokens and generates 300 output tokens per interaction, creating contextually rich responses for users.
Using Llama 3.3 70B, the monthly cost of running this chatbot, focusing solely on LLM usage, would be 88% lower cost compared to Llama 3.1 405B and 72% more cost-effective compared to leading proprietary models.
Now let’s take a look at a batch inference example. For tasks like document classification or entity extraction across a 100K-record dataset, the Llama 3.3 70B model offers remarkable efficiency compared to Llama 3.1 405B. Processing rows with 3500 input tokens and generating 300 output tokens each, the model achieves the same high-quality results while cutting costs by 88%, that’s 58% more cost-effective than using leading proprietary models. This enables you to classify documents, extract key entities, and generate actionable insights at scale without excessive operational expenses.
Visit the AI Playground to quickly try Llama 3.3 directly from your workspace. For more information, please refer to the following resources: