Unlocking the Potential of AI Agents: From Pilots to Production Success

Introducing New Tools to Build Scalable and Trusted AI Agents

Tools to take agents from pilot to production

Published: March 10, 2025

Summary

Scaling AI agents beyond pilots is tough due to accuracy, governance, and risk challenges.
Introducing new tools to simplify AI deployment, monitoring, and integration.
Learn how these innovations can help you confidently scale AI in high-value use cases.

While 85% of global enterprises already use Generative AI (GenAI), organizations face significant challenges scaling these projects beyond the pilot phase. Even the most advanced GenAI models struggle to deliver business-specific, accurate, and well-governed outputs, largely because they lack awareness of relevant enterprise data. While many customers are comfortable deploying GenAI solutions across low-risk, limited-scope use cases, most do not have the confidence to deploy for external or internal use cases that carry financial risk.

Today we are excited to introduce several key innovations that will help enterprises scale and deploy AI agents with confidence. These include:

Centralized governance for all AI models: Integrate and manage both open source and commercial AI models all in one place with Mosaic AI Gateway support for custom LLM providers (Public Preview).
Simplified integration into existing app workflows: AI/BI Genie Conversational API suite (Public Preview) enables developers to embed natural language-based chatbots directly into custom-built apps or popular productivity tools like Microsoft Teams, Sharepoint, and Slack.
Streamlined human-in-the-loop workflows: The upgraded Agent Evaluation Review App (Public Preview) makes it easier for domain experts to provide targeted feedback, send traces for labeling, and customize evaluation criteria.
Provision-Less Batch Inference: A new way to run batch inference with Mosaic AI using a single SQL query (Public Preview)—eliminating the need to provision infrastructure while enabling seamless unstructured data integration.

These new capabilities will empower organizations to deploy AI agents in high-value, mission-critical applications while ensuring accuracy, governance, and ease of use. Now, let’s dive into the details of each launch.

Building and governing high-quality agents

At Databricks, we believe the best foundation model is the one that is most effective in addressing your specific use case. Sometimes this may be an open source model, while at other times it might be GPT-4o or another commercial AI model. To help customers govern and manage both open source as well as proprietary AI models, we have created Mosaic AI Gateway. The AI Gateway allows you to bring in external model endpoints so you can have unified governance, monitoring, and integration across all of your models.

Starting today, we are expanding the scope of AI Gateway to support any LLM endpoint, so you can also bring endpoints from your own internal gateway. This will allow companies to gain all of the value of Databricks without having to give up any bespoke capabilities that have been built into their own systems. We have heard lots of folks asking for this and we are excited to announce it is in Public Preview today. I hope you will stay tuned for more AI Gateway announcements on Tuesday.

Additionally, we are introducing the Genie Conversation API suite, which enables users to self-serve data insights using natural language from various platforms, including Databricks Apps, Slack, Teams, SharePoint, and custom-built applications. With the Genie API, users can programmatically submit prompts and receive insights just as they would in the Genie UI. The API is stateful, allowing it to retain context across multiple follow-up questions within a conversation thread.

In our upcoming blog, we’ll review the key endpoints available in Public Preview, explore Genie’s integration with Mosaic AI Agent Frameworks, and highlight an example of embedding Genie into a Microsoft Teams channel.

Ensuring agents deliver accurate, reliable results

Building high-quality AI agents is a challenge as it isn’t always clear how to improve the response to one prompt without negatively impacting many others at the same time. Practitioners have spent considerable time and effort trying to understand whether their agent will perform successfully and how it is performing in production. In mid-December, we launched an API that allows customers to synthetically build an evaluation dataset based on their proprietary data. Today, we are excited to announce new updates to the Agent Evaluation Review App to streamline human-in-the-loop feedback. This upgraded tool enables domain experts to provide targeted evaluations, send traces from development or production for labeling, and define custom evaluation criteria—all without needing spreadsheets or custom-built applications. By making it easier to collect structured feedback, teams can continuously refine AI agent performance and drive systematic accuracy improvements.

As customers seek to deploy agents in domains that carry reputational or financial risk, measuring accuracy and having the tools to systemically drive accuracy improvements is critical. If you want to learn more about our new features for evaluating agents, look out for our blog post this Wednesday where we will go deep into how you can use it to improve the accuracy of new or existing agents.

Scaling AI without infrastructure headaches

While model selection, governance, and evaluation are critical to building high quality agents, we know that simplifying the experience is also important to companies wanting to scale this technology across the business. Over the past year, more organizations have adopted batch inference for foundation models and agents. With Mosaic AI now supporting batch inference with AI Functions scaling these workloads is simpler than ever.

Whether using an LLM to do classification or natural language processing, or using an agent to execute more complex data intelligence tasks, customers have appreciated using simple SQL statements to access the power of these models at scale.

While writing the SQL statements is not difficult, many customers have gotten stuck provisioning and scaling serving endpoints. Now, you no longer need to set up the infrastructure to run ai_query – instead we take care of it for you and you only pay for what you use. Customers are already seeing success with these capabilities:

“Batch AI with AI Functions is streamlining our AI workflows. It's allowing us to integrate large-scale AI inference with a simple SQL query–no infrastructure management needed. This will directly integrate into our pipelines cutting costs and reducing configuration burden. Since adopting it we've seen dramatic acceleration in our developer velocity when combining traditional ETL and data pipelining with AI inference workloads."
— Ian Cadieu, Altana CTO

We are excited to share more about this launch and other exciting capabilities with you in our blog on Thursday.

More to come during the week of agents

This is going to be a big week as we celebrate a “Week of Agents” with a wide variety of new capabilities. Despite two years of GenAI advancements, many enterprises still struggle to deploy AI agents in high-value use cases due to concerns around accuracy, governance, and security. From our conversations with customers, it’s clear that confidence—not just technology—remains the biggest hurdle.

The innovations we’ve introduced this week address these challenges head-on, enabling businesses to move beyond pilots and into full-scale production with AI agents they can trust.

We look forward to sharing more with you this week and hope you will try our products and share your feedback with us so that we can continue to help you unlock the promised value of this technology.

Check out the Compact Guide to AI Agents

Watch the demo video

Get started with documentation: