Skip to main content

AI Agent Systems: Modular Engineering for Reliable Enterprise AI Applications

Naveen Rao
Matei Zaharia
Patrick Wendell
Share this post

Monolithic to Modular

The proof of concept (POC) of any new technology often starts with large, monolithic units that are difficult to characterize. By definition, POCs are designed to show that a technology works without considering issues around extensibility, maintenance, and quality. However, once technologies achieve maturity and are deployed widely, these needs drive product development to be broken down into smaller, more manageable units. This is the fundamental concept behind systems thinking and why we are seeing AI implementation move from models to AI agent systems.

 

The concept of modular design has been applied to:

  • Cars: seats, tires, lights, and engines can all be sourced from different vendors.
  • Computer chips: chip designs now integrate pre-built modules for memory interfaces, IO interfaces, or specialized circuits such as FLASH memory.
  • Buildings: windows, doors, floors, appliances
  • Software: object-oriented programming and APIs break software into smaller, manageable components.

 

Virtually every engineered system matures into modular, composable units that can be independently verified and connected. While 50 years ago software could be implemented as a single stream of commands, this is almost unthinkable in a modern developer environment. Software engineering evolved practices to manage complexity that resulted in portable, extensible, maintainable code. Today, developers divide problems into manageable subunits with well-defined interfaces between them. Functionality can be compartmentalized; modification of a component does not require changes to the entire system. As long as a component correctly services its interfaces to other modules, the integrated system will still work as intended. This composability allows extensibility; components can be composed in new ways or with new components to build different systems.

 

Large language models (LLMs) have been in a monolithic regime until recently; inputting new training data often required full retraining of the model, and the impact of customizations was difficult to characterize. Early on, LLMs were unreliable, inscrutable units; it was unclear when their output relied on supplied verified data or was already present in the training data. This “black box” output made them ill-suited for enterprise applications that require a high degree of control, reliability, and predictability for customer-facing applications. In addition, regulated industries have legal and compliance frameworks to which interactions with customers must conform. For instance, healthcare systems are required to provide healthcare data to patients, but there are restrictions on the interpretation of that data for patients. By separating the retrieval of data from its interpretation, healthcare systems can qualify correctness of data separately from correctness of interpretation. Agent AI systems give organizations the ability to parcel out different functions and control each of these functions separately. One such function is giving these systems deterministic access to data (calling functions or incorporating databases) that forms a foundation for all the responses. In the above scenarios, the desire is to provide a set of data as a source of ground truth for ALL responses from the system. 

 

A new development paradigm for intelligence applications

These requirements necessitate a new way to build end-to-end intelligence applications. Earlier this year, we introduced the concept of compound AI systems (CAIS) in a blog post published by the Berkeley AI Research department. AI agent systems apply the concept of CAIS and modular design theory to real-world AI systems development. AI agent systems use multiple components (including models, retrievers, and vector databases) as well as tools for evaluation, monitoring, security, and governance. These multiple interacting components offer much higher quality outputs than a single-mode foundation model and enable AI developers to deploy independently verifiable components that are easier to maintain and update. We are now seeing large AI labs like OpenAI move in this direction: ChatGPT can access the internet through a tools interface, and their latest reasoning model, O1, has multiple interacting components in its reasoning chain. 

 

In contrast to standard application software, intelligence applications have probabilistic components and deterministic components that must interact in predictable ways. Human inputs are inherently ambiguous; LLMs have now given us the ability to use context to interpret the intent of a request and convert this into something more deterministic. To service the request, it might be necessary to retrieve specific facts, execute code, and apply a reasoning framework based on previously learned transformation. All of this information must be reassembled into a coherent output that is formatted correctly for whomever (or whatever) will consume it. Modularizing allows the developer to separate the parts of the application that are completely deterministic (such as database lookups or calculators), partially ambiguous (such as contextual processing of a prompt), and completely creative (rendering new designs or novel prose). 

 

Most intelligence applications will have these logical components:

Logical Components of Intelligence Applications
  • Input and output formatting: The format or language specific to an application. For example, tax code is a very specific kind of human language and might require a specialized LLM to interpret and produce it. Formats may even come in highly structured ways like JSON or domain-specific languages which require other kinds of processing (e.g., executing code).
  • Data foundation:  The set of facts needed to support the application. Today, this is usually in the form of a database that can provide context and facts for the user’s queries. Common approaches are to use a Mosaic AI Vector Search on each query or to simply append all needed facts to the query as a prompt to the system. 
  • Deterministic processing: The set of functions and tools required to produce correct, high-quality responses. The LLM can extract fields from a query and pass these to a standard function call to do deterministic processing. Within the Databricks Platform, the Mosaic AI Tools and Functions capabilities enable this behavior. User-defined functions can perform most activities inside Databricks and these can be invoked using natural language, mixing deterministic and probabilistic capabilities.
  • General reasoning: What most LLMs do today. These LLMs are trained on general information from the internet to contextualize normal language usage, idioms, and common knowledge. These LLMs typically understand some basic jargon in various domains; however, they are not trained to parse domain information and can give unreliable results.
  • Domain reasoning: Understanding how to parse and phrase language specific to a domain and how to correctly answer questions in that particular domain. It is important for the system’s domain reasoning to be matched to the domain of the data foundation such that the data foundation can effectively ground responses.  These LLMs might be fine-tuned or heavily prompted to achieve this domain specialization. Function calls might be used to amend the capabilities of models here.
  • General and domain evaluation: How we define success for our application. Evaluations are a set of questions and responses that we define as correct behavior for our task. It is important to build evaluations for a task early in the development process; it allows us to understand the required quality for our application and how various interventions change this score. The Mosaic AI Agent Evaluation Framework gives us a structured way to define these evaluations, as well as a method to run them against the intelligence application. This capability is rapidly improving, so keep an eye on this area.

 

Putting it into practice

At Databricks, we have created the Mosaic AI Agent Framework to make it easy to build these end-to-end systems. This framework can be used to define evaluation criteria for a system and score its quality for the given application. The Mosaic AI Gateway provides access controls, rate limiting, payload logging, and guardrails (filtering for system inputs and outputs). The gateway gives the user constant monitoring of running systems to monitor for safety, bias, and quality. 

 

Today, the typical components of an AI agent system are:

 

We have already seen customers taking advantage of this modularity to drive better end-to-end quality and maintainability of intelligence applications. As an example, Factset provides financial data, analytics, and software solutions for investment and corporate professionals. They created their own query language, known as FQL, to structure queries on their data. They wanted to add an English-language interface to their platform while maintaining a high quality of information output. By using a combination of fine-tuning, Vector Search, and prompting, they were able to deploy their AI agent system to production.

Factset AI Agent System

We see AI agent systems as the vanguard of a novel application development paradigm for intelligence applications. Moving from monolithic, unmaintainable LLMs to a modular, customizable approach is a natural progression that comes with many advantages: higher reliability, easier maintainability, and greater extensibility. Databricks provides the fabric to sew together these applications in a unified platform with the necessary monitoring and governing structures for enterprise needs. Developers who learn to wield these tools for their organizations will have a distinct advantage in building quality applications quickly.

Try Databricks for free

Related posts

Announcing Mosaic AI Agent Framework and Agent Evaluation

Databricks announced the public preview of Mosaic AI Agent Framework and Agent Evaluation alongside our Generative AI Cookbook at the Data + AI...

Databricks announces significant improvements to the built-in LLM judges in Agent Evaluation

An improved answer-correctness judge in Agent Evaluation Agent Evaluation enables Databricks customers to define, measure, and understand how to improve the quality of...
See all Generative AI posts