Introducing Structured Outputs for Batch and Agent Workflows

Published: November 14, 2024

by Jeffrey Chen, Bay Foley-Cox, Ahmed Bilal and Margaret Qian

Summary

Introducing structured outputs on Mosaic AI Model Serving—a unified API for generating JSON objects that can optionally adhere to a provided JSON schema.
Batch structured generation with response_format: The response_format API field enables batched feature extraction by reliably structuring JSON outputs from large datasets, supporting all LLMs on the Databricks FMAPI platform, including fine-tuned models.
Building agents with function calling: Function calling capabilities, supported through the tools API field, allow LLMs to make structured API calls within agent workflows and are currently available for Llama 3 70B and 405B on the Mosaic AI agent framework.

Many AI use cases now depend on transforming unstructured inputs into structured data. Developers are increasingly relying on LLMs to extract structured data from raw documents, build assistants that retrieve data from API sources, and create agents capable of taking action. Each of these use cases requires the model to generate outputs that adhere to a structured format.

Today, we’re excited to introduce Structured Outputs on Mosaic AI Model Serving—a unified API for generating JSON objects that can optionally adhere to a provided JSON schema. This new feature supports all types of models, including open LLMs like Llama, fine-tuned models, and external LLMs like OpenAI’s GPT-4o, giving you the flexibility to select the best model for your specific use cases. Structured Outputs can be used for both batched structured generation with the newly introduced response_format and for creating agentic applications with function calling.

Why Structured Outputs?

Two primary use cases get massive boosts in quality and consistency with structured outputs.

Batch Structured Generation with response_format: Because batch inference feature extraction is sometimes done with millions of data points, reliably outputting complete JSON objects adherent to a strict schema is hard. Using structured outputs, customers are able to easily fill JSON objects with relevant information for each of the documents they possess in their databases. Batched feature extraction is accessible through the response_format API field which works with all LLMs on the Databricks FMAPI platform including fine-tuned models!

Building Agents with Function Calling: Agent workflows rely on function calling and tool use to be successful. Structured outputs enable LLMs to consistently output function calls to external APIs and internally defined code. We launched function calling support for FMAPI at the 2024 Data + AI Summit, which supports the Mosaic AI agent framework, which was launched shortly after. Function calling capabilities are available to users through the tools API field. See our blog on evaluating function calling quality here. The tools API field currently only works on Llama 3 70B and Llama 3 405B.

How to use Structured Outputs?

Using response_format lets users detail how a model serving output should be constrained to a structured format. The three different response formats supported are:

Text: Unstructured text outputted from the model based on a prompt.
Json_object: Output a JSON object of an unspecified schema that the model intuits from the prompt
Json_schema: Output a JSON object adherent to a JSON schema applied to the API.

With the latter two response_format modes, users can get reliable JSON outputs for their use cases.

Here are some examples of use cases for the response_format field:

Extracting legal information and POC information from rental leases
Extracting investor risk from transcripts of investors and their wealth advisors
Parsing research papers for keywords, topics, and author contacts

Here is an example of adhering to a JSON schema to extract a calendar event from a prompt. The Open AI SDK makes it easy to define object schemas using Pydantic that you can pass to the model instead of an articulated JSON schema.

Building Agents with Function Calling

Usingtools and tool_choice lets users detail how an LLM makes a function call. With the tools parameter, users can specify a list of potential tools that the LLM can call, where each tool is a function defined with a name, description, and parameters in the form of a JSON schema.

Users can then use tool_choice to determine how tools are called. The options are:

none: The model will not call any tool listed in tools.
auto: The model will decide the relevance of whether a tool from the tools list should be called or not. If no tool is called, the model outputs unstructured text like normal.
required: The model will definitely output one of the tools in the list of tools no matter the relevance
{"type": "function", "function": {"name": "my_function"}}: If ”my_function” is the name of a valid function in the list of tools, the model will be forced to pick that function.

Here is an example of a model picking between calling two tools get_delivery_date and get_relevant_products. For the following code snippet, the model should return a call to get_relevant_products.

Under the Hood

Under the hood, constrained decoding powers structured outputs. Constrained decoding is a technique in which we limit the set of tokens that can be returned by a model at each step of token generation based on an expected structural format. For example, let’s consider the beginning of a JSON object which always begins with a left curly bracket. Since only one initial character is possible, we constrain generation to only consider tokens that start with a left curly bracket when applying token sampling. Although this is a simple example, this example can be applied to other structural components of a JSON object such as required keys that the model knows to expect or the type of a specific key-value pair. At each position in the output, a set of tokens adherent to the schema are identified, and sampled accordingly. More technically, raw logits output by the LLM that do not correspond to the schema are masked at each time stamp before they are sampled.

With constrained decoding, we can guarantee that a model’s output will be a JSON object that adheres to the provided JSON schema, as long as we generate enough tokens to complete the JSON object. This is because constrained decoding eliminates syntax and type errors. With constrained decoding, our customers can get consistent and reliable outputs from LLMs which can scale to millions of data points, eliminating the need to write any custom retry or parsing logic.

There has been a ton of open source interest in constrained decoding, for example, popular libraries like Outlines and Guidance. We are actively researching better ways to conduct constrained decoding at Databricks and the quality and performance implications of constrained decoding at scale.

Tips for Constraining

In addition to the examples provided above, here are some tips and tricks for maximizing the quality of your batch inference workloads.

Simpler JSON schemas produce higher quality outputs compared to more complex JSON schemas

Try to avoid using JSON schemas that have deep nesting as it is more difficult for the model to reason about. If you have a nested JSON schema, try and flatten it down!
Try to avoid having too many keys in your JSON schema and bloating it with unnecessary keys. Keep your keys succinct!
In addition to improving quality, using simple and precise schemas will slightly boost performance and reduce cost
Try to use your intuition. If a JSON schema looks too complicated from the eye test, it would probably benefit from some schema optimization.

Have clear and concise parameter descriptions and parameter names

Models are better at reasoning when they know what they are constraining to and why. This significantly increases the quality of extraction.

Take advantage of JSON schema features such as the ability to mark properties as required, or restrict fields to a set of possible values with the enum feature. You should always have at least one property set to required.

Try to align the relevance of the JSON schema to constrain with the input data.

For example, if you care about extracting names and events from a Wikipedia article, it may be beneficial to narrow the scope of your data and pass in actual text rather than the page's HTML markup.

It helps to add examples of successful extractions in the system prompt.

LLMs do well when they have examples of what you, as a customer, consider to be a successful extraction. This might not always help, so make sure to experiment.

Let’s run through an example. Let's say you are extracting legal and POC information from leases and you start with the following schema:

We can use the above tips for constraining to guide us to an optimal schema. First, we can remove extraneous keys and flatten the schema down. For example, we don’t need if_pets if we can check the length of the pets field. We can also make all names more explicit for the model to recognize. Next, we can constrain the right types for each property and add helpful descriptions. Finally, we can mark which key values are required to arrive at an optimal JSON schema for our use case.

Here is the full code to run structured outputs with the schema after we’ve optimized it.

Looking Forward

Stay tuned for more developments about using structured outputs in the future. Structured outputs will soon be available on ai_query, an easy way to run batched inference on millions of rows with a single command.

What's next?

June 12, 2024/8 min read

Mosaic AI: Build and Deploy Production-quality AI Agent Systems

December 9, 2024/8 min read