Many AI use cases now depend on transforming unstructured inputs into structured data. Developers are increasingly relying on LLMs to extract structured data from raw documents, build assistants that retrieve data from API sources, and create agents capable of taking action. Each of these use cases requires the model to generate outputs that adhere to a structured format.
Today, we’re excited to introduce Structured Outputs on Mosaic AI Model Serving—a unified API for generating JSON objects that can optionally adhere to a provided JSON schema. This new feature supports all types of models, including open LLMs like Llama, fine-tuned models, and external LLMs like OpenAI’s GPT-4o, giving you the flexibility to select the best model for your specific use cases. Structured Outputs can be used for both batched structured generation with the newly introduced response_format
and for creating agentic applications with function calling.
Two primary use cases get massive boosts in quality and consistency with structured outputs.
response_format
: Because batch inference feature extraction is sometimes done with millions of data points, reliably outputting complete JSON objects adherent to a strict schema is hard. Using structured outputs, customers are able to easily fill JSON objects with relevant information for each of the documents they possess in their databases. Batched feature extraction is accessible through the response_format
API field which works with all LLMs on the Databricks FMAPI platform including fine-tuned models! tools
API field. See our blog on evaluating function calling quality here. The tools
API field currently only works on Llama 3 70B and Llama 3 405B.Using response_format
lets users detail how a model serving output should be constrained to a structured format. The three different response formats supported are:
Text:
Unstructured text outputted from the model based on a prompt. Json_object:
Output a JSON object of an unspecified schema that the model intuits from the prompt Json_schema:
Output a JSON object adherent to a JSON schema applied to the API.With the latter two response_format modes, users can get reliable JSON outputs for their use cases.
Here are some examples of use cases for the response_format
field:
Here is an example of adhering to a JSON schema to extract a calendar event from a prompt. The Open AI SDK makes it easy to define object schemas using Pydantic that you can pass to the model instead of an articulated JSON schema.
Usingtools
and tool_choice
lets users detail how an LLM makes a function call. With the tools
parameter, users can specify a list of potential tools that the LLM can call, where each tool is a function defined with a name, description, and parameters in the form of a JSON schema.
Users can then use tool_choice
to determine how tools are called. The options are:
none
: The model will not call any tool listed in tools.auto
: The model will decide the relevance of whether a tool from the tools list should be called or not. If no tool is called, the model outputs unstructured text like normal.required
: The model will definitely output one of the tools in the list of tools no matter the relevance {"type": "function", "function": {"name": "my_function"}}
: If ”my_function” is the name of a valid function in the list of tools, the model will be forced to pick that function.Here is an example of a model picking between calling two tools get_delivery_date
and get_relevant_products
. For the following code snippet, the model should return a call to get_relevant_products
.
Under the hood, constrained decoding powers structured outputs. Constrained decoding is a technique in which we limit the set of tokens that can be returned by a model at each step of token generation based on an expected structural format. For example, let’s consider the beginning of a JSON object which always begins with a left curly bracket. Since only one initial character is possible, we constrain generation to only consider tokens that start with a left curly bracket when applying token sampling. Although this is a simple example, this example can be applied to other structural components of a JSON object such as required keys that the model knows to expect or the type of a specific key-value pair. At each position in the output, a set of tokens adherent to the schema are identified, and sampled accordingly. More technically, raw logits output by the LLM that do not correspond to the schema are masked at each time stamp before they are sampled.
With constrained decoding, we can guarantee that a model’s output will be a JSON object that adheres to the provided JSON schema, as long as we generate enough tokens to complete the JSON object. This is because constrained decoding eliminates syntax and type errors. With constrained decoding, our customers can get consistent and reliable outputs from LLMs which can scale to millions of data points, eliminating the need to write any custom retry or parsing logic.
There has been a ton of open source interest in constrained decoding, for example, popular libraries like Outlines and Guidance. We are actively researching better ways to conduct constrained decoding at Databricks and the quality and performance implications of constrained decoding at scale.
In addition to the examples provided above, here are some tips and tricks for maximizing the quality of your batch inference workloads.
Let’s run through an example. Let's say you are extracting legal and POC information from leases and you start with the following schema:
We can use the above tips for constraining to guide us to an optimal schema. First, we can remove extraneous keys and flatten the schema down. For example, we don’t need if_pets
if we can check the length of the pets
field. We can also make all names more explicit for the model to recognize. Next, we can constrain the right types for each property and add helpful descriptions. Finally, we can mark which key values are required to arrive at an optimal JSON schema for our use case.
Here is the full code to run structured outputs with the schema after we’ve optimized it.
Stay tuned for more developments about using structured outputs in the future. Structured outputs will soon be available on ai_query, an easy way to run batched inference on millions of rows with a single command.