We recently announced our AI-generated documentation feature, which uses large language models (LLMs) to automatically generate documentation for tables and columns in Unity Catalog. We have been humbled by the reception of this feature among our customers. Today, more than 80% of the table metadata updates on Databricks are AI-assisted.
In this blog post, we share our experience developing this feature – from prototyping as a hackathon project using off-the-shelf SaaS-based LLMs to creating a bespoke LLM that is better, faster, and cheaper. The new model took 2 engineers, 1 month and less than $1,000 in compute cost to develop. We hope you will find the learnings useful, as we believe they apply to a wide class of GenAI use cases. More importantly, it has allowed us to take advantage of rapid advances being made in open-source LLMs.
At the center of each data platform lies a (potentially enormous) collection of datasets (often in the form of tables). In virtually every organization we have worked with, the vast majority of tables are not documented. The absence of documentation provides a number of challenges, including making it difficult for humans to discover the data needed for answering a business question, or more recently, for AI agents to automatically find datasets to use in response to questions (a key capability in our platform that we’re calling Data Intelligence).
Rather than relying on humans to document these datasets, we prototyped as part of our quarterly hackathon a new workflow using an off-the-shelf SaaS-based LLM to automatically generate documentation for tables and their columns based on their schema. This new workflow would automatically suggest descriptions for the tables and columns and allow users to either individually accept, bulk accept, or modify the suggestions for higher fidelity, as shown below. When we showed this prototype to some users, their immediate question was universally, “When can I have it?!”
As we moved towards launching this feature to all our customers, we ran into three challenges with the model:
We have heard similar concerns from a variety of customers as they try to move their LLM-based applications from a proof-of-concept to production and saw this as an excellent opportunity for us to explore alternatives for an organization like ours.
We experimented with different versions of the SaaS LLMs, but they all had the same challenges. This is not surprising in hindsight. The SaaS LLMs are an engineering marvel, but they are very general models that need to address all the use cases from table generation to conversing about the meaning of life. The generality means it needs to have an extremely large number of parameters, which limits how fast and how cheap it can return answers. As it continues to evolve to optimize for different use cases, it might also regress the narrower use case we have.
To address the aforementioned challenges, we started building a bespoke model. It took a team of two engineers one month to build a customized, smaller LLM that was better, faster, and cheaper:
The first step was to treat this as an applied machine learning problem. “Applied machine learning” sounds daunting and complicated, but all it meant was that we needed to:
We created the initial training dataset for this fine-tuning task, using two different sources of data:
Then we synthesized CREATE TABLE statements using the above use cases to yield a diverse set of tables and generated sample responses including table descriptions and column comments using another LLM. In total, we generated ~3600 training examples.
Notably, we didn’t use any customer data for training this powerful feature that all of our customers can benefit from.
After the feature launch, we could measure a model’s quality through production metrics such as the rate of users accepting the suggestions. But before we made it to the launch, we needed a way to evaluate the model’s quality against that of the SaaS LLM.
To do that in an unbiased fashion, we set up a simple double-blind evaluation framework in which we asked 4 employees to rate table descriptions generated from the two models we wanted to compare using a set of 62 unseen tables. Our framework then generated a sheet where each row showed the input and showed both outputs in a randomized order. The evaluator would vote on the better sample (or give a tie). The framework then processed the votes from different evaluators to generate a report; it also summarizes the degree to which each of the evaluators agreed.
Based on our experiences so far, having an evaluation dataset of tens to hundreds of data points is a sufficient initial milestone and can be generalized to other use cases as well.
We considered the following criteria for model selection:
Based on these criteria, MPT-7B and Llama2-7B were the leading candidates, as shown in our LLM guide. We considered larger models such as MPT-30B and Llama-2-13B. In the end we chose MPT-7B, as it has the best combination of quality and inference performance:
The total time it took to finetune the model on the ~3600 examples was only around 15 minutes!
While we chose MPT-7B for our model, we believe the LLM landscape is changing rapidly and the best model today won’t be the best model tomorrow. That’s why we consider this to be an iterative and continuous process and are focused on using tools that make our evaluation efficient and fast.
We were able to build this quickly by relying on the following key components of the Databricks Data Intelligence Platform:
The fine-tuning compute cost for the whole project was less than $1000 (each fine-tuning run cost only a few dollars). And the final result is a more than 10-fold reduction in cost. Why is the cost-saving so significant? It is not surprising if we consider the following:
Having well-documented data is critical to all data users, and growing more important day-by-day to power AI-based data platforms (what we’re calling Data Intelligence). We started with SaaS LLMs for prototyping this new GenAI feature but ran into challenges with quality, performance, and cost. We built a bespoke model to do the same task at better quality, and yet resulting in higher throughput with scale-out and 10x cost reduction. To recap what it took:
This experience demonstrates how easy it is to develop and deploy bespoke LLMs for specific tasks. This model is now live on Databricks in Amazon Web Services and Google Cloud and is being used to power most data annotations on the platform.