TL;DR: Finetuning an embedding model on in-domain data can significantly improve vector search and retrieval-augmented generation (RAG) accuracy. With Databricks, it’s easy to finetune, deploy, and evaluate embedding models to optimize retrieval for your specific use case—leveraging synthetic data without manual labeling.
Why It Matters: If your vector search or RAG system isn’t retrieving the best results, finetuning an embedding model is a simple yet powerful way to boost performance. Whether you’re dealing with financial documents, knowledge bases, or internal code documentation, finetuning can give you more relevant search results and better downstream LLM responses.
What We Found: We finetuned and tested two embedding models on three enterprise datasets and saw major improvements in retrieval metrics (Recall@10) and downstream RAG performance. This means finetuning can be a game-changer for accuracy without requiring manual labeling, leveraging only your existing data.
Want to try embedding finetuning? We provide a reference solution to help you get started. Databricks makes vector search, RAG, reranking, and embedding finetuning easy. Reach out to your Databricks Account Executive or Solutions Architect for more information.
Embedding models power modern vector search and RAG systems. An embedding model transforms text into vectors, making it possible to find relevant content based on meaning rather than just keywords. However, off-the-shelf models aren’t always optimized for your specific domain—that’s where finetuning comes in.
Finetuning an embedding model on domain-specific data helps in several ways:
In this blog post, we show that finetuning an embedding model is an effective way to improve retrieval and RAG performance for task-specific, enterprise use cases.
We finetuned two embedding models (gte-large-en-v1.5 and e5-mistral-7b-instruct) on synthetic data and evaluated them on three datasets from our Domain Intelligence Benchmark Suite (DIBS) (FinanceBench, ManufactQA, and Databricks DocsQA). We then compared them against OpenAI’s text-embedding-3-large.
Key Takeaways:
After comparing across three datasets, we found that embedding finetuning improves accuracy on two of these datasets. Figure 1 shows that for FinanceBench and ManufactQA, finetuned embeddings outperformed their base versions, sometimes even beating OpenAI’s API model (light grey). For Databricks DocsQA, however, OpenAI text-embedding-3-large accuracy surpasses all finetuned models. It is possible that this is because the model has been trained on public Databricks documentation. This shows that while finetuning can be effective, it strongly depends on the training dataset and the evaluation task.
We then compared the above results with API-based reranking using voyageai/rerank-1 (Figure 2). A reranker typically takes the top k results retrieved by an embedding model, reranks these results by relevance to the search query, and then returns the reranked top k (in our case k=30 followed by k=10). This works because rerankers are usually larger, more powerful models than embedding models and also model the interaction between the query and the document in a way that is more expressive.
What we found was:
Rerankers usually incur additional per-query inference latency and cost relative to embedding models. However, they can be used with existing vector databases and can in some cases be more cost effective than re-embedding data with a newer embedding model. The choice of whether to use a reranker depends on your domain and your latency/cost requirements.
For FinanceBench, better retrieval translated directly to better RAG accuracy when combined with GPT-4o (see Appendix). However, in domains where retrieval was already strong, such as Databricks DocsQA, finetuning didn’t add much—highlighting that finetuning works best when retrieval is a clear bottleneck.
Here are some of the more technical details of our synthetic data generation, finetuning, and evaluation.
We finetuned two open-source embedding models:
We then compared them against OpenAI’s text-embedding-3-large.
We evaluated all models on the following datasets from our Domain Intelligence Benchmark Suite (DIBS): FinanceBench, ManufactQA, and Databricks DocsQA.
Dataset | Description | # Queries | # Corpus |
---|---|---|---|
FinanceBench | Questions about SEC 10-K documents generated by human experts. Retrieval is done over individual pages from a superset of 360 SEC 10-K filings. | 150 | 53,399 |
ManufactQA | Questions and answers sampled from public forums of an electronic devices manufacturer. | 6,787 | 6,787 |
Databricks DocsQA | Questions based on publicly available Databricks documentation generated by Databricks experts. | 139 | 7,561 |
We report recall@10 as our main retrieval metric; this measures whether the correct document is in the top 10 retrieved documents.
The golden standard for embedding model quality is the MTEB benchmark, which incorporates retrieval tasks such as BEIR as well as many other non-retrieval tasks. While models such as gte-large-en-v1.5 and e5-mistral-7b-instruct do well on MTEB, we were curious to see how they performed on our internal enterprise tasks.
We trained separate models on synthetic data tailored for each of the benchmarks above:
Training Set | Description | # Unique Samples |
Synthetic FinanceBench | Queries generated from 2,400 SEC 10-K documents | ~6,000 |
Synthetic Databricks Docs QA | Queries generated from public Databricks documentation. | 8,727 |
ManufactQA | Queries generated from electronics manufacturing PDFs | 14,220 |
In order to generate the training set for each domain, we took existing documents and generated sample queries grounded in the content of each document using LLMs such as Llama 3 405B. The synthetic queries were then filtered for quality by an LLM-as-a-judge (GPT4o). The filtered queries and their associated documents were then used as contrastive pairs for finetuning. We used in-batch negatives for contrastive training, but adding hard negatives could further improve performance (see Appendix).
We ran sweeps across:
All finetuning was done using the open source mosaicml/composer, mosaicml/llm-foundry, and mosaicml/streaming libraries on the Databricks platform.
Finetuning is only one approach for improving vector search and RAG performance; we list a few additional approaches below.
Finetuning embeddings can be an easy win for improving retrieval and RAG in your AI systems. On Databricks, you can:
Ready to try it? We’ve built a reference solution to make fine-tuning easier—reach out to your Databricks Account Executive or Solutions Architect to get access.
Size |
FinanceBench Recall@10 |
ManufactQA Recall@10 |
DocsQA Recall@10 |
||||
Baseline |
Finetuned |
Baseline |
Finetuned |
Baseline |
Finetuned |
||
gte-large-en-v1.5 |
0.4B |
0.293 |
0.552 |
0.821 |
0.873 |
0.849 |
0.884 |
e5-mistral-7b-instruct |
7B |
0.479 |
0.670 |
0.836 |
0.913 |
0.899 |
0.899 |
text-embedding-3-large |
Unknown |
0.44 |
NA |
0.895 |
NA |
0.95 |
NA |
Table 1: Comparison of gte-large-en-v1.5, e5-mistral-7b-instruct and text-embedding-3-large. Same data as Figure 1.
Generating Synthetic Training Data
For all datasets, the queries in the training set were not the same as the queries in the test set. However, in the case of Databricks DocsQA (but not FinanceBench or ManufactQA), the documents used to generate synthetic queries were the same documents used in the evaluation set. The focus of our study is to improve retrieval on particular tasks and domains (as opposed to a zero-shot, generalizable embedding model); we therefore see this as a valid approach for certain production use cases. For FinanceBench and ManufactQA, the documents used to generate synthetic data did not overlap with the corpus used for evaluation.
There are various ways to select negative passages for contrastive training. They can either be selected randomly, or they can be pre-defined. In the first case, the negative passages are selected from within the training batch; these are often referred to as "in-batch negatives" or “soft negatives”. In the second case, the user preselects text examples that are semantically difficult, i.e. they are potentially related to the query but slightly incorrect or irrelevant. This second case is sometimes called "hard negatives". In this work, we simply used in-batch negatives; the literature indicates that using hard negatives would likely lead to even better results.
Finetuning Details
For all finetuning experiments, maximum sequence length is set to 2048. We then evaluated all checkpoints. For all benchmarking, corpus documents were truncated to 2048 tokens (not chunked), which was a reasonable constraint for our particular datasets. We choose the strongest baselines on each benchmark after sweeping over query prompts and pooling strategy.
Improving RAG Performance
A RAG system consists of both a retriever and a generative model. The retriever selects a set of documents relevant to a particular query, and then feeds them to the generative model. We selected the best finetuned gte-large-en-v1.5 models and used them for the first retrieval stage of a simple RAG system (following the general approach described in Long Context RAG Performance of LLMs and The Long Context RAG Capabilities of OpenAI o1 and Google Gemini). In particular, we retrieved k=10 documents each with a maximum length of 512 tokens and used GPT4o as the generative LLM. Final accuracy was evaluated using an LLM-as-a-judge (GPT4o).
On FinanceBench, Figure 3 shows that using a finetuned embedding model leads to an improvement in downstream RAG accuracy. Additionally, it is competitive with text-embedding-3-large. This is expected, since finetuning gte led to a large improvement in Recall@10 over baseline gte (Figure 1). This example highlights the efficacy of embedding model finetuning on particular domains and datasets.
On the Databricks DocsQA dataset, we do not find any improvements when using the finetuned gte model above baseline gte. This is somewhat expected, since the margins between the baseline and finetuned models in Figures 1 and 2 are small. Interestingly, even though text-embedding-3-large has (slightly) higher Recall@10 than any of the gte models, it does not lead to higher downstream RAG accuracy. As shown in Figure 1, all the embedding models had relatively high Recall@10 on the Databricks DocsQA dataset; this indicates that retrieval is likely not the bottleneck for RAG, and that finetuning an embedding model on this dataset is not necessarily the most fruitful approach.
We would like to thank Quinn Leng and Matei Zaharia for feedback on this blogpost.