In collaboration with Intercontinental Exchange (ICE), we are breaking new ground by pioneering structured RAG applications that empower business users to effortlessly extract insights from their most valuable asset: structured data (tables). This application eliminates the need for end users to understand data models, schemas, or SQL queries. To achieve this, we leveraged the full stack of Mosaic AI products – Unity Catalog, Vector Search, Foundation Model APIs, and Model Serving – implementing the end-to-end lifecycle of RAG with robust evaluation. We adapted the widely recognized Spider evaluation benchmark for state-of-the-art text-to-SQL applications to suit our enterprise use case. By comparing syntax match and execution match metrics between ground truth queries and LLM-generated queries, we can identify incorrect queries for few-shot learning, thereby refining the quality of our SQL query outputs.
Retrieval-augmented generation (RAG) has revolutionized how enterprises harness their unstructured knowledge base using Large Language Models (LLMs), and its potential has far-reaching impacts. Intercontinental Exchange (ICE) is a global financial organization operating exchanges, clearing houses, data services, and mortgage technology, including the largest stock exchange group in the world, the New York Stock Exchange (NYSE). ICE is breaking new ground by pioneering a seamless solution for natural language search for structured data products by having a structured RAG pipeline without the need for any data movement from the pre-existing application. This solution eliminates the need for end users to understand data models, schemas, or SQL queries.
The ICE team collaborated with Databricks engineers to leverage the full stack of Databricks Mosaic AI products (Unity Catalog, Vector Search, Foundation Model APIs, and Model Serving) and implement an end-to-end RAG lifecycle with robust evaluation. The team adapted the widely recognized Spider evaluation benchmark for state-of-the-art text-to-SQL applications to suit their enterprise use case. By comparing syntax match and execution match metrics between ground truth queries and LLM-generated queries, ICE is able to identify incorrect queries for few-shot learning, thereby refining the quality of their SQL query outputs.
For the purpose of confidentiality, synthetic data is referenced in the code snippets shown throughout this blog post.
The team leveraged Vector Search for indexing table metadata to enable rapid retrieval of relevant tables and columns. Foundation Model APIs gave ICE access to a suite of large language models (LLMs), facilitating seamless experimentation with various models during development.
Inference Tables, part of the Mosaic AI Gateway, were used to track all incoming queries and outgoing responses. To compute evaluation metrics, the team compared LLM-generated responses with ground truth SQL queries. Incorrect LLM-generated queries were then streamed into a query sample table, providing valuable data for few-shot learning.
This closed-loop approach enables continuous improvement of the text-to-SQL system, allowing for refinement and adaptation to evolving SQL queries. This system is designed to be highly configurable, with component settings easily adjustable via a YAML file. This modularity ensures the system remains adaptable and future-proof, ready to integrate with best-in-breed solutions for each component.
Read on for more details on how ICE and Databricks collaborated to build this text-to-SQL system.
To generate accurate SQL queries from natural language inputs, we used few-shot learning in our prompt. We further augmented the input question with relevant context (table DDLs, sample data, sample queries), using two specialized retrievers: ConfigRetriever
and VectorSearchRetriever
.
ConfigRetriever
reads context from a YAML configuration file, allowing users to quickly experiment with different table definitions and sample queries without the need to create tables and vector indexes in Unity Catalog. This retriever provides a flexible and lightweight way to test and refine the text-to-SQL system. Here is an example of the YAML configuration file:
VectorSearchRetriever
reads context from two metadata tables: table_definitions
and sample_queries
. These tables store detailed information about the database schema and sample queries, which are indexed using Vector Search to enable efficient retrieval of relevant context. By leveraging the VectorSearchRetriever,
the text-to-SQL system can tap into a rich source of contextual information to inform its query generation.
We created two metadata tables to store information about the tables and queries:
table_definitions
: The table_definitions
table stores metadata about the tables in the database, including column names, column types, column descriptions/comments and table descriptions. COMMENT ON TABLE
. Individual column comment/description can be defined using ALTER TABLE {table_name} ALTER COLUMN {column} COMMENT \”{comment}\”
. Table DDLs can be extracted from a delta table using the SHOW CREATE TABLE
command. These table- and column-level descriptions are tracked and versioned using GitHub. table_definitions
table is indexed by the table Data Definition Language (DDL) via Vector Search, enabling efficient retrieval of relevant table metadata.sample_queries
: The sample_queries
table stores pairs of questions and corresponding SQL queries, which serve as a starting point for the text-to-SQL system. This table is initialized with a set of predefined question-SQL pairs.sample_queries
table. The ground truth for these incorrect queries can be utilized as context for related upcoming queries.To enable efficient retrieval of relevant context, we indexed both metadata tables using Vector Search to retrieve the most relevant tables based on queries via similarity search.
When a question is submitted, an embedding vector is created and matched against the vector indexes of the table_definitions and sample_queries tables. This retrieves the following context:
The retrieved context is used to augment the input question, creating a prompt that provides the LLM with a rich understanding of the relevant tables, data, and queries. The prompt includes:
Here is an example of a prompt augmented with retrieved context:
The augmented prompt is sent to an LLM of choice, e.g., Llama3.1-70B, via the Foundation Model APIs. The LLM generates a response based on the context provided, from which we utilized regex to extract the SQL statement.
We adapted the popular Spider benchmark to comprehensively assess the performance of our text-to-SQL system. SQL statements can be written in various syntactically correct forms while producing identical results. To account for this flexibility, we employed two complementary evaluation approaches:
To ensure compatibility with the Spider evaluation framework, we preprocessed the generated LLM responses to standardize their formats and structures. This step involves modifying the SQL statements to conform to the expected input format of the evaluation framework, for example:
After generating the initial response, we applied a post-processing function to extract the SQL statement from the generated text. This critical step isolates the SQL query from any surrounding text or metadata, enabling accurate evaluation and comparison with the ground truth SQL statements.
This streamlined evaluation with processing approach offers two significant advantages:
By automating these processes, we ensure consistent, objective, and scalable evaluation of our text-to-SQL system's performance, paving the way for continuous improvement and refinement. We will provide additional details on our evaluation process later on in this blog post.
We evaluated the syntactic correctness of our generated SQL queries by computing the F1 score to assess component matching and accuracy score for exact matching. More details are below:
For this evaluation, we have 48 queries with ground truth SQL statements. Spider implements SQL Hardness Criteria, which categorizes queries into four levels of difficulty: easy, medium, hard, and extra hard. There were 0 easy, 36 medium, 7 hard, and 5 extra hard queries. This categorization helps analyze model performance across different levels of query difficulty.
Prior to computing syntactic matching metrics, we made sure that the table schemas conformed to the Spider’s format. In Spider, table names, column names and column types are all defined in individual lists and they are linked together by indexes. Here is an example of table definitions:
Each column name is a tuple of the table it belongs to and column name. The table is represented as an integer which is the index of that table in the table_names list. The column types are in the same order as the column names.
Another caveat is that the table alias needs to be defined with the as
keyword. Column alias in the select clause is not supported and is removed before evaluation. SQL statements from both ground truth and prediction are preprocessed according to the specific requirements before running the evaluation.
In addition to syntactic matching, we implemented execution matching to evaluate the accuracy of our generated SQL queries. We executed both the ground truth SQL queries and the LLM-generated SQL queries on the same dataset and compared the result dataframes using the following metrics:
In summary, this dual-pronged evaluation strategy of involving both syntactic and execution matches allowed us to robustly and deterministically assess our text-to-SQL system's performance. By analyzing both the syntactic accuracy and the functional equivalence of generated queries, we gained comprehensive insights into our system's capabilities. This approach not only provided a more nuanced understanding of the system's strengths but also helped us pinpoint specific areas for improvement, driving continuous refinement of our text-to-SQL solution.
To effectively monitor our text-to-SQL system’s performance, we leveraged the Inference Table feature within Model Serving. Inference Table continuously ingests serving request inputs (user-submitted questions) and responses (LLM-generated answers) from Mosaic AI Model Serving endpoints. By consolidating all questions and responses into a single Inference Table, we simplified monitoring and diagnostics processes. This centralized approach enables us to detect trends and patterns in LLM behavior. With the extracted generated SQL queries from the inference table, we compare them against the ground truth SQL statements to evaluate model performance.
To create ground-truth SQLs, we extracted user questions from the inference table, downloaded the table as a .csv file, and then imported them into an open-source labeling tool called Label Studio. Subject matter experts can add ground-truth SQL statements on the Studio, and the data is imported back as an input table to Databricks and merged with the inference table using the table key databricks_requests_id
.
We then evaluated the predictions against the ground truth SQL statements using the syntactic and execution matching methods discussed above. Incorrect queries can be detected and logged into the sample_queries
table. This process allows for a continuous loop that identifies the incorrect SQL queries and then uses those queries for few-shot learning, enabling the model to learn from its mistakes and improve its performance over time. This closed-loop approach ensures that the model is continuously learning and adapting to changing user needs and query patterns.
We chose to implement this text-to-SQL application as a Python library, designed to be fully modular and configurable. Configurable components like retrievers, LLM names, inference parameters, etc., can be loaded dynamically based on a YAML configuration file for easy customization and extension of the application. A basic ConfigRetriever
can be utilized for quick testing based on hard-coded context in the YAML configuration. For production-level deployment, VectorSearchRetriever
is used to dynamically retrieve table DDLs, sample queries and data from Databricks Lakehouse.
We deployed this application as a Python .whl file and uploaded it to a Unity Catalog Volume so it can be logged with the model as a dependency. We can then seamlessly serve this model using Model Serving endpoints. To invoke a query from an MLflow model, use the following code snippet:
In just five weeks, the Databricks and ICE team was able to develop a robust text-to-SQL system that answers non-technical business users’ questions with remarkable accuracy: 77% syntactic accuracy and 96% execution matches across ~50 queries. This achievement underscores two important insights:
The Databricks Data Intelligence Platform's comprehensive capabilities, including data storage and governance (Unity Catalog), state-of-the-art LLM querying (Foundation Model APIs), and seamless application deployment (Model Serving), eliminated the technical complexities typically associated with integrating diverse tool stacks. This streamlined approach enabled us to deliver a high-caliber application in several week’s time.
Ultimately, the Databricks Platform has empowered ICE to accelerate the journey from raw financial data to actionable insights, revolutionizing their data-driven decision-making processes.
This blog post was written in collaboration with the NYSE/ICE AI Center of Excellence team led by Anand Pradhan, along with Suresh Koppisetti (Director of AI and Machine Learning Technology), Meenakshi Venkatasubramanian (Lead Data Scientist) and Lavanya Mallapragada (Data Scientist).