Magic in the Data: Data Curation for AI/BI Genie

Published: August 20, 2024

During my MBA internship this summer, I worked on several data projects. My favorite project was building a "virtual analyst" for our strategy team using AI/BI Genie.

AI/BI Genie is a new text-to-SQL data analysis tool that enables users to speak to their data in natural language and receive SQL-generated data tables and charts in return. Once properly set up and curated, it allows any business user to run data analytics queries. It's built on AI foundation models and integrates perfectly with the Unity Catalog governance platform.

Data Curation Process

A lot of data in the enterprise today lives across scattered tables. Pulling a specific piece of information often requires searching, merging, and cleaning tables with SQL (or other equivalent language) to compile dashboards and execute data pulls.

As part of my internship, I built a tool that bypasses these complex processes, making data analysis 10x more efficient. After polling my team for their most critical and common data questions, I set out to curate a custom Genie Space that can quickly and accurately answer these requests. I took a 3-part approach:

Defining data
Tactical & narrow reasoning
Output cleansing

Defining the Data

After connecting the Genie Space to 4 large data tables, I sought to provide the Genie Space with a contextual understanding of each dataset and where they sat in relation to each other. This meant curating a set of instructions around critical data definitions.

First, I tagged first-order definitions, or quick definitions to explain the columns of every dataset, and what each dataset covered. Then, I tagged second-order definitions, or jargon and acronyms that were specific to my team's language, but weren't necessarily directly represented in the tables. For example, "UCOs" meant use cases and "BUs" meant business units.

Tactical and Narrow Reasoning

Once I set up the Genie Space to comfortably understand basic definitions around the data, I had to extend the Genie Room to be better at approaching common data questions beyond simply reading out values. To do this, I added instructions to help it answer both high-level data questions and specific edge cases.

Luckily, Genie Spaces makes tactical or high-level reasoning easy because you can provide sample SQL code as templates for how you expect it to approach common data question types. I added SQL snippets, such as the best way to join specific data tables and how to calculate specific business elements such as time series data.

For narrow reasoning around specific "edge case" queries, I added custom instructions including how to interpret niche strategy questions that may require a non-intuitive approach to analyze. For example, I defined terms like slippage in the Databricks context and added instructions about its reference to a specific trend within one data table, rather than the usual business definition.

Output Cleansing

Finally, I instructed the Genie Space to output its answers in a format that would be most useful to our strategy team. This came with a range of instructions, including:

Ensure all SQL outputs include a comment at the top stating the ask, as well as in-line comments for most sections
Always show the name of a data item versus just its ID string
When showing X object, always include A+B+C attributes
Return specific error messages if the query can't be computed using the included data tables rather than just returning a null outcome

Limitations

Through this 2-week curation process, I increased this custom Genie Space's answer accuracy from 13% to 86% on the most critical and commonly asked questions within our strategy team.

A limitation of this curation approach is there are diminishing returns to scale. Up until a certain point, adding more instructions meant more accurate responses and only a marginally slower runtime. However, as more data tables are added, compounding permutations of instructions are required to fully map out relations between data elements. Accuracy begins falling as it becomes tough for the Genie Space to execute a clear course of action; being over-specific sometimes ends up confusing the output.

Conclusion

With Databricks Genie, anyone with a working knowledge of SQL as well as the company's jargon and datasets can build a bespoke data analytics tool, no AI engineering needed. And anyone who has a grasp of the English language can then use the finished Genie Space to grab data faster than ever before. We go from a scrambled mess of datasets to a magic tool that can pull data, in the language of your workflow.

It has been an incredible summer at Databricks being able to work on several cross-functional projects. I'm especially grateful to get to experiment with these new data tools and get a peek into the future of what's possible for enterprises in the age of advanced business intelligence.

"A sufficiently advanced technology is indistinguishable from magic."

Learn more about Databricks AI/BI Genie Spaces here.

If you're interested in learning more about our intern and new grad roles, check out our University Recruiting page.