by Tim Lortz, Parth Vakil and Lisa Sion
Over the past few months, interest in Large Language Models (LLMs) from Public Sector agencies has skyrocketed as LLMs are fundamentally changing the expectations that people have in their interactions with computers and data. From Databricks' point of view, practically every Public Sector customer and prospect we interact with feels a mandate to inject LLMs into their mission. We repeatedly hear questions about what LLMs (like Databricks' Dolly) are, what they can be used for, and how the Databricks Lakehouse will support LLM-related applications. In this post, we will touch on these questions in the context of the unique needs, opportunities and constraints of Public Sector organizations. We will also focus on the benefits of creating, owning and curating your own LLM vs adopting a technology that requires third party data sharing like ChatGPT.
Today's LLMs represent the latest version in a series of innovations in natural language processing, starting roughly in 2017 with the rise of the transformer model architecture. These transformer-based models have long possessed uncanny abilities to understand human language well enough to accomplish tasks such as identifying sentiment; extracting named people, places and things; and translating documents from one language to another. They have also been capable of generating interesting text from a prompt, with varying degrees of quality and accuracy. More recently, researchers and developers have discovered that very large language models, "pre-trained" on very large and diverse sources of text, can be "fine-tuned" to follow a variety of instructions from a human to generate useful information.
Previously, the best practice was to train separate models for each language-related task. The model training process required resources: curated data, compute (typically one or more GPUs), and advanced data science and software development expertise. While such models can be highly accurate, there are clearly resource constraints - both in terms of computation and human effort - when scaling up their usage. With the rapid rise of ChatGPT to stardom, we now see that a single LLM - with the appropriate amount of context and the right prompt - can be used to deliver on many different tasks, sometimes with better accuracy than a more specialized model. And the LLMs' ability to generate new text - "Generative AI" - is both fascinating and extremely useful.
Private sector organizations have reported amazing benefits from LLMs, such as code generation and migration, automated customer feedback categorizations and responses, call center chatbots, report generation, and much more. As a microcosm of many different industries, Public Sector agencies have the same LLM opportunities, in addition to other unique needs. Common Public Sector use cases include:
While certainly powerful, LLMs also introduce a new set of challenges that is amplified by some of the operating constraints native to Public Sector organizations. Let's dissect a few of these and align them with the Databricks Lakehouse capabilities:
Most Public Sector organizations have strict regulatory controls around their data. These controls exist for privacy, security, and the need to preserve secrecy in some cases. Even the simple task of asking an LLM a question or set of questions could reveal proprietary information. Furthermore, most Federal agencies will have the need to fine-tune LLMs to meet their particular requirements. For these reasons, it's logical to assume that Public Sector agencies will be limited in their use of public models. It's likely that they'll require the models to be fine-tuned in an environment that ensures their confidentiality and security, and that interactions with the models via various prompting methods are also confidential.
Databricks' Lakehouse platform has the tools necessary to develop and deploy end-to-end LLM applications. (More on that later.) Moreover, Databricks possesses the necessary certifications to process data for the vast majority of U.S. Public Sector organizations. Databricks is a trusted and capable partner for organizations seeking to harness the full power of LLMs without the risks that come from leveraging proprietary LLMs-as-a-service like ChatGPT or Bard.
Beyond Databricks, the industry is seeing increased evidence that open-source LLMs - used appropriately - can deliver results that approach parity with the leading proprietary LLMs. The evidence is strongest in use cases where the proprietary LLMs must understand nuanced context or instructions on which they have not previously been trained. In these cases, open-source LLMs can be either prompted with or fine-tuned on organization-specific data to deliver astounding results. In this solution architecture, organizations can achieve world-class results with modest amounts of compute and development time, without data ever leaving approved boundaries. For Public Sector organizations, this represents a significant advantage that cannot be overlooked.
Databricks' belief in the power of open-source LLMs is reinforced by our releasing Dolly 2.0, the first open source, instruction-following LLM, fine-tuned on a human-generated instruction dataset licensed for research and commercial use. Dolly's release has been followed by a wave of other capable open-source LLMs, some of which have very impressive performance. Databricks strives to give Public Sector organizations a platform to build applications with their LLM of choice - open-source, or commercial - and we are excited for what's yet to come.
Data estate modernization continues to be top of mind for most technical leaders in the Public Sector. Mostly gone are the days of on-premise data warehouses, typically replaced by a data warehouse or lakehouse in the cloud. Organizations that have not yet migrated to the cloud - or that opted for a data warehouse in the cloud - now face another inflection point: how to adopt LLMs in an architecture that can't accommodate them? Given the immense potential of LLMs to impact agencies' missions and the public servants delivering on them, it is critical to establish a future-proof architecture. Enter the lakehouse.
Databricks has long been a capable home for machine learning (ML) and artificial intelligence (AI) workloads. Customers have been using production-grade LLMs and their predecessors on Databricks for years, taking advantage of features such as:
None of these features are offered in a data warehouse, even in the cloud. To use LLMs in conjunction with a data warehouse, an organization would need to procure other software services for all facets of the model training and deployment processes, and send data back and forth between these services. Only the Databricks Lakehouse architecture offers the architectural simplicity of performing all LLM operations in a single platform, fully delivering on the benefits explained in our discussion of data sovereignty above.
At Data and AI Summit 2023, Databricks presented Lakehouse AI, which adds several major new LLM-related features that significantly simplify the architecture for LLMOps, including:
Government agencies have struggled with a persistent "brain drain" in recent years, particularly in roles that overlap with hot technological trends such as cybersecurity, cloud computing, and ML/AI. The current intense focus on LLMs is driving even more demand for talented practitioners in ML/AI. Inevitably, the allure and perks that come with employment in big tech and the startup scene will exacerbate the talent shortage in the public sector. Government leadership needs access to platforms and partnerships that will help them to easily adopt LLMs and empower their employees to become self-sufficient with them.
Databricks is busy rolling out features that simplify and expand upon the existing capabilities to work with LLMs in the lakehouse platform. These include:
In addition to making LLMs easy to use in Databricks, we are also introducing LLM training and enablement programs to help organizations scale up their LLM proficiency. These are delivered at a level that is approachable for Databricks' public sector users.
Opportunities to harness LLMs to accelerate Public Sector use cases abound. Immense value remains buried in legacy data, just waiting to be discovered and applied to current problems. Come learn more about how Databricks can help you adopt LLMs in your mission by participating in our webinar Large Language Models in the Public Sector on August 2 at Noon, EDT. Also, peruse the feature preview signups listed in the Lakehouse AI announcement and see which ones your organization qualifies for.