Your data, your model: How custom LLMs can turbocharge operations while protecting valuable IP

Now is the time to transform your organization with custom LLMs powered by a Data Intelligence Platform

Published: November 27, 2023

Large language models (LLMs) have set the corporate world ablaze, and everyone wants to take advantage. In fact, 47% of enterprises expect to increase their AI budgets this year by more than 25%, according to a recent survey of technology leaders from Databricks and MIT Technology Review.

Despite this momentum, many companies are still unsure exactly how LLMs, AI, and machine learning can be used within their own organization. Privacy and security concerns compound this uncertainty, as a breach or hack could result in significant financial or reputational fall-out and put the organization in the watchful eye of regulators.

However, the rewards of embracing AI innovation far outweigh the risks. With the right tools and guidance organizations can quickly build and scale AI models in a private and compliant manner. Given the influence of generative AI on the future of many enterprises, bringing model building and customization in-house becomes a critical capability.

GenAI can’t exist without data governance in the enterprise

Responsible AI requires good data governance. Data has to be securely stored, a task that grows harder as cyber villains get more sophisticated in their attacks. It must also be used in accordance with applicable regulations, which are increasingly unique to each region, country, or even locality. The situation gets tricky fast. Per the Databricks-MIT survey linked above, the vast majority of large businesses are running 10 or more data and AI systems, while 28% have more than 20.

Compounding the problem is what enterprises want to do with their data: model training, predictive analytics, automation, and business intelligence, among other applications. They want to make outcomes accessible to every employee in the organization (with guardrails, of course). Naturally, speed is paramount, so the most accurate insights can be accessed as quickly as possible.

Depending on the size of the organization, distributing all that information internally in a compliant manner may become a heavy burden. Which employees are allowed to access what data? Complicating matters further, data access policies are constantly shifting as employees leave, acquisitions happen, or new regulations take effect.

Data lineage is also important; businesses should be able to track who is using what information. Not knowing where files are located and what they are being used for could expose a company to heavy fines, and improper access could jeopardize sensitive information, exposing the business to cyberattacks.

Why customized LLMs matter

AI models are giving companies the ability to operationalize massive troves of proprietary data and use insights to run operations more smoothly, improve existing revenue streams and pinpoint new areas of growth. We’re already seeing this in motion: in the next two years, 81% of technology leaders surveyed expect AI investments to result in at least a 25% efficiency gain, per the Databricks-MIT report.

For most businesses, making AI operational requires organizational, cultural, and technological overhauls. It may take many starts and stops to achieve a return on the money and time spent on AI, but the barriers to AI adoption will only get lower as hardware get cheaper to provision and applications become easier to deploy. AI is already becoming more pervasive within the enterprise, and the first-mover advantage is real.

So, what’s wrong with using off-the-shelf models to get started? While these models can be useful to demonstrate the capabilities of LLMs, they’re also available to everyone. There’s little competitive differentiation. Employees might input sensitive data without fully understanding how it will be used. And because the way these models are trained often lacks transparency, their answers can be based on dated or inaccurate information—or worse, the IP of another organization. The safest way to understand the output of a model is to know what data went into it.

Most importantly, there’s no competitive advantage when using an off-the-shelf model; in fact, creating custom models on valuable data can be seen as a form of IP creation. AI is how a company brings its unique data to life. It’s too precious of a resource to let someone else use it to train a model that’s available to all (including competitors). That’s why it’s imperative for enterprises to have the ability to customize or build their own models. It’s not necessary for every company to build their own GPT-4, however. Smaller, more domain-specific models can be just as transformative, and there are several paths to success.

LLMs and RAG: Generative AI’s jumping-off point

In an ideal world, organizations would build their own proprietary models from scratch. But with engineering talent in short supply, businesses should also think about supplementing their internal resources by customizing a commercially available AI model.

By fine-tuning best-of-breed LLMs instead of building from scratch, organizations can use their own data to enhance the model’s capabilities. Companies can further enhance a model’s capabilities by implementing retrieval-augmented generation, or RAG. As new data comes in, it’s fed back into the model, so the LLM will query the most up-to-date and relevant information when prompted. RAG capabilities also enhance a model’s explainability. For regulated industries, like healthcare, law, or finance, it’s essential to know what data is going into the model, so that the output is understandable — and trustworthy.

This approach is a great stepping stone for companies that are eager to experiment with generative AI. Using RAG to improve an open source or best-of-breed LLM can help an organization begin to understand the potential of its data and how AI can help transform the business.

Custom AI models: level up for more customization

Building a custom AI model requires a large amount of information (as well as compute power and technical expertise). The good news: companies are flush with data from every part of their business. (In fact, many are probably unaware of just how much they actually have.)

Both structured data sets—like the ones that power corporate dashboards and other business intelligence—and internal libraries that house “unstructured” data, like video and audio files, can be instrumental in helping to train AI and ML models. If necessary, organizations can also supplement their own data with external sets.

However, businesses may overlook critical inputs that can be instrumental in helping to train AI and ML models. They also need guidance to wrangle the data sources and compute nodes needed to train a custom model. That’s where we can help. The Data Intelligence Platform is built on lakehouse architecture to eliminate silos and provide an open, unified foundation for all data and governance. The MosaicML platform was designed to abstract away the complexity of large model training and finetuning, stream in data from any location, and run in any cloud-based computing environment.

Plan for AI scale

One common mistake when building AI models is a failure to plan for mass consumption. Often, LLMs and other AI projects work well in test environments where everything is curated, but that’s not how businesses operate. The real world is far messier, and companies need to consider factors like data pipeline corruption or failure.

AI deployments require constant monitoring of data to make sure it’s protected, reliable, and accurate. Increasingly, enterprises require a detailed log of who is accessing the data (what we call data lineage).

Consolidating to a single platform means companies can more easily spot abnormalities, making life easier for overworked data security teams. This now-unified hub can serve as a “source of truth” on the movement of every file across the organization.

Don’t forget to evaluate AI progress

The only way to make sure AI systems are continuing to work correctly is to constantly monitor them. A “set-it-and-forget-it” mentality doesn’t work.

There are always new data sources to ingest. Problems with data pipelines can arise frequently. A model can “hallucinate” and produce bad results, which is why companies need a data platform that allows them to easily monitor model performance and accuracy.

When evaluating system success, companies also need to set realistic parameters. For example, if the goal is to streamline customer service to alleviate employees, the business should track how many queries still get escalated to a human agent.

To read more about how Databricks helps organizations track the progress of their AI projects, check out these pieces on MLflow and Lakehouse Monitoring.

Conclusion

By building or fine-tuning their own LLMs and GenAI models, organizations can gain the confidence that they are relying on the most accurate and relevant information possible, for insights that deliver unique business value.

At Databricks, we believe in the power of AI on data intelligence platforms to democratize access to custom AI models with improved governance and monitoring. Now is the time for organizations to use Generative AI to turn their valuable data into insights that lead to innovations. We're here to help.

Join this webinar to learn more about how to get started with and build Generative AI solutions on Databricks!

What's next?

August 30, 2024/6 min read

Winning at GenAI: Building the right processes for the data intelligence future

November 12, 2024/9 min read