Large Language Models (LLMs)

What are Large Language Models (LLMs)?

Large language models (LLMs) are a new class of natural language processing (NLP) models that have significantly surpassed their predecessors in performance and ability in a variety of tasks such as answering open-ended questions, chat, content summarization, execution of near-arbitrary instructions, translation as well as content and code generation. LLMs are trained from massive data sets using advanced machine learning algorithms to learn the patterns and structures of human language.

Here’s more to explore

The Big Book of MLOps

A must-read for ML engineers and data scientists seeking a better way to do MLOps.

Get the eBook

A compact guide to large language models thumbnail

Tap the Potential of LLMs

How to boost efficiency and reduce costs with AI.

Download now

Build Your Own LLM, Like Dolly

Discover how to fine-tune and deploy your custom LLM.

Watch now

How do large language models (LLMs) work?

Large language models or LLMs typically have three architectural elements:

Encoder: After a tokenizer converts large amounts of text into tokens, which are numerical values, the encoder creates meaningful embeddings of tokens that put words with similar meanings close together in vector space.
Attention mechanisms: These algorithms are used in LLMs that enable the model to focus on specific parts of the input text, for related words of text. This is not separate from the encoder and decoder.
Decoder: The tokenizer converts the tokens back into words so we can understand. In this process, the LLM predicts the next word, and the next word, for millions of words. Once the models complete their training process, they can now accomplish new tasks such as answering questions, doing language translations, semantic search and more.

How do LLMs work?

A simplified version of the LLM training process

Learn more about transformers, the foundation of every LLM

What is the history of large language models (LLMs)?

The techniques used in LLMs are a culmination of research and work in the field of artificial intelligence that originated in the 1940s.

1940s

The first scientific paper on neural networks was published in 1943.

1989

A scientific paper was published by Lecun on digit recognition showing a back propagation network can be applied to image-recognition problems.

2012

A paper from Hinton et al showed deep neural networks significantly outperforming any previous models for speech recognition.

A convolutional neural network (AlexNet) halved the existing error rate on Imagenet visual recognition, becoming the first to break 75% accuracy. It highlighted new techniques, including the use of GPUs to train models.

2017

The ground-breaking paper, “Attention is All you Need,” introduced the transformer architecture which is the underlying architecture for all LLM models.

2018

Google introduces BERT (Bidirectional Encoder Representations from Transformers), which is a big leap in architecture and paves the way for future large language models.

2020

OpenAI releases GPT-3, which becomes the largest model at 175B parameters and sets a new performance benchmark for language-related tasks.

2022

ChatGPT is launched, which turns GPT-3 and similar models into a service that is widely accessible to users through a web interface and kicks off a huge increase in public awareness of LLMs and generative AI.

2023

Open source LLMs show increasingly impressive results with releases such as LLaMA 2, Falcon and MosaicML MPT. GPT-4 was also released, setting a new benchmark for both parameter size and performance.

What are the use cases for LLMs?

LLMs can drive business impact across use cases and different industries. Example use cases include:

Chatbots and virtual assistants: LLMs are used to power chatbots to give customers and employees the ability to have open-ended conversations to help on customer support, website lead follow-up, and be a personal assistant.
Code generation and debugging: LLMs can generate useful code snippets, identify and fix errors in code and complete programs based on input instructions.
Sentiment analysis: LLMs can automatically understand the sentiment of a piece of text to automate understanding of customer satisfaction.
Text classification and clustering: LLMs can organize, categorize and sort large volumes of data to identify common themes and trends to support informed decision-making.
Language translation: LLMs can translate documents and web pages into different languages.
Summarization and paraphrasing: LLMs can summarize papers, articles, customer calls or meetings and surface the most important points.
Content generation: LLMs can develop an outline or write new content that can be a good first draft to build from.

What are customer examples where LLMs have been deployed effectively?

JetBlue

JetBlue has deployed “BlueBot,” a chatbot that uses open source generative AI models complemented by corporate data, powered by Databricks. This chatbot can be used by all teams at JetBlue to get access to data which is governed by role. For example, the finance team can see data from SAP and regulatory filings, but the operations team will only see maintenance information.

Chevron Phillips

Chevron Phillips Chemical uses Databricks to support their generative AI initiatives, including document process automation.

Thrivent Financial

Thrivent Financial is looking at generative AI to make search better, produce better summarized and more accessible insights and improve the productivity of engineering.

Why are large language models (LLMs) suddenly becoming popular?

There are many recent technological advancements that have propelled LLMs into the spotlight:

Advancement of machine learning technologies
- LLMs utilize many advancements in ML techniques. The most notable is the transformer architecture which is the underlying architecture for most LLM models.
Increased accessibility
- The release of ChatGPT opened the door for anyone with internet access to interact with one of the most advanced LLMs through a simple web interface. This lets the world understand the power of LLMs.
Increased computational power
- The availability of more powerful computing resources, like graphics processing units (GPUs), and better data processing techniques allowed researchers to train much larger models.
Quantity and quality of training data
- The availability of large data sets and the ability to process them have improved model performance dramatically. For example, GPT-3 was trained on big data (about 500 billion tokens) that included high-quality subsets such as the WebText2 data set (17 million documents), which contains publicly crawled web pages with an emphasis on quality.

How do I customize an LLM with my organization’s data?

There are four architectural patterns to consider when customizing an LLM application with your organization’s data. These techniques are outlined below and are not mutually exclusive. Rather, they can (and should) be combined to take advantage of the strengths of each.

Method	Definition	Primary use case	Data requirements	Advantages	Considerations
Prompt engineering	Crafting specialized prompts to guide LLM behavior	Quick, on-the-fly model guidance	None	Fast, cost-effective, no training required	Less control than fine-tuning
Retrieval augmented generation (RAG)	Combining an LLM with external knowledge retrieval	Dynamic data sets and external knowledge	External knowledge base or database (e.g., vector database)	Dynamically updated context, enhanced accuracy	Increases prompt length and inference computation
Fine-tuning	Adapting a pre-trained LLM to specific data sets or domains	Domain or task specialization	Thousands of domain-specific or instruction examples	Granular control, high specialization	Requires labeled data, computational cost
Pre-training	Training an LLM from scratch	Unique tasks or domain-specific corpora	Large data sets (billions to trillions of tokens)	Maximum control, tailored for specific needs	Extremely resource-intensive

Regardless of the technique selected, building a solution in a well-structured, modularized manner ensures organizations will be prepared to iterate and adapt. Learn more about this approach and more in The Big Book of MLOps.

What does prompt engineering mean as it relates to large language models (LLMs)?

Prompt engineering is the practice of adjusting the text prompts given to an LLM to elicit more accurate or relevant responses. Not every LLM model will produce the same quality, as prompt engineering is model-specific. Some generalized tips that work for a variety of models include:

Use clear, concise prompts, which may include an instruction, context (if needed), a user query or input, and a description of the desired output type or format.
Provide examples in your prompt (“few-shot learning”) to help the LLM understand what you want.
Tell the model how to behave, such as telling it to admit if it cannot answer a question.
Tell the model to think step-by-step or explain its reasoning.
If your prompt includes user input, use techniques to prevent prompt hacking, such as making it very clear which parts of the prompt correspond to your instruction vs. user input.

What does retrieval augmented generation (RAG) mean as it relates to large language models (LLMs)?

Retrieval augmented generation or RAG is an architectural approach that can improve the efficacy of large language model (LLM) applications by leveraging custom data. This is done by retrieving relevant data/documents relevant to a question or task and providing them as context for the LLM. RAG has shown success in support chatbots and Q&A systems that need to maintain up-to-date information or access domain-specific knowledge.
Learn more about RAG here.

What does it mean to fine-tune large language models (LLMs)?

Fine-tuning is the process of adapting a pre-trained LLM on a comparatively smaller data set that is specific to an individual domain or task. During the fine-tuning process, it continues training for a short time, possibly by adjusting a relatively smaller number of weights compared to the entire model.

The term “fine-tuning” can refer to several concepts, with the two most common forms being:

Supervised instruction fine-tuning: This approach involves continued training of a pre-trained LLM on a data set of input-output training examples — typically conducted with thousands of training examples.
Continued pre-training: This fine-tuning method does not rely on input and output examples but instead uses domain-specific unstructured text to continue the same pre-training process (e.g., next token prediction, masked language modeling).

What does it mean to pre-train a large language model (LLM)?

Pre-training an LLM model from scratch refers to the process of training a language model on a large corpus of data (e.g., text, code) without using any prior knowledge or weights from an existing model. This is in contrast to fine-tuning, where an already pre-trained model is further adapted to a specific task or data set. The output of full pre-training is a base model that can be directly used or further fine-tuned for downstream tasks. Pre-training is typically the largest and most expensive training tasks one would encounter, and not typical for what most organizations would undertake.

What are the most common LLMs and how are they different?

The field of large language models is crowded with many options to choose from. Generally speaking, you can group LLMs into two categories: proprietary services and open source models.

Proprietary services

The most popular LLM is ChatGPT from OpenAI which was released with much fanfare. ChatGPT provides a friendly search interface where users can feed prompts and typically receive a fast and relevant response. Developers can access the ChatGPT API to integrate this LLM into their own applications, products or services. Other services include Google Bard and Claude from Anthropic.

Open source models

Another option is to self-host an LLM, typically using a model that is open source and available for commercial use. The open source community has quickly caught up to the performance of proprietary models. Popular open source LLM models include Llama 2 from Meta, and MPT from MosaicML (acquired by Databricks).

How to evaluate the best choice

The biggest considerations and differences in approach between using an API from a closed third-party vendor vs. self-hosting your own open source (or fine-tuned) LLM model are future-proofing, managing costs and leveraging your data as a competitive advantage. Proprietary models can be deprecated and removed, breaking your existing pipelines and vector indexes; open source models will be accessible to you forever. Open source and fine-tuned models can offer more choice and tailoring to your application, allowing better performance-cost trade-offs. Planning for future fine-tuning your own models will allow you to leverage your organization’s data as a competitive advantage for building better models than are available publicly. Finally, proprietary models may raise governance concerns as these “black box” LLMs permit less oversight of their training processes and weights.

Hosting your own open source LLM models does require more work than using proprietary LLMs. MLflow from Databricks makes this easier for someone with Python experience to pull any transformer model and use it as a Python object.

How do I choose which LLM to use based on a set of evaluation criteria?

Evaluating LLMs is a challenging and evolving domain, primarily because LLMs often demonstrate uneven capabilities across different tasks. An LLM might excel in one benchmark, but slight variations in the prompt or problem can drastically affect its performance.

Some prominent tools and benchmarks used to evaluate LLM performance include:

MLflow
- Provides a set of LLMOps tools for model evaluation.
Mosaic Model Gauntlet
- An aggregated evaluation approach, categorizing model competency into six broad domains (shown below) rather than distilling to a single monolithic metric.
Hugging Face gathers hundreds of thousands of models from open LLM contributors
BIG-bench (Beyond the Imitation Game benchmark)
- A dynamic benchmarking framework, currently hosting over 200 tasks, with a focus on adapting to future LLM capabilities.
EleutherAI LM Evaluation Harness
- A holistic framework that assesses models on over 200 tasks, merging evaluations like BIG-bench and MMLU, promoting reproducibility and comparability.

Also read the Best Practices for LLM Evaluation of RAG Applications.

How do you operationalize the management of large language models (LLMs) via large language model ops (LLMOps)?

Large language model ops (LLMOps) encompasses the practices, techniques and tools used for the operational management of large language models in production environments.

LLMOps allows for the efficient deployment, monitoring and maintenance of large language models. LLMOps, like traditional machine learning ops (MLOps), requires a collaboration of data scientists, DevOps engineers and IT professionals. See more details of LLMOps here.

Where can I find more information about large language models (LLMs)?

There are many resources available to find more information on LLMs, including:

Training

LLMs: Foundation Models From the Ground Up (EDX and Databricks Training) — Free training from Databricks that dives into the details of foundation models in LLMs
LLMs: Application Through Production (EDX and Databricks Training) — Free training from Databricks that focuses on how to build LLM-focused applications with the latest and most well-known frameworks

eBooks

The Big Book of MLOps

Technical blogs

Next steps

Contact Databricks to schedule a demo and talk to someone about your large language model (LLM) projects
Read about Databricks’ offerings for LLMs
Read more about the retrieval augmented generation (RAG) use case (the most common LLM architecture)

Back to Glossary