Cache Augmented Generation — Intuitively and Exhaustively Explained

A new way to inject context into LLMs

May 05, 2025

“Swapping Records” by Daniel Warfield using Midjourney. All images by the author unless otherwise specified.

In this article we’ll discuss “Context Augmented Generation”, a new approach that allows data scientists to efficiently inject large amounts of contextual information into LLMs.

I’ve heard KV Caching referred to as a “RAG killer,” but I don’t think that’s the case. In this article, we’ll explore what CAG is and exactly what it’s good for.

Share this article with friends and colleagues.

Who is this useful for? Anyone who’s currently developing applications that interface AI with large amounts of contextual information.

How advanced is this post? This article is geared towards intermediate to experienced data scientists/engineers.

Prerequisites: To get the most out of this article, I recommend having a strong understanding of transformers and LLMs in general. If you find yourself lost, these will both be good resources for understanding LLMs and transformers to a greater depth:

Speculative Sampling — Intuitively and Exhaustively Explained

Daniel Warfield

December 15, 2023

Read full story

Solving a Rubik's Cube with Supervised Learning — Intuitively and Exhaustively Explained

Daniel Warfield

Jan 9

Read full story

If you’re more of a beginner and know very little about the transformer architecture, I recommend the following articles:

Transformers - Intuitively and Exhaustively Explained

Daniel Warfield

September 20, 2023

Read full story

GPT — Intuitively and Exhaustively Explained

Daniel Warfield

December 1, 2023

Read full story

I also recommend having a solid understanding of RAG.

Retrieval Augmented Generation — Intuitively and Exhaustively Explain

Daniel Warfield

October 12, 2023

Read full story

Multimodal RAG — Intuitively and Exhaustively Explained

Daniel Warfield

July 25, 2024

Read full story

Note: The code in this article is derivative from this excellent article. I’ll be breaking it down further and adding some modifications for explanation sake, but I highly recommend this article if you’re looking for a quick rundown of CAG.

A Brief Review of Retrieval Augmented Generation

Depending on who you are and what your application is, “Retrieval Augmented Generation” RAG can take on many different forms. The fundamental idea of RAG is to stick contextually relevant information into a language models prompt, and then in the same prompt ask some question.

A conceptual diagram of RAG. The prompt is used to retrieve information in a knowledge base, which is in turn used to augment the prompt. This augmented prompt is then fed into the model for generation. From my article on retrieval augmented generation.

Originally, RAG was created to make relatively weak LLMs better at questions that require specific information. The idea was that language models have a decent ability to reason, but they don’t have enough room to know all the information that might be relevant to every question. If we inject relevant information into an LLM, then it can supplement its internal knowledge with this externally sourced information to come up with better answers to questions.

As AI has progressed, the emphasis on making models better at knowledge-intensive tasks has waned in the RAG space. Massive multi-trillion parameter models are more and more able to remember a lot of information, making RAG over general questions less relevant.

modern LLMs don’t have perfect recall of the entire internet, but they know a lot about a lot of things. Here I’m asking GPT4o a question about a specific article on wikipedia. The model can’t regurgitate the section word-for-word, but is pretty good at recalling the general gist.

Lately, the emphasis on RAG has shifted to task-specific and private information. It doesn’t matter how much public information a language model might know, if you ask a language model to help you prepare your taxes, it can’t do that without seeing relevant information about your business. Thus, enterprise customers with large internal information (medical, legal, construction, logistics) are still very much in the purview of RAG.

And RAG has been evolving to meet the challenges of those domains. I just released a big article on how to make robust RAG systems for enterprise environments:

Testing Document Contextualized AI — Intuitively and Exhaustively Explained

Daniel Warfield

Apr 7

Read full story

RAG is great, I personally work for a company that’s been using RAG with great effect in environments with large amounts of proprietary data. But, like any technology, RAG has some drawbacks.

Problems with RAG

Like any technology, RAG has its costs and benefits. The authors of the paper we’ll be discussing mention a few potential drawbacks to the approach:

The need for real-time retrieval introduces latency, while errors in selecting or ranking relevant documents can degrade the quality of the generated responses. Additionally, integrating retrieval and generation components increases system complexity, necessitating careful tuning and adding to the maintenance over- head. — From the CAG paper, on the drawbacks of RAG.

While latency and complexity are certainly valid considerations, the main thrust of the paper is that RAG can cause errors due to poor retrieval quality. This boils down to the essential way RAG functions.

In the vast majority of RAG implementations, the overall system consists of the following elements:

The major steps of most document RAG systems. Parsing, chunking, storing in a queryable representation, prompting, and LLM completion. We’ll be discussing these steps in depth throughout the article. From my article on testing document contextualized AI.

Parsing: re-represents complex documents like PDFs and HTML files into content an LLM can readily understand.
Chunking: divides that parsed content into small chunks. The idea is that these chunks contain manageable bites of information which are neither too big nor too small and can be mixed together to provide context to an LLM from a variety of documents.
Embedding in a Queryable Manner: all the chunks get passed to a special embedding model which represents each chunk as a vector. The user's query is passed through the same model, resulting in a vector. Retreival is ultimately done by comparing the similarity of the vectors of the chunks with the vector of the query
Augmented Prompting: A query is constructed consisting of the system prompt, the retrieved chunks, and the users query.
LLM Completion: The augmented prompt, with retrieved context and user query, is passed to an LLM for a generation.

“Cache Augmented Generation” (CAG) still requires parsing (step 1), some flavor of augmentation (step 4), and an LLM to do completion (step 5), but replaces chunking and embedding with a different approach. The authors of the CAG paper claim that the retrieval step is a major drawback of RAG, and that incorrect retrievals are a major source of performance degradation in many RAG systems.

The encoder is responsible for organizing both the query and the chunks in high dimensional space. The locations of the documents relative to the query are what ultimately dictate what information is retrieved, and if the encoder model used is not performant enough, the end RAG system will suffer. From my article on testing document contextualized AI.

The Fundamental Idea of CAG

In its most naive sense, CAG says “Why bother retrieving information? Just stick all of the context into the LLM”. They justify this by citing the recent advent of large context models.

Recent advances in long-context LLMs have extended their ability to process and reason over substantial textual inputs. For example, Llama 3.1 [1] was trained with a 128K context length, and its effective context length is 32K in Llama 3.1 8B and 64K in Llama 3.1 70B [3]. This 32K to 64K context window is sufficient for storing knowledge sources such as internal company documentation, FAQs, customer support logs, and domain-specific databases, making it practical for many real-world applications. — From the CAG paper

Have any questions about this article? Join the IAEE Discord.

Join The Discord

If you have a knowledge base that can fit into the context window of your model, then this is a very compelling option. Why risk building some RAG pipeline with a retrieval system that might miss some critical information when you can just pass all of the data into the LLM?

There are a few reasons, but the main ones are cost and computational time.

Every time you pass a query through your language model, regardless of how small that query is, you’ll need to pass your whole knowledge base through the LLM. This can cost an expensive amount of time and money, especially for large context sizes and models.

If you append your entire knowledge base as context every time you pass a query, the model needs to re-process the tokens from your context to generate an output.

One of the fundamental ideas of CAG is to employ a “key-value cache” to get around this problem. Basically, we can pass just the context through our model, and then store all of the keys and values throughout our LLM. Essentially, this allows us to store how an LLM thinks about our knowledge base.

We can pass just the context through our model to generate the models internal intermediary representations.

We can then save these values for later. When we get a query, we can load up the calculations from our context and then just compute those that are relevant to the new query.

Conceptually, we’re freezing intermediary representations in the model which have to do with the context, allowing us to leverage them in further calculations.

This works because modern LLMs generate output “causally”, meaning throughout an LLM, a given token is only influenced by previous tokens.

A conceptual diagram of what a causal mask does in an LLM. When you feed in some input, each token in the input is only influenced by preceding tokens, not future tokens. If we imagine our context to be in red, and our query is in blue, the values of our context will be the same every time regardless of if our query changes. Because of the causal mask, the context tokens do not attend with the query tokens and thus are the same every generation. Modified based on my article on speculative sampling. Recommend reading my article on multiheaded self attention or my article on GPT to better understand causal masking,

So, if we put the context at the beginning of the input to our model, the contextual calculations will be the same regardless of what query we append to the end of the context.

The mechanism used to store this information is called a “key value” cache, which is a cornerstone implementation detail of how modern LLMs are served. In a key-value cache, the keys and values throughout various attention layers are saved, effectively saving the intermediary representation throughout the model so that they can be used in further autoregressive generation passes. I’ll make a “By Hand” article on key value caching soon, but for now, you can think through my by-hand article on self-attention if you want to work through why caching only the key and value is required.

Multi-Headed Self Attention — By Hand

Daniel Warfield

July 11, 2024

Read full story

Let’s explore how CAG can be implemented in practice.

Implementing CAG

Full code can be found here, and is based on this article.

Downloading the Model

First of all, we’re going to need to install bitsandbytes

!pip install -U bitsandbytes

We’re going to be using bitsandbytes to load an LLM via a process called “quantization”. I’m planning on covering bitsandbytes in a future article but, basically, if all the parameters in an LLM look like this:

we’re going to load them at a reduced precision which is expressable with a smaller memory footprint:

This means our model still has the same fundamental parameters, just rounded to fit within a less precise representation. This will result in our model being slightly less performant but with a massively reduced memory footprint.

to do that, we can set up authentication with HuggingFace

from google.colab import userdata
import os

# Retrieve the Hugging Face token from Colab secrets
hf_token = userdata.get('HuggingFace')

# Set the environment variable so that Hugging Face Transformers uses it
os.environ['HF_TOKEN'] = hf_token

Configure how we want bitsandbytes to quantize our data, then download our model with quantization.

#importing libraries for downloading model
import torch
from transformers import (
    AutoTokenizer,
    BitsAndBytesConfig,
    AutoModelForCausalLM)

import bitsandbytes as bnb
from transformers.cache_utils import DynamicCache

#configuring quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16)

#downloading model with quantization
model_id  = "meta-llama/Meta-Llama-3.1-8B-Instruct"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model     = AutoModelForCausalLM.from_pretrained(
            model_id,
            quantization_config=bnb_config, 
            device_map='auto')

Here, BitsAndBytesConfig is saying that we’ll be quantizing all model parameters into a four-bit representation. We’re also doing some fancy recommended stuff to encourage our quantization to be more accurate to the original LLM parameters, I’ll be covering stuff like this in more detail when I cover bitsandbytes.

After we set quantization, we’ll download our tokenizer and model. The tokenizer allows us to convert text into numbers that a language model can understand, and then convert numerical output from our model back into words.

For pretty much every LLM, there needs to be some systems that turns words into number and that can output the numbers a model outputs back into words. I cover this topic in a variety of my articles on LLMs, this particular figure is from my article on transformers.

Then, when we download the LLM we’ll be using, we pass in the quantization_config , which loads the model in a 4-bit representation. None of this is relevant to context-augmented generation, but hey, this is an exhaustive article.

Defining the Context

The whole point of CAG is to be able to inject some context into a language model. For this toy example, I copied the Franz Ferdinand (band) Wikipedia article and pasted the content into a text block. We’ll use that as our context in this example.

This context is truncated. Full context can be found here.

.knowledge = """
Franz Ferdinand are a Scottish rock band formed in Glasgow in 2002. Their original line-up was composed of Alex Kapranos (lead vocals, guitar, keyboards), Nick McCarthy (guitar, keyboards, vocals), Bob Hardy (bass, percussion) and Paul Thomson (drums, percussion, backing vocals). Julian Corrie (keyboards, guitar, backing vocals) and Dino Bardot (guitar, backing vocals) joined the band in 2017 after McCarthy left during the previous year, and Audrey Tait (drums, percussion) joined the band after Thomson left in 2021. The band were categorised as a post-punk revival band and garnered multiple UK top 20 hits.[1] They have been nominated for several Grammy Awards and have received two Brit Awards—winning one for Best British Group—as well as one NME Award.

The band's first single, "Darts of Pleasure", just missed out on the Top 40 of the UK Singles Chart, peaking at number 44. Their second single, "Take Me Out", proved their big commercial breakthrough, peaking at number three. "Take Me Out" charted in several other countries and earned a Grammy nomination for Best Rock Performance by a Duo or Group with Vocal; it became the band's signature song. Their debut album Franz Ferdinand won the 2004 Mercury Prize and earned a Grammy nomination for Best Alternative Album.

In 2005, the band released their second studio album, You Could Have It So Much Better, produced by Rich Costey. It peaked within the top-ten in multiple countries and earned Grammy-nominations for Best Alternative Album and for one of the singles, "Do You Want To". The band's third studio album, Tonight: Franz Ferdinand, was released in January 2009; by then the band had shifted from a post-punk-focused sound to a more dance-oriented sound. A remix album of Tonight, titled Blood, was released in July 2009.

Four years after the release of Tonight, the band released their fourth studio album, Right Thoughts, Right Words, Right Action, in August 2013. In 2015, Franz Ferdinand and American rock band Sparks formed the supergroup FFS and released a one-off album, FFS, in June 2015. The band underwent multiple line-up changes following FFS, beginning with McCarthy's departure in 2016. After acquiring Corrie and Bardot, the band released their fifth studio album Always Ascending in February 2018. Thomson departed in 2021 and was replaced by Tait. The band's sixth studio album, The Human Fear, was released in January 2025 to commercial success.

History
Formation (2001–2003)

The Archduke Franz Ferdinand of Austria inspired the band's name.
The band's members played in various bands during the 1990s, including The Karelia, Yummy Fur, 10p Invaders, and Embryo. Alex Kapranos and Paul Thomson met at a party and began a close friendship and played together in Yummy Fur, and subsequently teamed up to write songs. Around the same time, Kapranos taught his friend Bob Hardy how to play bass after being given a bass by Mick Cooke of Belle & Sebastian. Kapranos met guitarist Nick McCarthy, who had returned to Scotland after studying jazz bass in Germany, in 2001.[2]

Once the members came together, they settled on the name Franz Ferdinand for their band. The name was originally inspired by a racehorse called Archduke Ferdinand.[3] After seeing the horse win the Northumberland Plate in 2001, the band began to discuss Archduke Franz Ferdinand and thought it would be a good band name because of the alliteration of the name and the implications of the Archduke's death: his assassination was a significant factor in the lead-up to World War I.[4] In an interview, Hardy recollected that "mainly we just liked the way it sounded. We liked the alliteration." Kapranos continued, saying "he was an incredible figure as well. His life, or at least the ending of it, was the catalyst for the complete transformation of the world and that is what we want our music to be. But I don't want to over-intellectualise the name thing. Basically a name should just sound good ... like music." Thomson concluded, saying “I like the idea that, if we become popular, maybe the words Franz Ferdinand will make people think of the band instead of the historical figure.”[5]

Franz Ferdinand and international breakthrough (2003–2005)
Main article: Franz Ferdinand (album)

The band performing in 2004
In May 2003 the band signed to Laurence Bell's independent record label, Domino Recording Company.[6] The band moved to Gula Studios in Malmö, Sweden, with Cardigans producer Tore Johansson to record their debut album. In the latter part of 2003, the band released their debut single, "Darts of Pleasure". In January 2004, the single "Take Me Out" reached No. 3 in the UK charts. The album, Franz Ferdinand, was released in early 2004, debuting at No. 3 in the UK Albums Chart in February 2004, and at No. 12 in the Australian album charts in April 2004. The album only reached the lowest levels of the Billboard 200 album charts in the US as of early 2004, but reached the top 5 of the indie rock chart and the Heatseeker chart for debut artists. After a couple of North American tours and heavy rotation of the "Take Me Out" video on MTV, the album eventually reached No. 32 on the Billboard 200 later in 2004, and sold over a million copies in the United States.[2] Franz Ferdinand received a generally strong positive response from critics. NME rated it 9 out of 10, and said that the band was the latest in the line of art school rock bands featuring the Beatles, The Rolling Stones, The Who, Roxy Music, the Sex Pistols, Wire, Travis and Blur.
...
"""

Creating the KV Cache

HuggingFace has a variety of “KV Caching” strategies, the explanation of which warrants an article within itself. Basically, though, if you have some transformer-style model generating text via causal autoregression (which is basically all modern generative LLMs), KV Caching makes it so you don’t have to re-compute previous tokens over and over again for each newly generated output. Typically it’s used for keeping track of values within the model as it’s generating an output, but we can also manually define the previous values that need to be saved in the cache. Here’s a high-level example of that from the HuggingFace docs:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM, DynamicCache

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-chat-hf")
model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-chat-hf", torch_dtype=torch.float16).to("cuda:0")
inputs = tokenizer("I like rock music because", return_tensors="pt").to(model.device)

# here the past_key_values are initialized as empty, but they could have
# some content. We'll be using this functionality to implement CAG
past_key_values = DynamicCache()
out = model.generate(**inputs, do_sample=False, max_new_tokens=20, past_key_values=past_key_values)

Recall that the fundamental idea of CAG is to pre-compute passing the context through the model so that we can save on computational costs. We do that by constructing a KV cache of the keys and values generated throughout the LLM based on a particular contextual input.

def preprocess_knowledge(
    model,
    tokenizer,
    prompt: str) -> DynamicCache:
    """
    Prepare knowledge kv cache for CAG.
    Args:
        model: HuggingFace model with automatic device mapping
        tokenizer: HuggingFace tokenizer
        prompt: The knowledge to preprocess, which is basically a prompt
    Returns:
        DynamicCache: KV Cache
    """
    embed_device = model.model.embed_tokens.weight.device # check which device are used 
    input_ids    = tokenizer.encode(prompt, return_tensors="pt").to(embed_device)
    past_key_values = DynamicCache()
    with torch.no_grad():
        outputs = model(
            input_ids=input_ids,
            past_key_values=past_key_values,
            use_cache=True,
            output_attentions=False,
            output_hidden_states=False)
    return outputs.past_key_values

In this code, the function preprocess_knowledge takes in the model, its corresponding tokenizer, and a prompt (representing the context we want to bake into our KV cache), and then outputs a KV Cache based on running the prompt through the model.

First, we do some bookkeeping by setting embed_device, this just defines if the model is on the CPU or GPU. After that, we run our prompt through our tokenizer to get it ready to pass to the model, and we put the tokens on whatever device the model is on.

We then define an empty KV Cache with past_key_values = DynamicCache(). This isn’t strictly necessary I don’t think, as if we set use_cache=True I think a DynamicCache() will automatically be generated, but I like how explicit this is so I’m not changing it.

Anyway, we can then pass our context through our model. We set torch.no_grad() to skip gradient calculation (because we’re doing inferencing and not training), then pass in data to the model. input_ids is the tokenized input, past_key_values is our initialized (and maybe not completely necessary) KV Cache, use_cache says we want to use a KV Cache, output_attentions is set to false because we don’t need that, and output_hidden_states is also set to false because we don’t need it. I don’t actually know if there’s any benefit to explicitly setting these as false, maybe it saves on compute resources, whatever.

After we pass our query (which is the context we want to save as a cache) through our model, we get some output. By returning outputs.past_key_values we’re getting the KV Cache which corresponds to the creation of that output.

One quirk of Context Augmented Generation is that the system prompt is typically baked in with the context explicitly. In most LLM contexts the system prompt is provided to a model first, and then things like context and queries are provided later. This is useful because, as the token processes the context, it can do so with the system prompt in mind. However, the idea of CAG is to save the context at the beginning of the prompt, up to the query. If you want a system prompt to be before the context provided, then you’ll have to include that in the KV Cache.

def prepare_kvcache(documents, answer_instruction: str = None):
    # Prepare the knowledges kvcache

    if answer_instruction is None:
        answer_instruction = "Answer the question with a super short answer."

    knowledges = f"""
    <|begin_of_text|>
    <|start_header_id|>system<|end_header_id|>
    You are an assistant designed to answer questions from wikipedia articles.<|eot_id|>
    <|start_header_id|>user<|end_header_id|>
    Context information is bellow.
    ------------------------------------------------
    {documents}
    ------------------------------------------------
    {answer_instruction}
    Question:
    """
    # Get the knowledge cache
    kv = preprocess_knowledge(model, tokenizer, knowledges)
    kv_len = kv.key_cache[0].shape[-2]
    print("kvlen: ", kv_len)
    return kv, kv_len


knowledge_cache, kv_len  = prepare_kvcache(documents =knowledge)
# kvlen:  610

That’s exactly what this function does. it adds a system prompt before the context and then passes the system prompt along with the context into our preprocess_knowledge function. If you wanted the KV Cache to be more general purpose, you could experiment with providing a system prompt after the context so that baking a system prompt into the context in the KV Cache isn’t required.

At this point, we have a KV Cache and are ready to use it for cache-augmented generation.

Doing CAG with our KV Cache

Before we get into actually creating a response with CAG, there's one useful helper function we should implement.

def clean_up(kv: DynamicCache, origin_len: int):
    """
    Truncate the KV Cache to the original length.
    """
    for i in range(len(kv.key_cache)):
        kv.key_cache[i] = kv.key_cache[i][:, :, :origin_len, :]
        kv.value_cache[i] = kv.value_cache[i][:, :, :origin_len, :]

Recall that the primary reason KV caches exist is to make autoregressive (one token at a time) output faster by making it so we don’t have to re-compute keys and values for previous tokens. Once we call our model, our model will add it’s response to the KV Cache as it generates. If we want to call our LLM multiple times, we’ll have to clean up the response portion of the cache when we make a new inference.

this code iterates through all layers in the transformer via for i in range(len(kv.key_cache)), then truncates both the keys and values to be the original sequence length of the context, discarding any content that comes after. If we run this on a newly constructed KV Cache than nothing would happen, but this is necessary in generating a new output using a KV Cache that’s been used in a previous run.

The shape of both the key_cache and value_cache is [batch, num_heads, sequence_length, head_dim] , which is why we’re truncating on only a single dimension up to the sequence length.

Now that we have that code set up, we can go ahead and generate some output using CAG.

def generate(
    model,
    input_ids: torch.Tensor,
    past_key_values=None,
    max_new_tokens: int = 300
) -> torch.Tensor:
    """
    Greedy decoding generation.

    Args:
        model: HuggingFace model with device mapping
        input_ids: Prompt input token IDs
        past_key_values: Optional KV cache for fast generation
        max_new_tokens: Number of tokens to generate

    Returns:
        Generated token IDs (excluding prompt)
    """
    device = model.device if hasattr(model, "device") else model.model.embed_tokens.weight.device
    input_ids = input_ids.to(device)
    generated = input_ids.clone()

    for _ in range(max_new_tokens):
        with torch.no_grad():
            outputs = model(
                input_ids=generated[:, -1:] if generated.shape[1] > 1 else input_ids,
                past_key_values=past_key_values,
                use_cache=True
            )
        next_token = outputs.logits[:, -1].argmax(dim=-1, keepdim=True)
        past_key_values = outputs.past_key_values
        generated = torch.cat([generated, next_token], dim=-1)

        if next_token.item() in model.config.eos_token_id:
            break

    return generated[:, input_ids.shape[-1]:]

This function generates some output given a model, tokenized query, and pre-loaded KV Cache with a system prompt and context.

First it checks the device the model is on, then it moves the tokenized input_ids to that device. It also clones the tokenized input and renames it as generated. Then, max_new_tokens rounds of autoregressive generation are performed.

Autoregressive generation, where a token is generated, appended to the input, and used to generate another token. From my article on GPT.

Because past_key_values is being used as a KV cache, the model only expects a single input token. Once that input token is passed into the model, then it’s added to the cache and doesn’t need to be added again. That is accomplished by input_ids=generated[:, -1:] if generated.shape[1] > 1.

Once that last token is passed throught the decoder, it results in logits which can be softmaxed to produce a new token prediction, via outputs.logits[:, -1].argmax(dim=-1, keepdim=True), we can update our cache to include the new token via past_key_values = outputs.past_key_values , and we can add the newly predicted token to the list of generated tokens via generated = torch.cat([generated, next_token], dim=-1). We also check of the generated token is the end of sequence utility token so we can stop generating new output.

At this point, because generated is a copy of the original input_ids, it contains both the context we gave the model and the generated response. We can get only the newly generated output by getting after the input context sequence length via generated[:, input_ids.shape[-1]:].

Finally, we can tie all this together and do context-augmented generation via the following:

query = 'recite the first paragraph of "Franz Ferdinand and international breakthrough (2003–2005)" from wikipedia without referencing the internet. Do it from memory.'

clean_up(knowledge_cache, kv_len)
input_ids = tokenizer.encode(query, return_tensors="pt").to(model.device)
output = generate(model, input_ids, knowledge_cache)
generated_text = tokenizer.decode(output[0], skip_special_tokens=True, temperature=None)
print(f"Response of the model:\n {generated_text}")

This

cleans up the knowledge cache we defined previously, allowing us to run the same code block several times and not have our responses accumulate
tokenizes the query
generates response tokens from the model based on the tokenized query and KV cache of our context
decodes those tokens into text

Some Thoughts

Part of the reason I was introduced to context augmented generation was because there was an instance that it was somehow a replacement to retrieval augmented generation. In working at a company that does a lot of RAG work, I was naturally interested in this.

For a long time, long context window models have been knocking on the door of RAG, but a “long” context is still relatively short compared to many of the sets of documents that require a RAG approach. Law, medical, and engineering applications often have tens or hundreds of thousands of documents. CAG is an efficient way to employ long context windows but does not magically make models capable of handling truly large knowledge bases.

I, personally, see CAG as an exciting tool in the RAG developer's toolbelt, rather than a replacement to RAG. It’s not hard to imagine retrieving KV Caches rather than textual context, allowing for a RAG system to select which long context might be relevant to an LLM. Thus, long context window models would be able to be efficiently applied to very large documents, but the documents themselves could be queried from massive stores of documents that could not exist in a single model context window.

Also, while long context is generally seen as more performant than RAG tit-for-tat, there is some evidence that long context models aren’t as performant as one might hope. It’s hard to say if long context is better than RAG in each use case.

In other words, I see CAG as a way to make context windows much larger, but not as a way to replace RAG in many of the applications it’s currently used in.

Intuitively and Exhaustively Explained

Cache Augmented Generation — Intuitively and Exhaustively Explained

A new way to inject context into LLMs

Speculative Sampling — Intuitively and Exhaustively Explained

Solving a Rubik's Cube with Supervised Learning — Intuitively and Exhaustively Explained

Transformers - Intuitively and Exhaustively Explained

GPT — Intuitively and Exhaustively Explained

Retrieval Augmented Generation — Intuitively and Exhaustively Explain

Multimodal RAG — Intuitively and Exhaustively Explained

A Brief Review of Retrieval Augmented Generation

Testing Document Contextualized AI — Intuitively and Exhaustively Explained

Problems with RAG

The Fundamental Idea of CAG

Multi-Headed Self Attention — By Hand

Implementing CAG

Downloading the Model

Defining the Context

Creating the KV Cache

Doing CAG with our KV Cache

Some Thoughts

Discussion about this post