Positional Encoding — Intuitively and Exhaustively Explained

How modern AI understands space and time

Aug 04, 2025

“Relative Rotation” by Daniel Warfield using MidJourney. All images by the author unless otherwise specified. This article is made available thanks to the generous support of IAEE subscribers. To support the creation of future work, consider becoming a paid subscriber of Intuitively and Exhaustively Explained.

In this article we’ll form a thorough understanding of positional encoding, a fundamental tool that allows AI models to understand ideas of space and time.

We’ll start by exploring the transformer (the grandfather of today's modern AI models) and why it needs additional positional information to be provided along with the input. We’ll then work through increasingly sophisticated methods of introducing positional information to transformer style models to build a thorough conceptual understanding of the approach.

By the end of this article you’ll understand why positional encoding plays a fundamental role in the performance of modern AI systems, and several approaches to do positional encoding.

Share this article with friends and colleagues.

Who is this useful for? Anyone interested in forming a deep understanding of modern AI.

How advanced is this post? This article contains intuitive descriptions of complex ideas, and is thus relevant to readers of all levels. Some of the sections have implementation that may be relevant to more advanced readers.

Prerequisites: I’ll do my best to provide all foundational information so that you can forge a thorough conceptual understanding without prior AI experience. That said, this is a fairly nuanced topic. I’ll have references to supplementary information throughout the article should you find yourself lost.

The Transformer, How LLMs Work, and Why They Need Positional Information

Before we get into the various forms of positional encoding, let’s first discuss why LLMs need positional encoding in the first place. Imagine you feed in the following sequence to an LLM:

Joe is six foot. Bob is five seven. Who is taller?

The relative position of certain words is important. Imagine if we shuffled some of the words around:

Bob is six foot. Joe is five seven. taller who is?

The meaning completely changes. Even worse, if you really shuffle the words around, all meaning is completely lost

Joe taller? foot. is Bob Who is is seven. five six

Thus, for any LLM to do pretty much anything with text, it has to have some understanding of the order of the input sequence.

To the uninitiated this might appear to be a simple problem. If you want the sequence to be in order, you just pass the sequence to your AI model in order. In fact, with many forms of AI models this is sufficient; many older and simpler AI strategies like recurrent neural networks, convolutional networks, and classic neural networks have a strong ability to understand the order of an input implicitly.

Two previously popular modeling strategies (recurrent models on the left and convolutional models on the right) had an implicit understanding of position due to the way the models were defined. From the transformer and convolution IAEE articles.

Modern LLMs don’t use these older strategies. Instead, they use a fairly new modeling architecture called the “transformer”. Transformers allow any word to interact with any other word, instead of only allowing nearby information to interact.

The core idea of the transformer is to create a vector that represents every word in an input (query) and another vector that represents each word in the input (key). These interact with each other, allowing every word to interact with a representation of every other word. From the IAEE article on transformers.

This arbitrary interaction allows the model to create a robust understanding of the entire input, but the added complexity makes it very hard for the model to implicitly understand the order of the input. To hammer home this idea, I’d like to take a moment to explore how a model like GPT works.

How Data Flows Through a Transformer, and Why That Breaks Position

Let’s imagine we have some LLM, like GPT.

A diagram of the original GPT model. source. I discuss GPT in depth in the IAEE article on the subject

Then imagine we pass GPT the following query

Translate to French: I am a manager

Our input is first passed through a text embedding. This turns our sequence of words into a sequence of vectors. By turning words into vectors, we allow the LLM to reason mathematically about how words relate with other words.

Text embedding maps all words in the input into some vector of numbers. From the IAEE article on transformers.

Our input is also treated with a positional encoding, which we’ll talk about in a bit. For now, we’ll skip this step to explore why positional encoding is necessary.

A conceptual diagram of positional encoding, which we’ll discuss later. source. From my article on GPT.

Now that we have a vector that’s associated with each word, we can pass that through a system called “self-attention”. In self attention we first use a neural network to convert our input representation into three different representations.

First, the input is used to construct three different representations of the input. From my article on transformers

Then, the representation called the query and key are multiplied together and used to create something called an “attention matrix”

The key is rotated (transposed) so that a representation of each word in the query has an opportunity to interact with a representation of each word in the key. From my article on transformers.

This attention matrix is then multiplied by the value matrix, resulting in the output of self-attention

The attention matrix is multiplied by the value matrix to create the output of self-attention. From my article on transformers.

There’s a bunch of math we would need to cover to thoroughly understand this mechanism. It’s not incredibly complicated, I have several articles on the subject that will allow you to completely understand it if you’re interested:

Transformers - Intuitively and Exhaustively Explained

Daniel Warfield

September 20, 2023

Read full story

GPT — Intuitively and Exhaustively Explained

Daniel Warfield

December 1, 2023

Read full story

Multi-Headed Self Attention — By Hand

Daniel Warfield

July 11, 2024

Read full story

Solving a Rubik's Cube with Supervised Learning — Intuitively and Exhaustively Explained

Daniel Warfield

Jan 9

Read full story

But, for our purposes, there’s one critical concept I’d like to discuss.

When you multiply the attention matrix by the value matrix, each row in the attention matrix essentially functions as a set of weights dictating how rows should be combined to create the output.

Here’s an example of multiplying an attention matrix by a value matrix from the “Multi-Headed Self Attention — By Hand” article:

multiplying an attention matrix by a value matrix. Source.

The first row in the output is equivalent to the first row in the value matrix because the first row in attention is 100% the first value, and 0% all of the other values. If we move onto the next row in the output, it’s a combination of 39% of the first row and 61% of the second row in the value matrix, as specified by the attention matrix.

Recall that we calculated the attention matrix by multiplying the query and key together.

Recall that the key is rotated (transposed) so that a representation of each word in the query has an opportunity to interact with a representation of each word in the key. The result is the attention matrix. From my article on transformers.

Because of the way matrix multiplication works, the values in the attention matrix are not governed at all by the location of a word in a sequence. Rather, they’re solely governed by the values of the word vectors in the query and key. If the query and key multiplied together to create an attention matrix that looked something like this:

an example of an attention matrix that, when multiplied by a value matrix, would perfectly reverse the order of the sequence.

Then multiplying this matrix by the value matrix would produce a perfectly reversed output.

So, in allowing word vectors to arbitrarily interact with one another (one of the fundamental strengths of the transformer), the transformer has the ability to destroy all sense of order within the sequence. The thing that governs the order of the sequence, the attention matrix, is calculated by the values of the word vectors, not their location. Thus, LLMs can very easily destroy all sense of position within an input. Let’s explore how the original transformer got around this problem.

Additive Sinusoidal Positional Encoding

The creators of the original transformer architecture were confronted with the issue of their new model losing the order of inputs, and had to find a way to solve it. They elected to do that with a strategy commonly referred to as “additive sinusoidal positional encoding”

Essentially, the idea is to bake positional information into the values of the vectors that represent each word. Recall that one of the first steps of feeding words into a transformer style model is to convert those words into vectors.

Recall that one of the first steps of passing a sequence through a transformer style model is to convert each word into a vector. In this image each word is actually represented as two vectors. One represents the word, and the other represents the location of the word. source. From my article on GPT.

In the original transformer paper, they essentially generated wiggly lines for each spot in the input, where the lines got progressively more wiggly throughout locations in the input sequence. They added these wiggly lines to the word vectors, constructing vectors that represent both the values of words as well as their locations in the input sequence.

An example of some of the wiggly lines that are added to word vectors in the original transformer. From my article on transformers.

Now, because each vector has positional information, the model can use that information when constructing queries and keys,

If the word vectors have positional information baked into them, that information can be propagated to the query and key. From my article on transformers.

which in turn dictate the creation of the attention matrix

Which, ultimately, dictates how words are contextualized together.

In this strategy, positional information is added to the word vectors that are input into the model. It’s then the models job to learn parameters which reason about that information effectively. Let’s get an idea of how this might work with a toy problem.

Exploring Additive Sinusoidal Positional Encoding With a Toy Problem

Imagine the following problem:

Given a sequence of random numbers, can we make a transformer style model
that can tell us the index of the largest number in the sequence?

Naturally, this requires a model to understand position, as the task requires the model to return the location of the largest value in the input sequence. Let’s try implementing a transformer style model with additive sinusoidal positional encoding, and see if it’s up to the job. Full code can be found here.

First, we’ll create a function that defines our modeling objective. This simply returns a list of random numbers, and the corresponding maximum index.

# --- Data: Predict index of max token ---
def generate_batch(num, seq_len):
    x = torch.randint(1, vocab_size, (num, seq_len))
    y = torch.argmax(x, dim=1)
    return x.to(device), y.to(device)

Notice how we generate numbers between one and vocab_size. Transformers require us to create a vector for each word in our vocabulary. Here, each word is a number between 1 and whatever vocab_size is specified to be, in this case 20. seq_len defines how long the sequence we generate will be.

Before we implement the transformer, let’s implement the positional encoding which the transformer will use.

# --- Sinusoidal Positional Encoding ---
def sinusoidal_encoding(seq_len, d_model, device):
    position = torch.arange(seq_len, dtype=torch.float, device=device).unsqueeze(1)
    div_term = torch.exp(torch.arange(0, d_model, 2, device=device) * (-math.log(10000.0) / d_model))
    pe = torch.zeros(seq_len, d_model, device=device)
    pe[:, 0::2] = torch.sin(position * div_term)
    pe[:, 1::2] = torch.cos(position * div_term)
    return pe  # [seq_len, d_model]

this function accepts seq_len, which is important because we need that to know how many squiggly lines we need to make. d_model, on the other hand, defines how long those squiggly lines need to be. That will be whatever length we use to represent the words (or, in this case, numbers) in our vocabulary. Having a squiggly line for each spot in the sequence, and having those squiggly lines represented as vectors that are the same length as the word vectors, will make applying the positional encoding as easy as adding the two vectors together.

To facilitate actually creating our positional encoding, we first generate a list of indexes with position = torch.arange(seq_len, dtype=torch.float, device=device).unsqueeze(1). This gets us an array of positions like.

[0, 1, 2, ..., seq_len-1]

The rest is excessively complicated given how conceptually simple the approach is, and I don’t want to waste a bunch of time unpacking all of the math. The positional encoding formula for this approach looks like this:

I talk about this function, and it’s implementation, in depth in my article on transformers.

Which, if you’re familiar with trigonometry, you might realize that these are sin and cosine functions. Here, pos represents the location of a particular word within the sequence, and we’re modifying the frequency both by i, the location in the embedding vector, and dmodel, the size of the model. The final positional encoding vectors look something like this:

Example of positional encoding. The Y axis represents subsequent words, and the x axis represents values within a particular words positional encoding. Each row in this image represents the positional encoding for an individual word. From my article on transformers.

In the image above, each row represents the squiggly line that would be added to a word. with subsequent rows representing subsequent words. Here’s the same plot, but with a smaller model dimension and smaller sequence length to aid in visibility.

We can use our generated positional encodings within a transformer, like so:

# --- Config ---
device = "cuda" if torch.cuda.is_available() else "cpu"
vocab_size = 20
d_model = 16
n_heads = 4
train_seq_len = 64
num_train = 30000
num_test = 500
batch_size = 64
epochs = 10
num_layers = 2

# --- Transformer Model ---
class MaxIndexTransformer(nn.Module):
    def __init__(self, seq_len, num_layers):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_enc = sinusoidal_encoding(seq_len, d_model, device)
        self.layers = nn.ModuleList([
            nn.TransformerEncoderLayer(d_model=d_model, nhead=n_heads, batch_first=True)
            for _ in range(num_layers)
        ])
        self.norm = nn.LayerNorm(d_model)
        self.classifier = nn.Linear(d_model, 1)

    def forward(self, x):
        tok = self.token_emb(x)
        pos = self.pos_enc[:x.size(1)]
        tok = tok + pos
        for layer in self.layers:
            tok = layer(tok)
        logits = self.classifier(tok).squeeze(-1)  # [B, T]
        return logits

Here we’re defining a transformer style model with PyTorch. It has a word embedding for each word as defined by

self.token_emb = nn.Embedding(vocab_size, d_model)

and we also pre-calculate positional embeddings

self.pos_enc = sinusoidal_encoding(seq_len, d_model, device)

When we go to make an inference with the model, we add the positional information to the embeddings with

tok = tok + pos

before passing the tokens through the model.

Recall that our toy problem was to find the index of the largest value in the sequence. We can train and evaluate our model on a few examples and see that the model can performantly identify the largest value in the input sequence.

# --- Train ---
def train(model, optimizer, criterion):
    model.train()
    for _ in range(epochs):
        for _ in range(num_train // batch_size):
            x, y = generate_batch(batch_size, train_seq_len)
            logits = model(x)
            loss = criterion(logits, y)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

# --- Evaluate ---
@torch.no_grad()
def evaluate(model, seq_len):
    model.eval()
    x, y = generate_batch(num_test, seq_len)
    logits = model(x)
    preds = torch.argmax(logits, dim=1)
    acc = (preds == y).float().mean().item()
    print(f"Eval acc at seq_len={seq_len}: {acc:.3f}")
    return acc

# --- Run ---
model = MaxIndexTransformer(seq_len=train_seq_len, num_layers=num_layers).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

train(model, optimizer, criterion)
evaluate(model, train_seq_len)

Thus, we’ve seen how the addition of additive sinusoidal positional encoding can impart a transformer style model with the ability to understand sequence order.

Learned Positional Encodings

OpenAI famously took ideas from the transformer and built on them with GPT. One of the many things they changed was how GPT handled positional encoding.

Instead of using some fancy math to create fancy squiggly lines for each location in a sequence, GPT simply created a vector for each spot in the sequence and had the model learn what values make sense in each spot.

Recall that we’re adding together a vector that represents each word, and a word that represents each spot. The vectors representing the vocabulary are randomly initialized and updated throughout the training process, allowing the model to learn which vector should be used to represent each word. GPT applied the same approach to the positional encoding, assigning and learning a vector for each location in the input sequence. source. From my article on GPT.

Actually implementing this is super straightforward. Instead of doing a bunch of fancy math to make squiggly lines, we just use nn.Embedding to create a vector for each spot in our sequence (called pos_emb)

# --- Transformer with Learned Positional Embedding ---
class MaxIndexTransformer(nn.Module):
    def __init__(self, seq_len, num_layers):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.pos_emb = nn.Embedding(seq_len, d_model)
        self.layers = nn.ModuleList([
            nn.TransformerEncoderLayer(d_model=d_model, nhead=n_heads, batch_first=True)
            for _ in range(num_layers)
        ])
        self.norm = nn.LayerNorm(d_model)
        self.classifier = nn.Linear(d_model, 1)

    def forward(self, x):
        batch_size, seq_len = x.size()
        positions = torch.arange(seq_len, device=x.device).unsqueeze(0).expand(batch_size, seq_len)
        tok = self.token_emb(x) + self.pos_emb(positions)
        for layer in self.layers:
            tok = layer(tok)
        logits = self.classifier(tok).squeeze(-1)  # [B, T]
        return logits

When we go to make an inference, we simply add the word and position embedding together with

tok = self.token_emb(x) + self.pos_emb(positions)

Running this model on our toy problem (finding the max index in a random sequence), we get almost 100% accuracy. I’m just playing around, and could almost certainly modify this training code to be 100% accurate, so this is functionally identical to the 100% accuracy we saw in sinusoidal positional encoding.

The authors of the original GPT paper claimed that learned positional encoding resulted in superior output to additive sinusoidal approaches. It’s also super easy to implement it, which is why I used this approach when creating a transformer style model designed to solve a rubik’s cube.

Solving a Rubik's Cube with Supervised Learning — Intuitively and Exhaustively Explained

Daniel Warfield

Jan 9

Read full story

Relative Positional Encoding

Both of the encoding strategies we’ve discussed previously are examples of “absolute” positional encoding, meaning the information added to the word vectors denotes the actual location of the vector in the sequence. One problem with this approach is that sequences that are longer than the model saw during training can significantly degrade the performance of the model, as the model has never seen information about those locations in the training set.

Imagine if a model was trained on some sequence of data (highlighted in yellow). It would do well in that region because it’s seen that positional information before. However, imagine if you then gave the model a sequence which is longer than any of the data in the training set. The model performance would quickly devolve as the model tries to reason with positional information it’s never seen before.

The paper Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context built on top of previous positional encoding strategies with a single objective: making transformer style models that can understand sequences than are larger than the ones they were trained on.

One of the central ideas of transformer XL was to use relative positional encoding. Instead of tokens being enumerated based on how far they were to the beginning of the sequence, like in previous approaches:

token 0, token 1, token 2, token 3, .... token n

They, instead, enumerated the tokens based on how far they were to the last token.

token n ... token 3, token 2, token 1, token 0

This is incredibly powerful because, as a model generates new tokens, the most recently generated tokens in the sequence are the most important. If I gave an LLM the following phrase, for instance:

I don't know about you, but I think an elephant would
easily crush a can if it stepped on it. Elephants are
very _______

The information at the end of the sequence is more relevant than the information at the beginning of the sequence when filling in the blank. If a model is especially good at understanding information at the end of the sequence, it will be better at generating output, even if the sequence becomes very large.

One practical issue with this approach is that, as you generate a sequence, the positional information for each token has to change. The first token generated is at position zero, then when another token is generated it’s kicked over to spot one, then two, etc.

0
1, 0
2, 1, 0
3, 2, 1, 0
...

Both of the previous approaches injected positional information about each word at the beginning of the model.

Recall that both of the previous approaches discussed inject positional information into the vectors for each value at the input. This only happens at the beginning. Every transformer layer afterwards uses the information provided by previous transformer layers to reason about the sequence. From my article on GPT.

Then, these vectors (with their positional information) are passed through a big (and expensive to run) LLM. To re-calculate new values with new positional information, you would have to change the positional information at the input and recalculate all of the inputs throughout the entire model. For every token generated. That would be ludicrously expensive.

To deal with this problem, the authors of the Transformer XL paper shifted where positional information is injected. Instead of injecting positional information at the input, then calculating how word vectors should attend to one another

A conceptual diagram of self attention in a traditional transformer. Each of word vectors have already had positional information injected into them, so the model uses the information about the words and where they’re located to reason about how words should interact with one another.

They calculate scores based on content and position separately

Instead of locational information being injected into the values of the vectors, information about what words are and where they’re defined are handled seperately.

And combine them together to calculate how tokens should attend with one another.

By separating locational information out of content information, in every instance of attention throughout the model, transformer XL can efficiently manipulate locational information on the fly. In reality, Transformer XL does a variety of more complex operations than just adding content and location info together, but this communicates the essential idea. I have a by-hand article coming out soon that will go over the transformer XL style of positional encoding in-depth.

This allowed the authors of the Transformer XL paper to modify positional information after a token was already generated, meaning you can maintain relative positional information without needing to recalculate every token at every generated step.

It’s worth noting, the actual positional information could be learned, or it could be sinusoidal. The authors of the transformer XL paper favored sinusoidal encoding like the original transformer, as continuous functions allow the model to extrapolate it’s understanding of position further into the past.

I’ll probably cover Transformer XL, and similar approaches like DeBERTa, in isolation in other articles. The general approach is fascinating, but there’s a lot of subtlety in the approach and how it relates with sequence modeling, which is deserving of it’s own article. For now, I think it’s sufficient to understand the major contribution of Transformer XL: it encodes positional information relative to the most recent token instead of relative to the first token, and it does that by cleverly injecting positional information within the attention mechanism itself, rather than at the input of the model. We’ll discuss strategies that take this general idea and run with it.

Bucket Relative Bias

Because the Transformer XL paper has the goal of working with arbitrarily large sequences, the authors of the Transformer XL paper invest a lot of computational resources into modeling position. This results in a model which is very expressive about positional information, but at the cost of complexity.

The authors of the T5 paper did not have this objective (The T5 paper is a metastudy about transfer learning, without an explicit focus on studying positional embedding). Thus, the T5 paper uses a positional encoding scheme which is inspired by the Transformer XL paper, but simplifies the approach significantly. There are two key deviations the T5 paper made from the original Transformer XL approach:

They represent positional information as a learned bias that gets added to attention
They group certain distances between queries and keys into buckets

This allowed the T5 paper to very efficiently model relative position. To make it happen, you calculate attention scores like normal:

calculate how far apart those two tokens are

grab the bias term corresponding to the bucket containing each distance

The bias for each bucket is a learnable parameter

add the relevant biases to the attention scores, then softmax to calculate attention.

The only real subtlety is the way buckets are defined. The T5 Paper defines:

a set of buckets for small, exact differences for keys that are before the query in the sequence
a set of buckets for small, exact differences for keys that are after the query in the sequence
a set of buckets for large, inpercise differences for keys that are before the query in the sequence
a set of buckets for large, inpercise differences for keys that are after the query in the sequence

Here’s an implementation of a PyTorch module with learnable biases corresponding to where relative positions fall into certain buckets.

# --- T5 Relative Positional Bias ---
class T5RelativePositionalBias(nn.Module):
    def __init__(self, n_heads, num_buckets=32, max_distance=128):
        super().__init__()
        self.n_heads = n_heads
        self.num_buckets = num_buckets
        self.max_distance = max_distance
        self.relative_attention_bias = nn.Embedding(num_buckets, n_heads)

    def _relative_position_bucket(self, relative_position):
        num_buckets = self.num_buckets
        max_distance = self.max_distance

        # Bidirectional bucketing
        num_buckets //= 2
        relative_buckets = (relative_position > 0).to(torch.long) * num_buckets
        relative_position = torch.abs(relative_position)

        # Small positions map directly to buckets
        max_exact = num_buckets // 2
        is_small = relative_position < max_exact

        # Large positions get log-scaled buckets
        relative_position_if_large = max_exact + (
            torch.log(relative_position.float() / max_exact) /
            math.log(max_distance / max_exact) * (num_buckets - max_exact)
        ).to(torch.long)
        relative_position_if_large = torch.min(
            relative_position_if_large,
            torch.full_like(relative_position, num_buckets - 1)
        )

        # Combine buckets
        relative_buckets += torch.where(is_small, relative_position, relative_position_if_large)
        return relative_buckets

    def forward(self, seq_len):
        q_pos = torch.arange(seq_len, dtype=torch.long, device=self.relative_attention_bias.weight.device)
        rel_pos = q_pos[:, None] - q_pos[None, :]
        buckets = self._relative_position_bucket(rel_pos)
        bias = self.relative_attention_bias(buckets)  # [T, T, H]
        return bias.permute(2, 0, 1).unsqueeze(0)    # [1, H, T, T]

We can then define a layer of a transformer style model that uses these relative attention biases like so:

# --- T5 Transformer Layer ---
class T5TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model, n_heads, dropout=0.1):
        super().__init__()
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

        # Attention projections
        self.q_proj = nn.Linear(d_model, d_model)
        self.k_proj = nn.Linear(d_model, d_model)
        self.v_proj = nn.Linear(d_model, d_model)
        self.out_proj = nn.Linear(d_model, d_model)

        self.head_dim = d_model // n_heads

    def forward(self, x, relative_attention_bias):
        # Pre-LayerNorm attention
        residual = x
        x_norm = self.norm1(x)

        # Project to query/key/value
        B, T, _ = x_norm.shape
        q = self.q_proj(x_norm).view(B, T, n_heads, self.head_dim).transpose(1, 2)
        k = self.k_proj(x_norm).view(B, T, n_heads, self.head_dim).transpose(1, 2)
        v = self.v_proj(x_norm).view(B, T, n_heads, self.head_dim).transpose(1, 2)

        # Attention scores
        attn_scores = torch.matmul(q, k.transpose(-2, -1)) / math.sqrt(self.head_dim)
        attn_scores = attn_scores + relative_attention_bias

        # Attention weights
        attn_weights = F.softmax(attn_scores, dim=-1)

        # Context
        context = torch.matmul(attn_weights, v)
        context = context.transpose(1, 2).contiguous().view(B, T, -1)
        context = self.out_proj(context)

        # Residual connection
        x = residual + context

        # Final LayerNorm (no FFN)
        x = x + self.norm2(x)
        return x

This is a minimal implementation of an encoder block. Usually you’d also have a feed forward network in here as well, but we’re keeping it simple in this example.

You might notice that relative_attention_bias is accepted as an input to the forward function. If you wanted each layer of the model to learn it’s own positional parameters, you could define a seperate learnable T5RelativePositionalBias for each layer, but the authors of the T5 paper elected for all layers in the model to share the same bias information. As a result, the biases are passed to the encoder block, rather than stored within the block directly.

As we previously discussed, these biases are added to the attention scores calculated by the multiplication of the query and key, and those biased attention scores are then softmaxed to calculate attention.

We can then build a model that uses these relatie biases and encoder blocks.

# --- Transformer Model ---
class MaxIndexTransformer(nn.Module):
    def __init__(self, seq_len, num_layers):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.relative_bias = T5RelativePositionalBias(n_heads)
        self.layers = nn.ModuleList([
            T5TransformerEncoderLayer(d_model, n_heads)
            for _ in range(num_layers)
        ])
        self.norm = nn.LayerNorm(d_model)
        self.classifier = nn.Linear(d_model, 1)

    def forward(self, x):
        tok = self.token_emb(x)
        T = x.size(1)
        rel_bias = self.relative_bias(T)

        for layer in self.layers:
            tok = layer(tok, rel_bias)

        tok = self.norm(tok)
        logits = self.classifier(tok).squeeze(-1)
        return logits

We can then train this model on our task of finding the index of the largest value in the sequence.

# --- Train ---
def train(model, optimizer, criterion):
    model.train()
    for _ in range(epochs):
        for _ in range(num_train // batch_size):
            x, y = generate_batch(batch_size, train_seq_len)
            logits = model(x)
            loss = criterion(logits, y)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

# --- Evaluate ---
@torch.no_grad()
def evaluate(model, seq_len):
    model.eval()
    x, y = generate_batch(num_test, seq_len)
    logits = model(x)
    preds = torch.argmax(logits, dim=1)
    acc = (preds == y).float().mean().item()
    print(f"Eval acc at seq_len={seq_len}: {acc:.3f}")
    return acc

# --- Run ---
model = MaxIndexTransformer(seq_len=train_seq_len, num_layers=num_layers).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

train(model, optimizer, criterion)
evaluate(model, train_seq_len)

And we see that our performance has dropped relative to previous approaches, which makes sense. We’re using buckets that include multiple positions to encode position; the model might know approximately where largest word in the sequence is, but if the largest number falls in a bucket corresponding to five locations then the model has to guess which of those locations contains the largest number.

For this task, bucketed relative bias encoding is a horrible positional encoding strategy. However, for language modeling, this approach is very performant. The T5 paper was state of the art in 2020, despite using a positional encoding strategy that is drastically simpler than previous positional encoding strategies. I find it fascinating that, given such a complex subject as natural human language, position can be modeled so sparsely and imprecisely while still achieving incredible performance.

Rotary Positional Encoding

Rotary Positional Encoding (RoPE) is a significant deviation of previous approaches, and is also one of the most widely used positional encoding schemes in modern AI. It borrows some ideas from previous positional encoding schemes, but merges them together in a way that’s fundamentally unique to the approaches we’ve previously discussed.

Recall that, in all of the previous encoding schemes, we calculated some positional information then somehow got that positional information to interact with information about the content of words.

In the original transformer, we added squiggly lines to vectors representing words. Different squiggles represent different locations
In GPT, we learned vectors and added them to words
In Transformer XL we separated out positional information, used it to calculate attention scores, then added those attention scores to content based attention scores to calculate attention
In T5 we calculated the delta between queries and keys, and assigned a bias to the attention scores based on that offset

Rotary positional encoding is different because position is represented as a function which modifies the word vectors directly, instead of values that interact with content in some way.

Recall that, in a transformer for instance, each word (or, more accurately, token) is represented as a vector in high dimensional space. Previously, we add squiggly lines to these vectors to impart positional information.

A conceptual diagram of addative sinusoidal positional embedding in the transformer. Some vector of waves is added to each word vector to encode position.

RoPE treats each of these vectors as a collection of small, 2D vectors. Then, RoPE rotates these sub-vectors depending on where they are in the input sequence.

A conceptual diagram of how RoPE functions. Each vector is treated as a collection of sub-vectors, and each of those sub-vectors is then rotated to encode position. Different vectors in different positions in the sequence see different amounts of rotation, thus encoding position.

Naturally, if you rotate a vector all the way around, you end up with the same vector. If we want to encode long sequences with a lot of tokens, we need a way to encode a lot of positions without creating duplicates. We also, though, don’t want to make rotations so miniscule as to make it difficult for the model to understand the positional information we’re trying to encode.

If you rotate a vector enough, you end up with the same vector. This means, if you’re using rotation to encode position, you need to be clever about how you’re doing it. Otherwise, multiple locations will have the same positional information.

RoPE deals with this by rotating different sub-vectors within each word vector by different amounts. Under the hood, actual rotation is done by multiplying a specially formed matrix by each individual sub-component.

The rotation matrix. Multiplying this matrix by some vector will rotate that vector. This general idea is found all over the place, not just RoPE.

This matrix has the parameter θ (theta), which specifies how much the vector is rotated. In RoPE, θ changes not only as a function of where a vector is within the sequence, but also which sub-component we’re operating on within each vector. Some sub-vectors are rotated quickly, while others are rotated slowly. This allows the model to be exposed to rotations that change quickly over time (communicating a lot of information, but repeating often) and slowly change over time (not communicating a lot of granular information, but also never repeating).

the angle of rotation is a function of the index of the word vector we’re operating on, and the index of the subcomponent we’re operating on.

Here, θ is fed into the rotation matrix to govern how much a particular matrix is rotated. p represents the index of a word vector within a sequence. ωk represents the speed at which the kth vector rotates, which is defined by 10000^(-2k/d). Here. k is which sub-component of the vector we’re operating on, and d is the total length of the vector.

A conceptual diagram of the effect of rotary positional encoding. Imagine each word vector as a row, where each row contains sub-components that are rotated at different frequencies.

This has strong conceptual ties to the additive sinusoidal positional encoding strategy we discussed previously.

Recall adaptive sinusoidal embedding. Sin waves of different frequencies are added together to calculate the positional encoding at different locations. This is similar in spirit to RoPE, with the major difference that in RoPE, the word vector itself is rotated rather than being added to.

RoPE is relatively straightforward conceptually, but it (and derivative approaches like XPos and YaRN) contain subtle qualities which are worth discussing in their own article. I’ll likely be doing dedicated IAEE and By-Hand articles on the subject in the near future. For now, we can cap off our understanding by studying an implementation of the approach.

import torch
import torch.nn as nn
import math

# --- Config ---
device = "cuda" if torch.cuda.is_available() else "cpu"
vocab_size = 20
d_model = 16
n_heads = 4
train_seq_len = 64
num_train = 30000
num_test = 500
batch_size = 256
epochs = 10
num_layers = 2

# --- Data: Predict index of max token ---
def generate_batch(num, seq_len):
    x = torch.randint(1, vocab_size, (num, seq_len))
    y = torch.argmax(x, dim=1)
    return x.to(device), y.to(device)

# --- RoPE ---
def apply_rope(x):
    b, seq_len, d = x.shape
    assert d % 2 == 0
    half_d = d // 2
    x = x.view(b, seq_len, half_d, 2)

    freqs = torch.arange(half_d, device=x.device).float()
    theta = 10000 ** (-freqs / half_d)
    pos = torch.arange(seq_len, device=x.device).float()
    angles = torch.einsum("s,f->sf", pos, theta)

    cos = torch.cos(angles).unsqueeze(0).unsqueeze(-1)
    sin = torch.sin(angles).unsqueeze(0).unsqueeze(-1)

    x1 = x[..., 0:1] * cos - x[..., 1:2] * sin
    x2 = x[..., 0:1] * sin + x[..., 1:2] * cos

    x_rot = torch.cat([x1, x2], dim=-1)
    return x_rot.view(b, seq_len, d)

# --- RoPE-aware Multihead Attention ---
class RoPEAttention(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.attn = nn.MultiheadAttention(d_model, n_heads, batch_first=True)
        self.norm = nn.LayerNorm(d_model)

    def forward(self, x):
        residual = x
        x = apply_rope(x)
        x, _ = self.attn(x, x, x)
        return self.norm(x + residual)

# --- Transformer Encoder Layer ---
class RoPEEncoderLayer(nn.Module):
    def __init__(self, d_model, n_heads):
        super().__init__()
        self.self_attn = RoPEAttention(d_model, n_heads)
        self.ff = nn.Sequential(
            nn.Linear(d_model, d_model * 4),
            nn.ReLU(),
            nn.Linear(d_model * 4, d_model)
        )
        self.norm = nn.LayerNorm(d_model)

    def forward(self, x):
        x = self.self_attn(x)
        x = x + self.ff(self.norm(x))
        return x

# --- Transformer Model with RoPE ---
class MaxIndexTransformer(nn.Module):
    def __init__(self, seq_len, num_layers):
        super().__init__()
        self.token_emb = nn.Embedding(vocab_size, d_model)
        self.layers = nn.ModuleList([
            RoPEEncoderLayer(d_model, n_heads) for _ in range(num_layers)
        ])
        self.norm = nn.LayerNorm(d_model)
        self.classifier = nn.Linear(d_model, 1)

    def forward(self, x):
        tok = self.token_emb(x)
        for layer in self.layers:
            tok = layer(tok)
        tok = self.norm(tok)
        logits = self.classifier(tok).squeeze(-1)  # [B, T]
        return logits

# --- Train ---
def train(model, optimizer, criterion):
    model.train()
    for _ in range(epochs):
        for _ in range(num_train // batch_size):
            x, y = generate_batch(batch_size, train_seq_len)
            logits = model(x)
            loss = criterion(logits, y)
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

# --- Evaluate ---
@torch.no_grad()
def evaluate(model, seq_len):
    model.eval()
    x, y = generate_batch(num_test, seq_len)
    logits = model(x)
    preds = torch.argmax(logits, dim=1)
    acc = (preds == y).float().mean().item()
    print(f"Eval acc at seq_len={seq_len}: {acc:.3f}")
    return acc

# --- Run ---
model = MaxIndexTransformer(seq_len=train_seq_len, num_layers=num_layers).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=1e-3)
criterion = nn.CrossEntropyLoss()

train(model, optimizer, criterion)
evaluate(model, train_seq_len)

Here, we see a lot of the usual suspects. We’re applying a transformer style model (this time, one that utilizes RoPE for positional encoding), to our task of finding the index of the largest number in the sequence.

The main addition, here is the apply_rope function

# --- RoPE ---
def apply_rope(x):
    b, seq_len, d = x.shape
    assert d % 2 == 0
    half_d = d // 2
    x = x.view(b, seq_len, half_d, 2)

    freqs = torch.arange(half_d, device=x.device).float()
    theta = 10000 ** (-freqs / half_d)
    pos = torch.arange(seq_len, device=x.device).float()
    angles = torch.einsum("s,f->sf", pos, theta)

    cos = torch.cos(angles).unsqueeze(0).unsqueeze(-1)
    sin = torch.sin(angles).unsqueeze(0).unsqueeze(-1)

    x1 = x[..., 0:1] * cos - x[..., 1:2] * sin
    x2 = x[..., 0:1] * sin + x[..., 1:2] * cos

    x_rot = torch.cat([x1, x2], dim=-1)
    return x_rot.view(b, seq_len, d)

I don’t want to get lost in the sauce in this article, but a bit of googling and GPTing can get you far if you want a thorough understanding of what every single line is doing. In essence, this function:

calculates key parameters based on the input sequence
creates the rotation matrix for each position
applies those rotations
returns the result

After running this code, we can see that the approach performs well on our toy problem.

Conclusion

In this article we covered positional encoding. We first discussed why it’s necessary, and how the original transformer tackled the problem with additive sinusoidal positional encoding. We then discussed a variety of different positional encoding strategies; learned positional encoding, relative positional encoding, bucketed relative positional encoding, and RoPE.

There are countless other positional encoding schemes out there, but I think the most important thing is to understand the general approach, why it’s necessary, and a few ways it can be done. In future article’s we’ll be exploring strategies that build on top of RoPE, so stay tuned!

Intuitively and Exhaustively Explained

Positional Encoding — Intuitively and Exhaustively Explained

How modern AI understands space and time

The Transformer, How LLMs Work, and Why They Need Positional Information

How Data Flows Through a Transformer, and Why That Breaks Position

Transformers - Intuitively and Exhaustively Explained

GPT — Intuitively and Exhaustively Explained

Multi-Headed Self Attention — By Hand

Solving a Rubik's Cube with Supervised Learning — Intuitively and Exhaustively Explained

Additive Sinusoidal Positional Encoding

Exploring Additive Sinusoidal Positional Encoding With a Toy Problem

Learned Positional Encodings

Solving a Rubik's Cube with Supervised Learning — Intuitively and Exhaustively Explained

Relative Positional Encoding

Bucket Relative Bias

Rotary Positional Encoding

Conclusion

Discussion about this post