7 Comments
User's avatar
Gordon Erlebacher's avatar

i like the article. Nonetheless, I must have a misconception. If the method depends on the draft token being checked against the target token, where is the speed gain? Thanks again for your insight.

Expand full comment
Daniel Warfield's avatar

That was my misconception as well. The core idea unlocks a much deeper understanding of LLMS, so I'm excited that you're asking!

LLMs, when trained, are parallelized to predict *every* next word in a sequence. So, if you train a model to learn based on the phrase "<start> this is a training example <end>" it will simultaneously attempt to output "this", "is", "a", "training", "example", "<end>". This is possible because of the mask within the attention mechanism, which makes it so that the information for a particular point in the output is only influenced by preceding tokens.

I have an article on attention behind the scenes which might support this intuition a bit:

https://iaee.substack.com/p/multi-headed-self-attention-by-hand?utm_source=publication-search

and I have an article on transformers in general which might help as well:

https://open.substack.com/pub/iaee/p/transformers-intuitively-and-exhaustively-explained-58a5c5df8dbb?r=2i9k1a&utm_campaign=post&utm_medium=web

This nature of LLMs to predict every next word persists in inference. 99% of the time we don't care, we only care about the last next word prediction, but the idea is that the target model can check numerous predictions of the draft model simultaneously by checking all words in the input as the "next" word. If there's a disagreement, you can simply cut it off where there is a disagreement.

For more information, I recommend re-reading the following section:

https://iaee.substack.com/i/144704324/the-secret-outputs-of-transformers-and-how-speculative-sampling-uses-them

Expand full comment
Gordon Erlebacher's avatar

Hi Daniel, thanks for the reply. I understand all you are writing, and have programmed my own attention and transformer, but obviously, but obviously I am still missing the punchline.

Both models, target and draft are already trained, but you run the target model in train mode (which predicts many words in parallel) and you run the draft model in inference mode (where causality is respected).

You write: "So, if you train a model to learn based on the phrase "<start> this is a training example <end>" it will simultaneously attempt to output "this", "is", "a", "training", "example", "<end>". This is possible because of the mask within the attention mechanism, which makes it so that the information for a particular point in the output is only influenced by preceding tokens." My question is: "when you train WHICH model"? I think we need a more detailed blog on this with a step-by-step example, similar in spirit to the blog "An annotated transform" by Jay Alammar. I love your work!

Expand full comment
Daniel Warfield's avatar

Ah, I see, let's roll back.

The nature of the output is identical for both training and testing for a transformer. When you train a model to output information in a certain format, the model outputs data in a similar format in inference.

When training a transformer you do something like this:

```

| model input | model output |

|-------------|--------------|

| <start> | this |

| this | is |

| is | error |

| data | <end> |

| <end> | |

```

Here, the model thought the word after "is" should be "error", instead of "data", so the model would be updated based on the loss of that error.

When using the model, we usually only care about the final next word prediction, so we might provide the model the input:

"predict the next word in this", then we might expect the model to predict the word "sentence". However, because of the way the model is trained, it also outputs all next words in the sequence as if future words didn't exist.

```

| model input | model output |

|-------------|--------------|

| <start> | Predict |

| Predict | the |

| the | next |

| next | word |

| word | in |

| in | this |

| this | *sentence* |

```

That happens, essentially, for all transformer style language models predicting text autoregressively. Both the target and the draft model output all next word predictions each time the model generates a new inference. Speculative sampling exploits this by providing the output of the draft model as the input of the target model. If the draft model's output is used as the input of the target model, the target model will predict what it thinks all next words should be in the entire sequence in one shot. Where there is a disagreement, we can ignore whatever the draft model said at the point of that disagreement, and use what the target model predicted.

Expand full comment
Gordon Erlebacher's avatar

That makes more sense, Daniel, and I am closer to understanding. I will think more on that. I don't think your blog made this clear, or it is perhaps me. I think more math would be useful. I appreciate the feedback.

Expand full comment
Jesus's avatar

>

Expand full comment
Daniel Warfield's avatar

I would be thrilled to answer any questions or thoughts you might have about the article. An article is one thing, but an article combined with thoughts, ideas, and considerations holds much more educational power!

Expand full comment