GPT — Intuitively and Exhaustively Explained
Exploring the architecture of OpenAI’s Generative Pre-trained Transformers.
In this article we’ll be exploring the evolution of OpenAI’s GPT models. We’ll briefly cover the transformer, describe variations of the transformer which lead to the first GPT model, then we’ll go through GPT1, GPT2, GPT3, and GPT4 to build a complete conceptual understanding of the state of the art.
Who is this useful for? Anyone interested in natural language processing (NLP), or cutting edge AI advancements.
How advanced is this post? This post should be accessible to all experience levels.
Pre-requisites: I’ll briefly cover transformers in this article, but you can refer to my dedicated article on the subject for more information.
A Brief Introduction to Transformers
Before we get into GPT I want to briefly go over the transformer. In its most basic sense, the transformer is an encoder-decoder style model.
The encoder converts an input into an abstract representation which the decoder uses to iteratively generate output.
Both the encoder and decoder employ an abstract representations of text which is created using multi headed self attention.
There’s a few steps which multiheaded self attention employs to construct this abstract representation. In a nutshell, a dense neural network constructs three representations, usually referred to as the query, key, and value, based on the input.
The query and key are multiplied together. Thus, some representation of every word is combined with a representation of every other word.
The value is then multiplied by this abstract combination of the query and key, constructing the final output of multi headed self attention.
The encoder uses multi-headed self attention to create abstract representations of the input, and the decoder uses multi-headed self attention to create abstract representations of the output.
That was a super quick rundown on transformers. I tried to cover the high points without getting too in the weeds, feel free to refer to my article on transformers for more information. Now that we vaguely understand the essentials we can start talking about GPT.
GPT-1 (Released June 2018)
The paper Improving Language Understanding by Generative Pre-Training introduced the GPT style model. This is a fantastic paper, with a lot of cool details. We’ll summarize this paper into the following key concepts:
A decoder-only style architecture
Unsupervised pre-training
Supervised fine tuning
Task-specific input transformation
Let’s unpack each of these ideas one by one.
Decoder Only Transformers
As we previously discussed, the transformer is an encoder-decoder style architecture. The encoder converts some input into an abstract representation, and the decoder iteratively generates the output.
You might notice that both the encoder and the decoder are remarkably similar. Since the transformer was published back in 2017, researchers have played around with each of these subcomponents, and have found that both of them are phenomenally good at language representation. Models which use only an encoder, or only a decoder, have been popular ever since.
Generally speaking, encoder-only style models are good at extracting information from text for tasks like classification and regression, while decoder-only style models focus on generating text. GPT, being a model focused on text generation, is a decoder only style model.
The decoder-only style of model used in GPT has very similar components to the traditional transformer, but also some important and subtle distinctions. Let’s run through the key ideas of the architecture.
GPT-1 uses a text and position embedding, which converts a given input word into a vector which encodes both the words general meaning and the position of the word within the sequence.
Like the original transformer, GPT uses a “learned word embedding.” Essentially, a vector for each word in the vocabulary of the model is randomly initialized, then updated throughout model training.
Unlike the original transformer, GPT uses a “learned positional encoding.” GPT learns a vector for each input location, which it adds to the learned vector embedding. This results in an embedding which contains information about the word, and where the word is in the sequence.
These embedded words, with positional information, get passed through masked multi-headed self attention. For the purposes of this article we’ll stick with the simplification that this mechanism combines every word vector with every other word vector to create an abstract matrix encoding the whole input in some meaningful way.
A lot of math happens in self attention, which could result in super big or super small values. This has a tendency to make models perform poorly, so all the values are squashed down into a reasonable range with layer normalization.
The data is then passed through a dense network (your classic neural network) then passed through another layer normalization. This all happens in several decoder blocks which are stacked on top of eachother, allowing the GPT model to do some pretty complex manipulations to the input text.
I wanted to take a moment to describe how decoder style models actually go about making inferences; a topic which, for whatever reason, a lot of people don’t seem to take the time to explain. The typical encoder-decoder style transformer encodes the entire input, then recurrently constructs the output.
Decoder only transformers, like GPT, don’t have an encoder to work with. Instead, they simply concatenate the previous output to the input sequence and pass the entire input and all previous outputs for every inference, a style referred to in the literature as “autoregressive generation”.
This idea of treating outputs similarly to inputs is critical in one of the main deviations GPT took from transformers to reach cutting edge performance.
Semi-Supervised Pre-Training, Then Supervised Fine Tuning
If you research GPT, or language modeling as a whole, you might find the term “language modeling objective.” This term refers to the act of predicting the next word given an input sequence, which, essentially models textual language. The idea is, if you can get really really good at predicting the next word in a sequence of text, in theory you can keep predicting the next word over and over until you’ve output a whole book.
Because the traditional transformer requires the concept of an explicit “input” and “output” (an input for the encoder and an output from the decoder), the idea of next word prediction doesn’t really make a tone of sense. Decoder models, like GPT, only do next word prediction, so the idea of training them on language modeling makes a tone of sense.
This opens up a tone of options in terms of training strategies. I talk a bit about pre-training and fine tuning in my article on LoRA. In a nutshell, pre-training is the idea of training on a bulk dataset to get a general understanding of some domain, then that model can be fine tuned on a specific task.
GPT is an abbreviation of “Generative Pre-Trained Transformer” for a reason. GPT is pre-trained on a vast amount of text using language modeling (next word prediction). It essentially learns “given an input sequence X, the next word should be Y.”
This form of training falls under the broader umbrella of “semi-supervision”. Here’s a quote from another article which highlights how semi-supervised learning deviates from other training strategies:
Supervised Learning is the process of training a model based on labeled information. When training a model to predict if images contain cats or dogs, for instance, one curates a set of images which are labeled as having a cat or a dog, then trains the model (using gradient descent) to understand the difference between images with cats and dogs.
Unsupervised Learning is the process of giving some sort of model unlabeled information, and extracting useful inferences through some sort of transformation of the data. A classic example of unsupervised learning is clustering; where groups of information are extracted from un-grouped data based on local position.
Self-supervised learning is somewhere in between. Self-supervision uses labels that are generated programmatically, not by humans. In some ways it’s supervised because the model learns from labeled data, but in other ways it’s unsupervised because no labels are provided to the training algorithm. Hence self-supervised.
Using a semi-supervised approach, in the form of language modeling, allows GPT to be trained on a previously unprecedented volume of training data, allowing the model to create robust linguistic representations. This model, with a firm understanding of language, can then be fine-tuned on more specific datasets for more specific tasks.
We use the BooksCorpus dataset [71] for training the language model. It contains over 7,000 unique unpublished books from a variety of genres including Adventure, Fantasy, and Romance. — The GPT paper on pre-training
We perform experiments on a variety of supervised tasks including natural language inference, question answering, semantic similarity, and text classification. — The GPT paper on fine tuning
Task-Specific Input Transformation
Language models, like GPT, are currently known to be incredibly powerful “in context learners”; you can give them information in their input as context, then they can use that context to construct better responses. I used this concept in my article on retrieval augmented generation, my article on visual question answering, and my article on parsing the output from language models. It’s a pretty powerful concept.
At the time of the first GPT paper in context learning was not nearly so well understood. The paper dedicates an entire section to “Task-specific input transformations”. Basically, instead of adding special components to the model to make it work for a specific task, the authors of GPT opted to format those tasks textually.
As an example, for textual entailment (the process of predicting if a piece of text directly relates with another piece of text) the GPT paper simply concatenated both pieces of text together, with a dollar sign in between. This allowed them to fine tune GPT on textual entailment without adding any new parameters to the model. Other objectives, like text similarity, question answering, and commonsense reasoning, were all fine tuned in a similar way.
GPT-2 (Released February 2019)
The paper Language Models are Unsupervised Multitask Learners introduced the world to GPT-2, which is essentially identical to GPT-1 save two key differences:
Keep reading with a 7-day free trial
Subscribe to Intuitively and Exhaustively Explained to keep reading this post and get 7 days of free access to the full post archives.