CLIP - Intuitively and Exhaustively Explained
Creating strong image and language representations for general machine learning tasks.
In this post you’ll learn about “contrastive language-image pre-training” (CLIP), A strategy for creating vision and language representations so good they can be used to make highly specific and performant classifiers without any training data. We’ll go over the theory, how CLIP differs from more conventional methods, then we’ll walk through the architecture step by step.
Who is this useful for? Anyone interested in computer vision, natural language processing (NLP), or multimodal modeling.
How advanced is this post? This post should be approachable to novice data scientists. Some of the later sections are a bit more advanced (particularly when we dig into the loss function).
Pre-requisites: Some cursory knowledge of computer vision and natural language processing.
The Typical Image Classifier
When training a model to detect if an image is of a cat or a dog, a common approach is to present a model with images of both cats and dogs, then incrementally adjust the model based on it’s errors until it learns to distinguish between the two.
This traditional form of supervised learning is perfectly acceptable for many use cases, and is known to perform well in a variety of tasks. However, this strategy is also known to result in highly specialized models which only perform well within the bounds of their initial training.
To resolve the issue of over-specialization, CLIP approaches classification in a fundamentally different way; by trying to learn the association between images and their annotation through contrastive learning. We’ll explore what that means in the next section.
CLIP, in a Nutshell
What if, instead of creating a model that can predict if an image belongs to one of some list of classes, we create a model which predicts if an image belongs to some arbitrary caption? This is a subtle shift in thinking which opens the doors to completely new training strategies and model applications.
The core idea of CLIP is to use captioned images scraped from the internet to create a model which can predict if text is compatible with an image or not.
CLIP does this by learning how to encode images and text in such a way that, when the text and image encodings are compared to each other, matching images have a high value and non-matching images have a low value. In essence, the model learns to map images and text into a landscape such that matching pairs are close together, and not matching pairs are far apart. This strategy of learning to predict if things belong or don’t belong together is commonly referred to as “contrastive learning”.
In CLIP, contrastive learning is done by learning a text encoder and an image encoder, which learns to put an input into some position in a vector space. CLIP then compares these positions during training and tries to maximize the closeness of positive pairs, and minimize the closeness of negative pairs.
The general strategy CLIP employs allows us to do all sorts of things:
We can build image classifiers by just asking the model which text, like “a photo of a cat” and “a photo of a dog” are most likely to be associated with an image
We can build an image search system which can be used to find the image which is most related to input text. For instance, we can look at a variety of images and find which image is most likely to correspond to the text “a photo of a dog”
We can use the image encoder by itself to extract abstract information about an image which is relevant to text. The encoder can position images in space dependent on the content of the image, that information can be used by other machine learning models.
We can use the text encoder by itself to extract abstract information about text which is relevant to images. The encoder can position text in space dependent on the entire content of the text, that information can be used by other machine learning models.
While zero-shot classification is pretty cool (zero-shot meaning the ability to perform well on an unseen type of data. For instance, asking the model “is this person happy” when it was never trained explicitly to detect happiness), extracting and using just the text or image encoder within CLIP has become even more popular. Because CLIP models are trained to create subtle and powerful encodings of text and images which can represent complex relationships, the high quality embeddings from the CLIP encoders can be co-opted for other tasks; I have an article which uses the image encoder from CLIP to enable language models to understand images, for instance:
So, now we have a high level understanding of CLIP. Don’t worry if you don’t completely understand; in the next section we’ll break down CLIP component by component to build an intuitive understanding of how it functions.
The Components of CLIP
CLIP is a high level architecture which can use a variety of different sub components to achieve the same general results. We’ll be following the CLIP paper and break down one of the possible approaches.
The Text Encoder
At its highest level, the text encoder converts input text into a vector (a list of numbers) that represents the text’s meaning.
The text encoder within CLIP is a standard transformer encoder, which I cover intuitively and exhaustively in another post. For the purposes of this article, a transformer can be thought of as a system which takes an entire input sequence of words, then re-represents and compares those words to create an abstract, contextualized representation of the entire input. The self attention mechanism within a transformer is the main mechanism that creates that contextualized representation.
One modification CLIP makes to the general transformer strategy is that it results in a vector, not a matrix, which is meant to represent the entire input sequence. It does this by simply extracting the vector for the last token in the input sequence. This works because the self attention mechanism is designed to contextualize each input with every other input. Thus, after passing through multiple layers of self attention, the transformer can learn to encode all necessary meaning into a single vector.
Feel free to refer to my article on transformers for more in-depth information. In the next section we’ll talk about Image encoders, which convert an image into a representative vector.
The Image Encoder
At its highest level, the image encoder converts an image into a vector (a list of numbers) that represents the images meaning.
There are a few approaches to image encoders which are discussed in the CLIP paper. In this article we’ll consider ResNET-50, a time tested convolutional approach which has been applied to several general image tasks. I’ll be covering ResNET in a future post, but for the purposes of this article we can just think of ResNET as a classic convolutional neural network.
A convolutional neural network is a strategy of image modeling which filters an image with a small matrix of values called a kernel. It sweeps the kernel through the image and calculates a new value for each pixel based on the kernel and the input image.
The whole idea behind a convolutional network is, by doing a combination of convolutions and downsampling of an image, you can extract more and more subtle feature representations. Once an image has been condensed down to a small number of high quality abstract features, a dense network can then be used to turn those features into some final output. I talk about this more in depth in another post, specifically the role of the dense network at the end, called a projection head.
From the perspective of CLIP, the end result ends up being a vector which can be thought of as a summary of the input image. This vector, along with the summary vectors for text, are then used in the next section to construct a multi-modal embedding space, which we’ll discuss in the next section.
The Multi-Modal Embedding Space, and CLIP Training
In the previous two sections we went over modeling strategies which can summarize text and images into vectors. In this section we’ll go over how CLIP uses those vectors to build strong image and language representations.
The idea of summarizing something complicated into an abstract vector is generally referred to as an “embedding”. We “embed” things like images and text into a vector as a way to summarize their general meaning.
We can think of these embedding vectors as representing the input as some point in high dimensional space. For demonstrative purposes we can imagine creating encoders which embed their input into a vector of length two. These vectors could then be considered as points in a two dimensional space, and we could draw their positions.
We can think of this space as the Multi-Modal Embedding Space, and we can train CLIP (by training the image and text encoders) to put these points in spots such that positive pairs are close to each other.
There are a lot of ways to define “close” in machine learning. Arguably the most common approach is cosine similarity, which is what CLIP employs. The idea behind cosine similarity is that we can say two vectors are similar if the angle between them is small.
The term “cosine” comes from the cosine function, a trigonometry function which calculates the ratio of the adjacent leg of a right triangle with the hypotenuse based on some angle. If that sounds like gibberish, no big deal: if the angle is small between two vectors, the cosine between the two vectors will be close to 1. If the vectors are 90 degrees apart, the cosine will be zero. If the vectors are pointing in opposite directions, the cosine will be -1. As a result, with cosine, you get big numbers when the vectors point in the same direction, and you get small numbers when they don't.
The cosine of the angle between two vectors can be calculated by measuring the angle between them and then passing that angle through the cosine function. Printing out all the vectors and measuring the angle between them using a protractor might slow down our training time, though. Luckily, we can use the following identity to calculate the cosine of the angle between two vectors:
If you were already daunted by math, you might feel even more daunted now. But I’ll break it down:
Keep reading with a 7-day free trial
Subscribe to Intuitively and Exhaustively Explained to keep reading this post and get 7 days of free access to the full post archives.