YOLO — Intuitively and Exhaustively Explained

The genesis of the most widely used object detection models.

May 28, 2024

∙ Paid

“Look Once” by Daniel Warfield using MidJourney. All images by the author unless otherwise specified.

In this post we’ll discuss YOLO, the landmark paper that laid the groundwork for modern real-time computer vision. We’ll start with a brief chronology of some relevant concepts, then go through YOLO step by step to build a thorough understanding of how it works.

Who is this useful for? Anyone interested in computer vision or cutting-edge AI advancements.

How advanced is this post? This article should be accessible to technology enthusiasts, and interesting to even the most skilled data scientists.

Pre-requisites: A good working understanding of a standard neural network. Some cursory experience with convolutional networks may also be useful.

A Brief Chronology of Computer Vision Before YOLO

The following sections contain useful concepts and technologies to know before getting into YOLO. Feel free to skip ahead if you feel confident.

Types of Computer Vision Problems

Computer vision is a class of several problem types, all of which relate to somehow enabling computers to “see” things. Typically, computer vision is broken up into the following:

Image Classification: the task of trying to classify an entire image. For instance, one might classify an entire image as containing a cat or a dog.
Object Detection: the task of finding instances of an object within an image, and where those instances are.
Image Segmentation: the task of identifying the individual pixels within an image that correspond to a specific object. So, for instance, identify all the pixels within an image that correspond to dogs.

The three major sub-problems of computer vision. This is somewhat of a simplification; in reality there are sub-problems within these sub-problems, but that’s out of scope for this article. Yolo, the topic of this article, was a breakthrough object detection model. Source.

Convolutional Neural Networks

YOLO employs a form of model called a “Convolutional Neural Network”. A convolutional neural network (CNN for short) is a style of neural network that applies a filter, called a “Kernel” over an image.

A conceptual diagram of a convolutional network working over an image. From my article on CNNs

These “kernels” are simply a block of numbers. If the numbers in the kernel change, the result of the filtering process changes.

A Kernel, applied over an image, acts as a filter which modifies that image. CNNs learn to change the values in the kernel to improve whatever task they’re being trained on. From my article on CNNs. Source

The actual filtering process consists of the kernel being swept through various parts of an image. At a given location, the kernels values are multiplied to the values in the image, and then added together to result in a new output. This process of “sweeping” is how CNNs get their name. In math, sweeping in this way is called “convolving”.

The process of convolving a kernel over an image. From my article on CNNs.

For computer vision tasks, CNNs typically apply convolution and information compression over successive steps to break down an image into some dense and meaningful representation. This representation is then used by a classic neural network to achieve some final task.

The most common way a CNN compresses an image down into a meaningful representation is by employing “max pooling”. Basically, you break an image up into N by N squares, then out of those squares you only preserve the highest value.

A conceptual diagram of max pooling, from my article on CNNs.

After a model has filtered (convolved) and down sampled (max pool) an image over numerous iterations, the result is a compressed representation that contains key information about the image. This is often passed through a dense network (a classic neural network) to produce the final output.

An example of how a convolutional network actually solves problems. The model passes an image through convolution and down sampling (max pooling) until it creates an abstract and dense representation of the image. This is passed through a neural network to produce the final output. The convolutional section of this form of model is often called a “backbone”, and the neural network at the end is often called a “Head”. From my article on CNNs.

If you want to learn more about convolutional networks, I wrote a whole article on the topic:

Convolutional Networks — Intuitively and Exhaustively Explained

Daniel Warfield

October 26, 2023

Read full story

If you’re interested in the structure of CNNs, and how backbones and heads can be used in advanced training processes, you might be interested in this article:

Self-Supervised Learning Using Projection Heads

Daniel Warfield

May 28, 2024

Read full story

Early Object Detection with Sliding Window

Before approaches like YOLO, “sliding window” was the go-to strategy in object detection. Recall that the goal of object detection is to detect instances of some object within an image.

The three major sub-problems of computer vision.

In sliding window, the idea is to sweep a window across an image and classify the content of the window with a classification model.

A conceptual demonstration of the process of classifying different windows as containing a dog or not. A classification model, which has been trained to identify if an image contains a dog or not, is shown several “windows” within an image. We can record the windows the model thinks contain a dog, and thus create a family of windows that likely contain a dog.

Once classifications have been calculated, a final bounding box can be defined by simply combining all the classified windows.

Once several classifications have been made via sliding window, an overall bounding box can be calculated. In this example, we simply found the bounding box which includes all windows where the model predicted a presence of a dog.

There are a few tricks one can use to get this process working better. However, the sliding window strategy of object detection still suffers from two key problems:

It’s very computationally intensive (you may have to run a model tens, hundreds, or even thousands of times per image)
The bounding boxes are inaccurate

Selective Search and R-CNN

Instead of arbitrarily sweeping some window through an image, the idea of selective search is to find better windows based on the content of the image itself. In selective search, first small regions within an image that contain a lot of similar pixels are found, then similar neighboring regions are merged together over successive iterations to build larger regions. These large regions can be used to recommend bounding boxes.

Selective search, which creates fine regions within an image, then combines those regions iteratively to construct larger region proposals. Source.

With selective search, instead of finding random windows based on sweeping, bounding boxes are suggested by the image itself. Several approaches have used selective search to drastically improve object detection.

One of the most famous models to use this trick is R-CNN, which trained a tailored convolutional network based on proposed regions in order to enable high quality object detection.

A diagram from the R-CNN paper, which uses selective search to propose regions and a tailored CNN to predict based on those regions. Source

R-CNN was a mainstay in computer vision for a while, and spawned many derivative ideas. However, it’s still very computationally intensive.

YOLO blew the paradigm of R-CNN out of the water, and inspired a fundamentally new way of thinking about image processing that remains relevant to this day. Let’s get into it.

YOLO: You Only Look Once

The idea of YOLO is to do everything in one pass of a CNN, hence why it’s called “You Only Look Once”. That means a single CNN, in a single pass, has to somehow find numerous different instances of objects, correctly classify them, and draw bounding boxes around them.

An example of YOLO in action, from the YOLO paper.

To achieve this, the authors of YOLO broke down the task of object detection into two sub-tasks, and built a model to do those sub-tasks simultaneously.

Subtask 1) Regionalized Classification

YOLO breaks images up into some arbitrary number of regions, and then classifies all those regions at the same time. It does this by modifying the output structure of a traditional CNN.

Normally a CNN compresses an image into a dense 2D representation, then a process called flattening is applied to that representation to turn it into a single vector, which can in turn be fed into a dense network to generate a classification.

A conceptual diagram of how a conventional CNN might predict if there is a dog in an image, vs nothing in an image. Convolution and max pooling create a dense representation of the image which is flattened and passed through a neural network.

Unlike this traditional approach, YOLO predicts classes for sub-regions of the image rather than the entire image.

In YOLO, the convolution output is flattened like normal, but then the output is converted back into a 2D representation of shape S x S x C where S represents how finely the image is subdivided into regions and C represents the number of classes being predicted. Both S and C are configurable parameters which can be used to apply YOLO to different tasks.

A conceptual diagram of YOLO predicting classes within subregions of an image. If there are two classes (C=2, “dog” and “no object”) and the image is divided 7 ways in both length and width (S=7), the output of YOLOs class prediction looks like the one in the image above, with all classes being predicted for each grid space. In reality, the output doesn’t have to *actually* have this output shape, it just needs to have a spot for each output. So the final output for this model would be a vector of length S x S x C.

Provided this model is trained correctly (we’ll cover that later), a model with this structure could classify numerous regions within an image in a single pass.

An example of YOLO predicting various classes within sub-regions of an image. From the YOLO paper.

Not only is this more efficient than R-CNN (due to only one inference generating classes for an entire image), it’s also more performant. When R-CNN makes a prediction it only has access to the region proposed by selective search, which can make R-CNN bad at discerning irrelevant background objects.

An example of what a bad background detection might look like. Because R-CNN can’t reason about the entire image (it’s only given small regions of an image), it may create random bounding boxes that don’t make sense in the context of the image. Source.

CNNs have something called a “perceptive field”, which means that CNNs by themselves suffer a similar issue. A particular spot within a CNN can only see a small subset of the image. In theory this might cause YOLO to also make poor choices about predictions.

Because of the way CNNs shake out, the final dense representation contains region specific information. This concept is typically referred to as a “perceptive field” as the top left of the dense representation can only “perceive” (contain information based on) the top left of the input image.

However, YOLO passes the dense 2D representation through a fully connected neural network; allowing information with each perceptive field to interact with one another before making the final prediction.

While the dense representation from the CNN is region specific, because the result is passed through a dense network, the final output of any given region is based on information from all regions in the image.

This allows YOLO to reason about the entire image before making the final prediction, a key difference that makes YOLO more robust than R-CNN in terms of contextual awareness.

Subtask 2) Bounding Box Prediction

In theory we could use the predicted regions from the previous step to draw bounding boxes, but the results wouldn’t be very good. Also, what if there were two dogs next to each other? It would be impossible to distinguish two dogs from one wide dog, because both would just look like a bunch of squares labeled dog.

Imagine using these predictions to draw bounding boxes. While it would be better than nothing, the bounding boxes would all be much larger than the objects they represent, and would fail to capture individual instances of the same thing that are next to each other. From the YOLO paper.

To alleviate these issues YOLO predicts bounding boxes as well as class predictions.

YOLO assigns a “responsibility” to each square in the S x S grid. Basically, if a square contains the center of an object, then that square is responsible for creating the bounding box for that object.

A conceptual diagram of “responsibility” in YOLO. The SxS Square that contains the center of the dog is “responsible” for the dog, and likewise for the bike and the car. From the YOLO paper.

The responsible square for a given object is in charge of drawing the bounding box for that object.

On top of the S x S x C tensor for class prediction (which we covered in the previous section), YOLO also predicts an S x S x B x 5 tensor for bounding box prediction. In this tensor S represents the divisions of the image (as before), and B represents the number of bounding boxes each S x S square can create. The 5 represents:

Bounding box width
Bounding box height
Bounding box horizontal offset
Bounding box vertical offset
Bounding box confidence

So, in essence, YOLO creates a bunch of bounding boxes for each square in the S x S grid. Specifically, YOLO creates B bounding boxes per square.

All the bounding boxes predicted by YOLO, where the thickness of the bounding box corresponds with that bounding boxes “confidence” output. YOLO predicts numerous bounding boxes per square, as specified by the “B” parameter. From the YOLO paper.

If we only look at bounding boxes with high confidence scores, and the classes of the grid square those bounding boxes correspond to, we get the final output of YOLO.

We can use the bounding box predictions, and the regional class probabilities, to create our final bounding boxes.

We’ll re-visit the idea of “confidence” when we explore how YOLO is trained. For now, let's take a step back and look at YOLO from a higher level.

The Architecture of YOLO

The cool thing about YOLO is that it does object detection in “one look”. In one pass of the model both subtasks of class prediction and bounding box prediction are done simultaneously.

We unify the separate components of object detection into a single neural network. — The YOLO paper

Essentially, this is done by YOLO outputting the S x S x C class predictions and the S x S x B x 5 bounding box predictions all in one shot, meaning YOLO outputs S x S x (B x 5 + C) things.

An example of YOLOs outputs if S=3 (the division of the image), C=3 (the number of classes, for instance “Dog”, “Cat”, and “background”), and B=2 (how many bounding boxes of “x”, “y”, “width”, “height” and “confidence” exist within each division). Keep in mind that this data shape is chiefly for demonstrative purposes. The shape of the output of YOLO can be three dimensional (S, S, B*5+C), or it could just as well be a vector of length S*S*(B*5+C). Either approach functions equivalently, and is ultimately an implementation detail.

Now that we understand the subtasks YOLO solves, and how it formats an output to solve those problems, we can start making sense of the actual architecture of YOLO.

As we’ve discussed, YOLO is a convolutional network that distills an image into a dense representation, then uses a fully connected network to construct the output.

The architecture diagram of YOLO. Source

In reality, the diagram above is somewhat of a simplification of the actual architecture, which is written out below the diagram. Let’s go through a few layers of YOLO to build a more thorough understanding.

First of all, the input image is an RGB image, meaning it has some width and height and three color channels. It looks like YOLO is designed to receive a square RGB image of width 448 and height 448. If you want to do YOLO on a smaller or larger image, you can just resize the image into 448 x 448.

A conceptual diagram of turning an image into the input of YOLO. The image gets squashed into 448x448, then each color channel represents some depth of the input.

The first layer of YOLO is listed as 7x7x64-s-2 , which means we have a convolutional layer with 64 kernels of size 7x7 that have a stride of 2.

When a convolutional model has multiple kernels, each of those kernels consists of different learned parameters, and they work together to make the final output.

A conceptual diagram of multiple kernels working together to construct an output. From my article on CNNs

In this particular layer, the kernels have a width and a height of 7, and instead of moving by one space at a time, it moves by two.

Conceptual diagram of strides of length one (left), two (middle), and three (right), all for a kernel of size two. From my article on convolutional networks.

So, the first convolutional layer consists of 64 kernels of size 7x7 and a stride of 2.

The first layer of YOLO, which we just discussed. Source.

After the first convolutional layer, the data is passed through a max pool of size 2x2 with a stride of 2.

Recall that max pooling takes some window of data and only preserves the maximum value, effectively downsampling. In the layer we’re discussing these windows would be 2x2, instead of 3x3 as shown in the image. From my article on CNNs.

The kernel of stride 2 reduces the dimension of the input by 1/2, the max pool of stride 2 reduces the dimension of the input by another 1/2, and the 64 kernels converts our 3 color channels to 64 kernel channels. This results in a tensor of shape 112 x 112 x 64 .

The first two layers of YOLO acting on the input. Both the convolution and max pooling reduce the special dimension by half, and applying 64 kernels changes the depth of data to a depth of 64.

The YOLO architecture has many layers, many of which behave fundamentally similarly. I won’t bore you with an exploration of every single layer, but there are a few design details which are worth highlighting.

The idea of a 1 x 1 convolution is interesting, and kind of flies in the face of a normal intuition around convolution. If a kernel is only looking at one pixel, what’s the point?

Recall that a convolution applies a kernel not only to some n x n region of the image, but also all input channels.

While convolution is typically drawn as a 2D matrix that propagates over an input, in reality kernels are 3D, and apply over all channels in the input. I cover the dimensionality of kernels extensively in my article on CNNs. This animation shows a 3x3 kernel applied over the three input channels. A 1x1 kernel in this context would be a vector that is applied to all input channels.

So a 1 x 1 convolution is essentially a filter that operates over only the channel dimension, and not the spatial dimension.

You may also wonder “why did the researchers who made YOLO settle on all of these numbers? Why 192 filters vs 200 here? Why a stride of 2 here and a stride of 1 there?”

Recall the architecture diagram of YOLO. A natural question might be “why did how did the researchers come up with all these numbers?” Source

The honest truth is that researchers usually use a combination of what others have done combined with concepts that seem cool to them. YOLO could probably have some of its network details changed without a major impact on performance. If you choose to build a model like this yourself, you often start with a baseline model and play around with different parameters to see if you can get something better.

One design constraint that YOLO does inherit from many CNNs is the concept of an information bottleneck. Throughout successive layers, the total amount of information is reduced. This is a pervasive concept in machine learning; that by passing data through a bottleneck you force a model to trim away irrelevant information and distill an input into its essence.

The total amount of numbers that are used to describe the image at each layer in YOLO. As can be seen, YOLO is allowed to represent the image in a much larger form initially, but successive layers force the same image to be represented in smaller and smaller representations. This forces the convolutional layers to work together to distill the image into it’s essence.

YOLO very heavily reduces the spatial dimension, while expanding the channel dimension, essentially implying that YOLO heavily breaks down a given region of an image, but increases the number of representations for that region.

If you’re finding this post valuable, feel free to share it with friends and colleagues. All content before this point is free and publicly accessible to all readers.

Also, don’t forget to join the IAEE discord. If you’re an IAEE member you have access to weekly lectures by the author of this article. If you’re a free reader, there are plenty of exciting opportunities, conversations, and free resources.

Join the Discord

Training YOLO

We’ve covered the nature of the output, as well as the structure of the model. Now let’s explore how the model is trained.

Keep reading with a 7-day free trial

Subscribe to Intuitively and Exhaustively Explained to keep reading this post and get 7 days of free access to the full post archives.