YOLO — By Hand

A breakdown of the math within YOLO

Jun 06, 2024

“Eye By Hand” by Daniel Warfield using MidJourney. All images by the author unless otherwise specified.

YOLO is a landmark object detection model which can quickly classify and localize numerous objects within an image. This summary goes over all critical mathematical operations within a YOLO model. If you’d like to learn more about the intuition behind YOLO, check out the IAEE article.

YOLO — Intuitively and Exhaustively Explained

Daniel Warfield

May 28, 2024

Read full story

Step 1: Defining the Input

To use a YOLO model, an RGB image must first be converted into a 448 x 448 x 3 tensor

An image can be stretched into the correct dimension and, constructing a 448x448x3 tensor. Dog image source.

We’ll be using a simplified 5 x 5 x 1 tensor so the math takes up less space.

The input is idealized as a smaller 2D matrix. You can refer to my article on convolution to get a better idea of how this impacts the math.

Step 2: Layer Normalization

Neural networks typically perform better on normalized data. We can normalize the input by first computing the average value in the matrix (µ ).

Calculating the mean by adding all values and dividing by the number of values

Next, the absolute difference of all elements by the average can be calculated.

All values in the input matrix can be subtracted by the mean, and the absolute value of the result can be calculated for each element in the input.

Then, the standard deviation can be calculated by squaring all values in the result of the previous section, adding them together, dividing by the number of values, and calculating the square root.

Calculating the standard deviation by squaring all values of the absolute difference of the mean, squaring all values, dividing by the number of values, then calculating the square root.

Once the standard deviation has been calculated, the input can be layer normalized by subtracting the mean and dividing by the standard deviation.

The mean and standard deviation can be used to normalize the values in the input.

The mean is the average value of the input image, and the standard deviation is how broad the distribution of values is in original image. By subtracting by the mean and dividing by the standard deviation, we “normalize” the image.

Note: we calculated the layer norm. The original YOLO paper used batch norm, which normalizes the same values across different images in a batch. We calculated the layer norm. The conceptual difference between the two is negligible

Step 3: Convolution

Now that our input has been normalized, we’ll pass it through a convolutional network. We’ll idealize YOLO as a single convolutional layer with two kernels

To ensure the output tensor has the same spatial dimensions as the input, a pad of 0’s is applied to our normalized input

Then both kernels can be convolved over the image via element wise multiplication (⊙) and cumulative summation (Σ).

After both kernels have been convolved over the input, we get two arrays of equal size. These are typically represented as a 3D tensor where different kernels exist across a dimension called the “filter” or “kernel” dimension.

Step 4: Max Pooling

Now that we convolved our input, we can apply max pooling. In this example we max pool each convolved matrix with a 2 x 2 window of stride 2. We max pool partial sections as well. In this case I used an implementation of max pooling that sets the value to zero if all values are negative. Practically I think this has little impact.

Max pool observes a subset of the input and preserves only the maximum value of each subset.

Step 5: Non-Linear Activation

Virtually all ML models, YOLO included, employ non-linear “activation functions” throughout the model. As all math before this point has been linear (multiplication and addition) , the previous steps can only model linear relationships. Adding a function that maps values within the model non-linearly allows the model to learn non-linear relationships. In this example we’re using the sigmoid activation function, but ReLU is more common.

Note: it’s slightly more efficient to apply activation functions after max pooling.

This function can be applied to all max pooled matrices element wise

Applying sigmoid element wise to both results of convolution and max pooling, derived from our two kernels.

Step 6: Flattening

Now that the input image has been filtered into an abstract representation which is better suited for the end modeling task (in reality by several convolutions layers rather than the one convolutional layer in this example), it can be converted into a vector via flattening.

Step 7: Output Projection

A dense network (which is matrix multiplication) can be used to project the flattened matrix into the final output. The final output of YOLO consists of SxSxCclass predictions, and SxSxBx5bounding box predictions. Thus, the shape of the output must be of shape SxSx(C+Bx5). Assuming the output of flattening in the previous step has a length L, the weight matrix of the dense network must be of shape Lx(SxSx(C+Bx5)).

In this example we’ll assume S is 1, C is 2, and B is 1. L is the length of the flattened vector, which is 18. Therefore the shape of the weight matrix should be 18 x 7.

Note: matrices denoted with ` are visually transposed.

Multiplying the flattened matrix by a dense weight matrix to create the final output of YOLO.

Class probabilities per square should add to 1. Thus, the predicted classes for each grid square are softmaxed.

Each of the two class probabilities is made an exponent of e, and those values are divided by the total of both values raised as an exponent of e. e is Eulers constant, 2.718, which has certain exponential properties that make it common to be used in this context.

Step 8: Inferencing

The final output of YOLO, in this example a 7 element long vector, has been calculated. Now we can use those values to generate our final inference. Because S = 1, there is only one grid cell. Because B = 1, there is only one bounding box. Because C = 2 there are two class predictions

YOLO predicted that the grid cell (the entire image in this case) contains a dog. The bounding box is 50% off of the left wall and 50% off of the top wall of the cell. The width is 30% of the width of the cell, the height is 70% the height of the cell. Also, YOLO is 90% confident that this is a good bounding box.

In a less trivial example, where there are four grid cells (S=4) with one bounding box per cell:

The confidence for the top left and right cells is too low, so the bounding boxes aren’t used. The other two are used. Note that the height (and width) can be greater than 1, as the bounding box can be a multiple larger than the grid cell height.