Convolutional Networks — Intuitively and Exhaustively Explained
Unpacking a cornerstone modeling strategy
Convolutional neural networks are a mainstay in computer vision, signal processing, and a massive number of other machine learning tasks. They’re fairly straightforward and, as a result, many people take them for granted without really understanding them. In this article we’ll go over the theory of convolutional networks, intuitively and exhaustively, and we’ll explore their application within a few use cases.
Who is this useful for? Anyone interested in computer vision, signal analysis, or machine learning.
How advanced is this post? This is a very powerful, but very simple concept; great for beginners. This also might be a good refresher for seasoned data scientists, particularly in considering convolutions in various dimensions.
Pre-requisites: A general familiarity of with backpropagation and dense neural networks might be useful, but is not required. I cover both of those in this post:
The Reason Convolutional Networks Exist
The first topic many fledgling data scientists explore is a dense neural network. This is the classic neural network consisting of nodes and edges which have certain learnable parameters. These parameters allow the model to learn subtle relationships about the topics they’re trained on.
As the number of neurons grow within the network, the connections between layers become more and more abundant. This can allow complex reasoning, which is great, but the “denseness” of dense networks presents a problem when dealing with images.
Let’s say we wanted to train a dense neural network to predict if an image contains a dog or not. We might create a dense network which looks at each pixel of the image, then boil that information down to some final output.
Already we’re experiencing a big problem. Skipping through some math to get to the point, for this tiny little network we would need 1,544 learnable parameters. For a larger image we would need a larger network. Say we have 64 neurons in the first layer and we want to learn to classify images that are 256x256 pixels. Just the first layer alone would be 8,388,608 parameters. That’s a lot of parameters for a, still, pretty tiny image.
Another problem with neural networks is their sensitivity to minor changes in an image. Say we made two representations of our dog image; one with the dog at the top of the image, and one with the dog at the bottom.
Even though these images are very similar to the human eye, their values from the perspective of a neural network are very different. The neural network not only has to logically define a dog, but also needs to make that logical definition of a dog robust to all sorts of changes in the image. This might be possible, but that means we need to feed the network a lot of training data, and because dense networks have such a large number of parameters, each of those training steps is going to take a long time to compute.
So, dense networks aren’t good at images; they’re too big and too sensitive to minor changes. In the next sections we’ll go over how convolutional networks address both of these issues, first by defining what a convolution is, then by describing how convolution is done within a neural network.
Convolution in a Nutshell
At the center of the Convolutional Network is the operation of “convolution”. A “convolution” is the act of “convolving” a “kernel” over some “target” in order to “filter” it. That’s a lot of words you may or may not be familiar with, so let’s break it down. We’ll use edge detection within an image as a sample use case.
A kernel, from a convolutional perspective, is a small array of numbers
This kernel can be used to transform an input image into another image. The act of using a standard operation to transform an input into an output is typically called “filtering” (think Instagram filters used to modify images).
The filtering actually gets done with “convolution”. The kernel, which is much smaller than the input image, is placed in every possible position within the image. Then, at a given location, the values of the kernel are multiplied by values of the input image. The results are then summed together to define the value of the output image.
In machine learning convolutional is most often applied to images, but they work perfectly well in other domains. You can convolve a wavelet over a one dimensional signal, you can convolve a three dimensional tensor over a three dimensional space. Convolution can take place in an arbitrary number of dimensions.
We’ll stay in two dimensions for most of this article, but it’s important to keep the general aspect of convolutions in mind; they can be used for many problem types outside of computer vision.
So, now we know what convolution is and how it works. In the next section we’ll explore how this idea can be used to build models.
Convolutional Neural Networks in a Nutshell
The whole idea of a convolutional network is to use a combination of convolutions and downsampling to incrementally break down an image into a smaller and more meaningful representation. Typically this broken down representation of the image is then passed to a dense network to generate the final inference.
Similarly to a fully connected neural network which learns weights between connections to get better at a task, convolutional neural networks learn the values of kernels within the convolutional layers to get better at a task.
There are many ways to downsample in a convolutional network, but the most common approach is max pooling. Max pooling is similar to convolution, in that a window is swept across an entire input. Unlike convolution, max pooling only preserves the maximum value from the window, not some combination of the entire window.
So, through layers of a convolution and max pooling, an image is incrementally filtered and downsampled. Through each successive layer the image becomes more and more abstract, and smaller and smaller in size, until it contains an abstract and condensed representation of the image.
And that’s were a lot of people stop in terms of theory. However, convolutional neural networks have some more critical concepts which people often disregard. Particularly, the feature dimension and how convolution relates with it.
⚠️ Epilepsy Warning: The following sections contain rapidly moving animations⚠️
The Feature Dimension
You might have noticed, in some of the previous examples, we used grayscale images. In reality images typically have three color channels; red, green, and blue. In other words, an image has two spatial dimensions (width and height) and one feature dimension (color).
This idea of the feature dimension is critical to the thorough understanding of convolutional networks, so let’s look at a few examples:
Example 1) RGB images
Because an image contains two spatial dimension (height and width) and one feature dimension (color), an image can be conceptualized as three dimensional.
Generally, convolutional networks move their kernel along all spatial dimensions, but not along the feature dimension. With a two dimensional input like an image one usually uses a 3D kernel, which has the same depth as the feature dimension, but a smaller width and height. This kernel is then swept through all spatial dimensions.
Typically, instead of doing one convolution, it’s advantageous to do multiple convolutions, each with different kernels. This allows the convolutional network to create multiple representations of the image. Each of these convolutions uses its own learnable kernel, and the representations are concatenated together along the feature axis.
As you may have inferred, you can have an arbitrary number of kernels, and can thus create a feature dimension of arbitrary depth. Many convolutional neural networks use a different number of features at various points within the model.
Max pooling typically only considers a single feature layer at a time. In essence, we just do max pooling on each individual feature layer.
Those are the two main operations, convolution and max pooling, on a two dimensional RGB image.
Example 2) Stereo Audio
While time series signals like audio are typically thought of as one dimensional, they’re actually typically two dimensional, with one dimension representing time and another dimension representing multiple values at that time. For instance, stereo audio has two separate channels, one for the left ear and one for the right ear.
This can be conceptualized as a signal with one spatial dimension (time) and one feature dimension (which ear).
Applying convolutions and max pooling to this data is very similar to images, except instead of iterating over two dimensions, we only iterate over one.
Max pooling is also similar to the image approach discussed previously. We treat each row across the feature dimension separately, apply a moving window, and preserve the maximum within that window.
Example 3) MRI/CT scans
Depending on the application, data of scans can be conceptualized as three dimensional or two dimensional.
Keep reading with a 7-day free trial
Subscribe to Intuitively and Exhaustively Explained to keep reading this post and get 7 days of free access to the full post archives.