Dropout — Intuitively and Exhaustively Explained

Encouraging robust learning in AI

Mar 10, 2025

∙ Paid

“Controlled Chaos” by Daniel Warfield using Midjourney. All images by the author unless otherwise specified. Brought to you by the subscribers of Intuitively and Exhaustively Explained.

“Dropout” is a fundamental approach that involves randomly deactivating components of an AI model throughout the training process. As we’ll discuss, this random deactivation of elements can drastically improve the performance of AI models, which is why it’s featured in the training process of virtually all cutting-edge AI.

We’ll begin our exploration of dropout by reviewing how neural networks are trained. Once we do that, we’ll discuss how dropout can be applied to the training process and how it can lead to more generalized learning in AI systems. After we’ve formed a thorough conceptual understanding of dropout, we’ll go through an example using PyTorch.

Who is this useful for? Anyone who wants to form a complete understanding of the state of the art of AI, especially those interested in training AI models.

How advanced is this post? This article is great for beginners and serves as a good refresher for more experienced data scientists.

Pre-requisites: None

A Brief Review of the Neural Network

The following is a brief summarization from my article on neural networks

Neural Networks — Intuitively and Exhaustively Explained

Daniel Warfield

Jan 30

Read full story

Neural networks take direct inspiration from the human brain, which is made up of billions of incredibly complex cells called neurons.

When we use certain neurons more frequently, their connections become stronger. When we don’t use certain neurons, those connections weaken. This general rule has inspired the phrase “Neurons that fire together, wire together” and is the high-level quality of the brain that is responsible for the learning process.

Different tasks require the use of different neurons. As we learn, the strength of certain connections become stronger. From my article on Neural Networks

Neural networks are, essentially, a mathematically convenient and simplified version of neurons within the brain. A neural network is made up of elements called “perceptrons”, which are directly inspired by neurons.

A perceptron, on the left, vs a neuron, on the right. source 1, source 2. From my article on Neural Networks

A neural network can be conceptualized as a big network of these perceptrons, just like the brain is a big network of neurons.

A neural network (left) vs the brain (right). src1 src2. From my article on Neural Networks

One of the fundamental ideas of AI is that you can “train” a model. This is done by asking a neural network (which starts its life as a big pile of random data) to do some task. Then, you somehow update the model based on how the model’s output compares to a known good answer.

The fundamental idea of training a neural network. You give it some data where you know what you want the output to be, compare the neural networks output with your desired result, then use how wrong the neural network was to update the parameters so it’s less wrong. From my article on Neural Networks

This happens via a process called back propagation. You feed some input into the model and observe how the model's output deviates from the expected output.

An example of training a neural network. The training data has an input of 0.1 and 0.3, and the desired output is 0.2. The prediction from the model was -0.1. Thus, the difference between the output and the desired output is 0.3. From my article on Neural Networks

You then look backward through the model and adjust each parameter a tiny bit so that the desired output is more likely.

Using the desired change, we can look back through the model and calculate how different components of the model should have changed. This allows us to update the parameters in our model to have a better output. From my article on Neural Networks

The training process, then, is an iterative process of feeding large amounts of data through an AI model, calculating how different the output is from a desired answer, and adjusting the parameters of the model accordingly.

If you want to understand this process in depth, feel free to check out my article on neural networks. For our purposes, though, there’s one important detail I would like to discuss.

Because of how the training process works out mathematically, when two perceptrons “fire” together (have an output greater than zero), they are updated together. This is somewhat similar to the idea that “Neurons that fire together wire together” within the human brain.

When a perceptron has an output of zero, there is no change to the weights connecting to its output that could result in any change in the final output. That means, when a perceptron has an output of zero, we do not update parameters which are derivative of that perceptrons output. Thus, “neurons that fire together, wire together”, and neurons that don’t fire are not updated. From my article on Neural Networks

A single perceptron is an incredibly simple learner, but a large cluster of them can learn to interact with each other in such a way that incredibly complex inferences can be made. This is chiefly the result of certain perceptrons learning to, or not to, fire together.

This is great, but it has some pitfalls when applying neural networks to real-world problems.

Have any questions about this article? Join the IAEE Discord.

Join The Discord

Over Fitting

Over-fitting is possibly the most fundamental problem in artificial intelligence and is essentially the reason why dropout exists. Recall that when you train a model, you update that model based on some example training data.

Recall the fundamental idea of training a neural network. You give it some data where you know what you want the output to be, compare the neural networks output with your desired result, then use how wrong the neural network was to update the parameters so it’s less wrong. From my article on Neural Networks

Let’s say we have a few pictures of cats and dogs, and we want to use them to train a model to be able to distinguish between cats and dogs.

Our training data, which we’ll be using to train a model to distinguish between pictures of cats and dogs.

The question is, how thoroughly do we want our model to learn from this dataset? That might sound like a silly question, but it’s one of great profundity where AI is concerned. One way our model could learn to distinguish between these photos is by learning the macroscopic features of cats and dogs, which is great.

The model might learn to identify by features like ear shape, or the shape of the animals nose.

Another way our model could learn to distinguish cats from dogs is by looking at the top left pixel and memorizing which color is from a dog or a cat photo.

The model might learn to memorize the background color to predict if a certain image is of a cat or of a dog.

The model's job is to learn which image is of a cat or a dog. Nothing more and nothing less. Memorizing which top left pixel color corresponds to a cat or a dog could result in a high level of accuracy when applying our model to our training data, so it’s a perfectly valid approach as far as our model is concerned.

When a model that learns features like this is applied to new problems, however, we’d likely have some problems.

If our model learns to identify images based on background color, rather than more wholistic features of the image, it can result in poor model generalization.

This is why, when training an AI model, data scientists often employ a “holdout set”. This is a set of data that the model is not trained on and thus can be used to evaluate how well an AI model can generalize to data it’s never seen before.

When training an AI model, it’s customary to hold out some information so we can test how well our model performs on data it’s never seen before.

Often, when training an AI model, the model will initially get better at both the training set and the hold-out set as the model learns fundamental aspects of the problem it’s being trained on. However, as the AI model looks at the training data more and more, it might begin to over-memorize the training data, which causes a reduction in performance of the holdout set.

Throughout training, the models performance on the holdout set may decline as the model transitions to learning about the problem to, instead, memorizing aspects of the training set.

When a model begins to memorize the training data rather than learn about the underlying problem, we call that “over-fitting”. There are a lot of ways to deal with overfitting, but one of them is dropout.

Dropout

The idea of dropout is to randomly turn off neurons throughout a neural network during the training process. Recall that when training a neural network, we have a forward pass, which results in a prediction, and a backward pass, where we update the model based on how that prediction compares to a desired result.

When doing this process with dropout, you randomly deactivate a small, random subset of perceptrons. You then update the output of that model based on how the expected output deviates from the desired output.

An example of dropout, as it’s applied to the training process. Here the perceptrons in black are “dropped out”, meaning they were randomly selected to be deactivated. Their output is forced to be zero, and thus they have no impact on the values of future perceptrons. Because they have no impact on the output, the parameters associated with these perceptrons are also not updated.

When you train a model, you feed it data over and over again in a loop. In dropout, you randomly deactivate different perceptrons in the neural network at every iteration.

In different iterations of training, different perceptrons within the neural network may be deactivated. The likelihood in which a certain perceptron migt be deactivated is often called the `dropout rate`, which can range from 0% (no chance of dropout) to 100% (every perceptron is always dropped out). Typicall dropout rates range from 20%-50%.

This approach of randomly deactivating data has proven to be a fundamental approach when training complex AI systems, and I think there are some intuitive reasons why. Here are three potential explanations:

Why Dropout Works 1) Reduced Co-Adaptation

As previously mentioned, when training a neural network, perceptrons that fire together tend to update together. This can cause a feedback loop where certain neurons that fired together update together and, thus, are more likely to fire together in the next training iteration.

Once a certain group of perceptrons in a neural network has been sufficiently updated together, it can be very hard for the training process to break up that group of neurons. This can be good because neurons firing together is a big part of neural network learning, but it’s easy for this phenomenon to result in a sub-optimal rut that is hard to train the model out of.

If we randomly deactivate a few of the neurons in these clusters throughout the training process, it can allow new forms of learning to emerge.

Why Dropout Works 2) Redundant Representations

Imagine you had a pretty good set of neurons that magically wire together to correctly predict if the dog was a dog 90% of the time.

There may be a particular set of connections which has formed throughout the training process which dominates the decision of if a picture contains a dog or not.

Throughout training, it might be difficult for the model to learn some other representation of a dog because learning that representation might somehow damage the super good one we’re relying on. When you train with dropout, every once in a while, you’ll turn off a very important section of the model, forcing the model to consider how the remaining perceptrons can be used to solve the problem.

If a perceptron in that continuum is deactivated, the model will be forced to attempt to learn some other representation of the problem.

This has the effect of encouraging multiple, slightly different representations of the problem within the neural network, allowing the model to still perform well when the odd perceptron is deactivated.

This is essentially the same as the idea of reduced co-adaptation, but in my mind, “reduced-coadaptation” applies to breaking the model out of a rut, whle “redundent representations” applies to thinking of the model as many small learners working together within the model. This is purely conceptual, though, so feel free to reflect and build your own intuition.

Why Dropout Works 3) Noise

A simple but fundamental characteristic of dropout is that it’s random. The model has no way of knowing which neurons will deactivate. As a result, if you feed the same image multiple times, each instance of feeding the image into the model can look very different to the neural network.

This makes it harder for the neural network to rely on simplistic memorization because there is a fundamental randomness in the training process. Thus, the model is forced to learn more general characteristics of the problem.

Using Dropout in PyTorch

If you’re unfamiliar with PyTorch, I have an article that covers the subject in depth. The implementation section of this article will assume basic PyTorch knowledge:

AI for the Absolute Novice — Intuitively and Exhaustively Explained

Daniel Warfield

August 8, 2024

Read full story

Full code can be found here:

Let’s define a toy problem for us to play with. Here, I’m defining a function that takes in two inputs and results in a single output.

Keep reading with a 7-day free trial

Subscribe to Intuitively and Exhaustively Explained to keep reading this post and get 7 days of free access to the full post archives.