Neural Networks — Intuitively and Exhaustively Explained
An in-depth exploration of the most fundamental architecture in modern AI

In this article we’ll form a thorough understanding of the neural network, a cornerstone technology underpinning virtually all cutting edge AI systems. We’ll first explore neurons in the human brain, and then explore how they formed the fundamental inspiration for neural networks in AI. We’ll then explore back-propagation, the algorithm used to train neural networks to do cool stuff. Finally, after forging a thorough conceptual understanding, we’ll implement a Neural Network ourselves from scratch and train it to solve a toy problem.
Who is this useful for? Anyone who wants to form a complete understanding of the state of the art of AI.
How advanced is this post? This article is designed to be accessible to beginners, and also contains thorough information which may serve as a useful refresher for more experienced readers.
Pre-requisites: None
Inspiration From the Brain
Neural networks take direct inspiration from the human brain, which is made up of billions of incredibly complex cells called neurons.

The process of thinking within the human brain is the result of communication between neurons. You might receive stimulus in the form of something you saw, then that information is propagated to neurons in the brain via electrochemical signals.
The first neurons in the brain receive that stimulus, then each neuron may choose whether or not to “fire” based on how much stimulus it received. “Firing”, in this case, is a neurons decision to send signals to the neurons it’s connected to.
Then the neurons which those Neurons are connected to may or may not choose to fire.

Thus, a “thought” can be conceptualized as a large number of neurons choosing to, or not to fire based on the stimulus from other neurons.
As one navigates throughout the world, one might have certain thoughts more than another person. A cellist might use some neurons more than a mathematician, for instance.
When we use certain neurons more frequently, their connections become stronger, increasing the intensity of those connections. When we don’t use certain neurons, those connections weaken. This general rule has inspired the phrase “Neurons that fire together, wire together”, and is the high-level quality of the brain which is responsible for the learning process.
I’m not a neurologist, so of course this is a tremendously simplified description of the brain. However, it’s enough to understand the fundamental idea of a neural network.
The Intuition of Neural Networks
Neural networks are, essentially, a mathematically convenient and simplified version of neurons within the brain. A neural network is made up of elements called “perceptrons”, which are directly inspired by neurons.
Perceptrons take in data, like a neuron does,

aggregate that data, like a neuron does,

then output a signal based on the input, like a neuron does.
A neural network can be conceptualized as a big network of these perceptrons, just like the brain is a big network of neurons.
When a neuron in the brain fires, it does so as a binary decision. Or, in other words, neurons either fire or they don’t. Perceptrons, on the other hand, don’t “fire” per-se, but output a range of numbers based on the perceptrons input.
Neurons within the brain can get away with their relatively simple binary inputs and outputs because thoughts exist over time. Neurons essentially pulse at different rates, with slower and faster pulses communicating different information.
So, neurons have simple inputs and outputs in the form of on or off pulses, but the rate at which they pulse can communicate complex information. Perceptrons only see an input once per pass through the network, but their input and output can be a continuous range of values. If you’re familiar with electronics, you might reflect on how this is similar to the relationship between digital and analogue signals.
The way the math for a perceptron actually shakes out is pretty simple. A standard neural network consists of a bunch of weights connecting the perceptron's of different layers together.
You can calculate the value of a particular perceptron by adding up all the inputs, multiplied by their respective weights.

Many Neural Networks also have a “bias” associated with each perceptron, which is added to the sum of the inputs to calculate the perceptron’s value.

Calculating the output of a neural network, then, is just doing a bunch of addition and multiplication to calculate the value of all the perceptrons.
Sometimes data scientists refer to this general operation as a “linear projection”, because we’re mapping an input into an output via linear operations (addition and multiplication). One problem with this approach is, even if you daisy chain a billion of these layers together, the resulting model will still just be a linear relationship between the input and output because it’s all just addition and multiplication.
This is a serious problem because not all relationships between an input and output are linear. To get around this, data scientists employ something called an “activation function”. These are non-linear functions which can be injected throughout the model to, essentially, sprinkle in some non-linearity.

by interweaving non-linear activation functions between linear projections, neural networks are capable of learning very complex functions,

In AI there are many popular activation functions, but the industry has largely converged on three popular ones: ReLU, Sigmoid, and Softmax, which are used in a variety of different applications. Out of all of them, ReLU is the most common due to its simplicity and ability to generalize to mimic almost any other function.

So, that’s the essence of how AI models make predictions. It’s a bunch of addition and multiplication with some nonlinear functions sprinkled in between.
Another defining characteristic of neural networks is that they can be trained to be better at solving a certain problem, which we’ll explore in the next section.
Back Propagation
One of the fundamental ideas of AI is that you can “train” a model. This is done by asking a neural network (which starts its life as a big pile of random data) to do some task. Then, you somehow update the model based on how the model’s output compares to a known good answer.

For this section, let’s imagine a neural network with an input layer, a hidden layer, and an output layer.

Each of these layers are connected together with, initially, completely random weights.
And we’ll use a ReLU activation function on our hidden layer.
Let’s say we have some training data, in which the desired output is the average value of the input.
And we pass an example of our training data through the model, generating a prediction.

To make our neural network better at the task of calculating the average of the input, we first compare the predicted output to what our desired output is.

Now that we know that the output should increase in size, we can look back through the model to calculate how our weights and biases might change to promote that change.
First, let’s look at the weights leading immediately into the output: w₇, w₈, w₉. Because the output of the third hidden perceptron was -0.46, the activation from ReLU was 0.00.
As a result, there’s no change to w₉ that could result us getting closer to our desired output, because every value of w₉ would result in a change of zero in this particular example.
The second hidden neuron, however, does have an activated output which is greater than zero, and thus adjusting w₈ will have an impact on the output for this example.
The way we actually calculate how much w₈ should change is by multiplying how much the output should change, times the input to w₈.

The easiest explanation of why we do it this way is “because calculus”, but if we look at how all weights get updated in the last layer, we can form a fun intuition.
Notice how the two perceptrons that “fire” (have an output greater than zero) are updated together. Also, notice how the stronger a perceptrons output is, the more its corresponding weight is updated. This is somewhat similar to the idea that “Neurons that fire together, wire together” within the human brain.
Calculating the change to the output bias is super easy. In fact, we’ve already done it. Because the bias is how much a perceptrons output should change, the change in the bias is just the changed in the desired output. So, Δb₄=0.3
Now that we’ve calculated how the weights and bias of the output perceptron should change, we can “back propagate” our desired change in output through the model. Let’s start with back propagating so we can calculate how we should update w₁.
First, we calculate how the activated output of the of the first hidden neuron should change. We do that by multiplying the change in output by w₇.

For values that are greater than zero, ReLU simply multiplies those values by 1. So, for this example, the change we want the un-activated value of the first hidden neuron is equal to the desired change in the activated output

Recall that we calculated how to update w₇ based on multiplying it’s input by the change in its desired output. We can do the same thing to calculate the change in w₁.

It’s important to note, we’re not actually updating any of the weights or biases throughout this process. Rather, we’re taking a tally of how we should update each parameter, assuming no other parameters are updated.
So, we can do those calculations to calculate all parameter changes.

A fundamental idea of back propagation is called “Learning Rate”, which concerns the size of the changes we make to neural networks based on a particular batch of data. To explain why this is important, I’d like to use an analogy.
Imagine you went outside one day, and everyone wearing a hat gave you a funny look. You probably don’t want to jump to the conclusion that wearing hat = funny look
, but you might be a bit skeptical of people wearing hats. After three, four, five days, a month, or even a year, if it seems like the vast majority of people wearing hats are giving you a funny look, you may start considering that a strong trend.
Similarly, when we train a neural network, we don’t want to completely change how the neural network thinks based on a single training example. Rather, we want each batch to only incrementally change how the model thinks. As we expose the model to many examples, we would hope that the model would learn important trends within the data.
After we’ve calculated how each parameter should change as if it were the only parameter being updated, we can multiply all those changes by a small number, like 0.001
, before applying those changes to the parameters. This small number is commonly referred to as the “learning rate”, and the exact value it should have is dependent on the model we’re training on. This effectively scales down our adjustments before applying them to the model.
At this point we covered pretty much everything one would need to know to implement a neural network. Let’s give it a shot!
Have any questions about this article? Join the IAEE Discord.
Implementing a Neural Network from Scratch
Typically, a data scientist would just use a library like PyTorch
to implement a neural network in a few lines of code, but we’ll be defining a neural network from the ground up using NumPy, a numerical computing library.
First, let’s start with a way to define the structure of the neural network.
"""Blocking out the structure of the Neural Network
"""
import numpy as np
class SimpleNN:
def __init__(self, architecture):
self.architecture = architecture
self.weights = []
self.biases = []
# Initialize weights and biases
np.random.seed(99)
for i in range(len(architecture) - 1):
self.weights.append(np.random.uniform(
low=-1, high=1,
size=(architecture[i], architecture[i+1])
))
self.biases.append(np.zeros((1, architecture[i+1])))
architecture = [2, 64, 64, 64, 1] # Two inputs, two hidden layers, one output
model = SimpleNN(architecture)
print('weight dimensions:')
for w in model.weights:
print(w.shape)
print('\nbias dimensions:')
for b in model.biases:
print(b.shape)
While we typically draw neural networks as a dense web in reality we represent the weights between their connections as matrices. This is convenient because matrix multiplication, then, is equivalent to passing data through a neural network.

We can make our model make a prediction based on some input by passing the input through each layer.
Keep reading with a 7-day free trial
Subscribe to Intuitively and Exhaustively Explained to keep reading this post and get 7 days of free access to the full post archives.