Machine Learning algorithms are breaking down all barriers. Increasingly terms such as “Generative AI” are dominating all newspapers headlines.
Meanwhile, hidden in this earthquake, a piece of news has gone silently unnoticed. Geoffrey Hinton (Wimbledon, UK) has resigned from their privileged position in Google.
This article is a tribute to Geoffrey Hinton and his key contribution to what Machine Learning is nowadays: the back-propagation algorithm.
The back-propagation algorithm.
But, what is back-propagation algorithm? And, why is this algorithm so crucial to the current success of Machine Learning?
In my article “How Machine Learning learns” published on this amazing platform, we dove into the 4 main building blocks of every single Machine Learning algorithm, we can remember them now in the following list:
- Forward pass: we plug into our model a batch of our training data and getting the an output. Then, we compare the result of our model and the actual data using a loss function.
- Backward pass: this is the step that allows the loss function to be minimized.
- Optimization: here we modify the weights and bias of our model in order to approach our results to the actual data and, therefore, minimize the loss
- Repetition: we repeat the algorithm as many times as the number of epochs defines.
Today, we’ll walk through the math that underlies the back-propagation algorithm solving the forward pass and the backward pass of a pretty easy computation graph.
The computation graph.
We can define a computation graph as the representation of a series of mathematical functions, along with their dependencies among them.
It’s key to remark the last part of the sentence: “among them”. Neural Networks support their powerful results in the capacity of combining the outputs of neurons as inputs to other neurons.
In a computation graph, three main components must be defined:
- Nodes: Each node represents a mathematical operation or function, such as addition, multiplication, or more complex operations like neural network layers or activation functions.
- Edges: They connect the nodes to represent the flow of data or output from one node to the input of another node or nodes
- Inputs and Outputs: The input data and the final output are typically represented as nodes without incoming or outgoing edges, respectively.
For the sake of clarity, let’s define the following a simple computation graph with just three nodes and another three features as shown in the following picture:
Our computation graph with three nodes and three features.
We can have a glance to these three nodes pointing out their inputs and outputs:
- Node 1 receives features b and c as inputs and returns u as output.
- Node 2 receives feature a and previous output u as input and returns v as output.
- Node 3 receives previous output v as input and returns j as final output.
We can observe that a computation graph is composed of a sequence of sequential calculations, where the outputs of the previous nodes serve as the inputs for the following ones.
This graph is pretty small. If we want to scale it up, it makes sense to mathematically formalize the way we calculate its final output. And here it is the math tool that does this work: function composition.
The function composition.
What is the function composition? We can quote the definition from Wikipedia (https://en.wikipedia.org/wiki/Function_composition) as follows:
“In mathematics, function composition is an operation ∘ that takes two functions f and g, and produces a function h = g ∘ f such that h(x) = g(f(x)).
And here it comes the key takeaway provided by Wikipedia:
“In this operation, the function g is applied to the result of applying the function f to x.”
This powerful property of function composition allows as to sequentially chain all the nodes we need and, it we are building neural networks, the function composition is the math tool to scale them up to deep neural networks.
We can note down our computation graph as a sequence of function compositions such as:
Our computation graph as sequence of function compositions
Function composition is always associative. Therefore, there are several ways to solve our computation graph. Let me note down my the most straightforward way to do it.
Solving our computation graph
As you might have figured out, solving the computation graph is the task that Machine Learning algorithms do in the forward pass.
Our extremely powerful math tool, function composition, allows us to build as complex graphs as we want. From the simplest forward pass of a linear regression to the Transformers graphs ChatGPT has in its engine room.
Our goal.
Our main goal is to calculate the derivative of the graph function that yields the final output BUT with respect to the variables given in the first node of our graph.
Therefore, we want to calculate the following derivatives to obtain the gradient of the final function j with respect to to b and c:
Our target, the gradient of function J with respect o to variables b and c
But, reviewing function J, variables b and c don’t appear. Here it is where the back-propagation algorithm enters the scene.
The back-propagation algorithm.
So we can define the back-propagation algorithm as the sequence of calculation of the successive derivatives in all nodes of the graph, beginning at the first node and finishing at the last node. Then we apply the amazing math tool of the chain rule to obtain the derivatives of the last node with respect to the variables of the first nodes.
If you got lost with the definition given above, let’s break it down in two pieces walking through the steps the backward propagation algorithm does.
First step: the derivative of function J with respect to variable v:
First step of the backward propagation algorithm.
We calculate the derivative:
First derivative in Node 3.
Second step: the derivative of function V with respect to u:
The second step of the backward propagation algorithm.
We calculate the derivative:
The derivative of function J with respect to to v.
Remember, variable a is a feature and we have its value, 5.
Third step: the derivative of function U with respect to b.
The third and final step of the backward propagation algorithm over our computation graph.
We calculate the derivative:
The derivative of function U with respect to b.
We put all together with the chain rule and here it comes the partial derivative we were looking for:
The partial derivative of J with respect to b
Let’s finish our third step calculating the partial derivative of function V with respect to variable c:
The partial derivative of function U with respect to to c.
We again put all together with the chain rule to get the partial derivative of function J with respect to b.
The partial derivative of J with respect to c
We conclude our work building the gradient of function J with respect to b and c:
Our final gradient, a vector with to components
So guys, we’ve managed to walk through our computation graph in the forward direction and then figuring out the gradient of the final function with respect to the initial variables using the backward propagation algorithm.
Maybe it’s dawning on you the real power of its algorithm. The back-propagation algorithm can be generalized to the most complex computation graph you need. It’s just all about doing more steps and finally applying the chain rule to all the derivatives putting them all together.
About the author:
At Somosierra Tech (www.somosierratech.com), Pedro Anquela drives Machine Learning to enhance clients’ business workflows, driving efficiency and innovation.
He also teaches this amazing technology at Universidad Francisco de Vitoria (www.ufv.es).
You can connect him at https://www.linkedin.com/in/pedro-anquela-0944531/