Making the Back Propagation Routine

One of the more complicated aspects of a neural network is the back propagation routine. This is the routine that tunes the synapses (weights) of the network to optimize the system and reduce the error from the computed output to an ideal output.

For me, the difficulty in understanding this process was largely based on the technical terms and mathematical notations that are used to describe them, so I will not be using those in my article and I will simply go over my implementation of the back propagation routine from a theory and code perspective.

My test example is based off a very popular online article “A Step by Step Backpropagation Example” by Matt Mazur.

Matt’s article is widely referenced online with YouTube videos referencing the steps and examples as well as pretty good amount of comments on his blog dating back to it’s original posting date in 2015.

In order to prove out my back propagation method I used the structure, weights and ideals from his example (image below) and setup unit tests in my project to ensure that the values were coming out to about the right value. Utilizing floating point numbers results in less than exact numbers – so precision was an important part of validating the results.

Setting up my network with these specific layers, biases and inputs was a great way to make sure that as I refactor and enhance my system I will have proof that I have not broken something fundamental in it..

One of the struggles when I was working with the example images was just making sure that I had the proper weights associated with the proper neurons – something about the way they were illustrated in the image made me attribute the weights to the wrong neurons.

I spent a lot of time just comparing Matt’s examples to make sure that I interpreted the weights properly. I think that Matt struggled with that as well because there is an error in his example (that cost me a couple of days). All my math was adding up and only a couple of values were not lining up with his numbers. The weights in his example were going in opposite directions for w6 and w7, so I think he flipped the error value he used to compute those.

I have uploaded an Excel file with the math that is used in this example, and with it you can try and proof your work, and see if maybe I have made a mistake!

Here is what the file looks like.

One of the hardest things I had to grasp when working on implementing this routine based on this example was that there didn’t appear to be a pattern to the way the weights were identified in the step by step example. To find the contribution to the error for the weights in the last layer, a separate set of variables were being used than the first layers weights. To find a pattern I took his example and added another layer to it to see if I could find a pattern with the additional complexity, his example may have been too simple to find a pattern.

It took me a while, but what I broke it down to was:

  1. Get the Synapse (weight) target Neuron
  2. Get the Error relative to the target neuron’s output
  3. Multiply that by the derivative of output of the target neuron
  4. Multiple that by the derivative of the net of the source layer

I have made a particular focus to the target and source neuron in my bullet points above because it became very confusing in the code to differentiate things when everything was called “neuron”. Once I setup a source and a target neuron naming convention I was able to iron out a lot of my issues.

In my implementation I never use the existing weights value to identify the contribution of the preceding weights like Matt does in his example. In my routine everything is relative to the partial derivative of the error with respect to the output of a neuron. What that means is that when moving back through the network I always take it to the point of the target neurons output. Going from that output to a synapses contribution is just the derivative of the output of the target neuron and the derivative of that to the net input.

  • The net input is always the source neurons output.
  • The derivative of the output is the derivative of the activation function of the output value.

The pattern that I learned in this process was how to work backwards through functions. Derivatives are always described using gradient and slope and graphical illustrations, but I understood it a lot better when I realized that a derivative allows you to reverse through the network, and allow you to identify how much that specific network element (neuron input, neuron output, weight) contributed to the error. Because of the “chain rule” you can take all the previous contributions and use that to identify the additional contribution a previous synapse had on the error. Everything just builds on the previous values because the downstream contributions are part of the upstream contributions.

Understanding that aspect and really “knowing” that allowed me to finally put my back propagation routine together.

I have unit test for all the values in the step by step example, and just to make sure that my network can tune the synapses (weights) to the proper value and it can learn how to compute the proper output, I have a while loop iterating over as many examples as it needs to train the accuracy within a 0.00001 error rate and my network trains this simple example in ~40 ms.

With the network wired up, I can now run tests on my proposed architecture where I have a middle layer that is the output of the system – as an action, and the last layer being a goal that the system is optimized for.

I’ll keep you updated…