Backpropagation \ Tutorials \ Library \ CroftSoft

-	About
-	Contract
-	Library
-	-	Books
-	-	-	AJGP
-	-	Code
-	-	Courses
-	-	Links
-	-	Media
-	-	Tutorials
-	People
-	-	David
-	-	-	Résumé
-	-	Shannon
-	Portfolio
-	Update

Backpropagation

David Wallace Croft

2005-02-18

Introduction

This is a derivation of a popular neural network training algorithm known as backpropagation. It is commonly used with feed-forward multilayer perceptrons.

To see backpropagation in action, check out the applet BackpropXOR in the CroftSoft Collection.

Definitions and Identities

Sigmoid Function

$σ (x) = \frac{1}{(1 + ⅇ^{-x})}$
Derivative of the Sigmoid Function

$σ' (x) = σ (x) * [1 - σ (x)]$

Simplest Case

This is a derivation of the backpropagation neural network training algorithm for the simplest case: a two-layer network with one neuron in each layer. In this derivation, all of the variables are scalars. Deriving the learning rule for the simplest case makes backpropagation easier to understand when vectors and matrices are introduced later.

s → v → b → σ → h → w → a → σ → r

A signal (s) is the input to the first layer hidden neuron. It is multiplied by the first layer weight (v). The weighted input (b) is then squashed by the nonlinear neuron to fall between zero and one using the sigmoidal function (σ). The output of the hidden neuron (h) is then multiplied by the weight (w). The weighted input from the hidden neuron (a) is fed to the squashing function of the second layer output neuron. The difference between the desired response (d) and the actual response (r) of the output neuron is the error (e). The objective function (l) is the scaled square of the error (e).

Objective Function

$l = ½ * e^{2}$
Error Function

$e = d - r$
Response from Output Neuron

$r = σ (a)$
Weighted Input from Hidden Neuron to Output Neuron

$a = w * h$
Hidden Neuron Output

$h = σ (b)$
Weighted Input to Hidden Neuron

$b = v * s$

Output Layer

We seek to minimize the objective function (l) by modifying the weight (w) to the output layer. This is done by taking the partial derivative of the objective function with respect to the weights. This derivative is then used to modify the weight using a gradient descent algorithm.

Gradient for Output Layer Weight, Chain Rule

$\frac{\partial l (w)}{\partial w} = \frac{\partial l (e)}{\partial e} * \frac{\partial e (r)}{\partial r} * \frac{\partial r (a)}{\partial a} * \frac{\partial a (w)}{\partial w}$
Output Layer Weight Gradient, First Term

$\frac{\partial l (e)}{\partial e} = \frac{\partial (½ * e^{2})}{\partial e} = e = d - r$
Output Layer Weight Gradient, Second Term

$\frac{\partial e (r)}{\partial r} = \frac{\partial (d - r)}{\partial r} = -1$
Output Layer Weight Gradient, Third Term

$\frac{\partial r (a)}{\partial a} = \frac{\partial σ (a)}{\partial a} = σ (a) * (1 - σ (a)) = r * (1 - r)$
Output Layer Weight Gradient, Fourth Term

$\frac{\partial a (w)}{\partial w} = \frac{\partial (w * h)}{\partial w} = h$
Gradient for Output Layer Weight, Combined Terms

$\frac{\partial l (w)}{\partial w} = [d - r] * [-1] * [r * (1 - r)] * [h]$

Hidden Layer

We also seek to minimize the objective function (l) by modifying the weight (v) to the hidden layer. The derivation of the first three terms in the gradient for the hidden layer weight are skipped as they are the same as the first three terms in the gradient for the output layer derived previously.

Gradient for Hidden Layer Weight, Chain Rule

$\frac{\partial l (v)}{\partial v} = \frac{\partial l (e)}{\partial e} * \frac{\partial e (r)}{\partial r} * \frac{\partial r (a)}{\partial a} * \frac{\partial a (h)}{\partial h} * \frac{\partial h (b)}{\partial b} * \frac{\partial b (v)}{\partial v}$
Hidden Layer Weight Gradient, Fourth Term

$\frac{\partial a (h)}{\partial h} = \frac{\partial (w * h)}{\partial h} = w$
Hidden Layer Weight Gradient, Fifth Term

$\frac{\partial h (b)}{\partial b} = \frac{\partial σ (b)}{\partial b} = σ (b) * (1 - σ (b)) = h * (1 - h)$
Hidden Layer Weight Gradient, Sixth Term

$\frac{\partial b (v)}{\partial v} = \frac{\partial (v * s)}{\partial v} = s$
Gradient for Hidden Layer Weight, Combined Terms

$\frac{\partial l (v)}{\partial v} = [d - r] * [-1] * [r * (1 - r)] * [w] * [h * (1 - h)] * [s]$

Local Gradient

We noted that the first three terms of the gradients for the output layer and the hidden layer are the same. The negative product of these three terms is the local gradient. The local gradient can be computed for the output layer and then reused for the hidden layer.

Local Gradient for Output Neuron

$δ_{o} = - \frac{\partial l (a)}{\partial a} = - \frac{\partial l (e)}{\partial e} * \frac{\partial e (r)}{\partial r} * \frac{\partial r (a)}{\partial a} = e * r * (1 - r)$
Output Layer Gradient Abbreviated
$\frac{\partial l (w)}{\partial w} = {- δ}_{o} * h$
Hidden Layer Gradient Abbreviated

$\frac{\partial l (v)}{\partial v} = {- δ}_{o} * w * h * (1 - h) * s$

Backpropagation

The local gradient shows how the objective function decreases as the weighted input to the neuron increases. The local gradient for an output neuron depends on the error (e). The local gradients for hidden layer neurons, however, depend on the local gradients of the following layer. Thus the computation of the local gradients must be propagated from the back layer of the network to the front. The weight gradients can then be expressed as functions of the local gradients and the unweighted inputs.

Local Gradient for Hidden Neuron

$δ_{h} = - \frac{\partial l (b)}{\partial b} = - \frac{\partial l (e)}{\partial e} * \frac{\partial e (r)}{\partial r} * \frac{\partial r (a)}{\partial a} * \frac{\partial a (h)}{\partial h} * \frac{\partial h (b)}{\partial b}$
Local Gradient Backpropagation

$δ_{h} = δ_{o} * \frac{\partial a (h)}{\partial h} * \frac{\partial h (b)}{\partial b}$
The Pattern for Hidden Layers
$δ_{h} = δ_{o} * w * h * (1 - h)$
Hidden Layer Gradient Abbreviated Further

$\frac{\partial l (v)}{\partial v} = {- δ}_{h} * s$

Gradient Descent

Now that we have the gradient of the objective function with respect to the weights in terms of the of local gradients, we can incrementally decrease the objective function, and therefore the overall error, by shifting the weights.

Weight Update Rule
$w (t + 1) = w (t) + Δ w$
Descend the Gradient
$Δ w = - η * \frac{\partial l (w)}{\partial w}$
A Negative Becomes a Positive
$Δ w = η * δ_{o} * h$
The Weight Update Rule Revised
$w (t + 1) = w (t) + η * δ_{o} * h$
Similar Pattern for Hidden Weights
$v (t + 1) = v (t) + η * δ_{h} * s$

Links