chapter 2: how the backpropagation algorithm works

the algorithm was introduced in the 1970s, but its importance wasn't fully appreciated until the famous 1986 paper by David Rumelhart, Geoffrey Hinton, and Ronald Williams.
"workhorse of learning in neural networks"
at the heart of it is an expression that tells us how quickly the cost function changes when we change the weights and biases.

/projects/ml/dl/neural-nets/ch2/
activations.svg — activation diagram of a single neuron in matrix notation

notation

$w_{jk}^l$ ⊕ means the weight of the j$^{th}$ neuron in layer $l$ to the k$^{th}$ neuron in the previous layer
- similarly for $b_j^l$ we say it is the j$^{th}$ bias in the l$^{th}$ layer
- and $a_j^l$ we say is the j$^{th}$ activation in the l$^{th}$ layer

obviously the activations of $a_j^l$ are computed with:

\begin{equation} \label{eq:ac_sum} a_j^l = \sigma(\sum_k w_{jk}^l a_k^{l-1} + b_j^l) \end{equation}

and says that the activation of any given neuron in layer l = the sigmoid of the weights multiplied by activations for all inputs of the previous layer + bias.

we can then vectorise this:

\begin{equation} \label{eq:ac_vec} a^l = \sigma(w^la^{l-1} + b^l) \end{equation}

it appears that the 'juicy bit' of the sigma above is worth naming: $w^la^{l-1} + b^l = z^l$.

cost function

as before, we have:

\begin{equation} \label{eq:cost} C = \cfrac{1}{2n}\sum_x \|y(x) -a^L(x) \|^2 \end{equation}

we make two assumptions about this cost function.

the cost function can be written as an average $C = \frac{1}{n}\sum_x C_x$ over cost functions $C_x$ for individual training examples, $x$.
the cost function can be written as a function of the outputs from the neural network.

hadamard product

it's just the element-wise multiplication of vectors.

implementing back-prop with this operator tends to be quite fast. (faster than matrix multiplication at times).

four fundamental equations.

\[\require{bbox}\require{color}\definecolor{myred}{RGB}{128,0,0} \fcolorbox{red}{lightblue}{$\begin{align} \color{myred}{\boldsymbol{\delta^L}} &\color{myred}{\boldsymbol{= \nabla_a C \odot \sigma'(z^L)}} \label{eq:bp1}\\ \color{myred}{\boldsymbol{\delta^l}} &\color{myred}{\boldsymbol{= ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l)}} \label{eq:bp2}\\ \color{myred}{\boldsymbol{\cfrac{\partial C}{\partial b_j^l}}} &\color{myred}{\boldsymbol{= \delta_j^l}} \label{eq:bp3}\\ \color{myred}{\boldsymbol{\cfrac{\partial C}{\partial w_{jk}^l}}} &\color{myred}{\boldsymbol{= a_k^{l-1} \delta_j^l}} \label{eq:bp4} \end{align}$}\]

where:

$\delta_j^l = \cfrac{\partial C}{\partial z_j^l}$ is the error of neuron j in layer l.
L is the last layer
and then $\delta_j^L = \cfrac{\partial C}{\partial a_j^L}\sigma'(z^L_j) = \nabla_a C \odot \sigma'(z^L) = (a^L -y) \odot \sigma'(z^L)$

note that saturation means that the neuron has low or high activations (close to 0 or 1) and thus its output is not changing despite varying inputs. i.e. it is not learning.

pseudocode

backprop

Input x: Set the corresponding activation $a^1$ for the input layer
Feedforward: For each $l=2,3,\ldots,L$ compute $z^l = w^la^{l-1}+b^l$ and $a^l = \sigma(z_l)$.
Output Error $\delta^l$: Compute the vector $\delta^L = \nabla_a C \odot \sigma'(z^L)$
Backpropagate the error: For each $l=L-1,L-2,\ldots,2$ compute $\delta^l = ((w^{l^1})^T \delta^{l^1}) \odot \sigma'(z^l)$
Output: The gradient of the cost function is given by $\cfrac{\partial C}{\partial w_{jk}^l = a_k^{l-1}\delta^l_j}$ and $\cfrac{\partial C}{\partial b_j^l} = \delta_j^l$.

sgd

given a minibatch, we step in the direction of steepest descent with the stochastic gradient descent algorithm:

Input minibatch
For each training example x: Set the corresponding input activation $a^x,1, and perform the following steps:
1. Feedforward: For each \(l = 2, 3, ..., L$ compute $z^{x,l} = w^l a^{x, l-1} + b^l$ and $a^{x,l} = \sigma(z^{x,l})$
2. Output Error $\delta^{x,L}$: Compute the vector $\delta^{x,L} = \nabla_a C_x \odot \sigma'(z^{x,L})$.
3. Backpropagate the Error: For each $l = L-1, L-2, ..., 2$ compute $\delta^{x,L} = ((w^{l+1})^T \delta^{x,l+1}) \odot \sigma'(z^{x,L})$.
Gradient descent: For each $l = L, L-1, ..., 2$ update the weights according to the rule $w^l \rightarrow w^l - \frac{\eta}{m}\sum_x \delta^{x,l}(a^{x,l-1})^T$, and the biases according to the rule $b^l \rightarrow b^l - \frac{\eta}{m}\sum_x \delta^{x,l}$.

speed of backprop

the total cost of backpropagation is roughly the same as making just two forward passes through the network.
speedup was first appreciated in 1986, and greatly expanded the range of problems that neural networks could solve.
as such it is possible to use backprop to train deep neural nets.