chapter 2: how the backpropagation algorithm works
- the algorithm was introduced in the 1970s, but its importance wasn't fully appreciated until the famous 1986 paper by David Rumelhart, Geoffrey Hinton, and Ronald Williams.
- "workhorse of learning in neural networks"
- at the heart of it is an expression that tells us how quickly the cost function changes when we change the weights and biases.
notation
-
\(w_{jk}^l\) ⊕ means the weight of the j\(^{th}\) neuron in layer \(l\) to the k\(^{th}\) neuron in the previous layer
- similarly for \(b_j^l\) we say it is the j\(^{th}\) bias in the l\(^{th}\) layer
- and \(a_j^l\) we say is the j\(^{th}\) activation in the l\(^{th}\) layer
obviously the activations of \(a_j^l\) are computed with:
\begin{equation} \label{eq:ac_sum} a_j^l = \sigma(\sum_k w_{jk}^l a_k^{l-1} + b_j^l) \end{equation}and says that the activation of any given neuron in layer l = the sigmoid of the weights multiplied by activations for all inputs of the previous layer + bias.
we can then vectorise this:
\begin{equation} \label{eq:ac_vec} a^l = \sigma(w^la^{l-1} + b^l) \end{equation}it appears that the 'juicy bit' of the sigma above is worth naming: \(w^la^{l-1} + b^l = z^l\).
cost function
as before, we have:
\begin{equation} \label{eq:cost} C = \cfrac{1}{2n}\sum_x \|y(x) -a^L(x) \|^2 \end{equation}we make two assumptions about this cost function.
- the cost function can be written as an average \(C = \frac{1}{n}\sum_x C_x\) over cost functions \(C_x\) for individual training examples, \(x\).
- the cost function can be written as a function of the outputs from the neural network.
hadamard product
it's just the element-wise multiplication of vectors.
implementing back-prop with this operator tends to be quite fast. (faster than matrix multiplication at times).
four fundamental equations.
\[\require{bbox}\require{color}\definecolor{myred}{RGB}{128,0,0} \fcolorbox{red}{lightblue}{$\begin{align} \color{myred}{\boldsymbol{\delta^L}} &\color{myred}{\boldsymbol{= \nabla_a C \odot \sigma'(z^L)}} \label{eq:bp1}\\ \color{myred}{\boldsymbol{\delta^l}} &\color{myred}{\boldsymbol{= ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l)}} \label{eq:bp2}\\ \color{myred}{\boldsymbol{\cfrac{\partial C}{\partial b_j^l}}} &\color{myred}{\boldsymbol{= \delta_j^l}} \label{eq:bp3}\\ \color{myred}{\boldsymbol{\cfrac{\partial C}{\partial w_{jk}^l}}} &\color{myred}{\boldsymbol{= a_k^{l-1} \delta_j^l}} \label{eq:bp4} \end{align}$}\]
where:
- \(\delta_j^l = \cfrac{\partial C}{\partial z_j^l}\) is the error of neuron j in layer l.
- L is the last layer
- and then \(\delta_j^L = \cfrac{\partial C}{\partial a_j^L}\sigma'(z^L_j) = \nabla_a C \odot \sigma'(z^L) = (a^L -y) \odot \sigma'(z^L)\)
note that saturation means that the neuron has low or high activations (close to 0 or 1) and thus its output is not changing despite varying inputs. i.e. it is not learning.
pseudocode
backprop
- Input x: Set the corresponding activation \(a^1\) for the input layer
- Feedforward: For each \(l=2,3,\ldots,L\) compute \(z^l = w^la^{l-1}+b^l\) and \(a^l = \sigma(z_l)\).
- Output Error \(\delta^l\): Compute the vector \(\delta^L = \nabla_a C \odot \sigma'(z^L)\)
- Backpropagate the error: For each \(l=L-1,L-2,\ldots,2\) compute \(\delta^l = ((w^{l^1})^T \delta^{l^1}) \odot \sigma'(z^l)\)
- Output: The gradient of the cost function is given by \(\cfrac{\partial C}{\partial w_{jk}^l = a_k^{l-1}\delta^l_j}\) and \(\cfrac{\partial C}{\partial b_j^l} = \delta_j^l\).
sgd
given a minibatch, we step in the direction of steepest descent with the stochastic gradient descent algorithm:
- Input minibatch
-
For each training example x: Set the corresponding input activation \(ax,1, and perform the following steps:
- Feedforward: For each \(l = 2, 3, ..., L\) compute \(z^{x,l} = w^l a^{x, l-1} + b^l\) and \(a^{x,l} = \sigma(z^{x,l})\)
- Output Error \(\delta^{x,L}\): Compute the vector \(\delta^{x,L} = \nabla_a C_x \odot \sigma'(z^{x,L})\).
- Backpropagate the Error: For each \(l = L-1, L-2, ..., 2\) compute \(\delta^{x,L} = ((w^{l+1})^T \delta^{x,l+1}) \odot \sigma'(z^{x,L})\).
- Gradient descent: For each \(l = L, L-1, ..., 2\) update the weights according to the rule \(w^l \rightarrow w^l - \frac{\eta}{m}\sum_x \delta^{x,l}(a^{x,l-1})^T\), and the biases according to the rule \(b^l \rightarrow b^l - \frac{\eta}{m}\sum_x \delta^{x,l}\).
speed of backprop
- the total cost of backpropagation is roughly the same as making just two forward passes through the network.
- speedup was first appreciated in 1986, and greatly expanded the range of problems that neural networks could solve.
- as such it is possible to use backprop to train deep neural nets.