• the algorithm was introduced in the 1970s, but its importance wasn't fully appreciated until the famous 1986 paper by David Rumelhart, Geoffrey Hinton, and Ronald Williams.
  • "workhorse of learning in neural networks"
  • at the heart of it is an expression that tells us how quickly the cost function changes when we change the weights and biases.
/projects/ml/dl/neural-nets/ch2/
activations.svg
activation diagram of a single neuron in matrix notation

notation

  • \(w_{jk}^l\) means the weight of the j\(^{th}\) neuron in layer \(l\) to the k\(^{th}\) neuron in the previous layer

    • similarly for \(b_j^l\) we say it is the j\(^{th}\) bias in the l\(^{th}\) layer
    • and \(a_j^l\) we say is the j\(^{th}\) activation in the l\(^{th}\) layer

obviously the activations of \(a_j^l\) are computed with:

\begin{equation} \label{eq:ac_sum} a_j^l = \sigma(\sum_k w_{jk}^l a_k^{l-1} + b_j^l) \end{equation}

and says that the activation of any given neuron in layer l = the sigmoid of the weights multiplied by activations for all inputs of the previous layer + bias.

we can then vectorise this:

\begin{equation} \label{eq:ac_vec} a^l = \sigma(w^la^{l-1} + b^l) \end{equation}

it appears that the 'juicy bit' of the sigma above is worth naming: \(w^la^{l-1} + b^l = z^l\).

cost function

as before, we have:

\begin{equation} \label{eq:cost} C = \cfrac{1}{2n}\sum_x \|y(x) -a^L(x) \|^2 \end{equation}

we make two assumptions about this cost function.

  1. the cost function can be written as an average \(C = \frac{1}{n}\sum_x C_x\) over cost functions \(C_x\) for individual training examples, \(x\).
  2. the cost function can be written as a function of the outputs from the neural network.

hadamard product

it's just the element-wise multiplication of vectors.

implementing back-prop with this operator tends to be quite fast. (faster than matrix multiplication at times).

four fundamental equations.

\[\require{bbox}\require{color}\definecolor{myred}{RGB}{128,0,0} \fcolorbox{red}{lightblue}{$\begin{align} \color{myred}{\boldsymbol{\delta^L}} &\color{myred}{\boldsymbol{= \nabla_a C \odot \sigma'(z^L)}} \label{eq:bp1}\\ \color{myred}{\boldsymbol{\delta^l}} &\color{myred}{\boldsymbol{= ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l)}} \label{eq:bp2}\\ \color{myred}{\boldsymbol{\cfrac{\partial C}{\partial b_j^l}}} &\color{myred}{\boldsymbol{= \delta_j^l}} \label{eq:bp3}\\ \color{myred}{\boldsymbol{\cfrac{\partial C}{\partial w_{jk}^l}}} &\color{myred}{\boldsymbol{= a_k^{l-1} \delta_j^l}} \label{eq:bp4} \end{align}$}\]

where:

  • \(\delta_j^l = \cfrac{\partial C}{\partial z_j^l}\) is the error of neuron j in layer l.
  • L is the last layer
  • and then \(\delta_j^L = \cfrac{\partial C}{\partial a_j^L}\sigma'(z^L_j) = \nabla_a C \odot \sigma'(z^L) = (a^L -y) \odot \sigma'(z^L)\)

note that saturation means that the neuron has low or high activations (close to 0 or 1) and thus its output is not changing despite varying inputs. i.e. it is not learning.

pseudocode

backprop

  1. Input x: Set the corresponding activation \(a^1\) for the input layer
  2. Feedforward: For each \(l=2,3,\ldots,L\) compute \(z^l = w^la^{l-1}+b^l\) and \(a^l = \sigma(z_l)\).
  3. Output Error \(\delta^l\): Compute the vector \(\delta^L = \nabla_a C \odot \sigma'(z^L)\)
  4. Backpropagate the error: For each \(l=L-1,L-2,\ldots,2\) compute \(\delta^l = ((w^{l^1})^T \delta^{l^1}) \odot \sigma'(z^l)\)
  5. Output: The gradient of the cost function is given by \(\cfrac{\partial C}{\partial w_{jk}^l = a_k^{l-1}\delta^l_j}\) and \(\cfrac{\partial C}{\partial b_j^l} = \delta_j^l\).

sgd

given a minibatch, we step in the direction of steepest descent with the stochastic gradient descent algorithm:

  1. Input minibatch
  2. For each training example x: Set the corresponding input activation \(ax,1, and perform the following steps:

    1. Feedforward: For each \(l = 2, 3, ..., L\) compute \(z^{x,l} = w^l a^{x, l-1} + b^l\) and \(a^{x,l} = \sigma(z^{x,l})\)
    2. Output Error \(\delta^{x,L}\): Compute the vector \(\delta^{x,L} = \nabla_a C_x \odot \sigma'(z^{x,L})\).
    3. Backpropagate the Error: For each \(l = L-1, L-2, ..., 2\) compute \(\delta^{x,L} = ((w^{l+1})^T \delta^{x,l+1}) \odot \sigma'(z^{x,L})\).
  3. Gradient descent: For each \(l = L, L-1, ..., 2\) update the weights according to the rule \(w^l \rightarrow w^l - \frac{\eta}{m}\sum_x \delta^{x,l}(a^{x,l-1})^T\), and the biases according to the rule \(b^l \rightarrow b^l - \frac{\eta}{m}\sum_x \delta^{x,l}\).

speed of backprop

  • the total cost of backpropagation is roughly the same as making just two forward passes through the network.
  • speedup was first appreciated in 1986, and greatly expanded the range of problems that neural networks could solve.
  • as such it is possible to use backprop to train deep neural nets.