chapter 5: why are deep neural networks hard to train?

given the findings of the previous chapter (universality), why would we concern ourselves with learning deep neural nets?
- especially given that we are guaranteed to be able to approximate any function with just a single layer of hidden neurons?

well, just because something is possible, it doesn't mean it's a good idea!

considering that we are using computers, it's usually a good idea to break the problem down into smaller sub-problems, solve those, and then come back to solve the main problem.

this can only really be achieved by subsequent layers of abstraction, not just 1 or 2 layers which are theoretically guaranteed to produce the right answer.

because in addition to having these completeness theorems we also have literature about some functions requiring exponentially more circuit elements with very shallow circuits.

a famous series of papers (Johan Håstad's 2012 paper) showed that computing the parity of a set of bits requires exponentially many gates.

thus, deep circuits can be intrinsically much more powerful than shallow circuits.

note that in the second architecture, we can delegate tasks to each neuron: learning a particular edge / feature, and then having these grainy filters refine across more hidden layers.

the principal problem

now that we are sold on the conceptual and theoretical benefits of deep neural nets, in practice we hit a road-block:

the layers of our network learn at vastly different speeds!

there is an intrinsic instability associated to learning by gradient descent in deep, many-layer neural nets.

vanishing gradients

when earlier hidden layers are learning much slower than later hidden layers.
the speed of learning across epochs:

exploding gradients

when earlier hidden layers have really large gradients

generalising the problem

it turns out both of the above are just due to the instability of gradients themselves.

we can investigate the cause of the problem by solving for \(\frac{\partial C}{\partial b_1}\).

the exact expression of this gradient depends on the structure of the network, but one thing that will remain identical is the dependence of this gradient on all prior gradients:
\begin{equation} \frac{\partial C}{\partial b_1} = \sigma'(z_1)w_2\sigma'(z_2)w_3\sigma'(z_3)w_4\sigma'(z_4)\frac{\partial C}{\partial a_4} \end{equation}

clearly here the maximum is at \(\cfrac{1}{4}\). as such initialising the weights with mean 0 and standard deviation = 1, taking the products of many such terms < 0.25 will cause the gradients to decrease exponentially!

there are 2 steps to getting exploding gradients
1. choose all the weights in the network to be large
2. choose the biases such that \(\sigma'(z_j)\) terms are not too small.

—

fundamentally, the problem is not so much of exploding / vanishing gradients, but rather of the early gradient being the product of all the terms from later layers.

thus, when there become lots of layers, this causes learning to become intrinsically unstable.
as such, we now need to find mechanisms to balance the gradients

when using the sigmoid neurons, the gradient will usually vanish.
this slowdown (of learning) is not an inconvenience, nor is it an accident. it is a fundamental consequence of the approach we are taking to learning.
making good choices makes a substantial difference in the ability to train deep networks (think AlphaFold competition entries across the years).

in the next chapter we implement these fixes to train a network on MNIST with upwards of 99% accuracy.

we will initialise weights sensibly;
choose good activation functions,
tune hyperparameters
renew the network architecture
and regularise appropriately.