chapter 5: why are deep neural networks hard to train?

  • given the findings of the previous chapter (universality), why would we concern ourselves with learning deep neural nets?
    • especially given that we are guaranteed to be able to approximate any function with just a single layer of hidden neurons?

well, just because something is possible, it doesn’t mean it’s a good idea!

considering that we are using computers, it’s usually a good idea to break the problem down into smaller sub-problems, solve those, and then come back to solve the main problem.

this can only really be achieved by subsequent layers of abstraction, not just 1 or 2 layers which are theoretically guaranteed to produce the right answer.

because in addition to having these completeness theorems we also have literature about some functions requiring exponentially more circuit elements with very shallow circuits.

a famous series of papers (Johan Håstad’s 2012 paper) showed that computing the parity of a set of bits requires exponentially many gates.

thus, deep circuits can be intrinsically much more powerful than shallow circuits.

{{< tikz >}}

\begin{tikzpicture}[x=4.7cm,y=1.6cm] % Define colors \colorlet{myred}{red!80!black} \colorlet{myblue}{blue!80!black} \colorlet{mygreen}{green!60!black} \colorlet{myorange}{orange!70!red!60!black} \colorlet{mydarkred}{red!30!black} \colorlet{mydarkblue}{blue!40!black} \colorlet{mydarkgreen}{green!30!black}

% Define TikZ styles \tikzset{ >=latex, % for default LaTeX arrow head node/.style={thick,circle,draw=myblue,minimum size=22,inner sep=0.5,outer sep=0.6}, node in/.style={node,green!20!black,draw=mygreen!30!black,fill=mygreen!25}, node hidden/.style={node,blue!20!black,draw=myblue!30!black,fill=myblue!20}, node convol/.style={node,orange!20!black,draw=myorange!30!black,fill=myorange!20}, node out/.style={node,red!20!black,draw=myred!30!black,fill=myred!20}, connect/.style={thick,mydarkblue}, %,line cap=round connect arrow/.style={-{Latex[length=4,width=3.5]},thick,mydarkblue,shorten <=0.5,shorten >=1} }

% Define layers and nodes \def\layerNodes{{6,10,4}} % Number of nodes in each layer \def\layerX{1,2,3} % X positions of layers

% Loop through layers \foreach \l [count=\lay] in \layerX { % Get number of nodes for this layer \pgfmathsetmacro\nodes{\layerNodes[\lay-1]}

% Determine node style based on layer position
\ifnum\lay=1
  \def\nodestyle{node in}
\else
  \ifnum\lay=3
    \def\nodestyle{node out}
  \else
    \def\nodestyle{node hidden}
  \fi
\fi

% Draw nodes for this layer
\foreach \i in {1,...,\nodes} {
  \pgfmathsetmacro\y{\nodes/2-\i}
  \node[\nodestyle] (N\lay-\i) at (\l,\y) {};

  % Connect to previous layer if not the first layer
  \ifnum\lay>1
    \pgfmathsetmacro\prevnodes{\layerNodes[\lay-2]}
    \foreach \j in {1,...,\prevnodes} {
      \draw[connect,white,line width=1.2] (N\the\numexpr\lay-1\relax-\j) -- (N\lay-\i);
      \draw[connect] (N\the\numexpr\lay-1\relax-\j) -- (N\lay-\i);
    }
  \fi
}

}

\end{tikzpicture}

{{< /tikz >}}

{{< tikz >}}

\begin{tikzpicture}[x=2.3cm,y=1.0cm] % Define colors if not already defined \colorlet{myred}{red!80!black} \colorlet{myblue}{blue!80!black} \colorlet{mygreen}{green!60!black} \colorlet{myorange}{orange!70!red!60!black} \colorlet{mydarkred}{red!30!black} \colorlet{mydarkblue}{blue!40!black} \colorlet{mydarkgreen}{green!30!black}

% Define TikZ styles \tikzset{ >=latex, % for default LaTeX arrow head node/.style={thick,circle,draw=myblue,minimum size=22,inner sep=0.5,outer sep=0.6}, node in/.style={node,green!20!black,draw=mygreen!30!black,fill=mygreen!25}, node hidden/.style={node,blue!20!black,draw=myblue!30!black,fill=myblue!20}, node convol/.style={node,orange!20!black,draw=myorange!30!black,fill=myorange!20}, node out/.style={node,red!20!black,draw=myred!30!black,fill=myred!20}, connect/.style={thick,mydarkblue}, %,line cap=round connect arrow/.style={-{Latex[length=4,width=3.5]},thick,mydarkblue,shorten <=0.5,shorten >=1}, node 1/.style={node in}, % node styles, numbered for easy mapping with \nstyle node 2/.style={node hidden}, node 3/.style={node out} }

\message{^^JNeural network large} % Define layers and nodes \def\layerNodes{{6,7,7,7,7,7,4}} % Number of nodes in each layer \def\totalLayers{7} % total number of layers

\message{^^J Layer} % Loop over layers \foreach \lay in {1,…,\totalLayers} { % Get number of nodes for this layer \pgfmathsetmacro\N{\layerNodes[\lay-1]} \pgfmathsetmacro\prev{int(\lay-1)} % number of previous layer

% Determine node style based on layer position
\pgfmathsetmacro\n{int(\lay==1 ? 1 : (\lay==\totalLayers ? 3 : 2))}

\message{\lay,}
\foreach \i in {1,...,\N} { % loop over nodes
  % Calculate y-position
  \pgfmathsetmacro\y{\N/2-\i}

  % NODES as coordinates (initially)
  \coordinate (N\lay-\i) at (\lay,\y);

  % CONNECTIONS
  \ifnum\lay>1 % connect to previous layer
    \pgfmathsetmacro\prevN{\layerNodes[\prev-1]} % nodes in previous layer
    \pgfmathsetmacro\nprev{int(\prev<\totalLayers?min(2,\prev):3)}

    \foreach \j in {1,...,\prevN} { % loop over nodes in previous layer
      \draw[connect,white,line width=1.2] (N\prev-\j) -- (N\lay-\i);
      \draw[connect] (N\prev-\j) -- (N\lay-\i);

      % Draw node over lines for previous layer
      \node[node \nprev,minimum size=18] at (N\prev-\j) {};
    }

    % Draw last node over lines
    \ifnum\lay=\totalLayers
      \node[node \n,minimum size=18] at (N\lay-\i) {};
    \fi
  \else
    % For first layer, just draw nodes
    \node[node \n,minimum size=18] at (N\lay-\i) {};
  \fi
}

} \end{tikzpicture}

{{< /tikz >}}

note that in the second architecture, we can delegate tasks to each neuron: learning a particular edge / feature, and then having these grainy filters refine across more hidden layers.

the principal problem

now that we are sold on the conceptual and theoretical benefits of deep neural nets, in practice we hit a road-block:

  • the layers of our network learn at vastly different speeds!
  • there is an intrinsic instability associated to learning by gradient descent in deep, many-layer neural nets.

vanishing gradients

  • when earlier hidden layers are learning much slower than later hidden layers.
  • the speed of learning across epochs:

{{< img “training_speed_2_layers.png” >}}

exploding gradients

  • when earlier hidden layers have really large gradients

generalising the problem

it turns out both of the above are just due to the instability of gradients themselves.

we can investigate the cause of the problem by solving for \(\frac{\partial C}{\partial b_1}\).

  • the exact expression of this gradient depends on the structure of the network, but one thing that will remain identical is the dependence of this gradient on all prior gradients:

    \begin{equation} \frac{\partial C}{\partial b_1} = \sigma’(z_1)w_2\sigma’(z_2)w_3\sigma’(z_3)w_4\sigma’(z_4)\frac{\partial C}{\partial a_4} \end{equation}

{{< tikz >}}

\begin{tikzpicture}[>=stealth, every node/.style={circle, draw, minimum size=1cm}] % Input node \node (n0) at (0,0) {$x_1$};

% Hidden layer nodes \foreach \i in {1,2,3} { \node (n\i) at (\i*2.5, 0) {$b_{\i}$}; }

% Connections and weight labels (plain text, no circles) \foreach \i/\w in {0/2,1/3,2/4} { \pgfmathtruncatemacro{\j}{\i+1} \draw[->] (n\i) – (n\j) node[midway, above, draw=none, fill=none, circle=none] {$w_{\w}$}; }

% Output arrow to C \draw[->] (n3) – ++(2,0) node[right, draw=none,circle=none] {$C$};

\end{tikzpicture}

{{< /tikz >}}

clearly here the maximum is at \(\cfrac{1}{4}\). as such initialising the weights with mean 0 and standard deviation = 1, taking the products of many such terms < 0.25 will cause the gradients to decrease exponentially!

  • there are 2 steps to getting exploding gradients
    1. choose all the weights in the network to be large
    2. choose the biases such that \(\sigma’(z_j)\) terms are not too small.

{{< img “exploding-grad.gif” >}}


fundamentally, the problem is not so much of exploding / vanishing gradients, but rather of the early gradient being the product of all the terms from later layers.

  • thus, when there become lots of layers, this causes learning to become intrinsically unstable.
  • as such, we now need to find mechanisms to balance the gradients
  • when using the sigmoid neurons, the gradient will usually vanish.

  • this slowdown (of learning) is not an inconvenience, nor is it an accident. it is a fundamental consequence of the approach we are taking to learning.

  • making good choices makes a substantial difference in the ability to train deep networks (think AlphaFold competition entries across the years).

in the next chapter we implement these fixes to train a network on MNIST with upwards of 99% accuracy.

  • we will initialise weights sensibly;
  • choose good activation functions,
  • tune hyperparameters
  • renew the network architecture
  • and regularise appropriately.