Parameter swarm in neural nets.
Part 1 gave us a broad overview of the founding ideas of deep learning. Whether it be MLPs, convolutional networks or even RNNs, every type of neural net suffers from the huge computational cost attached to its training. The number of parameters required to train a net increases quadratically with the size of the layers rendering dense neural networks extremely hard to train.
Parameter sharing in convolutional networks and RNNs (see part 1) is an efficient method to reduce the training complexity in a smart way (capturing patterns), and greatly diminishes the number of parameters. Nevertheless, training neural nets remains a very difficult task.
In a MLP, the number of parameters that are required to connect a layer of size A to a layer of size B is AxB (weights) + B (bias). For the whole net, we have to sum the previous formula over every couple of adjacent layers. The example MLP above uses 8×10 + 9×10 + 9×10 + 9×5 = 305 parameters. Modern neural nets have over 10 hidden layers with more than 1000 neurons each; that is no less than 10 million parameters to be tuned.
Convolutional networks mitigate this issue using parameter sharing. If two layers of 1000 neurons are connected via a convolutional filter of size 5, the number of parameters to estimate during training will only be 5 instead of a million (1000×1000). However, convolutional networks most often use many feature maps*; the number of parameters to be estimated is the product of the number of shared parameters by the number of feature maps.
*feature map : group of neurons in a given layer encoding the result of the convolution of the layer below with a specific filter. A layer with 20 feature maps will actually contain the result of the convolution of the input image with 20 different filters.
Computing the gradient
Neural nets are usually trained with gradient descent. The gradient indicates how the loss function will vary in response to little changes of the parameters. All we have to do is tweak the parameters according to the gradient to minimize the loss function.
Given the explicit relation between the loss function and the parameters, it is easy to compute the gradient, using partial derivatives of the loss function with respect to the parameters. However, for neural nets, especially the deep ones, expressing the loss as a function of the parameters is far too complicated.
In fact, the complex structure of neural nets creates tangled dependencies between the parameters and the overall cost function. The outputs of the neurons in a given layer – which are computed using the networks’ parameters – are then re-used by the next layer and the layer after it, etc. all the way to the end. The more layers the outputs go through, the harder it is to get the explicit relation between these outputs – and therefore the parameters – and the value of the cost.
The backpropagation algorithm
An elegant way around the tangled dependencies issue is to break the problem of computing the relations through the whole net down to computing the relations from one layer to another. Since the output of the last layer is simply related to the cost function, we’ll start from the end and find the update increment of the last layer that minimizes the cost function. Recurrently, given the relation between layer l and layer l+1, and given the update increment at layer l+1 (updates of layer l+1 parameters), we can easily back-propagate that update increment to layer l.
This algorithm, known as back-propagation, was introduced in the 70’s, but only became famous and widely used after Geoffrey Hinton, David Rumelhart and Ronald Williams published the founding paper called Learning representations by back-propagating errors in 1986. Back-propagation proved to be considerably faster than earlier approaches, allowing more complex architectures to be trained. New types of problems could now be dealt with, and at last neural networks were able compete with rival machine learning algorithms.
However, one last obstacle was yet to mitigate the enthusiasm around neural nets and hinder their development…
Although the universal approximation theorem states that a single layer neural net can approximate any function provided the hidden layer is big enough, in practice, deep architectures are strongly preferred. In fact, deep neural nets easily capture complex dependencies between input variables (see part 1), and there are mathematical proofs that shallow neural nets require exponentially more neurons perform as well.
While performing backpropagation in deep neural nets, researchers came across an unexpected issue: the vanishing gradient problem. Simply put, the vanishing gradient issue arises due to multiple projections (one step of back-propagation, going from layer l to layer l-1) of the gradient through the networks’ layers during backpropagation. Each projection tends to lower the gradient’s norm, and multiple projections lower it exponentially. In deep networks, after a few steps of back-propagation the gradient has become very small and the weights are practically not updated: neurons in the first layers don’t learn much.
Recurrent neural networks undergo the same difficulties. Due to loops between units, back-propagation is performed slightly differently. The RNN is unfolded through time and becomes a very deep feedforward net on which back-propagation can be used.
Prevalence of the vanishing gradient issue
In the formula above, at each step of backpropagation, the previous gradient is multiplied by wf'(h), where h = wa and a is the neurons’ activation.
- If |wf'(wa)| > 1, the backpropagated gradient will explode
- If |wf'(wa)| < 1, the backpropagated gradient will vanish
Since f’ is bounded by 1, |wf'(wa)| can only be large if w is large. However, if you take a close look at the sigmoid function for example, when w is large, wa is also large and f'(wa) is small.
Therefore, the most common form of unstable gradient is the vanishing one.
Patches for the vanishing gradient issue
Feedforward neural networks can be deep, but they generally have less than 20 layers. Recent speedups in the learning include :
- changing the neurons’ activation functions (using ReLU or softplus instead of sigmoid or tanh)
- using GPUs to perform more iterations in the same period of time
- using smart weight initialization
Recurrent networks on the other hand, are often unfolded over 100+ timesteps. Consequently, the vanishing gradient problem is much more of an issue for RNNs than for FNNs. In 1997, german researcher Sepp Hochreiter came up with an improved version of RNNs : the Long-Short Term Memory unit (LSTM). LSTM networks rely on additive updates of their hidden state instead of multiplicative (projection) updates of the hidden state as new inputs from a sequence are fed in. In 2014, a variation of LSTMs, called Gated Recurrent Units (GRUs) were presented by Cho et al. These networks are significantly easier to train than LSTMs, and yield similar results on mid to long range dependencies