The Race for AI.
Since 2011, leading tech firms among which Google, Facebook, Apple, IBM, Twitter, Microsoft, have acquired over 140 AI companies according to cbinsights.com. More than 40 of these acquisitions took place in 2016 alone, and huge amounts of money were spent: Google paid $600M for DeepMind, Twitter spent $150M for Magic Pony, an image-processing startup, Microsoft bought Equivo for $200M.
This frenzy for AI is backed up by technological breakthroughs in recent years and stunning achievements computers have been able to accomplish. Machine translation and automatic language processing have become far more convincing than the previous “word-to-word” translation algorithms. Image recognition has also been a very active research field, yielding astonishing results. Vision algorithms used by the leading tech companies can now organize and describe entire collections of unlabeled pictures. Medical data scientists hope to be able to use machines to analyze medical images (X-rays, MRIs, …) or diagnose diseases with less invasive methods. Driverless cars are on the verge of taking over our roads…
Neural networks are the cornerstone of that success. Surprisingly though, neural nets are decades old, and were out of favor until the early 2000’s. What happened? How did such amazing tech go unnoticed for over 30 years, since the first neural net (perceptron) was created by Frank Rosenblatt in 1958?
From Neurons to Artificial Neural Networks
Artificial Neural Nets (ANNs) were built to replicate the way human brains were thought to work. In the human brain, neurons are connected and interact with each other. A neuron receives multiple electrical inputs from neighboring neurons and determines its output accordingly: an electrical impulse – 0 or 1.
Artificial Neural Nets operate in a similar manner : neurons are the basic computation units and are arranged in layers. The only rule is that they can only be connected to neurons in the layer above them (no connections between neurons in the same layer). An artificial neuron receives inputs from neurons in the previous layer, computes a weighted sum of those inputs and passes the sum through a function (called activation function); the result is the output (continuous variable).
Hidden Layers and deep learning
Any layer that isn’t the first or the last is a hidden layer. The first layer, called the input layer encodes the input features from which a prediction will be made; the last layer, called the output layer encodes the output. The input contains very low level information : it is raw data that has undergone no transformation.The first hidden layer performs quite simple computation on the input (the input only went through one layer of neurons which are basic computation units) and is comparable to feature engineering in machine learning. The second layer can capture cross dependencies between the input features. etc. Each layer applies a transformation to the data and therefore contains higher level information than the layer below. The last layer outputs our prediction which is the highest level of information.
Deep neural nets can capture very complicated dependencies and it is nowadays common to train nets with 5 hidden layers or more.
Multi Layer Perceptron
The first neural net models were “fully-connected”, meaning that each neuron was connected to all the neurons in the next layer. These vanilla neural nets called Mulit Layer Perceptrons (MLP) quickly fell into disfavor due to the huge number of parameters – increasing exponentially with the number of layers – that had to be tuned during the training phase (one parameter per connection), and because of poor performances on pattern recognition. In 1986, Geoffrey Hinton – a Canadian researcher, pioneer in the field of deep learning – found a way to train multilayer neural nets; however due to a lack of computational power at that time, MLPs were still outperformed by traditional machine learning algorithms like SVMs or Random Forests with boosting.
Convolutional Networks – Resolving the “spatial” pattern issue
Spatial patterns : patterns within a sample/similarity between features.
In 1998, Yann Le Cun – a French researcher, working at AT&T’s Bell Labs at the time – bested every other algorithm on handwritten digit classification tasks with a new type of neural net called convolutional network (abbrev. ConvNet). The convolutional network used by Yann Le Cun in 1998 (LeNet98) imposes specific constraints on the parameters rendering it robust to small distortions, and translation invariant: line thickness, slight rotations of the digits, translations etc. have very little effect on the classification. This patches the “spatial” pattern detection issue. Moreover, due to those constraints (weight sharing, see part 2 – coming soon), few parameters are required to train these nets.
What does convolution have to do with all this ?
Let’s assume we’re given a 20×20 image and a ConvNet that tries to determine whether or not there is a 5×5 image of a circle contained in the larger image. An intuitive approach would be to shift the 5×5 image pixel by pixel across the 20×20 image to see if there is a match somewhere.
A convolutional network does the exact same thing. In technical terms, it will simply create a 5×5 weight matrix (called filter, same dimension as the subimage the network is trying to detect) such that the sum of the elements of its pointwise mulitplication (nothing more than a weighted sum) with an image “close” to a 5×5 circle gives a big number and “far” from a 5×5 circle gives a small number. All we have to do to know if a 5×5 circle is present in the 20×20 image is compute the weighted sum of this filter with every 5×5 subimage of the original 20×20 image.
Ok, but what about the convolution? The convolution apears when we compute the weighted sum of the filter with all the subimages 😉 Write the convolution formula down for two functions if you don’t get it or look at this nice GIF on a eight:
Recurrent Neural Networks – Resolving the “time” pattern issue
Time patterns : patterns within a sequence of samples/similarity between samples.
At the same time, Sepp Hochreiter – a German researcher – managed to add memory features to neural nets, patching the “time” pattern recognition issue (1997). Vanilla neural nets with memory are called Recurrent Neural Networks: information from previous training samples is “remembered” by the network when processing the following samples. Patterns in training sample sequences can therefore be detected by the neural net. They are recurrent since the output of a layer, called the hidden state, is kept in memory and used when the next sample is fed in to compute the output of the same layer.
Memory in recurrent neural networks
Feedforward neural networks (FNNs) are neural nets wherein connections between units do not form a cycle as opposed to recurrent neural nets. Cycles in recurrent neural nets allow them to memorize past information and capture “time” patterns.
Note : The neural nets described by Hochreiter in 1997 are called LSTMs (Long-Short Term Memory) and are a bit more sophisticated than vanilla RNNs as they address the vanishing gradient problem (see part 2)
Computational power with GPU acceleration and the big data era
Despite these major discoveries – that remain the building blocks of most of today’s deep learning algorithms – neural nets fell back into disfavor in the late 90’s due to the unresolved lack of computational power. In the early 2000s, computing power increased by many orders of magnitude, especially due to the discovery of GPU acceleration.
Computers now had enough computational power and researchers needed data to perform the training. Neural nets are very powerful but require huge amounts of data to be trained efficiently. Unsurprisingly, companies like Google, Amazon, Facebook, Microsoft – which probably have the largest databases in the world – were involved in most of the deep learning applications that have been deployed over the past 10 years.