Neurons
A neuron can be thought of as a cell that contains a number between 0 and 1 and is able to compute a simple calculation.
The number is also called activation.
Sigmoid function
σ(x) = 1/(1 + exp(-x))
For all x, the sigmoid function produced a value between 0 and 1.
A particularly useful property of the sigmoid function is that that its derivative is
σ'(x) = σ(x) * (1-σ(x))
Convolutional neural network (CNN)
A convolutional neural network uses a convolution operation rather than a matrix multiplication in at least one of its layers.
CNNs are typically used to analyze visual imagery.
Recurrent neural network (RNN)
A recurrent neural network is a class of neural networks where connections between nodes can create a cycle.
Such cycles are essential to remember past events when a sequence is processed (i. e. for implementing a memory).
RNNs can be used for
- unsegmented, connected handwriting recognition
- speech recognition
RNNs are theoretically Turing complete.
RNNs were already in 2010 widely used in text prediction (usually referred to as language modeling).
A potential drawback of RNNs is that they have an inherently serial structure that prevents them from being run in parallel along the sequence length during training and evaluation.
Also, forward and backward signals need to traverse the full distance of the serial path to reach from one token in the sequence to another (See also Hochreiter et al, 2001)
Bidirectional RNN (BiRNN)
A BiRNN consists of forward and backward RNNs.
The forward RNN reads the input sequence and calculates a sequence of forward hidden states, the backward RNN reads the sequence in the reverse order resulting in a sequence of backward hidden states.
Long short-term memory (LSTM) RNNs
In principle, an arbitrary complex sequence should be able to be generated by a sufficiently large RNN.
It turns out that standard RNNs don't store past inputs for very long which makes them prone to instability and mistakes, from which the RNN cannot easily recover.
The long short-term memory (LSTM) architecture is designed to improve storing and accessing information.
LSTMs were proposed by S. Hochreiter and J. Schmidhuber in 1997.
Generative Adversial Network (GAN)
A Generative Adversial Network (GAN) are used to create high resolution, realistic imagages.
A GAN consist of two neural networks that compate against each other:
- a generator and
- a discriminator.
Because GANs are difficult to train effectively
Arjovsky et al. (2017) proposed the
Wasserstein GAN (WGAN) as an alternative.
Autoencoders
Autoencoders is a type of neural network used to learn efficient data codings in an unsupervised manner.
An autoencoder tries to learn the identity function (that is: to reconstruct its input). This entails that the autoencoder has the equal amount of input and output neurons.
An additional requirement for an autoencoder is that the number of neurons in the hidden layer must be less than the number of input/output neurons. This second requirement forces the autoencoder to learn the most important features of the input only.
An autoencoder consists of
- The encoding function (encoder)
- The decoding function (decoder)
- A distance function (loss function)
TODO
ByteNet
The ByteNet is a one-dimensional convolutional neural network that is composed of two parts, one to encode the source sequence and the other to decode the target sequence.
Minimal neural network libraries
tinn (Tyni neural network) is a dependency free neural network library writtin in C99 using less than 200 lines of code.
Geoffrey Hinton
Geoffrey Hinton is considered a pioneer in the field of Artificial Intelligence and neural networks.
Hinton co-invented the
Boltzmann machine with David Ackley and Terry Sejnowski in 1985.
Hinton was a co-author of the influential paper on backpropagation of 1986 (Learning internal representations by error propagation.)
Activation functions
An important architecture decision is the choice of the activation function.
Activation functions are divided into
- Linear, and
- Non-linear functions (some of(?) which have the property that they're differentiable which is needed for backpropagation)
In some literature, activation functions are also referred to as transfer functions.
Some activation functions include:
Sigmoid function | The sigmoid function (sometimes referred to as logistic function or squashing function) is mostly used in feedforward neural networks. |
hyperbolic tangent (tanh) | tanh has a gradient of 1 only when input is 0 which makes the function produce dead neurons during computations. This limitation lead to ReLU. |
SiLU (Sigmoid-Weighted Linear Units) | Can only be used in hidden layers of a deep neural network and only for reinforcement learning systems. |
dSiLU (derivative of SiLU) | |
Softmax | Used to compute a probability distribution (i. e. values between 0 and 1 and sum of these values equal to 1) |
ELU | |
GELU | The preferred activation function for GPT-2 |
ReLU (Rectified Linear Units) | Although ReLU is much simpler (relu(x) = max(0, x) ) than the sigmoid function, it turned out to be much better activation function because the signals don't die out when travelling through the layers (as they do with sigmoid). ReLU was proposed by Nair and Hinton in 2010 and has ever since the most widely used activation function for deep learning applications with SOTA results). Because ReLU has no computation of exponentials etc., it also performs way better when executed. |
LReLU (Leaky ReLU) | Introduces a small negative slope to ReLU to sustain and keep the weight updates alive during the entire propagation process) |