Neural networks

Neural networks are non-linear models.

Applications:

Image recognition
…

Neurons

A neuron can be thought of as a cell that contains a number between 0 and 1 and is able to compute a simple calculation.

The number is also called activation.

Layers

There are three types of layers:

One input layer
n hidden layers
One output layer

Hidden layers and the output layer contain a number of neurons.

Sigmoid function

The sigmoid function is an activation function.

σ(x) = 1/(1 + exp(-x))

For all x, the sigmoid function produced a value between 0 and 1.

A particularly useful property of the sigmoid function is that that its derivative is

σ'(x) = σ(x) * (1-σ(x))

Convolutional neural network (CNN)

A convolutional neural network uses a convolution operation rather than a matrix multiplication in at least one of its layers.

CNNs are typically used to analyze visual imagery.

Feedforward neural network (FNN)

A neural network without circles in its connections is a feedforward neural network (FNN).

Recurrent neural network (RNN)

A recurrent neural network is a class of neural networks where connections between nodes can create a cycle.

Such cycles are essential to remember past events when a sequence is processed (i. e. for implementing a memory).

RNNs can be used for

unsegmented, connected handwriting recognition
speech recognition

RNNs are theoretically Turing complete.

RNNs were already in 2010 widely used in text prediction (usually referred to as language modeling).

A potential drawback of RNNs is that they have an inherently serial structure that prevents them from being run in parallel along the sequence length during training and evaluation.

Also, forward and backward signals need to traverse the full distance of the serial path to reach from one token in the sequence to another (See also Hochreiter et al, 2001)

Bidirectional RNN (BiRNN)

A BiRNN consists of forward and backward RNNs.

The forward RNN reads the input sequence and calculates a sequence of forward hidden states, the backward RNN reads the sequence in the reverse order resulting in a sequence of backward hidden states.

Long short-term memory (LSTM) RNNs

In principle, an arbitrary complex sequence should be able to be generated by a sufficiently large RNN.

It turns out that standard RNNs don't store past inputs for very long which makes them prone to instability and mistakes, from which the RNN cannot easily recover.

The long short-term memory (LSTM) architecture is designed to improve storing and accessing information.

LSTMs were proposed by S. Hochreiter and J. Schmidhuber in 1997.

Generative Adversial Network (GAN)

A Generative Adversial Network (GAN) are used to create high resolution, realistic imagages.

A GAN consist of two neural networks that compate against each other:

a generator and
a discriminator.

Because GANs are difficult to train effectively Arjovsky et al. (2017) proposed the Wasserstein GAN (WGAN) as an alternative.

Another alternative to GANs are Variational Autoencoders (VAE) (See Kingma and Welling, 2022 (2013?).

Autoencoders

Autoencoders is a type of neural network used to learn efficient data codings in an unsupervised manner.

An autoencoder tries to learn the identity function (that is: to reconstruct its input). This entails that the autoencoder has the equal amount of input and output neurons.

An additional requirement for an autoencoder is that the number of neurons in the hidden layer must be less than the number of input/output neurons. This second requirement forces the autoencoder to learn the most important features of the input only.

An autoencoder consists of

The encoding function (encoder)
The decoding function (decoder)
A distance function (loss function)

Autoencoders are relevant for data quality.

TODO

ByteNet

Neural Machine Translation in Linear Time:

The ByteNet is a one-dimensional convolutional neural network that is composed of two parts, one to encode the source sequence and the other to decode the target sequence.

Minimal neural network libraries

Genann, written in C.

tinn (Tyni neural network) is a dependency free neural network library writtin in C99 using less than 200 lines of code.

kann

Caffe64, written in assembler language for 64 bit Linux.

Geoffrey Hinton

Geoffrey Hinton is considered a pioneer in the field of Artificial Intelligence and neural networks.

Hinton co-invented the Boltzmann machine with David Ackley and Terry Sejnowski in 1985.

Hinton was a co-author of the influential paper on backpropagation of 1986 (Learning internal representations by error propagation.)

Activation functions

An important architecture decision is the choice of the activation function.

Activation functions are divided into

Linear, and
Non-linear functions (some of(?) which have the property that they're differentiable which is needed for backpropagation)

In some literature, activation functions are also referred to as transfer functions.

Some activation functions include:

Sigmoid function	The sigmoid function (sometimes referred to as logistic function or squashing function) is mostly used in feedforward neural networks.
hyperbolic tangent (tanh)	tanh has a gradient of 1 only when input is 0 which makes the function produce dead neurons during computations. This limitation lead to ReLU.
SiLU (Sigmoid-Weighted Linear Units)	Can only be used in hidden layers of a deep neural network and only for reinforcement learning systems.
dSiLU (derivative of SiLU)
Softmax	Used to compute a probability distribution (i. e. values between 0 and 1 and sum of these values equal to 1)
ELU
GELU	The preferred activation function for GPT-2
ReLU (Rectified Linear Units)	Although ReLU is much simpler (`relu(x) = max(0, x)`) than the sigmoid function, it turned out to be much better activation function because the signals don't die out when travelling through the layers (as they do with sigmoid). ReLU was proposed by Nair and Hinton in 2010 and has ever since the most widely used activation function for deep learning applications with SOTA results). Because ReLU has no computation of exponentials etc., it also performs way better when executed.
LReLU (Leaky ReLU)	Introduces a small negative slope to ReLU to sustain and keep the weight updates alive during the entire propagation process)