Search notes:

The unfinished history of Artificial Intelligence

1943: Theory of nets without circles

Warren McCulloch and Walter Pitts: A locigal calculus of the ideas immanent in nervous activity.
The paper mentions a theory of nets without circles:
Escpecially the last point in combination with nets without circles rules out(?) the the possibility of «memory».
Their ideas would later be called neural networks.

1950: Turing test

In his paper Computing Machinery and Intelligence, Alan Turing ponders the question: «can machines think?» and devised the Turing Test.
The turing test states that a machine can be considered intelligent if it a human communicating with that machine cannot tell if the human is communicating with a machine or another human.
See also

1956: Dartmouth workshop

… every aspect of learning or any other feature of intelligence can be so precisely described that a machine can be made to simulate it …
After McCarthy persuaded the attendees to accept Artificial Intelligence as the name of a field of study, the Dartmouth workshop is widely considered to be the founding event for the discipline of Artificial Intelligence.

1957-1962: Frank Rosenblatt - The perceptron

In the years 1957-1962, Frank Rosenblatt has published quite a few articles on the Perceptron.

A probalistic model for information storage and organization in the brain (1958)

In The Perceptron: A probabilistic model for information storage and organization in the brain, Rosenblatt states: We must first have answers to three fundamental questions:
  • How is information about the physical world sensed, or detected, by the biological system?
  • In what form is information stored, or remembered?
  • How does information contained in storage, or in memory, influence recognition and behavior?
This article primarily deals with the second and third question «which are still subject to a vast amount of speculation».
On the percepton, Rosenblatt writes:
[ … ] a hypothetical nervous system, or machine, called a perceptron. The perceptron is designed to illustrate some of the fundamental properties of intelligent systems in general, without becoming too deeply enmeshed in the special, and frequently unknown, con- ditions which hold for particular bio- logical organisms. The analogy between the perceptron and biological systems should be readily apparent to the reader.
In this article, Rosenblatt connects Donald Hebb's theory (The organization of behavior, 1949) with his own theory:
Hebb, however, has never actually achieved a model by which behavior (or any psychological data) can be predicted from the physiological system. His physiology is more a suggestion as to the sort of organic substrate which might underlie behavior, and an attempt to show the plausibility of a bridge between biophysics and psychology.
The present theory represents the first actual completion of such a bridge.
Rosenblatt concludes his article with
In the meantime, the theory reported here clearly demonstrates the feasibility and fruitfulness of a quantitative statistical approach to the organization of cognitive systems. By the study of systems such as the perceptron, it is hoped that those fundamental laws of organization which are common to all information handling systems, machines and men included, may even tually be understood.

A perceptron is a brain model

Rosenblatt and/or the idea of the perecptron received negative feedback, which Rosenblatt attributed to the following three reasons:
  • An (admitted) lack of mathematical rigor
  • Sensational headlines, such as Frankenstein Monster Designed by Navy Robo That Thinks
  • Failure to comprehend the difference in motivation between the perceptron program and other engineering programs concerned with automatic pattern recognition, artificial intelligence and adavanced computers.
In Principles of Neurodynamics: Perceptrons and the Theory of Brain Mechanisms (1962), Rosenblatt writes
The term 'perceptron', originally intended as a generic name for a variety of theoretical nerve nets, has an unfortunate tendency to suggest a specific piece of hardware …
A perceptron is first: and foremost a brain model, not an invention for pattern recognition
Further down:
Perceptrons are not intended to serve as detailed copies of any actual nervous system. They are simplified networks, designed to permit the study of lawful relationships between the organization of a nerve net, the organization of its environment, and the "psychological" performances of which the network is capable.

See also

1959: Cognitive Systems Research Program

The work of Rosenblatt seems to be related the Cognitive Systems Research Program (CSRP).
Rosenblatt's final summary report on this program recaps:
This program was initiated in 1959, tor the purpose of studying "intelligent systems", by means of theoretical models, computer simulation, and biological experiments. While the program has been primarily supported since that time by the Office of Naval Research, some assistance has also been provided by other agencies, particularly the National Science Foundation, and (for a period of one year) the National Institutes of Health.

1966: ELIZA, the first chatbots

Joseph Weizenbaum developped ELIZA, possibly the first chatbot (then called «chatterbot»).
Conversations with ELIZA at times were so realistic that users occasionally believed to communicate with a human being and not a program (See Turing Test).
Weizenbaum presented the following excerpt of a typical conversation in ELIZA - A Computer Program For the Study of Natural Language Communication Between Man and Machine (1966):
User:  I am unhappy.
User:  I need some help, that much seems certain.
User:  Perhaps I could learn to get along with my mother.
User:  My mother takes care of me.
User:  My father.
User:  You are like my father in some ways.

1969: Perceptrons (Minsky)

Minsky and Papert published Perceptrons: an introduction to computational geometry.
The book is thought to have had a influence on the decline in research of neural nets in the 1970s and early 1980s (AI Winter).
The book was dedicated to Frank Rosenblatt (Perceptron (1957))

1991: Backpropagation

Rumelhart, Hinton and Williams publish the article Learning internal representations by error propagation which introduced backpropagation as a method for training neural networks.
Prior to the publication of this article, neural networks were considered a somewhat obscure area of research.
Because backpropagation is both practical and effective, interest in neural network grew rapidly.
This article is cited approximately 28000 times.
TODO: Paul Warbos' dissertation (1974), which first described the process of training neural networks through backpropagation of errors.
An early form of backpropagation was also proposed in 1985 by Yann LeCun (Une procédure d'apprentissage pour réseau a seuil asymmetrique/a Learning Scheme for Asymmetric Threshold Networks).

1997: Deep Blue beats Garry Kasparov

Deep Blue was the first computer to win against a reigning world chess champion: Garry Kasparov.

2012 - AlexNet

2012 became a pivotal year for computer vision with the groundbreaking release of AlexNet, similarly to how 2018 witnessed a comparable breakthrough in NLP with the introduction of GPT and BERT.
Computer vision led to the adoption of deep learning in AI.

2017: Attention Is All You Need

The paper Attention Is All You Need introduced the transformer model which provides a more structured memory for handling long-term dependencies in text, compared to alternatives like RRNs.

Introduction of the transformer architecture

The paper Attention Is All You Need (Vaswani et al.) introduced the transformer architecture which 2018 became the foundation for GPT models:
We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
And further down:
In this work we propose the Transformer, a model architecture eschewing recurrence and instead relying entirely on an attention mechanism to draw global dependencies between input and output. The Transformer allows for significantly more parallelization and can reach a new state of the art in translation quality …
And also:
To the best of our knowledge, however, the Transformer is the first transduction model relying entirely on self-attention to compute representations of its input and output without using sequence-aligned RNNs or convolution.

Relation of transformers to encoder-decoder models

The transformer model is a specific instance of the encoder-decoder models that had become popular in the years 2014-2015.

Original transformer architecture

Both, the encoder and the decoder of the original transformer architecture had 6 layers (see the illustrated transformer).
The structure (but not the weights) of each layer in the decoder is identical and likewise, the structure of each layer in the encoder is identical.
The output size of the encoder is 512.

Layers of the encoder

The layers of the encoder has two sublayers:
  • a multi-head attention layer
  • a simple feedforward network
Both sublayers have
  • a residual connection
  • a layer normalization


The decoder has a third sublayer: another multi-head attention layer (? which is additionally masked to prevent attention to subsequent positions ?)


The transformer model became the foundation for state-of-the-art NLP models such as
  • BERT (Bidirectional Encoder Representation from Transformers)
  • GPT
  • Meena
  • XLNet
  • RoBERTa
  • Transformer-XL
  • ETC


The code that was used to train and evaluate their models is in this GitHub repository.
The model architecture is released in the tensor2tensor library.
A guide to the architecture of the transformer is The Annotated Transformer.
The architecture of the PyTorch class torch.nn.Transformer is based on this paper.
Hugging Face produced the library transformers which supplies transformer-based architectures and pretrained models

2018 - BERT / GPT

Shortly after the release of GPT, BERT was published.
Similar to how AlexNet ushered in the era of computer vision, the combination of BERT and GPT marked the advent of a new era in NLP.

2020: GPT-3

GPT-3 showed that (extremly large) autoregressive language models can be used for few shot predictions.

2021 October: Google introduces Pathways

Google introduces Pathways, a next generation AI architecture.
Pathways is intended to improve three shortcomings of AIs:
The Pathway architecture was used, for example, to train PaLM.

2022 - April: PaLM (Pathways Language Model): Potential tipping point

PaLM is able to explain why jokes are funny which marks a potential tipping point in the history of AI. Apparently, this model «understands» language and passed a litmus test if a language model «knows» what's going on.
PaLM was announced in April 2022 and became public (for a limited number of developers) in March 2023.
PaLM was trained using the Pathways architecture (which is what the P in PaLM stands for).
PaLM uses SwiGLU activation for the MLP intermediate activations.j

2023 - March: LLaMA Leak

In the beginning of March 2023, (the parameters of?) Meta's LLaMA model was/were leaked.
The leak had no instruction or conversation tuning, and no RLHF. Nevertheless, it sparked a surge of in innovations in the open source community.
Among others, it led to the development of


Jürgen Schmidhuber: The road to modern AI - Annotated History of Modern AI and Deep Learning
