Large Language Models (LLMs)

Classification based on the Transformer architecture

Transformer based LLMs can be clasified into the three variations:

Encoder only
Decoder only
Encoder-Decoder

Perplexity

The most basic intrinsic measure of a language model's performance is its perplexity on a given text corpus: It measures how well a model is able to predict the contents of a dataset; the higher the likelihood the model assigns to the dataset, the lower the perplexity.

Perplexity is related to the cross-entropy loss function which is used to train neural language model.

Allocating training budget

When training an LLM, the available budget (compute) must be allocated to

number of model parameters
number of training tokens

Some guidance about such an allocation can be gained by applying the Scaling Laws for Neural Language Models.

TODO

Context size/length: the number of tokens taken into account to predict the next token.

Models

A list of models is maintained by LifeArchitect.ai.

Cerebras-GPT

Cerebras-GPT is a family of seven GPT models ranging from 111 million to 13 billion parameters.

These models were trained on CS-2 systems (Andromeda AI supercomputer) using the Chinchilla formula.

Their weights and checkpoints are available on Hugging Face and GitHub under the Apache 2.0 license.

GPT-4

GPT-4 was released without any information about its model architecture, training data, training hardware or hyperparameters.

As per The secret history of Elon Musk, Sam Altman, and OpenAI, GPT-4 has 1 trillion parameters.

Chinchilla (DeepMind)

Same budget as Gopher but with 70 B parameters and 4 times more data.

Chinchilla uniformly and significantly outperforms Gopher (280B), GPT-3 (175B), Jurassic-1 (178B), and Megatron-Turing NLG (530B) on a large range of downstream evaluation tasks. It uses substantially less computing for fine-tuning and inference, greatly facilitating downstream usage.

Gemini

Gemini is Google's next generation foundation model (as of 2023-05-13):

This model is created from the ground up to be multi modal, highly efficient at tool and API integrations and built to enable future innovations like memory and planning.

Scaling laws

J. Kaplan, S. McCandlish: Scaling Laws for Neural Language Models:

We study empirical scaling laws for language model performance on the cross-entropy loss.

Benchmarks

BIG-Bench

BIG-Bench is a collaborative benchmark with over 150 tasks which aim at producing challenging tasks for LLMs.

The tasks include

logical reasoning,
translation,
question answering,
mathematics, and
others