Search notes:

GPT-2

The GPT-2 model was developed by researchers at OpenAI to help them understand how the capabilities of language model capabilities scale as a function of the size of the models (by parameter count) combined with very large internet-scale datasets (WebText).

Models

There are 4 model versions
layers heads embd ctx vocab
124 M 12 12 768 1024 50257
355 M 24 16 1024 1024 50257
774 M 36 20 1280 1024 50257
1.5 B 48 25 1600 1024 50257
50257 = 256 individual byte tokens + 50000 merged tokens + 1 special end-of-text (<|endoftext|>?) token.
GPT-1 had 12 layers, 12 heads, 768 embeddings and 117M parameters.

Download

Files

checkpoint
encoder.json The vocabulary
hparams.json The model's hyper-parameters with the values n_vocab (number of tokens in the vocabulary), n_ctx (the maximum input sequence length), n_embd (Embedding dimension, the width of the network), n_head (number of attention heads, n_embd must be divisible by n_head) and n_layer (the depth of the network).
model.ckpt.data-00000-of-00001
model.ckpt.index
model.ckpt.meta
vocab.bpe Byte-pair merges

Pseudocode

for $model in '124M', '335M', '774M', '1558M' loop

    for $filename in 'checkpoint','encoder.json','hparams.json','model.ckpt.data-00000-of-00001', 'model.ckpt.index', 'model.ckpt.meta', 'vocab.bpe' loop

        download(https://openaipublic.blob.core.windows.net/gpt-2/models/$model/$filename)

    end loop

end loop

Dataset: WebText

All (? or only the largest ?) model(s) were trained on (and evaluated against) WebText.

Misc

Activation function

The preferred activation function for GPT-2 is GELU.

See also

Generative pre-trained transformers (GPT)

Links

The openai/gpt-2 github repository is meant as a starting point for experiments with GPT-2. It contains code and models from the paper Language Models are Unsupervised Multitask Learners.
More about GPT-2 and its staged release
There is a GPT-2 output dataset that contains

Index