GPT-2

Models

There are 4 model versions

	layers	heads	embd	ctx	vocab
124 M	12	12	768	1024	50257
355 M	24	16	1024	1024	50257
774 M	36	20	1280	1024	50257
1.5 B	48	25	1600	1024	50257

50257 = 256 individual byte tokens + 50000 merged tokens + 1 special end-of-text (<|endoftext|>?) token.

GPT-1 had 12 layers, 12 heads, 768 embeddings and 117M parameters.

Download

Files

`checkpoint`
`encoder.json`	The vocabulary
`hparams.json`	The model's hyper-parameters with the values `n_vocab` (number of tokens in the vocabulary), `n_ctx` (the maximum input sequence length), `n_embd` (Embedding dimension, the width of the network), `n_head` (number of attention heads, `n_embd` must be divisible by `n_head`) and `n_layer` (the depth of the network).
`model.ckpt.data-00000-of-00001`
`model.ckpt.index`
`model.ckpt.meta`
`vocab.bpe`	Byte-pair merges

Pseudocode

for $model in '124M', '335M', '774M', '1558M' loop

    for $filename in 'checkpoint','encoder.json','hparams.json','model.ckpt.data-00000-of-00001', 'model.ckpt.index', 'model.ckpt.meta', 'vocab.bpe' loop

        download(https://openaipublic.blob.core.windows.net/gpt-2/models/$model/$filename)

    end loop

end loop

Dataset: WebText

All (? or only the largest ?) model(s) were trained on (and evaluated against) WebText.

Misc

Activation function

The preferred activation function for GPT-2 is GELU.

Links

The openai/gpt-2 github repository is meant as a starting point for experiments with GPT-2. It contains code and models from the paper Language Models are Unsupervised Multitask Learners.

More about GPT-2 and its staged release

There is a GPT-2 output dataset that contains

250K documents from the WebText test set
For each GPT-2 model (trained on the WebText training set), 250K random samples (temperature 1, no truncation) and 250K samples generated with Top-K 40 truncation