Language Models are Unsupervised Multitask Learners
Paper Link
Jay Alammar’s Blog Post
Open AI Github Code
Overview
- Decoder only language model - no encoder-decoder attention in the Decoder block.
- Released 4 models - Small (768, 12 layers), Medium (1024, 24 layers), Large (1280, 36 layers) and XL (1600, 48 layers).
- Auto Regressive - outputs one taken at a time (like traditional RNN based language models).
Architecture
GPT2 vs GPT
- Layer Normalization moved to the input of each sub block (1 before Attention and 1 before Feed-Forward) and additional layer normalization was added after final self attention block.
- Modified initialization.
The decoder block has 2 components -
- Masked MultiHead Self Attention - Same as Transformer paper. Nothing new.
- Feed Forward Layer - Project onto higher dims (dim * 4) and then project it back to dim. Can use convolutions for this (Conv1D).
- Dataset - Common Crawl but only high quality pages.
- Input - BPE (Text) + Position
- Non-Linearity - GeLU
Results
Model | PTB Perplexity |
---|---|
Prev. SOTA | 46.54 |
Small | 65.85 |
Medium | 47.33 |
Large | 40.31 |
XL | 35.76 |
Kaushik Rangadurai
Code. Learn. Explore