Language Models are Unsupervised Multitask Learners

NLG Comments

Paper Link
Jay Alammar’s Blog Post
Open AI Github Code

Overview

  1. Decoder only language model - no encoder-decoder attention in the Decoder block.
  2. Released 4 models - Small (768, 12 layers), Medium (1024, 24 layers), Large (1280, 36 layers) and XL (1600, 48 layers).
  3. Auto Regressive - outputs one taken at a time (like traditional RNN based language models).

Architecture

GPT2 vs GPT

The decoder block has 2 components -

  1. Masked MultiHead Self Attention - Same as Transformer paper. Nothing new.
  2. Feed Forward Layer - Project onto higher dims (dim * 4) and then project it back to dim. Can use convolutions for this (Conv1D).

Results

Model PTB Perplexity
Prev. SOTA 46.54
Small 65.85
Medium 47.33
Large 40.31
XL 35.76

Kaushik Rangadurai

Code. Learn. Explore

Share this post

Comments