Language Models are Unsupervised Multitask Learners

Jan 22, 2020 NLG Comments

Paper Link
Jay Alammar’s Blog Post
Open AI Github Code

Overview

Decoder only language model - no encoder-decoder attention in the Decoder block.
Released 4 models - Small (768, 12 layers), Medium (1024, 24 layers), Large (1280, 36 layers) and XL (1600, 48 layers).
Auto Regressive - outputs one taken at a time (like traditional RNN based language models).

Architecture

GPT2 vs GPT

Layer Normalization moved to the input of each sub block (1 before Attention and 1 before Feed-Forward) and additional layer normalization was added after final self attention block.
Modified initialization.

The decoder block has 2 components -

Masked MultiHead Self Attention - Same as Transformer paper. Nothing new.
Feed Forward Layer - Project onto higher dims (dim * 4) and then project it back to dim. Can use convolutions for this (Conv1D).

Dataset - Common Crawl but only high quality pages.
Input - BPE (Text) + Position
Non-Linearity - GeLU

Results

Model	PTB Perplexity
Prev. SOTA	46.54
Small	65.85
Medium	47.33
Large	40.31
XL	35.76

Pangloss: Fast Entity Linking in Noisy Text Environments

Convolutional Neural Network (CNN)

Kaushik Rangadurai

Kaushik Rangadurai

Code. Learn. Explore

Share this post

Comments