Single Headed Attention RNN

Core Comments

Paper Link

Overview

  1. The goal is to build a simple language model that can run in a single GPU and still do well.
  2. Due to above, the goal is to avoid Transformer architectures and see if we can use the traditional LSTM.

Architecture

SHA-RNN

Single Headed Attention

Boom Layer

Results

Model Bits Per Char
LSTM 1.182
SHA-RNN 1.100
Adaptive Transformer 1.04

Kaushik Rangadurai

Code. Learn. Explore

Share this post

Comments