Single Headed Attention RNN
- The goal is to build a simple language model that can run in a single GPU and still do well.
- Due to above, the goal is to avoid Transformer architectures and see if we can use the traditional LSTM.
- The model consists of trainable embedding later, a single headed attention LSTM layer followed by a dense softmax layer.
- The weights from trainable embedding layer and the dense softmax layers are shared.
Single Headed Attention
- Similar to transformers’s MHA but just 1 head :P
- 2 feed-forward layers with GELU (project big and then back).
|Model||Bits Per Char|
Code. Learn. Explore