Single Headed Attention RNN

Jan 06, 2020 Core Comments

The goal is to build a simple language model that can run in a single GPU and still do well.
Due to above, the goal is to avoid Transformer architectures and see if we can use the traditional LSTM.

The model consists of trainable embedding later, a single headed attention LSTM layer followed by a dense softmax layer.
The weights from trainable embedding layer and the dense softmax layers are shared.

Single Headed Attention

Boom Layer

Code. Learn. Explore