Attention is All You Need

NLG Comments

Paper Link

Overview

Architecture

Attention Is All You Need

Scaled Dot-Product Attention

Compute the attention between Queries and Keys (each of dimension dk) and use it to find the weighted sum of values (dimension dv).

\[\begin{align*} Attention(Q, K, V) = softmax(\dfrac{QK^T}{\sqrt{d_k}})V \end{align*}\]

Multi-Head Attention

\[\begin{align*} MultiHead(Q, K, V) = concat(head_1,....,head_h)W^O \end{align*}\] \[\begin{align*} head_i = Attention(Q{W_i}^Q, K{W_i}^K, V{W_i}^V ) \end{align*}\]

Attentions Used In This Paper

Position-wise Feed-Forward Networks

\[\begin{align*} FFN(x) = max(0, xW_1 + b_1)W_2 + b_2 \end{align*}\]

Positional Encoding

Miscellaneous Things

Results

Paper EN-DE EN-FR
MoE 26.03 40.56
Transformer Base 27.3 38.1
Transformer 28.4 41.0

References

  1. http://nlp.seas.harvard.edu/2018/04/03/attention.html

Kaushik Rangadurai

Code. Learn. Explore

Share this post

Comments