Bi-directional Attention Flow for Machine Comprehension
- The motivation is that attention should flow both ways (from context to question and from question to context).
- The core idea behind the paper is on the Bi-directional attention layer between the context (decoder) and the question (encoder).
Assume we’ve the context hidden states and question hidden states . We compute the similarity matrix S which contains a similarity score Sij for each pair of (ci, qj) where
where is a weight matrix of shape and .
Context-to-Question Attention (C2Q)
We take a row-wise softmax of S to obtain the attention distributions which is used to take a weighted sum of the quesiton hidden states yielding C2Q attention output a_i.
This is very similar to the normal attention (instead of dot-product we use the matrix similarity S). The intution is that for every word in context, we compute the similarity to every other word in question. We then take a softmax on top of this to get a weighted sum of the question hidden states. We do this for every word/token in context.
Question-To-Context Attention (Q2C)
Similarly, we take a column-wise softmax of S to obtain the attention distributions which is used to take a weighted sum of the context hidden states yielding C2Q attention output .
Bi-directional Attention Flow
The intuition is that, for every word in context, we find the most similar word to question and then take a softmax of that to get a weight for every word in context. Now we take a weighted sum of the context hidden states to get .
|Dynamic Co-attention Networks||66.2||75.9|
|No char embedding||65||74.4|
|No C2Q Attention||57.2||67.7|
|No Q2C Attention||63.6||73.7|
Code. Learn. Explore