Bi-directional Attention Flow for Machine Comprehension

Overview

• The motivation is that attention should flow both ways (from context to question and from question to context).

Architecture

• The core idea behind the paper is on the Bi-directional attention layer between the context (decoder) and the question (encoder).

Assume we’ve the context hidden states $c_1,....,c_N \in \mathbb{R}^{2h}$ and question hidden states $q_1,....,q_M \in \mathbb{R}^{2h}$. We compute the similarity matrix S $\in \mathbb{R}^{NxM}$ which contains a similarity score Sij for each pair of (ci, qj) where

where $w_{sim}$ is a weight matrix of shape $\mathbb{R}^{6h}$ and $S \in \mathbb{R}^{NxM}$.

Context-to-Question Attention (C2Q)

We take a row-wise softmax of S to obtain the attention distributions $\alpha^i$ which is used to take a weighted sum of the quesiton hidden states $q_j$ yielding C2Q attention output a_i.

This is very similar to the normal attention (instead of dot-product we use the matrix similarity S). The intution is that for every word in context, we compute the similarity to every other word in question. We then take a softmax on top of this to get a weighted sum of the question hidden states. We do this for every word/token in context.

Question-To-Context Attention (Q2C)

Similarly, we take a column-wise softmax of S to obtain the attention distributions $\beta$ which is used to take a weighted sum of the context hidden states $c_i$ yielding C2Q attention output $c'$.

Bi-directional Attention Flow

The intuition is that, for every word in context, we find the most similar word to question and then take a softmax of that to get a weight for every word in context. Now we take a weighted sum of the context hidden states to get $c'$.

Results

Paper EM F1
Dynamic Co-attention Networks 66.2 75.9
BiDaf 67.7 77.3
No char embedding 65 74.4
No C2Q Attention 57.2 67.7
No Q2C Attention 63.6 73.7