# Bi-directional Attention Flow for Machine Comprehension

### Overview

- The motivation is that attention should flow both ways (from context to question and from question to context).

### Architecture

- The core idea behind the paper is on the Bi-directional attention layer between the context (decoder) and the question (encoder).

Assume we’ve the context hidden states and question hidden states . We compute the similarity matrix **S** which contains a similarity score S_{ij} for each pair of (c_{i}, q_{j}) where

where is a weight matrix of shape and .

**Context-to-Question Attention (C2Q)**

We take a row-wise softmax of **S** to obtain the attention distributions which is used to take a weighted sum of the quesiton hidden states yielding C2Q attention output **a_i**.

This is very similar to the normal attention (instead of dot-product we use the matrix similarity S). The intution is that for every word in context, we compute the similarity to every other word in question. We then take a softmax on top of this to get a weighted sum of the question hidden states. We do this for every word/token in context.

**Question-To-Context Attention (Q2C)**

Similarly, we take a column-wise softmax of **S** to obtain the attention distributions which is used to take a weighted sum of the context hidden states yielding C2Q attention output .

**Bi-directional Attention Flow**

The intuition is that, for every word in context, we find the most similar word to question and then take a softmax of that to get a weight for every word in context. Now we take a weighted sum of the context hidden states to get .

### Results

Paper | EM | F1 |
---|---|---|

Dynamic Co-attention Networks | 66.2 | 75.9 |

BiDaf | 67.7 | 77.3 |

No char embedding | 65 | 74.4 |

No C2Q Attention | 57.2 | 67.7 |

No Q2C Attention | 63.6 | 73.7 |

#### Kaushik Rangadurai

Code. Learn. Explore