BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Aug 16, 2019 NLU Comments

BERT stands for Bidirectional Encoder Representation from Transformer
Jointly conditioning on both the left and right context in all layers.
Pre-trained BERT can be fine-tuned by adding 1 additional layer for a bunch of NLP tasks.
Two strategies for applying pre-trained language representations -
- Feature Based (ElMo): Task specific architectures that uses the embeddings as additional features.
- Fine-tuning (GPT) - Simply fine-tuning the parameters with minimal architecture change.

Input/Output Representation

Represent both a ‘sentence’ and a ‘pair of sentence’ in 1 sequence.
The first token of every sequence is always a [CLS] - this token represents the entire information of the sequence.
Have a [SEP] token between the 2 sentences. Also, add a learned embedding to every token where it belongs to sentence A or sentence B.
The representation of an input is the sum of the embedding of the input along with the segment and position embeddings (as shown in figure below).

Masked LM

Unfortunately, standard conditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to indirectly “see itself”, and the model could trivially predict the target word in a multi-layered context.
In MLM, we mask some tokens (replace with [MASK]) in a sequence and then predict those tokens.
- Mask 15% of the tokens randomly.
- 80% convert to [MASK], 10% random replace and 10% exact replace.
Just like in LM, we pass the final hidden vectors of these masked tokens to a softmax and predict over the vocabulary.

Next Sentence Prediction

When choosing sentences A and B, 50% of the time B is the next sentence and 50% of the time it is not.
C predicts whether or not it is the next sentence.

Data

Paper	MNLI	QQP	QNLI	SST-2	CoLA	STS-B	MRPC	RTE	Average
BERT Large	86.7/85.9	72.1	92.7	94.9	60.5	86.5	89.3	70.1	82.1
OpenAI GPT	82.1/81.4	70.3	87.4	91.3	45.4	80.0	82.3	56.0	75.1

Code. Learn. Explore