BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
- BERT stands for Bidirectional Encoder Representation from Transformer
- Jointly conditioning on both the left and right context in all layers.
- Pre-trained BERT can be fine-tuned by adding 1 additional layer for a bunch of NLP tasks.
- Two strategies for applying pre-trained language representations -
- Feature Based (ElMo): Task specific architectures that uses the embeddings as additional features.
- Fine-tuning (GPT) - Simply fine-tuning the parameters with minimal architecture change.
- Unified architecture across different tasks.
- Multi-layer bidirectional Transformer encoder based on Vaswani et al.
- BERTBASE = (L=12, H=786, A=12, P=110M)
- BERTLARGE = (L=24, H=1024, A=16, P=340M)
- Represent both a ‘sentence’ and a ‘pair of sentence’ in 1 sequence.
- The first token of every sequence is always a [CLS] - this token represents the entire information of the sequence.
- Have a [SEP] token between the 2 sentences. Also, add a learned embedding to every token where it belongs to sentence A or sentence B.
- The representation of an input is the sum of the embedding of the input along with the segment and position embeddings (as shown in figure below).
- Unfortunately, standard conditional language models can only be trained left-to-right or right-to-left, since bidirectional conditioning would allow each word to indirectly “see itself”, and the model could trivially predict the target word in a multi-layered context.
- In MLM, we mask some tokens (replace with [MASK]) in a sequence and then predict those tokens.
- Mask 15% of the tokens randomly.
- 80% convert to [MASK], 10% random replace and 10% exact replace.
- Just like in LM, we pass the final hidden vectors of these masked tokens to a softmax and predict over the vocabulary.
Next Sentence Prediction
- When choosing sentences A and B, 50% of the time B is the next sentence and 50% of the time it is not.
- C predicts whether or not it is the next sentence.
- Wikipedia and BookCorpus.
Code. Learn. Explore