BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension

Jun 13, 2020 NLG Comments

BART: a denoising autoencoder for pretraining sequence-to-sequence models.
BART is trained by
1. corrupting text with an arbitrary noising function, and
2. learning a model to reconstruct the original text using a Seq2Seq.
Can be seen as generalizing BERT (encoder) and GPT (decoder).

It is implemented as a sequence-to-sequence model with a bidirectional encoder over corrupted text and a left-to-right autoregressive decoder.

Difference to BERT

Each layer of the decoder additionally performs cross-attention over the final hidden layer of the encoder (as in the transformer sequence-to-sequence model).
BERT uses an additional feed-forward network before word prediction, which BART does not.

Pre-training BART