Sentencepiece

Feb 14, 2020 Tokenization Comments

Language-independent subword tokenizer and detokenizer designed for Neural-based text processing.
Moses is a de-facto standard toolkit for SMT, implements a reasonably useful pre- and postprocessor.
SentencePiece implements two subword segmentation algorithms, BPE and unigram language model, with the extension of direct training from raw sentences.

Four main components -

Normalizer

Trainer

trains the subword segmentation model from the normalized corpus.
Similar to BPE but some speed-up is done here from O(N^2) to O(NlogN) by using a heap or priority queue.
Vocabulary text - The param to tune is vocab_size and not some number of merge operations.
Customizable character normalization - Can provide a tsv file where a sequence of tokens in first column is converted to the token in second column.
Self-Contained - the model file is self-contained. Its all you need to encode and decode.

Encoder

executes Normalizer to normalize the input text and tokenizes it into a subword sequence with the subword model trained by Trainer.

Decoder

Lossless Tokenization

Code. Learn. Explore