BertWordPieceTokenizer now cleans up some tokenization artifacts while decoding (cf #145)ByteLevelBPETokenizer now has dropout (thanks @colinclement with #149)Strip normalizerdo_lowercase has been changed to lowercase for consistency between the different tokenizers. (Especially ByteLevelBPETokenizer and CharBPETokenizer)__len__ on Encoding (cf #139)BertWordPieceTokenizer.BPETokenizerWordPieceTrainer that prevented BertWordPieceTokenizer from being trained. (cf #137).new() class methods by a proper __new__ implementation. (Huge thanks to @ljos with #131)CharDelimiterSplit: a new PreTokenizer that allows splitting sequences on the given delimiter (Works like .split(delimiter))WordLevel: a new model that simply maps tokens to their ids.Encoding that are ready to be processed by a language model, just as the main Encoding.output = tokenizer.encode(...)
print(output.original_str.offsets(output.offsets[3]))
In this release, we fixed some inconsistencies between the BPETokenizer and the original python version of this tokenizer. If you created your own vocabulary using this Tokenizer, you will need to either train a new one, or use a modified version, where you set the PreTokenizer back to Whitespace (instead of WhitespaceSplit).
Fixes the sdist build for Python