Hugging Face/Tokenizers

Tokenizers

$npx -y @buildinternet/releases show tokenizers

Sun

Mon

Tue

Wed

Thu

Fri

Sat

AprMayJunJulAugSepOctNovDecJanFebMarApr

Less

Releases2Avg0/wkVersionsv0.22.1 → v0.22.2

Feb 18, 2020

Python v0.5.0

Changes:

BertWordPieceTokenizer now cleans up some tokenization artifacts while decoding (cf #145)
ByteLevelBPETokenizer now has dropout (thanks @colinclement with #149)
Added a new Strip normalizer
do_lowercase has been changed to lowercase for consistency between the different tokenizers. (Especially ByteLevelBPETokenizer and CharBPETokenizer)
Expose __len__ on Encoding (cf #139)
Improved padding performances.

Fixes:

#145: Decoding was buggy on BertWordPieceTokenizer.
#152: Some documentation and examples were still using the old BPETokenizer

Feb 11, 2020

Python v0.4.2

Fixes:

Fix a bug in the class WordPieceTrainer that prevented BertWordPieceTokenizer from being trained. (cf #137)

Python v0.4.1

Fixes:

Fix a bug related to the punctuation in BertWordPieceTokenizer (Thanks to @Mansterteddy with #134)

Feb 10, 2020

Python v0.4.0

Changes:

Replaced all .new() class methods by a proper __new__ implementation. (Huge thanks to @ljos with #131)
Improved typings

Feb 5, 2020

Python v0.3.0

Changes:

BPETokenizer has been renamed to CharBPETokenizer for clarity.
Added CharDelimiterSplit: a new PreTokenizer that allows splitting sequences on the given delimiter (Works like .split(delimiter))
Added WordLevel: a new model that simply maps tokens to their ids.
Improve truncation/padding and the handling of overflowing tokens. Now when a sequence gets truncated, we provide a list of overflowing Encoding that are ready to be processed by a language model, just as the main Encoding.
Provide mapping to the original string offsets using:

output = tokenizer.encode(...)
print(output.original_str.offsets(output.offsets[3]))

Exposed the vocabulary size on all tokenizers: https://github.com/huggingface/tokenizers/pull/99 by @kdexd

Bug fixes:

Fix a bug with IndexableString
Fix a bug with truncation

Jan 22, 2020

Python v0.2.1

Fix a bug with the IDs associated with added tokens.
Fix a bug that was causing crashes in Python 3.5

Jan 20, 2020

Python v0.2.0

In this release, we fixed some inconsistencies between the BPETokenizer and the original python version of this tokenizer. If you created your own vocabulary using this Tokenizer, you will need to either train a new one, or use a modified version, where you set the PreTokenizer back to Whitespace (instead of WhitespaceSplit).

Jan 12, 2020

Python v0.1.1

Fix a bug where special tokens get split while encoding

Jan 10, 2020

Python v0.1.0

Jan 8, 2020

Jan 7, 2020

Dec 27, 2019

Fixes the sdist build for Python

Dec 26, 2019

Dec 23, 2019

Dec 20, 2019

Dec 17, 2019

Dec 13, 2019

Dec 10, 2019

Dec 3, 2019

Previous 3 4 5Next

Similar releases

Other sources from this team

Similar sources

Latest

v0.22.2

Source

@huggingface/tokenizers

Tracking Since

Dec 3, 2019

Last checked Apr 19, 2026

.json·.md·.atom

Tokenizers — Hugging Face — releases.sh

Tokenizers

Changes:

Fixes:

Fixes:

Fixes:

Changes:

Changes:

Bug fixes:

More from this team

Similar releases

Other sources from this team

Similar sources

Similar releases

More from this team

Other sources from this team

Similar sources