releases.shpreview
Hugging Face/Tokenizers

Tokenizers

$npx -y @buildinternet/releases show tokenizers
Mon
Wed
Fri
AprMayJunJulAugSepOctNovDecJanFebMarApr
Less
More
Releases2Avg0/wkVersionsv0.22.1 → v0.22.2
Feb 18, 2020
Python v0.5.0

Changes:

  • BertWordPieceTokenizer now cleans up some tokenization artifacts while decoding (cf #145)
  • ByteLevelBPETokenizer now has dropout (thanks @colinclement with #149)
  • Added a new Strip normalizer
  • do_lowercase has been changed to lowercase for consistency between the different tokenizers. (Especially ByteLevelBPETokenizer and CharBPETokenizer)
  • Expose __len__ on Encoding (cf #139)
  • Improved padding performances.

Fixes:

  • #145: Decoding was buggy on BertWordPieceTokenizer.
  • #152: Some documentation and examples were still using the old BPETokenizer
Feb 11, 2020
Python v0.4.2

Fixes:

  • Fix a bug in the class WordPieceTrainer that prevented BertWordPieceTokenizer from being trained. (cf #137)
Python v0.4.1

Fixes:

  • Fix a bug related to the punctuation in BertWordPieceTokenizer (Thanks to @Mansterteddy with #134)
Feb 10, 2020
Python v0.4.0

Changes:

  • Replaced all .new() class methods by a proper __new__ implementation. (Huge thanks to @ljos with #131)
  • Improved typings
Feb 5, 2020
Python v0.3.0

Changes:

  • BPETokenizer has been renamed to CharBPETokenizer for clarity.
  • Added CharDelimiterSplit: a new PreTokenizer that allows splitting sequences on the given delimiter (Works like .split(delimiter))
  • Added WordLevel: a new model that simply maps tokens to their ids.
  • Improve truncation/padding and the handling of overflowing tokens. Now when a sequence gets truncated, we provide a list of overflowing Encoding that are ready to be processed by a language model, just as the main Encoding.
  • Provide mapping to the original string offsets using:
output = tokenizer.encode(...)
print(output.original_str.offsets(output.offsets[3]))

Bug fixes:

  • Fix a bug with IndexableString
  • Fix a bug with truncation
Jan 22, 2020
Python v0.2.1
  • Fix a bug with the IDs associated with added tokens.
  • Fix a bug that was causing crashes in Python 3.5
Jan 20, 2020
Python v0.2.0

In this release, we fixed some inconsistencies between the BPETokenizer and the original python version of this tokenizer. If you created your own vocabulary using this Tokenizer, you will need to either train a new one, or use a modified version, where you set the PreTokenizer back to Whitespace (instead of WhitespaceSplit).

Jan 12, 2020
Python v0.1.1
  • Fix a bug where special tokens get split while encoding
Jan 10, 2020
Python v0.1.0
Jan 8, 2020
Jan 7, 2020
Dec 27, 2019

Fixes the sdist build for Python

Dec 26, 2019
Dec 23, 2019
Dec 20, 2019
Dec 17, 2019
Dec 13, 2019
Dec 10, 2019
Dec 3, 2019
Latest
v0.22.2
Tracking Since
Dec 3, 2019
Last checked Apr 19, 2026
Tokenizers — Hugging Face — releases.sh