releases.shpreview
Hugging Face/Tokenizers/python-v0.3.0

python-v0.3.0

Python v0.3.0

$npx -y @buildinternet/releases show rel_7QwF6oqTFZ3o0cMrK-qW9

Changes:

  • BPETokenizer has been renamed to CharBPETokenizer for clarity.
  • Added CharDelimiterSplit: a new PreTokenizer that allows splitting sequences on the given delimiter (Works like .split(delimiter))
  • Added WordLevel: a new model that simply maps tokens to their ids.
  • Improve truncation/padding and the handling of overflowing tokens. Now when a sequence gets truncated, we provide a list of overflowing Encoding that are ready to be processed by a language model, just as the main Encoding.
  • Provide mapping to the original string offsets using:
output = tokenizer.encode(...)
print(output.original_str.offsets(output.offsets[3]))

Bug fixes:

  • Fix a bug with IndexableString
  • Fix a bug with truncation

Fetched April 7, 2026