releases.shpreview
Hugging Face/Tokenizers/python-v0.8.0

python-v0.8.0

Python v0.8.0

$npx -y @buildinternet/releases show rel_seT1lyxWho-lG94te1q8D

Highlights of this release

  • We can now encode both pre-tokenized inputs, and raw strings. This is especially usefull when processing datasets that are already pre-tokenized like for NER (Name Entity Recognition), and helps while applying labels to each word.
  • Full tokenizer serialization. It is now easy to save a tokenizer to a single JSON file, to later load it back with just one line of code. That's what sharing a Tokenizer means now: 1 line of code.
  • With the serialization comes the compatibility with Pickle! The Tokenizer, all of its components, Encodings, everything can be pickled!
  • Training a tokenizer is now even faster (up to 5-10x) than before!
  • Compatibility with multiprocessing, even when using the fork start method. Since this library makes heavy use of the multithreading capacities of our computers to allows a very fast tokenization, this led to problems (deadlocks) when used with multiprocessing. This version now allows to disable the parallelism, and will warn you if this is necessary.
  • And a lot of other improvements, and fixes.

Fixed

  • [#286]: Fix various crash when training a BPE model
  • [#309]: Fixed a few bugs related to additional vocabulary/tokens

Added

  • [#272]: Serialization of the Tokenizer and all the parts (PreTokenizer, Normalizer, ...). This adds some methods to easily save/load an entire tokenizer (from_str, from_file).
  • [#273]: Tokenizer and its parts are now pickable
  • [#289]: Ability to pad to a multiple of a specified value. This is especially useful to ensure activation of the Tensor Cores, while ensuring padding to a multiple of 8. Use with enable_padding(pad_to_multiple_of=8) for example.
  • [#298]: Ability to get the currently set truncation/padding params
  • [#311]: Ability to enable/disable the parallelism using the TOKENIZERS_PARALLELISM environment variable. This is especially usefull when using multiprocessing capabilities, with the fork start method, which happens to be the default on Linux systems. Without disabling the parallelism, the process dead-locks while encoding. (Cf [#187] for more information)

Changed

  • Improved errors generated during truncation: When the provided max length is too low are now handled properly.
  • [#249] encode and encode_batch now accept pre-tokenized inputs. When the input is pre-tokenized, the argument is_pretokenized=True must be specified.
  • [#276]: Improve BPE training speeds, by reading files sequentially, but parallelizing the processing of each file
  • [#280]: Use onig for byte-level pre-tokenization to remove all the differences with the original implementation from GPT-2
  • [#309]: Improved the management of the additional vocabulary. This introduces an option normalized, controlling whether a token should be extracted from the normalized version of the input text.

Fetched April 7, 2026