Rust v0.9.0
encode and encode_batch now take a new argument, specifying whether we should add the
special tokensNormalizedString has been removed from the Encoding. It is now possible to
retrieve it by calling normalize on the Tokenizer. This brings a reduction of 70% of the memory
footprintNormalizedString API has been improved. It is now possible to retrieve parts of both
strings using both "normalized" or "original" offsetsEncoding are now relative to the original string, and not the
normalized one anymoreAddedToken are now used for both add_special_tokens and add_tokens. Also, these AddedToken
have more options to allow various behaviors.impl PostProcessor for ByteLevel: Handles trimming the offsets if activated. This avoids
the unintuitive inclusion of the whitespaces in the produced offsets, even if these whitespaces are
part of the actual tokenEncoding.post_process can be called on the TokenizerByteLevel BPE:
add_prefix_space is activatedByteLevel PostProcessor to your byte-level BPE tokenizers if relevant.Fetched April 7, 2026