Python v0.7.0
encode and encode_batch now take a new optional argument, specifying whether we
should add the special tokens. This is activated by default.original_str and normalized_str have been removed from the Encoding returned by
encode and encode_batch. This brings a reduction of 70% of the memory footprint.Encoding are now relative to the original string, and not the
normalized one anymore.add_special_tokens or add_tokens on a Tokenizer, or while using
train(special_tokens=...) can now be instances of AddedToken to provide more control over these
tokens.Model.from_files and Model.empty are removed in favor of using
constructors.CharBPETokenizer now corresponds to OpenAI GPT BPE implementation by default.ByteLevel is also a PostProcessor now and handles trimming the offsets if activated.
This avoids the unintuitive inclusion of the whitespaces in the produced offsets, even if these
whitespaces are part of the actual token.
It has been added to ByteLevelBPETokenizer but it is off by default (trim_offsets=False).RobertaProcessing also handles trimming the offsets.Encoding. Provide methods to easily convert between char
or word (input space) and token (output space).post_process can be called on the TokenizerTokenizer with
get_vocab(with_added_tokens: bool)ByteLevel BPE:
add_prefix_space=TrueBPEDecoder used by CharBPETokenizerByteLevel PostProcessor to your byte-level BPE tokenizers if relevant. If you are
using ByteLevelBPETokenizer, this option is disabled by default (trim_offsets=False).BertWordPieceTokenizer option to add_special_tokens must now be given to encode or
encode_batchoriginal_str on the Encoding has been removed. The original string is the input
of encode so it didn't make sense to keep it here.original_str.offsets(offsets[N]) to convert offsets to the original string. They
are now relative to the original string by default.normalized_str on the Encoding has been removed. Can be retrieved by calling
normalize(sequence) on the TokenizerModel.from_files and Model.empty to use constructor. The model constructor should take
the same arguments as the old methods. (ie BPE(vocab, merges) or BPE())CharBPETokenizer and want to keep the same behavior as before, set
bert_normalizer=False and split_on_whitespace_only=True.Fetched April 7, 2026