Precompiled corner casecontinuing_subword_prefixMetaspace serialization problemsPyNormalizedStringRefMutByteLevel instantiation from a previously saved state (using __getstate__())WordLevelTrainer used to train a WordLevel modeldatasetsfust_unk option to SentencePieceBPETokenizer.pyi filesModel can return its associated Trainer with get_trainer()tokenizer.model.dropout = 0.1)Model is now trained in-place. This fixes several bugs that were
forcing to reload the Model after a training.BaseTokenizer enable_truncation docstringWordLevelTrainer used to train a WordLevel modeldatasets.pyi filesModel can return its associated Trainer with get_trainer()tokenizer.model.dropout = 0.1)Model is now trained in-place. This fixes several bugs that were
forcing to reload the Model after a training.BaseTokenizer enable_truncation docstringfrom_file on BertWordPieceTokenizersentencepiece_model_pb2.pyinitial_alphabet and handles special_tokens correctly.train with some non-existent filesencode/encode_batch with numpy arraysTemplateProcessing PostProcessor..train with some non-existent filesencode/encode_batch with numpy arraysTemplateProcessing PostProcessor.AddedToken, where the content was not restored properlystrip_accents is not specified.Pickle! The Tokenizer, all of its components,
Encodings, everything can be pickled!multiprocessing, even when using the fork start method. Since this library
makes heavy use of the multithreading capacities of our computers to allows a very fast tokenization,
this led to problems (deadlocks) when used with multiprocessing. This version now allows to
disable the parallelism, and will warn you if this is necessary.Tokenizer and all the parts (PreTokenizer, Normalizer, ...).
This adds some methods to easily save/load an entire tokenizer (from_str, from_file).Tokenizer and its parts are now pickableenable_padding(pad_to_multiple_of=8) for example.TOKENIZERS_PARALLELISM environment
variable. This is especially usefull when using multiprocessing capabilities, with the fork
start method, which happens to be the default on Linux systems. Without disabling the parallelism,
the process dead-locks while encoding. (Cf [#187] for more information)encode and encode_batch now accept pre-tokenized inputs. When the input is pre-tokenized,
the argument is_pretokenized=True must be specified.onig for byte-level pre-tokenization to remove all the differences with the original
implementation from GPT-2normalized, controlling whether a token should be extracted from the normalized version of the
input text.encode and encode_batch now take a new optional argument, specifying whether we
should add the special tokens. This is activated by default.original_str and normalized_str have been removed from the Encoding returned by
encode and encode_batch. This brings a reduction of 70% of the memory footprint.Encoding are now relative to the original string, and not the
normalized one anymore.add_special_tokens or add_tokens on a Tokenizer, or while using
train(special_tokens=...) can now be instances of AddedToken to provide more control over these
tokens.Model.from_files and Model.empty are removed in favor of using
constructors.CharBPETokenizer now corresponds to OpenAI GPT BPE implementation by default.ByteLevel is also a PostProcessor now and handles trimming the offsets if activated.
This avoids the unintuitive inclusion of the whitespaces in the produced offsets, even if these
whitespaces are part of the actual token.
It has been added to ByteLevelBPETokenizer but it is off by default (trim_offsets=False).RobertaProcessing also handles trimming the offsets.Encoding. Provide methods to easily convert between char
or word (input space) and token (output space).post_process can be called on the TokenizerTokenizer with
get_vocab(with_added_tokens: bool)ByteLevel BPE:
add_prefix_space=TrueBPEDecoder used by CharBPETokenizerByteLevel PostProcessor to your byte-level BPE tokenizers if relevant. If you are
using ByteLevelBPETokenizer, this option is disabled by default (trim_offsets=False).BertWordPieceTokenizer option to add_special_tokens must now be given to encode or
encode_batchoriginal_str on the Encoding has been removed. The original string is the input
of encode so it didn't make sense to keep it here.original_str.offsets(offsets[N]) to convert offsets to the original string. They
are now relative to the original string by default.normalized_str on the Encoding has been removed. Can be retrieved by calling
normalize(sequence) on the TokenizerModel.from_files and Model.empty to use constructor. The model constructor should take
the same arguments as the old methods. (ie BPE(vocab, merges) or BPE())CharBPETokenizer and want to keep the same behavior as before, set
bert_normalizer=False and split_on_whitespace_only=True.Send + SyncTokenizer & ModelBPEDecoderencode and encode_batch now take a new argument, specifying whether we should add the
special tokensNormalizedString has been removed from the Encoding. It is now possible to
retrieve it by calling normalize on the Tokenizer. This brings a reduction of 70% of the memory
footprintNormalizedString API has been improved. It is now possible to retrieve parts of both
strings using both "normalized" or "original" offsetsEncoding are now relative to the original string, and not the
normalized one anymoreAddedToken are now used for both add_special_tokens and add_tokens. Also, these AddedToken
have more options to allow various behaviors.impl PostProcessor for ByteLevel: Handles trimming the offsets if activated. This avoids
the unintuitive inclusion of the whitespaces in the produced offsets, even if these whitespaces are
part of the actual tokenEncoding.post_process can be called on the TokenizerByteLevel BPE:
add_prefix_space is activatedByteLevel PostProcessor to your byte-level BPE tokenizers if relevant.vocab.txt file was named
vocab.json. This is now fixed.WordLevel model was also saving its vocabulary in the wrong format.name argument is now optional when saving a Model's vocabulary. When the name is not specified,
the files get a more generic naming, like vocab.json or merges.txt.