Decoder is now a composable trait, but without being backward incompatibleProcessor is now a composable trait, but without being backward incompatibleunstable_wasm feature to support building on Wasm (it's unstable !)Decoder is now a composable trait, but without being backward incompatibleProcessor is now a composable trait, but without being backward incompatibleBoth trait changes warrant a "major" number since, despite best efforts to not break backward compatibility, the code is different enough that we cannot be exactly sure.
The breaking change was causing more issues upstream in transformers than anticipated:
https://github.com/huggingface/transformers/pull/16537#issuecomment-1085682657
The decision was to rollback on that breaking change, and figure out a different way later to do this modification
Bump minor version because of a breaking change.
Using 0.12 to match other bindings.
[#938] Breaking change. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.
[#939] Making the regex in ByteLevel pre_tokenizer optional (necessary for BigScience)
[#952] Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)
[#954] Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.
[#961] Added link for Ruby port of tokenizers
The breaking change was causing more issues upstream in transformers than anticipated:
https://github.com/huggingface/transformers/pull/16537#issuecomment-1085682657
The decision was to rollback on that breaking change, and figure out a different way later to do this modification
Bump minor version because of a breaking change.
[#938] Breaking change. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.
[#939] Making the regex in ByteLevel pre_tokenizer optional (necessary for BigScience)
[#952] Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)
[#954] Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.
[#962] Fix tests for python 3.10
[#961] Added link for Ruby port of tokenizers
Bump minor version because of a breaking change.
The breaking change was causing more issues upstream in transformers than anticipated:
https://github.com/huggingface/transformers/pull/16537#issuecomment-1085682657
The decision was to rollback on that breaking change, and figure out a different way later to do this modification
[#938] Breaking change. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.
[#939] Making the regex in ByteLevel pre_tokenizer optional (necessary for BigScience)
[#952] Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)
[#954] Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.
[#961] Added link for Ruby port of tokenizers
[#960] Feature gate for cli and its clap dependency
added_tokens by loading them in batch.added_tokens by loading them in batch.[#895] Add wheel support for Python 3.10
[#884] Fixing bad deserialization following inclusion of a default for Punctuation
[#884] Fixing bad deserialization following inclusion of a default for Punctuation
Fixing various backward compatibility bugs (Old serialized files couldn't be deserialized anymore.
[#860] Adding TruncationSide to TruncationParams.
is_pretokenized and trim_offsets.Decoders.http featuresWordLevel tokenizer determinism during trainingSentencePieceUnigramTokenizerUnigramTrainerTokenizer.from_pretrained to load tokenizers from the Hugging Face Hubtokenizer.train(files, trainer).TemplateProcessingWordLevel and Unigram models (#490)nmtNormalizer and precompiledNormalizer normalizers (#490)templateProcessing post-processor (#490)digitsPreTokenizer pre-tokenizer (#490)splitPreTokenizer pre-tokenizer (#542)behavior option to the punctuationPreTokenizer (#657)fromPretrained (#780)