Okay mostly doing the release for these PR:
Basically good typing with at least ty, and a lot fast (from 4 to 8x faster) loading vocab with a lot of added tokens and GIL free !?
ci: add support for building Win-ARM64 wheels by @MugundanMCW in https://github.com/huggingface/tokenizers/pull/1869
Add cargo-semver-checks to Rust CI workflow by @haixuanTao in https://github.com/huggingface/tokenizers/pull/1875
Update indicatif dependency by @gordonmessmer in https://github.com/huggingface/tokenizers/pull/1867
Bump node-forge from 1.3.1 to 1.3.2 in /tokenizers/examples/unstable_wasm/www by @dependabot[bot] in https://github.com/huggingface/tokenizers/pull/1889
Bump js-yaml from 3.14.1 to 3.14.2 in /bindings/node by @dependabot[bot] in https://github.com/huggingface/tokenizers/pull/1892
fix: used normalize_str in BaseTokenizer.normalize by @ishitab02 in https://github.com/huggingface/tokenizers/pull/1884
[MINOR:TYPO] Update mod.rs by @cakiki in https://github.com/huggingface/tokenizers/pull/1883
Remove runtime stderr warning from Python bindings by @Copilot in https://github.com/huggingface/tokenizers/pull/1898
Mark immutable pyclasses as frozen by @ngoldbaum in https://github.com/huggingface/tokenizers/pull/1861
DOCS: add add_prefix_space to processors.ByteLevel by @CloseChoice in https://github.com/huggingface/tokenizers/pull/1878
Bump express from 4.21.2 to 4.22.1 in /tokenizers/examples/unstable_wasm/www by @dependabot[bot] in https://github.com/huggingface/tokenizers/pull/1903
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.22.1...v0.22.2
Main change:
from_bytes and read_bytes Methods in WordPiece Tokenizer for WebAssembly Compatibility by @sondalex in https://github.com/huggingface/tokenizers/pull/1758EncodingVisualizer.calculate_label_colors by @Liam-DeVoe in https://github.com/huggingface/tokenizers/pull/1853Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.21.3...v0.22.0rc0
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.21.3...v0.21.4
No change, the 0.21.3 release failed, this is just a re-release.
https://github.com/huggingface/tokenizers/releases/tag/v0.21.3
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.21.2...v0.21.3
This release if focused around some performance optimization, enabling broader python no gil support, and fixing some onig issues!
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.21.1...v0.21.2rc0
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.21.0...v0.21.1
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.21.0...v0.21.1rc0
We also no longer support python 3.7 or 3.8 (similar to transformers) as they are deprecated.
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.20.3...v0.21.0
There was a breaking change in 0.20.3 for tuple inputs of encode_batch!
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.20.2...v0.20.3
Thanks a MILE to @diliop we now have support for python 3.13! 🥳
set_var by @sftse in https://github.com/huggingface/tokenizers/pull/1664Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.20.1...v0.20.2
The most awaited offset issue with Llama is fixed 🥳
ignore_merges] Fix offsets by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1640Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.20.0...v0.20.1
This release is focused on performances and user experience.
First off, we did a bit of benchmarking, and found some place for improvement for us!
With a few minor changes (mostly #1587) here is what we get on Llama3 running on a g6 instances on AWS https://github.com/huggingface/tokenizers/blob/main/bindings/python/benches/test_tiktoken.py :
We shipped better deserialization errors in general, and support for __str__ and __repr__ for all the object. This allows for a lot easier debugging see this:
>>> from tokenizers import Tokenizer;
>>> tokenizer = Tokenizer.from_pretrained("bert-base-uncased");
>>> print(tokenizer)
Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[{"id":0, "content":"[PAD]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":100, "content":"[UNK]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":101, "content":"[CLS]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":102, "content":"[SEP]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":103, "content":"[MASK]", "single_word":False, "lstrip":False, "rstrip":False, ...}], normalizer=BertNormalizer(clean_text=True, handle_chinese_chars=True, strip_accents=None, lowercase=True), pre_tokenizer=BertPreTokenizer(), post_processor=TemplateProcessing(single=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0)], pair=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0), Sequence(id=B, type_id=1), SpecialToken(id="[SEP]", type_id=1)], special_tokens={"[CLS]":SpecialToken(id="[CLS]", ids=[101], tokens=["[CLS]"]), "[SEP]":SpecialToken(id="[SEP]", ids=[102], tokens=["[SEP]"])}), decoder=WordPiece(prefix="##", cleanup=True), model=WordPiece(unk_token="[UNK]", continuing_subword_prefix="##", max_input_chars_per_word=100, vocab={"[PAD]":0, "[unused0]":1, "[unused1]":2, "[unused2]":3, "[unused3]":4, ...}))
>>> tokenizer
Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[{"id":0, "content":"[PAD]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":100, "content":"[UNK]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":101, "content":"[CLS]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":102, "content":"[SEP]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":103, "content":"[MASK]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}], normalizer=BertNormalizer(clean_text=True, handle_chinese_chars=True, strip_accents=None, lowercase=True), pre_tokenizer=BertPreTokenizer(), post_processor=TemplateProcessing(single=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0)], pair=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0), Sequence(id=B, type_id=1), SpecialToken(id="[SEP]", type_id=1)], special_tokens={"[CLS]":SpecialToken(id="[CLS]", ids=[101], tokens=["[CLS]"]), "[SEP]":SpecialToken(id="[SEP]", ids=[102], tokens=["[SEP]"])}), decoder=WordPiece(prefix="##", cleanup=True), model=WordPiece(unk_token="[UNK]", continuing_subword_prefix="##", max_input_chars_per_word=100, vocab={"[PAD]":0, "[unused0]":1, "[unused1]":2, ...}))
The pre_tokenizer.Sequence and normalizer.Sequence are also more accessible now:
from tokenizers import normalizers
norm = normalizers.Sequence([normalizers.Strip(), normalizers.BertNormalizer()])
norm[0]
norm[1].lowercase=False
USED_PARALLELISM atomic by @nathaniel-daniel in https://github.com/huggingface/tokenizers/pull/1532cached_download to hf_hub_download in tests by @Wauplin in https://github.com/huggingface/tokenizers/pull/1547dropout = 0.0 as an equivalent to none in BPE by @mcognetta in https://github.com/huggingface/tokenizers/pull/1550None to reset pre_tokenizers and normalizers, and index sequences by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1590Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.19.1...v0.20.0rc1
ignore_merges by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1504Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.19.0...v0.19.1
remove black] And use ruff by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1436AddedVocabulary. by @eaplatanios in https://github.com/huggingface/tokenizers/pull/1443Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.15.2...v0.19.0
Bumping 3 versions because of this: https://github.com/huggingface/transformers/blob/60dea593edd0b94ee15dc3917900b26e3acfbbee/setup.py#L177
remove black] And use ruff by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1436AddedVocabulary. by @eaplatanios in https://github.com/huggingface/tokenizers/pull/1443Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.15.2...v0.19.0rc0
Big shoutout to @rlrs for the fast replace normalizers PR. This boosts the performances of the tokenizers:
Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.15.1...v0.15.2rc1
Clone on Tokenizer, add Encoding.into_tokens() method by @epwalsh in https://github.com/huggingface/tokenizers/pull/1381Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.15.0...v0.15.1
expect() for disabling truncation by @boyleconnor in https://github.com/huggingface/tokenizers/pull/1316safetensors. + Rewritten node bindings. by @Narsil in https://github.com/huggingface/tokenizers/pull/1331huggingface_hub<1.0 by @Wauplin in https://github.com/huggingface/tokenizers/pull/1385pre_tokenizers] Fix sentencepiece based Metaspace by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1357Clone on Tokenizer, add Encoding.into_tokens() method by @epwalsh in https://github.com/huggingface/tokenizers/pull/1381Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.13.4.rc2...v0.15.1.rc0
huggingface_hub<1.0 by @Wauplin in https://github.com/huggingface/tokenizers/pull/1385pre_tokenizers] Fix sentencepiece based Metaspace by @ArthurZucker in https://github.com/huggingface/tokenizers/pull/1357Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.14.1...v0.15.0