releases.shpreview
Hugging Face/Tokenizers

Tokenizers

$npx -y @buildinternet/releases show tokenizers
Mon
Wed
Fri
AprMayJunJulAugSepOctNovDecJanFebMarApr
Less
More
Releases2Avg0/wkVersionsv0.22.1 → v0.22.2
Dec 2, 2025
Release v0.22.2

What's Changed

Okay mostly doing the release for these PR:

<img width="2400" height="1200" alt="image" src="https://github.com/user-attachments/assets/0b974453-1fc6-4393-84ea-da99269e2b34" />

Basically good typing with at least ty, and a lot fast (from 4 to 8x faster) loading vocab with a lot of added tokens and GIL free !?

New Contributors

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.22.1...v0.22.2

Sep 19, 2025

Release v0.22.1

Main change:

  • Bump huggingface_hub upper version (#1866) from @Wauplin
  • chore(trainer): add and improve trainer signature (#1838) from @shenxiangzhuang
  • Some doc updates: c91d76ae558ca2dc1aa725959e65dc21bf1fed7e, 7b0217894c1e2baed7354ab41503841b47af7cf9, 57eb8d7d9564621221784f7949b9efdeb7a49ac1
Aug 29, 2025

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.21.3...v0.22.0rc0

Jul 28, 2025
Jul 4, 2025

What's Changed

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.21.2...v0.21.3

Jun 24, 2025

What's Changed

This release if focused around some performance optimization, enabling broader python no gil support, and fixing some onig issues!

New Contributors

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.21.1...v0.21.2rc0

Mar 13, 2025

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.21.0...v0.21.1

Mar 12, 2025

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.21.0...v0.21.1rc0

Nov 15, 2024
Release v0.21.0

Release v0.20.4 v0.21.0

We also no longer support python 3.7 or 3.8 (similar to transformers) as they are deprecated.

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.20.3...v0.21.0

Nov 5, 2024

What's Changed

There was a breaking change in 0.20.3 for tuple inputs of encode_batch!

New Contributors

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.20.2...v0.20.3

Nov 4, 2024

Release v0.20.2

Thanks a MILE to @diliop we now have support for python 3.13! 🥳

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.20.1...v0.20.2

Oct 10, 2024
Release v0.20.1

What's Changed

The most awaited offset issue with Llama is fixed 🥳

New Contributors

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.20.0...v0.20.1

Aug 8, 2024
Release v0.20.0: faster encode, better python support

Release v0.20.0

This release is focused on performances and user experience.

Performances:

First off, we did a bit of benchmarking, and found some place for improvement for us! With a few minor changes (mostly #1587) here is what we get on Llama3 running on a g6 instances on AWS https://github.com/huggingface/tokenizers/blob/main/bindings/python/benches/test_tiktoken.py :

Python API

We shipped better deserialization errors in general, and support for __str__ and __repr__ for all the object. This allows for a lot easier debugging see this:

>>> from tokenizers import Tokenizer;
>>> tokenizer = Tokenizer.from_pretrained("bert-base-uncased");
>>> print(tokenizer)
Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[{"id":0, "content":"[PAD]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":100, "content":"[UNK]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":101, "content":"[CLS]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":102, "content":"[SEP]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":103, "content":"[MASK]", "single_word":False, "lstrip":False, "rstrip":False, ...}], normalizer=BertNormalizer(clean_text=True, handle_chinese_chars=True, strip_accents=None, lowercase=True), pre_tokenizer=BertPreTokenizer(), post_processor=TemplateProcessing(single=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0)], pair=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0), Sequence(id=B, type_id=1), SpecialToken(id="[SEP]", type_id=1)], special_tokens={"[CLS]":SpecialToken(id="[CLS]", ids=[101], tokens=["[CLS]"]), "[SEP]":SpecialToken(id="[SEP]", ids=[102], tokens=["[SEP]"])}), decoder=WordPiece(prefix="##", cleanup=True), model=WordPiece(unk_token="[UNK]", continuing_subword_prefix="##", max_input_chars_per_word=100, vocab={"[PAD]":0, "[unused0]":1, "[unused1]":2, "[unused2]":3, "[unused3]":4, ...}))

>>> tokenizer
Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[{"id":0, "content":"[PAD]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":100, "content":"[UNK]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":101, "content":"[CLS]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":102, "content":"[SEP]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":103, "content":"[MASK]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}], normalizer=BertNormalizer(clean_text=True, handle_chinese_chars=True, strip_accents=None, lowercase=True), pre_tokenizer=BertPreTokenizer(), post_processor=TemplateProcessing(single=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0)], pair=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0), Sequence(id=B, type_id=1), SpecialToken(id="[SEP]", type_id=1)], special_tokens={"[CLS]":SpecialToken(id="[CLS]", ids=[101], tokens=["[CLS]"]), "[SEP]":SpecialToken(id="[SEP]", ids=[102], tokens=["[SEP]"])}), decoder=WordPiece(prefix="##", cleanup=True), model=WordPiece(unk_token="[UNK]", continuing_subword_prefix="##", max_input_chars_per_word=100, vocab={"[PAD]":0, "[unused0]":1, "[unused1]":2, ...}))

The pre_tokenizer.Sequence and normalizer.Sequence are also more accessible now:

from tokenizers import normalizers
norm = normalizers.Sequence([normalizers.Strip(), normalizers.BertNormalizer()])
norm[0]
norm[1].lowercase=False

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.19.1...v0.20.0rc1

Apr 17, 2024

What's Changed

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.19.0...v0.19.1

What's Changed

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.15.2...v0.19.0

Apr 16, 2024

Bumping 3 versions because of this: https://github.com/huggingface/transformers/blob/60dea593edd0b94ee15dc3917900b26e3acfbbee/setup.py#L177

What's Changed

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.15.2...v0.19.0rc0

Feb 12, 2024

What's Changed

Big shoutout to @rlrs for the fast replace normalizers PR. This boosts the performances of the tokenizers:

New Contributors

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.15.1...v0.15.2rc1

Jan 22, 2024

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.15.0...v0.15.1

Jan 18, 2024

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.13.4.rc2...v0.15.1.rc0

Nov 14, 2023

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/tokenizers/compare/v0.14.1...v0.15.0

Previous123Next
Latest
v0.22.2
Tracking Since
Dec 3, 2019
Last fetched Apr 19, 2026