Hugging Face/Tokenizers

Tokenizers

$npx -y @buildinternet/releases show tokenizers

Sun

Mon

Tue

Wed

Thu

Fri

Sat

AprMayJunJulAugSepOctNovDecJanFebMarApr

Less

Releases2Avg0/wkVersionsv0.22.1 → v0.22.2

Sep 19, 2022

Node v0.13.0

[0.13.0]

[#1008] Decoder is now a composable trait, but without being backward incompatible
[#1047, #1051, #1052] Processor is now a composable trait, but without being backward incompatible

Rust v0.13.0

[0.13.0]

[#1009] unstable_wasm feature to support building on Wasm (it's unstable !)
[#1008] Decoder is now a composable trait, but without being backward incompatible
[#1047, #1051, #1052] Processor is now a composable trait, but without being backward incompatible

Both trait changes warrant a "major" number since, despite best efforts to not break backward compatibility, the code is different enough that we cannot be exactly sure.

Apr 13, 2022

Python v0.12.1

[0.12.1]

[#938] Reverted breaking change. https://github.com/huggingface/transformers/issues/16520

Mar 31, 2022

[YANKED] Node v0.12.0

[0.12.0]

The breaking change was causing more issues upstream in transformers than anticipated: https://github.com/huggingface/transformers/pull/16537#issuecomment-1085682657

The decision was to rollback on that breaking change, and figure out a different way later to do this modification

Bump minor version because of a breaking change. Using 0.12 to match other bindings.

[#938] Breaking change. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.
[#939] Making the regex in ByteLevel pre_tokenizer optional (necessary for BigScience)
[#952] Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)
[#954] Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.
[#961] Added link for Ruby port of tokenizers

[YANKED] Python v0.12.0

[0.12.0]

The breaking change was causing more issues upstream in transformers than anticipated: https://github.com/huggingface/transformers/pull/16537#issuecomment-1085682657

The decision was to rollback on that breaking change, and figure out a different way later to do this modification

Bump minor version because of a breaking change.

[#938] Breaking change. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.
[#939] Making the regex in ByteLevel pre_tokenizer optional (necessary for BigScience)
[#952] Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)
[#954] Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.
[#962] Fix tests for python 3.10
[#961] Added link for Ruby port of tokenizers

[YANKED] Rust v0.12.0

[0.12.0]

Bump minor version because of a breaking change.

The breaking change was causing more issues upstream in transformers than anticipated: https://github.com/huggingface/transformers/pull/16537#issuecomment-1085682657

The decision was to rollback on that breaking change, and figure out a different way later to do this modification

[#938] Breaking change. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.
[#939] Making the regex in ByteLevel pre_tokenizer optional (necessary for BigScience)
[#952] Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)
[#954] Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.
[#961] Added link for Ruby port of tokenizers
[#960] Feature gate for cli and its clap dependency

Feb 28, 2022

Rust v0.11.2

[#919] Fixing single_word AddedToken. (regression from 0.11.2)
[#916] Deserializing faster added_tokens by loading them in batch.

Node v0.8.3

Python v0.11.6

[#919] Fixing single_word AddedToken. (regression from 0.11.2)
[#916] Deserializing faster added_tokens by loading them in batch.

Feb 16, 2022

Python v0.11.5

[#895] Add wheel support for Python 3.10

Jan 17, 2022

Node v0.8.2

[#884] Fixing bad deserialization following inclusion of a default for Punctuation

Python v0.11.4

[#884] Fixing bad deserialization following inclusion of a default for Punctuation

Python v0.11.3

[#882] Fixing Punctuation deserialize without argument.
[#868] Fixing missing direction in TruncationParams
[#860] Adding TruncationSide to TruncationParams

Rust v0.11.1

[#882] Fixing Punctuation deserialize without argument.
[#868] Fixing missing direction in TruncationParams
[#860] Adding TruncationSide to TruncationParams

Node v0.8.1

Fixing various backward compatibility bugs (Old serialized files couldn't be deserialized anymore.

Jan 4, 2022

Python v0.11.2

Fixes https://github.com/huggingface/tokenizers/pull/868

Dec 28, 2021

Python v0.11.1

[#860] Adding TruncationSide to TruncationParams.

Dec 24, 2021

Python v0.11.0

Fixed

[#585] Conda version should now work on old CentOS
[#844] Fixing interaction between is_pretokenized and trim_offsets.
[#851] Doc links

Added

[#657]: Add SplitDelimiterBehavior customization to Punctuation constructor
[#845]: Documentation for Decoders.

Changed

[#850]: Added a feature gate to enable disabling http features
[#718]: Fix WordLevel tokenizer determinism during training
[#762]: Add a way to specify the unknown token in SentencePieceUnigramTokenizer
[#770]: Improved documentation for UnigramTrainer
[#780]: Add Tokenizer.from_pretrained to load tokenizers from the Hugging Face Hub
[#793]: Saving a pretty JSON file by default when saving a tokenizer

Sep 2, 2021

Node v0.8.0

BREACKING CHANGES

Many improvements on the Trainer (#519). The files must now be provided first when calling tokenizer.train(files, trainer).

Features

Adding the TemplateProcessing
Add WordLevel and Unigram models (#490)
Add nmtNormalizer and precompiledNormalizer normalizers (#490)
Add templateProcessing post-processor (#490)
Add digitsPreTokenizer pre-tokenizer (#490)
Add support for mapping to sequences (#506)
Add splitPreTokenizer pre-tokenizer (#542)
Add behavior option to the punctuationPreTokenizer (#657)
Add the ability to load tokenizers from the Hugging Face Hub using fromPretrained (#780)

Fixes

Fix a bug where long tokenizer.json files would be incorrectly deserialized (#459)
Fix RobertaProcessing deserialization in PostProcessorWrapper (#464)

May 24, 2021

Python v0.10.3

Fixed

[#686]: Fix SPM conversion process for whitespace deduplication
[#707]: Fix stripping strings containing Unicode characters

Added

[#693]: Add a CTC Decoder for Wave2Vec models

Removed

[#714]: Removed support for Python 3.5

Previous 1 2 3 4 5 Next

Similar releases

Other sources from this team

Similar sources

Latest

v0.22.2

Source

@huggingface/tokenizers

Tracking Since

Dec 3, 2019

Last fetched Apr 19, 2026

.json·.md·.atom

Tokenizers

[0.13.0]

[0.13.0]

[0.12.1]

[0.12.0]

[0.12.0]

[0.12.0]

Fixed

Added

Changed

BREACKING CHANGES

Features

Fixes

Fixed

Added

Removed

More from this team

Similar releases

Other sources from this team

Similar sources

More from this team

Similar releases

Other sources from this team

Similar sources