releases.shpreview
Hugging Face/Tokenizers

Tokenizers

$npx -y @buildinternet/releases show tokenizers
Mon
Wed
Fri
AprMayJunJulAugSepOctNovDecJanFebMarApr
Less
More
Releases2Avg0/wkVersionsv0.22.1 → v0.22.2
Sep 19, 2022
Node v0.13.0

[0.13.0]

  • [#1008] Decoder is now a composable trait, but without being backward incompatible
  • [#1047, #1051, #1052] Processor is now a composable trait, but without being backward incompatible
Rust v0.13.0

[0.13.0]

  • [#1009] unstable_wasm feature to support building on Wasm (it's unstable !)
  • [#1008] Decoder is now a composable trait, but without being backward incompatible
  • [#1047, #1051, #1052] Processor is now a composable trait, but without being backward incompatible

Both trait changes warrant a "major" number since, despite best efforts to not break backward compatibility, the code is different enough that we cannot be exactly sure.

Apr 13, 2022
Python v0.12.1

[0.12.1]

Mar 31, 2022
[YANKED] Node v0.12.0

[0.12.0]

The breaking change was causing more issues upstream in transformers than anticipated: https://github.com/huggingface/transformers/pull/16537#issuecomment-1085682657

The decision was to rollback on that breaking change, and figure out a different way later to do this modification

Bump minor version because of a breaking change. Using 0.12 to match other bindings.

  • [#938] Breaking change. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.

  • [#939] Making the regex in ByteLevel pre_tokenizer optional (necessary for BigScience)

  • [#952] Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)

  • [#954] Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.

  • [#961] Added link for Ruby port of tokenizers

[YANKED] Python v0.12.0

[0.12.0]

The breaking change was causing more issues upstream in transformers than anticipated: https://github.com/huggingface/transformers/pull/16537#issuecomment-1085682657

The decision was to rollback on that breaking change, and figure out a different way later to do this modification

Bump minor version because of a breaking change.

  • [#938] Breaking change. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.

  • [#939] Making the regex in ByteLevel pre_tokenizer optional (necessary for BigScience)

  • [#952] Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)

  • [#954] Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.

  • [#962] Fix tests for python 3.10

  • [#961] Added link for Ruby port of tokenizers

[YANKED] Rust v0.12.0

[0.12.0]

Bump minor version because of a breaking change.

The breaking change was causing more issues upstream in transformers than anticipated: https://github.com/huggingface/transformers/pull/16537#issuecomment-1085682657

The decision was to rollback on that breaking change, and figure out a different way later to do this modification

  • [#938] Breaking change. Decoder trait is modified to be composable. This is only breaking if you are using decoders on their own. tokenizers should be error free.

  • [#939] Making the regex in ByteLevel pre_tokenizer optional (necessary for BigScience)

  • [#952] Fixed the vocabulary size of UnigramTrainer output (to respect added tokens)

  • [#954] Fixed not being able to save vocabularies with holes in vocab (ConvBert). Yell warnings instead, but stop panicking.

  • [#961] Added link for Ruby port of tokenizers

  • [#960] Feature gate for cli and its clap dependency

Feb 28, 2022
Rust v0.11.2
  • [#919] Fixing single_word AddedToken. (regression from 0.11.2)
  • [#916] Deserializing faster added_tokens by loading them in batch.
Node v0.8.3
Python v0.11.6
  • [#919] Fixing single_word AddedToken. (regression from 0.11.2)
  • [#916] Deserializing faster added_tokens by loading them in batch.
Feb 16, 2022
Python v0.11.5

[#895] Add wheel support for Python 3.10

Jan 17, 2022
Node v0.8.2

[#884] Fixing bad deserialization following inclusion of a default for Punctuation

Python v0.11.4

[#884] Fixing bad deserialization following inclusion of a default for Punctuation

Python v0.11.3
  • [#882] Fixing Punctuation deserialize without argument.
  • [#868] Fixing missing direction in TruncationParams
  • [#860] Adding TruncationSide to TruncationParams
Rust v0.11.1
  • [#882] Fixing Punctuation deserialize without argument.
  • [#868] Fixing missing direction in TruncationParams
  • [#860] Adding TruncationSide to TruncationParams
Node v0.8.1

Fixing various backward compatibility bugs (Old serialized files couldn't be deserialized anymore.

Jan 4, 2022
Dec 28, 2021
Python v0.11.1

[#860] Adding TruncationSide to TruncationParams.

Dec 24, 2021
Python v0.11.0

Fixed

  • [#585] Conda version should now work on old CentOS
  • [#844] Fixing interaction between is_pretokenized and trim_offsets.
  • [#851] Doc links

Added

  • [#657]: Add SplitDelimiterBehavior customization to Punctuation constructor
  • [#845]: Documentation for Decoders.

Changed

  • [#850]: Added a feature gate to enable disabling http features
  • [#718]: Fix WordLevel tokenizer determinism during training
  • [#762]: Add a way to specify the unknown token in SentencePieceUnigramTokenizer
  • [#770]: Improved documentation for UnigramTrainer
  • [#780]: Add Tokenizer.from_pretrained to load tokenizers from the Hugging Face Hub
  • [#793]: Saving a pretty JSON file by default when saving a tokenizer
Sep 2, 2021
Node v0.8.0

BREACKING CHANGES

  • Many improvements on the Trainer (#519). The files must now be provided first when calling tokenizer.train(files, trainer).

Features

  • Adding the TemplateProcessing
  • Add WordLevel and Unigram models (#490)
  • Add nmtNormalizer and precompiledNormalizer normalizers (#490)
  • Add templateProcessing post-processor (#490)
  • Add digitsPreTokenizer pre-tokenizer (#490)
  • Add support for mapping to sequences (#506)
  • Add splitPreTokenizer pre-tokenizer (#542)
  • Add behavior option to the punctuationPreTokenizer (#657)
  • Add the ability to load tokenizers from the Hugging Face Hub using fromPretrained (#780)

Fixes

  • Fix a bug where long tokenizer.json files would be incorrectly deserialized (#459)
  • Fix RobertaProcessing deserialization in PostProcessorWrapper (#464)
May 24, 2021
Python v0.10.3

Fixed

  • [#686]: Fix SPM conversion process for whitespace deduplication
  • [#707]: Fix stripping strings containing Unicode characters

Added

  • [#693]: Add a CTC Decoder for Wave2Vec models

Removed

  • [#714]: Removed support for Python 3.5
Latest
v0.22.2
Tracking Since
Dec 3, 2019
Last fetched Apr 19, 2026