rust-v0.9.0 — Tokenizers

Only one progress bar while reading files during training. This is better for use-cases with a high number of files as it avoids having too many progress bars on screen. Also avoids reading the size of each file before starting to actually read these files, as this process could take really long.
[#190]: Improved BPE and WordPiece builders
[#193]: encode and encode_batch now take a new argument, specifying whether we should add the special tokens
[#197]: The NormalizedString has been removed from the Encoding. It is now possible to retrieve it by calling normalize on the Tokenizer. This brings a reduction of 70% of the memory footprint
[#197]: The NormalizedString API has been improved. It is now possible to retrieve parts of both strings using both "normalized" or "original" offsets
[#197]: The offsets provided on Encoding are now relative to the original string, and not the normalized one anymore
AddedToken are now used for both add_special_tokens and add_tokens. Also, these AddedToken have more options to allow various behaviors.

[#188]: impl PostProcessor for ByteLevel: Handles trimming the offsets if activated. This avoids the unintuitive inclusion of the whitespaces in the produced offsets, even if these whitespaces are part of the actual token
More alignment mappings on the Encoding.
post_process can be called on the Tokenizer

[#193]: Fix some issues with the offsets being wrong with the ByteLevel BPE:
- when add_prefix_space is activated
- [#156]: when a Unicode character gets split-up in multiple byte-level characters
Fix a bug where offsets were wrong when there was any added tokens in the sequence being encoded.
[#175]: Fix a bug that prevented the addition of more than a certain amount of tokens (even if not advised, but that's not the question)

Add the ByteLevel PostProcessor to your byte-level BPE tokenizers if relevant.