python-v0.3.0 — Tokenizers

Changes:

BPETokenizer has been renamed to CharBPETokenizer for clarity.
Added CharDelimiterSplit: a new PreTokenizer that allows splitting sequences on the given delimiter (Works like .split(delimiter))
Added WordLevel: a new model that simply maps tokens to their ids.
Improve truncation/padding and the handling of overflowing tokens. Now when a sequence gets truncated, we provide a list of overflowing Encoding that are ready to be processed by a language model, just as the main Encoding.
Provide mapping to the original string offsets using:

output = tokenizer.encode(...)
print(output.original_str.offsets(output.offsets[3]))

Exposed the vocabulary size on all tokenizers: https://github.com/huggingface/tokenizers/pull/99 by @kdexd