TGI is back to Apache 2.0!

Highlights

License was reverted to Apache 2.0
Cuda graphs are now used by default. They improve latency substancially on high end nodes.
Llava-next was added. It is the second multimodal model available on TGI after Idefics.
Cohere Command R+ support. TGI is the fastest open source backend for Command R+
FP8 support.
We now share the vocabulary for all medusa heads, greatly improving latency and memory use.

Try out Command R+ with Medusa heads on 4xA100s with:

model=text-generation-inference/commandrplus-medusa
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --speculate 3 --num-shard 4

What's Changed

Add cuda graphs sizes and make it default. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1703
Pickle conversion now requires --trust-remote-code. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1704
Push users to streaming in the readme. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1698
Fixing cohere tokenizer. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1697
Force weights_only (before fully breaking pickle files anyway). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1710
Regenerate ld.so.cache by @oOraph in https://github.com/huggingface/text-generation-inference/pull/1708
Revert license to Apache 2.0 by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1714
Automatic quantization config. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1719
Adding Llava-Next (Llava 1.6) with full support. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1709
fix: fix CohereForAI/c4ai-command-r-plus by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1707
Update libraries by @abhishekkrthakur in https://github.com/huggingface/text-generation-inference/pull/1713
Dev/mask ldconfig output v2 by @oOraph in https://github.com/huggingface/text-generation-inference/pull/1716
Fp8 Support by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1726
Upgrade EETQ (Fixes the cuda graphs). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1729
fix(router): fix a possible deadlock in next_batch by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1731
chore(cargo-toml): apply lto fat and codegen-units of one by @somehowchris in https://github.com/huggingface/text-generation-inference/pull/1651
Improve the defaults for the launcher by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1727
feat: medusa shared by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1734
Fix typo in guidance.md by @eltociear in https://github.com/huggingface/text-generation-inference/pull/1735

New Contributors

@somehowchris made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1651

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.4.5...v2.0.0