v3.0.2 — Text Generation Inference

Tl;dr

New transformers backend supporting flashattention at roughly same performance as pure TGI for all non officially supported models directly in TGI. Congrats @Cyrilvallez

New models unlocked: Cohere2, olmo, olmo2, helium.

What's Changed

docs(README): supported hardware links TGI AMD GPUs by @guspan-tanadi in https://github.com/huggingface/text-generation-inference/pull/2814
Fixing latest flavor by disabling it. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2831
fix facebook/opt-125m not working issue by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2824
Fixup opt to reduce the amount of odd if statements. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2833
TensorRT-LLM backend bump to latest version + misc fixes by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2791
Feat/trtllm cancellation dev container by @Hugoch in https://github.com/huggingface/text-generation-inference/pull/2795
New arg. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2845
Fixing CI. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2846
fix: lint backend and doc files by @drbh in https://github.com/huggingface/text-generation-inference/pull/2850
Qwen2-VL runtime error fix when prompted with multiple images by @janne-alatalo in https://github.com/huggingface/text-generation-inference/pull/2840
Update vllm kernels for ROCM by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2826
change xpu lib download link by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2852
fix: include add_special_tokens in kserve request by @drbh in https://github.com/huggingface/text-generation-inference/pull/2859
chore: fixed some typos and attribute issues in README by @ruidazeng in https://github.com/huggingface/text-generation-inference/pull/2891
update ipex xpu to fix issue in ARC770 by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2884
Basic flashinfer 0.2 support by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2862
Improve vlm support (add idefics3 support) by @drbh in https://github.com/huggingface/text-generation-inference/pull/2437
Update to marlin-kernels 0.3.7 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2882
chore: Update jsonschema to 0.28.0 by @Stranger6667 in https://github.com/huggingface/text-generation-inference/pull/2870
Add possible variants for A100 and H100 GPUs for auto-detecting flops by @lazariv in https://github.com/huggingface/text-generation-inference/pull/2837
Update using_guidance.md by @nbroad1881 in https://github.com/huggingface/text-generation-inference/pull/2901
fix crash in torch2.6 if TP=1 by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2885
Add Flash decoding kernel ROCm by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2855
Enable FP8 Per-Tensor Scales and Integrate Marlin/MoE Kernels Repo for ROCm by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2825
Baichuan2-13B does not have max_position_embeddings in config by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2903
docs(conceptual/speculation): available links Train Medusa by @guspan-tanadi in https://github.com/huggingface/text-generation-inference/pull/2863
Fix docker run in README.md by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2861
:memo: add guide on using TPU with TGI in the docs by @baptistecolle in https://github.com/huggingface/text-generation-inference/pull/2907
Upgrading our rustc version. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2908
Fix typo in TPU docs by @baptistecolle in https://github.com/huggingface/text-generation-inference/pull/2911
Removing the github runner. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2912
Upgrading bitsandbytes. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2910
Do not convert weight scale to e4m3fnuz on CUDA by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2917
feat: improve star coder to support multi lora layers by @drbh in https://github.com/huggingface/text-generation-inference/pull/2883
Flash decoding kernel adding and prefill-chunking and prefix caching enabling in intel cpu/xpu by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2815
nix: update to PyTorch 2.5.1 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2921
Moving to uv instead of poetry. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2919
Add fp8 kv cache for ROCm by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2856
fix the crash of meta-llama/Llama-3.2-1B by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2918
feat: improve qwen2-vl startup by @drbh in https://github.com/huggingface/text-generation-inference/pull/2802
Revert "feat: improve qwen2-vl startup " by @drbh in https://github.com/huggingface/text-generation-inference/pull/2924
flashinfer: switch to plan API by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2904
Fixing TRTLLM dockerfile. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2922
Flash Transformers modeling backend support by @Cyrilvallez in https://github.com/huggingface/text-generation-inference/pull/2913
Give TensorRT-LLMa proper CI/CD 😍 by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2886
Trying to avoid the random timeout. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2929
Run pre-commit run --all-files to fix CI by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2933
Upgrading the deps to have transformers==4.48.0 necessary by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2937
fix moe in quantization path by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2935
Clarify FP8-Marlin use on capability 8.9 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2940
Bump TensorRT-LLM backend dependency to v0.16.0 by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2931
Set alias for max_completion_tokens in ChatRequest by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2932
Add NVIDIA A40 to known cards by @kldzj in https://github.com/huggingface/text-generation-inference/pull/2941
[TRTLLM] Expose finish reason by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2841
Tmp tp transformers by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2942
Transformers backend TP fix by @Cyrilvallez in https://github.com/huggingface/text-generation-inference/pull/2945
Trying to put back the archlist (to fix the oom). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2947

New Contributors

@janne-alatalo made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2840
@ruidazeng made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2891
@Stranger6667 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2870
@lazariv made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2837
@baptistecolle made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2907
@Cyrilvallez made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2913
@kldzj made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2941

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v3.0.1...v3.0.2