v3.0.2
Tl;dr
New transformers backend supporting flashattention at roughly same performance as pure TGI for all non officially supported models directly in TGI. Congrats @Cyrilvallez
New models unlocked: Cohere2, olmo, olmo2, helium.
What's Changed
- docs(README): supported hardware links TGI AMD GPUs by @guspan-tanadi in https://github.com/huggingface/text-generation-inference/pull/2814
- Fixing latest flavor by disabling it. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2831
- fix facebook/opt-125m not working issue by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2824
- Fixup opt to reduce the amount of odd if statements. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2833
- TensorRT-LLM backend bump to latest version + misc fixes by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2791
- Feat/trtllm cancellation dev container by @Hugoch in https://github.com/huggingface/text-generation-inference/pull/2795
- New arg. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2845
- Fixing CI. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2846
- fix: lint backend and doc files by @drbh in https://github.com/huggingface/text-generation-inference/pull/2850
- Qwen2-VL runtime error fix when prompted with multiple images by @janne-alatalo in https://github.com/huggingface/text-generation-inference/pull/2840
- Update vllm kernels for ROCM by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2826
- change xpu lib download link by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2852
- fix: include add_special_tokens in kserve request by @drbh in https://github.com/huggingface/text-generation-inference/pull/2859
- chore: fixed some typos and attribute issues in README by @ruidazeng in https://github.com/huggingface/text-generation-inference/pull/2891
- update ipex xpu to fix issue in ARC770 by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2884
- Basic flashinfer 0.2 support by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2862
- Improve vlm support (add idefics3 support) by @drbh in https://github.com/huggingface/text-generation-inference/pull/2437
- Update to marlin-kernels 0.3.7 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2882
- chore: Update jsonschema to 0.28.0 by @Stranger6667 in https://github.com/huggingface/text-generation-inference/pull/2870
- Add possible variants for A100 and H100 GPUs for auto-detecting flops by @lazariv in https://github.com/huggingface/text-generation-inference/pull/2837
- Update using_guidance.md by @nbroad1881 in https://github.com/huggingface/text-generation-inference/pull/2901
- fix crash in torch2.6 if TP=1 by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2885
- Add Flash decoding kernel ROCm by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2855
- Enable FP8 Per-Tensor Scales and Integrate Marlin/MoE Kernels Repo for ROCm by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2825
- Baichuan2-13B does not have max_position_embeddings in config by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2903
- docs(conceptual/speculation): available links Train Medusa by @guspan-tanadi in https://github.com/huggingface/text-generation-inference/pull/2863
- Fix
docker runinREADME.mdby @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2861 - 📝 add guide on using TPU with TGI in the docs by @baptistecolle in https://github.com/huggingface/text-generation-inference/pull/2907
- Upgrading our rustc version. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2908
- Fix typo in TPU docs by @baptistecolle in https://github.com/huggingface/text-generation-inference/pull/2911
- Removing the github runner. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2912
- Upgrading bitsandbytes. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2910
- Do not convert weight scale to e4m3fnuz on CUDA by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2917
- feat: improve star coder to support multi lora layers by @drbh in https://github.com/huggingface/text-generation-inference/pull/2883
- Flash decoding kernel adding and prefill-chunking and prefix caching enabling in intel cpu/xpu by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2815
- nix: update to PyTorch 2.5.1 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2921
- Moving to
uvinstead ofpoetry. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2919 - Add fp8 kv cache for ROCm by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2856
- fix the crash of meta-llama/Llama-3.2-1B by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2918
- feat: improve qwen2-vl startup by @drbh in https://github.com/huggingface/text-generation-inference/pull/2802
- Revert "feat: improve qwen2-vl startup " by @drbh in https://github.com/huggingface/text-generation-inference/pull/2924
- flashinfer: switch to plan API by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2904
- Fixing TRTLLM dockerfile. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2922
- Flash Transformers modeling backend support by @Cyrilvallez in https://github.com/huggingface/text-generation-inference/pull/2913
- Give TensorRT-LLMa proper CI/CD 😍 by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2886
- Trying to avoid the random timeout. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2929
- Run
pre-commit run --all-filesto fix CI by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2933 - Upgrading the deps to have transformers==4.48.0 necessary by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2937
- fix moe in quantization path by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2935
- Clarify FP8-Marlin use on capability 8.9 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2940
- Bump TensorRT-LLM backend dependency to v0.16.0 by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2931
- Set
aliasformax_completion_tokensinChatRequestby @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2932 - Add NVIDIA A40 to known cards by @kldzj in https://github.com/huggingface/text-generation-inference/pull/2941
- [TRTLLM] Expose finish reason by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2841
- Tmp tp transformers by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2942
- Transformers backend TP fix by @Cyrilvallez in https://github.com/huggingface/text-generation-inference/pull/2945
- Trying to put back the archlist (to fix the oom). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2947
New Contributors
- @janne-alatalo made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2840
- @ruidazeng made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2891
- @Stranger6667 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2870
- @lazariv made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2837
- @baptistecolle made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2907
- @Cyrilvallez made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2913
- @kldzj made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2941
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v3.0.1...v3.0.2
Fetched April 7, 2026


