v3.3.0
Notable changes
- Prefill chunking for VLMs.
What's Changed
- Fixing Qwen 2.5 VL (32B). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3157
- Fixing tokenization like https://github.com/huggingface/text-embeddin… by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3156
- Gaudi: clean cuda/rocm code in hpu backend, enable flat_hpu by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3113
- L4 fixes by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/3161
- setuptools <= 70.0 is vulnerable: CVE-2024-6345 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3171
- transformers flash llm/vlm enabling in ipex by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3152
- Upgrading the dependencies in Gaudi backend. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3170
- Hotfixing gaudi deps. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3174
- Hotfix gaudi2 with newer transformers. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3176
- Support flashinfer for Gemma3 prefill by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3167
- Get opentelemetry trace id from request headers instead of creating a new trace by @kozistr in https://github.com/huggingface/text-generation-inference/pull/2648
- Bump
sccacheto 0.10.0 by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/3179 - Fixing CI by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3184
- Add option to configure prometheus port by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/3187
- Warmup gaudi backend by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3172
- Put more wiggle room. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3189
- Fixing the router + template for Qwen3. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3200
- Skip
{% generation %}and{% endgeneration %}template handling by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/3204 - doc typo by @julien-c in https://github.com/huggingface/text-generation-inference/pull/3206
- Pr 2982 ci branch by @drbh in https://github.com/huggingface/text-generation-inference/pull/3046
- fix: bump snaps for mllama by @drbh in https://github.com/huggingface/text-generation-inference/pull/3202
- Update client SDK snippets by @julien-c in https://github.com/huggingface/text-generation-inference/pull/3207
- Fix
HF_HUB_OFFLINE=1for Gaudi backend by @regisss in https://github.com/huggingface/text-generation-inference/pull/3193 - IPEX support FP8 kvcache/softcap/slidingwindow by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3144
- forward and tokenize chooser use the same shape by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3196
- Chunked Prefill VLM by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/3188
- Prepare for 3.3.0 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3220
New Contributors
- @kozistr made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2648
- @julien-c made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3206
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v3.2.3...v3.3.0
Fetched April 7, 2026
