Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v3.3.6...v3.3.7
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v3.3.5...v3.3.6
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v3.3.4...git
Fix for Neuron models exported with batch_size 1.
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v3.3.3...v3.3.4
Neuron backend update.
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v3.3.2...v3.3.3
Gaudi improvements.
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v3.3.1...v3.3.2
This release updates TGI to Torch 2.7 and CUDA 12.8.
round_up_seq logic to align with prefill warmup phase on… by @kaixuanliu in https://github.com/huggingface/text-generation-inference/pull/3224Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v3.3.0...v3.3.1
sccache to 0.10.0 by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/3179{% generation %} and {% endgeneration %} template handling by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/3204HF_HUB_OFFLINE=1 for Gaudi backend by @regisss in https://github.com/huggingface/text-generation-inference/pull/3193Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v3.2.3...v3.3.0
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v3.2.2...v3.2.3
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v3.2.1...v3.2.2
kernels 0.2.1 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3084gemma3-text model type by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3107Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v3.2.0...v3.2.1
BREAKING CHANGE: Lots of modifications around tool calling. Tool calling now respects fully OpenAI return results (arguments return type is a string instead of a real JSON object). Lots of improvements around the tool calling and side effects fixed.
Added Gemma 3 support.
tool_calls a vector. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3075openai to impure shell for integration tests by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3081--max-batch-total-tokens description by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/3083/v1/chat/completions endpoint by @aW3st in https://github.com/huggingface/text-generation-inference/pull/3000Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v3.1.1...v3.2.0
strftime_now callable function for minijinja chat templates by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2983loop_controls feature to minijinja to handle {% break %} by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2998rotary kernel from the Hub by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3041RadixTrie::find by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3067RadixAllocator by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3068monitoring.md tutorial by @sadra-barikbin in https://github.com/huggingface/text-generation-inference/pull/3056Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v3.1.0...v3.1.1
Deepseek R1 is fully supported on both AMD and Nvidia !
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
ghcr.io/huggingface/text-generation-inference:3.1.0 --model-id deepseek-ai/DeepSeek-R1
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v3.0.2...v3.1.0
Tl;dr
New transformers backend supporting flashattention at roughly same performance as pure TGI for all non officially supported models directly in TGI. Congrats @Cyrilvallez
New models unlocked: Cohere2, olmo, olmo2, helium.
docker run in README.md by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2861uv instead of poetry. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2919pre-commit run --all-files to fix CI by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2933alias for max_completion_tokens in ChatRequest by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2932Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v3.0.1...v3.0.2
Patch release to handle a few older models and corner cases.
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v3.0.0...v3.0.1
Big new release
Details: https://huggingface.co/docs/text-generation-inference/conceptual/chunking
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.4.1...v3.0.0
main by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2718Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.3.0...v2.4.1
PREFILL_CHUNKING=1)_server.nix file by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2538RUN in Dockerfile by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2547DenseMoELayer and wire it up in Mixtral/Deepseek V2 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2537--features google by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2566main by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2586desc_act for groupsize != -1 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2590ChatRequest for Vertex AI Chat instead of VertexChat by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2651e4m3fn KV cache by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2655attention function by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2609desc_act=true by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2622impureWithCuda dev shell by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2677Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.3.0...v2.4
notify_error with the content error, it will instead output regular generation._server.nix file by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2538RUN in Dockerfile by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2547DenseMoELayer and wire it up in Mixtral/Deepseek V2 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2537--features google by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2566main by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2586desc_act for groupsize != -1 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2590Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.3.0...v2.3.1