---
name: Text Generation Inference
slug: text-generation-inference
type: github
source_url: https://github.com/huggingface/text-generation-inference
organization: Hugging Face
organization_slug: hugging-face
total_releases: 67
latest_version: v3.3.7
latest_date: 2025-12-19
last_updated: 2026-04-19
tracking_since: 2023-02-03
canonical: https://releases.sh/hugging-face/text-generation-inference
organization_url: https://releases.sh/hugging-face
---

<Release version="v3.3.7" date="December 19, 2025" published="2025-12-19T14:35:25.000Z" url="https://github.com/huggingface/text-generation-inference/releases/tag/v3.3.7">
## What's Changed
* misc(gha): expose action cache url and runtime as secrets by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2964
* feat: support max_image_fetch_size to limit by @drbh in https://github.com/huggingface/text-generation-inference/pull/3339
* Maintenance mode by @LysandreJik in https://github.com/huggingface/text-generation-inference/pull/3344
* Maintenance mode by @LysandreJik in https://github.com/huggingface/text-generation-inference/pull/3345
* fix(num_devices): fix num_shard/num device auto compute when NVIDIA_VISIBLE_DEVICES == "all" or "void" by @oOraph in https://github.com/huggingface/text-generation-inference/pull/3346


**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.3.6...v3.3.7
</Release>

<Release version="v3.3.6" date="September 17, 2025" published="2025-09-17T00:48:54.000Z" url="https://github.com/huggingface/text-generation-inference/releases/tag/v3.3.6">
## What's Changed
* Add missing backslash by @philsupertramp in https://github.com/huggingface/text-generation-inference/pull/3311
* Revert "feat: bump flake including transformers and huggingface_hub versions" by @drbh in https://github.com/huggingface/text-generation-inference/pull/3323
* fix: remove azure by @drbh in https://github.com/huggingface/text-generation-inference/pull/3325
* Fix mask passed to flashinfer by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3324
* Update iframe sources for streaming demo by @coyotte508 in https://github.com/huggingface/text-generation-inference/pull/3327
* Revert "Revert "feat: bump flake including transformers and huggingfa… by @drbh in https://github.com/huggingface/text-generation-inference/pull/3326
* Revert "feat: bump flake including transformers and huggingface_hub versions" by @drbh in https://github.com/huggingface/text-generation-inference/pull/3330
* Patch version 3.3.6 by @tengomucho in https://github.com/huggingface/text-generation-inference/pull/3329

## New Contributors
* @philsupertramp made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3311
* @coyotte508 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3327

**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.3.5...v3.3.6
</Release>

<Release version="v3.3.5" date="September 2, 2025" published="2025-09-02T15:02:33.000Z" url="https://github.com/huggingface/text-generation-inference/releases/tag/v3.3.5">
## What's Changed
* [gaudi] Refine rope memory, do not need to keep sin/cos cache per layer by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3274
* Gaudi: add CI by @baptistecolle in https://github.com/huggingface/text-generation-inference/pull/3160
* [gaudi] Gemma3 sliding window support by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3280
* xpu lora support by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3232
* Optimum neuron 0.2.2 by @dacorvo in https://github.com/huggingface/text-generation-inference/pull/3281
* [gaudi] Remove unnecessary reinitialize to HeterogeneousNextTokenChooser to m… by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3284
* [gaudi] Deepseek v2 mla and add ep to unquantized moe by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3287
* [gaudi] Fix the CI test errors by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3286
* Hpu gptq gidx support by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3297
* Migrate to V2 Pydantic interface by @emmanuel-ferdman in https://github.com/huggingface/text-generation-inference/pull/3262
* Xccl by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3252
* Multi modality fix by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3283
* some gptq case could not be handled by ipex. but could be handle by t… by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3298
* fix outline import issue by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3282
* HuggingFaceM4/Idefics3-8B-Llama3 crash fix by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3267
* Optimum neuron 0.3.0 by @tengomucho in https://github.com/huggingface/text-generation-inference/pull/3308
* Disable Cachix pushes by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3312
* chore: prepare version 3.3.5 by @tengomucho in https://github.com/huggingface/text-generation-inference/pull/3314
* feat: bump flake including transformers and huggingface_hub versions by @drbh in https://github.com/huggingface/text-generation-inference/pull/3313


**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.3.4...git
</Release>

<Release version="v3.3.4" date="June 19, 2025" published="2025-06-19T10:00:28.000Z" url="https://github.com/huggingface/text-generation-inference/releases/tag/v3.3.4">
Fix for Neuron models exported with batch_size 1.

## What's Changed
* [gaudi] gemma3 text and vlm model intial support. need to add sliding window … by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3270
* Neuron backend fix by @dacorvo in https://github.com/huggingface/text-generation-inference/pull/3273


**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.3.3...v3.3.4
</Release>

<Release version="v3.3.3" date="June 18, 2025" published="2025-06-18T13:11:39.000Z" url="https://github.com/huggingface/text-generation-inference/releases/tag/v3.3.3">
Neuron backend update.

## What's Changed
* Remove useless packages by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3253
* Bump neuron SDK version by @dacorvo in https://github.com/huggingface/text-generation-inference/pull/3260
* Perf opt by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3256
* [gaudi] Vlm rebase and issue fix in benchmark test by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3263
* Move the _update_cos_sin_cache into get_cos_sin by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3254
* [Gaudi] Remove optimum-habana by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3261
* [gaudi] HuggingFaceM4/idefics2-8b issue fix by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3264
* [Gaudi] Enable Qwen3_moe model by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3244
* [Gaudi]Fix the integration-test issues by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3265
* [Gaudi] use pad_token_id to pad input id by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3268
* chore: prepare release 3.3.3 by @dacorvo in https://github.com/huggingface/text-generation-inference/pull/3269
* [gaudi] Refine logging for Gaudi warmup by @regisss in https://github.com/huggingface/text-generation-inference/pull/3222
* doc: fix README by @dacorvo in https://github.com/huggingface/text-generation-inference/pull/3271


**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.3.2...v3.3.3
</Release>

<Release version="v3.3.2" date="May 30, 2025" published="2025-05-30T14:20:39.000Z" url="https://github.com/huggingface/text-generation-inference/releases/tag/v3.3.2">
Gaudi improvements.

## What's Changed
* upgrade to new vllm extension ops(fix issue in exponential bucketing) by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3239
* Nix: switch to hf-nix by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3240
* Add Qwen3 by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3229
* fp8 compressed_tensors w8a8 support by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3242
* [Gaudi] Fix the OOM issue of Llama-4-Scout-17B-16E-Instruct by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3245
* Fix the Llama-4-Maverick-17B-128E crash issue by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3246
* Prepare for 3.3.2 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3249


**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.3.1...v3.3.2
</Release>

<Release version="v3.3.1" date="May 22, 2025" published="2025-05-22T07:49:07.000Z" url="https://github.com/huggingface/text-generation-inference/releases/tag/v3.3.1">
This release updates TGI to Torch 2.7 and CUDA 12.8.

## What's Changed
* change HPU warmup logic: seq length should be with exponential growth by @kaixuanliu in https://github.com/huggingface/text-generation-inference/pull/3217
* adjust the `round_up_seq` logic to align with prefill warmup phase on… by @kaixuanliu in https://github.com/huggingface/text-generation-inference/pull/3224
* Update to Torch 2.7.0 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3221
* Enable Llama4 for gaudi backend by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3223
* fix: count gpu uuids if NVIDIA_VISIBLE_DEVICES env set to all by @drbh in https://github.com/huggingface/text-generation-inference/pull/3230
* Deepseek r1 by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3211
* Refine warmup and upgrade to synapse AI 1.21.0 by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3234
* fix the crash in default ATTENTION path by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3235
* Switch to punica-sgmv kernel from the Hub by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3236
* move input_ids to hpu and remove disposal of adapter_meta by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3237
* Prepare for 3.3.1 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3238

## New Contributors
* @kaixuanliu made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3217

**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.3.0...v3.3.1
</Release>

<Release version="v3.3.0" date="May 9, 2025" published="2025-05-09T13:57:39.000Z" url="https://github.com/huggingface/text-generation-inference/releases/tag/v3.3.0">
## Notable changes

* Prefill chunking for VLMs.

## What's Changed
* Fixing Qwen 2.5 VL (32B). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3157
* Fixing tokenization like https://github.com/huggingface/text-embeddin… by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3156
* Gaudi: clean cuda/rocm code in hpu backend, enable flat_hpu by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3113
* L4 fixes by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/3161
* setuptools <= 70.0 is vulnerable: CVE-2024-6345 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3171
* transformers flash llm/vlm enabling in ipex by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3152
* Upgrading the dependencies in Gaudi backend. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3170
* Hotfixing gaudi deps. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3174
* Hotfix gaudi2 with newer transformers. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3176
* Support flashinfer for Gemma3 prefill by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3167
* Get opentelemetry trace id from request headers instead of creating a new trace by @kozistr in https://github.com/huggingface/text-generation-inference/pull/2648
* Bump `sccache` to 0.10.0 by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/3179
* Fixing CI by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3184
* Add option to configure prometheus port by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/3187
* Warmup gaudi backend by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3172
* Put more wiggle room. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3189
* Fixing the router + template for Qwen3. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3200
* Skip `{% generation %}` and `{% endgeneration %}` template handling by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/3204
* doc typo by @julien-c in https://github.com/huggingface/text-generation-inference/pull/3206
* Pr 2982 ci branch by @drbh in https://github.com/huggingface/text-generation-inference/pull/3046
* fix: bump snaps for mllama by @drbh in https://github.com/huggingface/text-generation-inference/pull/3202
* Update client SDK snippets by @julien-c in https://github.com/huggingface/text-generation-inference/pull/3207
* Fix `HF_HUB_OFFLINE=1` for Gaudi backend by @regisss in https://github.com/huggingface/text-generation-inference/pull/3193
* IPEX support FP8 kvcache/softcap/slidingwindow by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3144
* forward and tokenize chooser use the same shape by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3196
* Chunked Prefill VLM by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/3188
* Prepare for 3.3.0 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3220

## New Contributors
* @kozistr made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2648
* @julien-c made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3206

**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.2.3...v3.3.0
</Release>

<Release version="v3.2.3" date="April 8, 2025" published="2025-04-08T08:18:36.000Z" url="https://github.com/huggingface/text-generation-inference/releases/tag/v3.2.3">
## Main changes

- Patching Llama 4

## What's Changed
* Use ROCM 6.3.1 by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/3141
* Update transformers to 4.51 by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/3148
* Gaudi: Add Integration Test for Gaudi Backend by @baptistecolle in https://github.com/huggingface/text-generation-inference/pull/3142
* fix: compute type typo by @oOraph in https://github.com/huggingface/text-generation-inference/pull/3150
* 3.2.3 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3151


**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.2.2...v3.2.3
</Release>

<Release version="v3.2.2" date="April 6, 2025" published="2025-04-06T09:41:33.000Z" url="https://github.com/huggingface/text-generation-inference/releases/tag/v3.2.2">
## What's Changed
* Minor fixes. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3125
* configurable termination timeout by @ErikKaum in https://github.com/huggingface/text-generation-inference/pull/3126
* CI: enable server tests for backends by @baptistecolle in https://github.com/huggingface/text-generation-inference/pull/3128
* Torch 2.6 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3134
* Gaudi: Fix llava-next and mllama crash issue by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3127
* nix-v3.2.1 -> v3.2.1-nix by @co42 in https://github.com/huggingface/text-generation-inference/pull/3129
* Gaudi: Use exponential growth to replace BATCH_BUCKET_SIZE by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3131
* Add llama4 by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/3145
* Preparing for release. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3147

## New Contributors
* @co42 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3129

**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.2.1...v3.2.2
</Release>

<Release version="v3.2.1" date="March 18, 2025" published="2025-03-18T14:28:12.000Z" url="https://github.com/huggingface/text-generation-inference/releases/tag/v3.2.1">
## What's Changed
* Update to `kernels` 0.2.1 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3084
* Router: add `gemma3-text` model type by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3107
* We need gcc during runtime to enable triton to compile kernels. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3103
* Release of Gaudi Backend for TGI by @baptistecolle in https://github.com/huggingface/text-generation-inference/pull/3091
* Fixing the docker build. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3108
* Make the Nix-based Docker container work on non-NixOS by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3109
* xpu 2.6 update by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3051
* launcher: correctly get the head dimension for VLMs by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3116
* Gaudi: Sync TGI with the latest changes from the TGI-Gaudi fork by @baptistecolle in https://github.com/huggingface/text-generation-inference/pull/3117
* Bug Fix: Sliding Window Attention  by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/3112
* Publish nix docker image. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3122
* Prepare for patch release. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3124
* Intel docker. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3121


**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.2.0...v3.2.1
</Release>

<Release version="v3.2.0" date="March 12, 2025" published="2025-03-12T10:17:46.000Z" url="https://github.com/huggingface/text-generation-inference/releases/tag/v3.2.0">
## Important changes

- BREAKING CHANGE: Lots of modifications around tool calling. Tool calling now respects fully OpenAI return results (arguments return type is a string instead of a real JSON object). Lots of improvements around the tool calling and side effects fixed.

- Added Gemma 3 support.

## What's Changed
* fix(neuron): explicitly install toolchain by @dacorvo in https://github.com/huggingface/text-generation-inference/pull/3072
* Only add token when it is defined. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3073
* Making sure Olmo (transformers backend) works. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3074
* Making `tool_calls` a vector. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3075
* Nix: add `openai` to impure shell for integration tests by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3081
* Update `--max-batch-total-tokens` description by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/3083
* Fix tool call2 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3076
* Nix: the launcher needs a Python env with Torch for GPU detection by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3085
* Add request parameters to OTel span for `/v1/chat/completions` endpoint by @aW3st in https://github.com/huggingface/text-generation-inference/pull/3000
* Add qwen2 multi lora layers support by @EachSheep in https://github.com/huggingface/text-generation-inference/pull/3089
* Add modules_to_not_convert in quantized model by @jiqing-feng in https://github.com/huggingface/text-generation-inference/pull/3053
* Small test and typing fixes by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3078
* hotfix: qwen2 formatting by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3093
* Pr 3003 ci branch by @drbh in https://github.com/huggingface/text-generation-inference/pull/3007
* Update the llamacpp backend by @angt in https://github.com/huggingface/text-generation-inference/pull/3022
* Fix qwen vl by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3096
* Update README.md by @celsowm in https://github.com/huggingface/text-generation-inference/pull/3095
* Fix tool call3 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3086
* Add gemma3 model by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/3099
* Fix tool call4 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3094
* Update neuron backend by @dacorvo in https://github.com/huggingface/text-generation-inference/pull/3098
* Preparing relase 3.2.0 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3100
* Try to fix on main CI color. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3101

## New Contributors
* @EachSheep made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3089
* @jiqing-feng made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3053

**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.1.1...v3.2.0
</Release>

<Release version="v3.1.1" date="March 4, 2025" published="2025-03-04T17:15:23.000Z" url="https://github.com/huggingface/text-generation-inference/releases/tag/v3.1.1">
## What's Changed
* Back on nix main. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2979
* hotfix: fix trtllm CI build on release by @Hugoch in https://github.com/huggingface/text-generation-inference/pull/2981
* Add `strftime_now` callable function for `minijinja` chat templates by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2983
* impureWithCuda: fix gcc version by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2990
* Improve qwen vl impl by @drbh in https://github.com/huggingface/text-generation-inference/pull/2943
* Using the "lockfile". by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2992
* Triton fix by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2995
* [Backend] Bump TRTLLM to v.0.17.0 by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2991
* Updating mllama after strftime. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2993
* Use kernels from the kernel hub by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2988
* fix Qwen VL break in intel platform by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3002
* Update the flaky mllama test. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3015
* Preventing single user hugging the server to death by asking by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3016
* Putting back the NCCL forced upgrade. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2999
* Support sigmoid scoring function in GPTQ-MoE by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3017
* [Backend] Add Llamacpp backend by @angt in https://github.com/huggingface/text-generation-inference/pull/2975
* Use eetq kernel from the hub by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3029
* Update README.md by @celsowm in https://github.com/huggingface/text-generation-inference/pull/3024
* Add `loop_controls` feature to `minijinja` to handle `{% break %}` by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2998
* Pinning trufflehog. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3032
* It's find in some machine. using hf_hub::api::sync::Api to download c… by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3030
* Improve Transformers support by @Cyrilvallez in https://github.com/huggingface/text-generation-inference/pull/2970
* feat: add initial qwen2.5-vl model and test by @drbh in https://github.com/huggingface/text-generation-inference/pull/2971
* Using public external registry (to use external runners for CI). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3031
* Having less logs in case of failure for checking CI more easily. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3037
* feat: Add the parsing of HF_HUB_USER_AGENT_ORIGIN environment variable for telemetry by @Hugoch in https://github.com/huggingface/text-generation-inference/pull/3027
* update ipex and torch to 2.6 for cpu by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3039
* flashinfer 0.2.0.post1 -> post2 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3040
* fix qwen2 vl crash in continous batching by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3004
* Simplify logs2. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3045
* Update Gradio ChatInterface configuration in consuming_tgi.md by @angt in https://github.com/huggingface/text-generation-inference/pull/3042
* Improve tool call message processing by @drbh in https://github.com/huggingface/text-generation-inference/pull/3036
* Use `rotary` kernel from the Hub by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3041
* Add Neuron backend by @dacorvo in https://github.com/huggingface/text-generation-inference/pull/3033
* You need to seek apparently. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3049
* some minor fix by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3048
* fix: run linters and fix formatting by @drbh in https://github.com/huggingface/text-generation-inference/pull/3057
* Avoid running neuron integration tests twice by @dacorvo in https://github.com/huggingface/text-generation-inference/pull/3054
* Add Gaudi Backend by @baptistecolle in https://github.com/huggingface/text-generation-inference/pull/3055
* Fix two edge cases in `RadixTrie::find` by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3067
* Add property-based testing for `RadixAllocator` by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3068
* feat: add support for HF_HUB_USER_AGENT_ORIGIN to add user-agent Origin field in Hub requests. by @Hugoch in https://github.com/huggingface/text-generation-inference/pull/3061
* Preparing for release. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3060
* Fix a tiny typo in `monitoring.md` tutorial by @sadra-barikbin in https://github.com/huggingface/text-generation-inference/pull/3056
* Patch rust release. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3069

## New Contributors
* @angt made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2975
* @celsowm made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3024
* @dacorvo made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3033
* @sadra-barikbin made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3056

**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.1.0...v3.1.1
</Release>

<Release version="v3.1.0" date="January 31, 2025" published="2025-01-31T13:26:50.000Z" url="https://github.com/huggingface/text-generation-inference/releases/tag/v3.1.0">
## Important changes

Deepseek R1 is fully supported on both AMD and Nvidia !

```
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
    ghcr.io/huggingface/text-generation-inference:3.1.0 --model-id deepseek-ai/DeepSeek-R1
```

## What's Changed
* Attempt to remove AWS S3 flaky cache for sccache by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2953
* Update to attention-kernels 0.2.0 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2950
* fix: Telemetry by @Hugoch in https://github.com/huggingface/text-generation-inference/pull/2957
* Fixing the oom maybe with 2.5.1 change. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2958
* Add backend name to telemetry by @Hugoch in https://github.com/huggingface/text-generation-inference/pull/2962
* Add fp8 support moe models by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2928
* Update to moe-kernels 0.8.0 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2966
* Hotfixing intel-cpu (not sure how it was working before). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2967
* Add deepseekv3 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2968
* doc: Update TRTLLM deployment doc.  by @Hugoch in https://github.com/huggingface/text-generation-inference/pull/2960
* Update moe-kernel to 0.8.2 for rocm by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2977
* Prepare for release 3.1.0 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2972


**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.0.2...v3.1.0
</Release>

<Release version="v3.0.2" date="January 24, 2025" published="2025-01-24T11:16:11.000Z" url="https://github.com/huggingface/text-generation-inference/releases/tag/v3.0.2">
Tl;dr

**New transformers backend supporting flashattention at roughly same performance as pure TGI for all non officially supported models directly in TGI. Congrats @Cyrilvallez**

**New models unlocked**: Cohere2, olmo, olmo2, helium.

## What's Changed
* docs(README): supported hardware links TGI AMD GPUs by @guspan-tanadi in https://github.com/huggingface/text-generation-inference/pull/2814
* Fixing latest flavor by disabling it. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2831
* fix facebook/opt-125m not working issue by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2824
* Fixup opt to reduce the amount of odd if statements. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2833
* TensorRT-LLM backend bump to latest version + misc fixes by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2791
* Feat/trtllm cancellation dev container by @Hugoch in https://github.com/huggingface/text-generation-inference/pull/2795
* New arg. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2845
* Fixing CI. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2846
* fix: lint backend and doc files by @drbh in https://github.com/huggingface/text-generation-inference/pull/2850
* Qwen2-VL runtime error fix when prompted with multiple images by @janne-alatalo in https://github.com/huggingface/text-generation-inference/pull/2840
* Update vllm kernels for ROCM by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2826
* change xpu lib download link by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2852
* fix: include add_special_tokens in kserve request by @drbh in https://github.com/huggingface/text-generation-inference/pull/2859
* chore: fixed some typos and attribute issues in README by @ruidazeng in https://github.com/huggingface/text-generation-inference/pull/2891
* update ipex xpu to fix issue in ARC770 by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2884
* Basic flashinfer 0.2 support by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2862
* Improve vlm support (add idefics3 support) by @drbh in https://github.com/huggingface/text-generation-inference/pull/2437
* Update to marlin-kernels 0.3.7 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2882
* chore: Update jsonschema to 0.28.0 by @Stranger6667 in https://github.com/huggingface/text-generation-inference/pull/2870
* Add possible variants for A100 and H100 GPUs for auto-detecting flops by @lazariv in https://github.com/huggingface/text-generation-inference/pull/2837
* Update using_guidance.md by @nbroad1881 in https://github.com/huggingface/text-generation-inference/pull/2901
* fix crash in torch2.6 if TP=1 by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2885
* Add Flash decoding kernel ROCm by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2855
* Enable FP8 Per-Tensor Scales and Integrate Marlin/MoE Kernels Repo for ROCm by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2825
* Baichuan2-13B does not have max_position_embeddings in config by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2903
* docs(conceptual/speculation): available links Train Medusa by @guspan-tanadi in https://github.com/huggingface/text-generation-inference/pull/2863
* Fix `docker run` in `README.md` by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2861
* :memo: add guide on using TPU with TGI in the docs by @baptistecolle in https://github.com/huggingface/text-generation-inference/pull/2907
* Upgrading our rustc version. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2908
* Fix typo in TPU docs by @baptistecolle in https://github.com/huggingface/text-generation-inference/pull/2911
* Removing the github runner. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2912
* Upgrading bitsandbytes. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2910
* Do not convert weight scale to e4m3fnuz on CUDA by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2917
* feat: improve star coder to support multi lora layers by @drbh in https://github.com/huggingface/text-generation-inference/pull/2883
* Flash decoding kernel adding and prefill-chunking and prefix caching enabling in intel cpu/xpu by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2815
* nix: update to PyTorch 2.5.1 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2921
* Moving to `uv` instead of `poetry`. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2919
* Add fp8 kv cache for ROCm by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2856
* fix the crash of meta-llama/Llama-3.2-1B by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2918
* feat: improve qwen2-vl startup  by @drbh in https://github.com/huggingface/text-generation-inference/pull/2802
* Revert "feat: improve qwen2-vl startup " by @drbh in https://github.com/huggingface/text-generation-inference/pull/2924
* flashinfer: switch to plan API by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2904
* Fixing TRTLLM dockerfile. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2922
* Flash Transformers modeling backend support by @Cyrilvallez in https://github.com/huggingface/text-generation-inference/pull/2913
* Give TensorRT-LLMa proper CI/CD 😍 by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2886
* Trying to avoid the random timeout. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2929
* Run `pre-commit run --all-files` to fix CI by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2933
* Upgrading the deps to have transformers==4.48.0 necessary by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2937
* fix moe in quantization path by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2935
* Clarify FP8-Marlin use on capability 8.9 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2940
* Bump TensorRT-LLM backend dependency to v0.16.0 by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2931
* Set `alias` for `max_completion_tokens` in `ChatRequest` by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2932
* Add NVIDIA A40 to known cards by @kldzj in https://github.com/huggingface/text-generation-inference/pull/2941
* [TRTLLM] Expose finish reason by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2841
* Tmp tp transformers by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2942
* Transformers backend TP fix by @Cyrilvallez in https://github.com/huggingface/text-generation-inference/pull/2945
* Trying to put back the archlist (to fix the oom). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2947

## New Contributors
* @janne-alatalo made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2840
* @ruidazeng made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2891
* @Stranger6667 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2870
* @lazariv made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2837
* @baptistecolle made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2907
* @Cyrilvallez made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2913
* @kldzj made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2941

**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.0.1...v3.0.2
</Release>

<Release version="v3.0.1" date="December 11, 2024" published="2024-12-11T20:13:58.000Z" url="https://github.com/huggingface/text-generation-inference/releases/tag/v3.0.1">
## Summary

Patch release to handle a few older models and corner cases.

## What's Changed
* Hotfix link2 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2812
* Small update to docs by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2816
* Using both value from config as they might not be correct. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2817
* Update README.md by @RodriMora in https://github.com/huggingface/text-generation-inference/pull/2827
* Prepare patch release. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2829

## New Contributors
* @RodriMora made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2827

**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.0.0...v3.0.1
</Release>

<Release version="v3.0.0" date="December 9, 2024" published="2024-12-09T20:22:42.000Z" url="https://github.com/huggingface/text-generation-inference/releases/tag/v3.0.0">
## TL;DR

Big new release


![benchmarks_v3](https://github.com/huggingface/text-generation-inference/blob/042791fbd5742b1644d42c493db6bec669df6537/assets/v3_benchmarks.png)

Details: https://huggingface.co/docs/text-generation-inference/conceptual/chunking

## What's Changed
* feat: concat the adapter id to the model id in chat response by @drbh in https://github.com/huggingface/text-generation-inference/pull/2779
* Move JSON grammar -> regex grammar conversion to the router by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2772
* Use FP8 KV cache when specified by compressed-tensors by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2761
* upgrade ipex cpu to fix coredump in tiiuae/falcon-7b-instruct (pageat… by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2778
* Fix: docs typo by @jp1924 in https://github.com/huggingface/text-generation-inference/pull/2777
* Support continue final message by @drbh in https://github.com/huggingface/text-generation-inference/pull/2733
* Fix doc. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2792
* Removing ../ that broke the link by @Getty in https://github.com/huggingface/text-generation-inference/pull/2789
* fix: add merge-lora arg for model id by @drbh in https://github.com/huggingface/text-generation-inference/pull/2788
* fix: only use eos_token_id as pad_token_id if int by @dvrogozh in https://github.com/huggingface/text-generation-inference/pull/2774
* Sync (most) server dependencies with Nix by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2782
* Saving some VRAM. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2790
* fix: avoid setting use_sgmv if no kernels present by @drbh in https://github.com/huggingface/text-generation-inference/pull/2796
* use oneapi 2024 docker image directly for xpu by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2793
* feat: auto max_new_tokens by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2803
* Auto max prefill by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2797
* Adding A100 compute. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2806
* Enable paligemma2 by @drbh in https://github.com/huggingface/text-generation-inference/pull/2807
* Attempt for cleverer auto batch_prefill values (some simplifications). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2808
* V3 doc by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2809
* Prep new version by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2810
* Hotfixing the link. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2811

## New Contributors
* @jp1924 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2777
* @Getty made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2789

**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v2.4.1...v3.0.0
</Release>

<Release version="v2.4.1" date="November 22, 2024" published="2024-11-22T17:35:00.000Z" url="https://github.com/huggingface/text-generation-inference/releases/tag/v2.4.1">
## Notable changes

* Choose input/total tokens automatically based on available VRAM
* Support Qwen2 VL
* Decrease latency of very large batches (> 128)


## What's Changed

* feat: add triton kernels to decrease latency of large batches by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2687
* Avoiding timeout for bloom tests. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2693
* Green main by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2697
* Choosing input/total tokens automatically based on available VRAM? by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2673
* We can have a tokenizer anywhere. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2527
* Update poetry lock. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2698
* Fixing auto bloom test. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2699
* More timeout on docker start ? by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2701
* Monkey patching as a desperate measure. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2704
* add xpu triton in dockerfile, or will show "Could not import Flash At… by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2702
* Support qwen2 vl by @drbh in https://github.com/huggingface/text-generation-inference/pull/2689
* fix cuda graphs for qwen2-vl by @drbh in https://github.com/huggingface/text-generation-inference/pull/2708
* fix: create position ids for text only input by @drbh in https://github.com/huggingface/text-generation-inference/pull/2714
* fix: add chat_tokenize endpoint to api docs by @drbh in https://github.com/huggingface/text-generation-inference/pull/2710
* Hotfixing auto length (warmup max_s was wrong). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2716
* Fix prefix caching + speculative decoding by @tgaddair in https://github.com/huggingface/text-generation-inference/pull/2711
* Fixing linting on main. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2719
* nix: move to tgi-nix `main` by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2718
* fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Inst… by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2717
* add trust_remote_code in tokenizer to fix baichuan issue by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2725
* Add initial support for compressed-tensors checkpoints by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2732
* nix: update nixpkgs by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2746
* benchmark: fix prefill throughput by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2741
* Fix: Change model_type from ssm to mamba by @mokeddembillel in https://github.com/huggingface/text-generation-inference/pull/2740
* Fix: Change embeddings to embedding by @mokeddembillel in https://github.com/huggingface/text-generation-inference/pull/2738
* fix response type of document for Text Generation Inference by @jitokim in https://github.com/huggingface/text-generation-inference/pull/2743
* Upgrade outlines to 0.1.1 by @aW3st in https://github.com/huggingface/text-generation-inference/pull/2742
* Upgrading our deps. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2750
* feat: return streaming errors as an event formatted for openai's client by @drbh in https://github.com/huggingface/text-generation-inference/pull/2668
* Remove vLLM dependency for CUDA by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2751
* fix: improve find_segments via numpy diff by @drbh in https://github.com/huggingface/text-generation-inference/pull/2686
* add ipex moe implementation to support Mixtral and PhiMoe by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2707
* Add support for compressed-tensors w8a8 int checkpoints by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2745
* feat: support flash attention 2 in qwen2 vl vision blocks by @drbh in https://github.com/huggingface/text-generation-inference/pull/2721
* Simplify two ipex conditions by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2755
* Update to moe-kernels 0.7.0 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2720
* PR 2634 CI - Fix the tool_choice format for named choice by adapting OpenAIs scheme by @drbh in https://github.com/huggingface/text-generation-inference/pull/2645
* fix: adjust llama MLP name from dense to mlp to correctly apply lora by @drbh in https://github.com/huggingface/text-generation-inference/pull/2760
* nix: update for outlines 0.1.4 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2764
* Add support for wNa16 int 2:4 compressed-tensors checkpoints by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2758
* nix: build and cache impure devshells by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2765
* fix: set outlines version to 0.1.3 to avoid caching serialization issue by @drbh in https://github.com/huggingface/text-generation-inference/pull/2766
* nix: downgrade to outlines 0.1.3 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2768
* fix: incomplete generations w/ single tokens generations and models that did not support chunking by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2770
* fix: tweak grammar test response by @drbh in https://github.com/huggingface/text-generation-inference/pull/2769
* Add a README section about using Nix by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2767
* Remove guideline from API by @Wauplin in https://github.com/huggingface/text-generation-inference/pull/2762
* feat: Add automatic nightly benchmarks by @Hugoch in https://github.com/huggingface/text-generation-inference/pull/2591
* feat: add payload limit by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2726
* Update to marlin-kernels 0.3.6 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2771
* chore: prepare 2.4.1 release by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2773

## New Contributors
* @tgaddair made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2711
* @mokeddembillel made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2740
* @jitokim made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2743

**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v2.3.0...v2.4.1
</Release>

<Release version="v2.4.0" date="October 25, 2024" published="2024-10-25T21:14:13.000Z" url="https://github.com/huggingface/text-generation-inference/releases/tag/v2.4.0">
## Notable changes

* Experimental prefill chunking (`PREFILL_CHUNKING=1`)
* Experimental FP8 KV cache support
* Greatly decrease latency for large batches (> 128 requests)
* Faster MoE kernels and support for GPTQ-quantized MoE
* Faster implementation of MLLama

## What's Changed
* nix: remove unused `_server.nix` file by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2538
* chore: Add old V2 backend by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2551
* Remove duplicated `RUN` in `Dockerfile` by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2547
* Micro cleanup. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2555
* Hotfixing main by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2556
* Add support for scalar FP8 weight scales by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2550
* Add `DenseMoELayer` and wire it up in Mixtral/Deepseek V2 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2537
* Update the link to the Ratatui organization by @orhun in https://github.com/huggingface/text-generation-inference/pull/2546
* Simplify crossterm imports by @orhun in https://github.com/huggingface/text-generation-inference/pull/2545
* Adding note for private models in quick-tour document by @ariG23498 in https://github.com/huggingface/text-generation-inference/pull/2548
* Hotfixing main. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2562
* Cleanup Vertex + Chat by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2553
* More tensor cores. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2558
* remove LORA_ADAPTERS_PATH by @nbroad1881 in https://github.com/huggingface/text-generation-inference/pull/2563
* Add LoRA adapters support for Gemma2 by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2567
* Fix build with `--features google` by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2566
* Improve support for GPUs with capability < 8 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2575
* flashinfer: pass window size and dtype by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2574
* Remove compute capability lazy cell by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2580
* Update architecture.md by @ulhaqi12 in https://github.com/huggingface/text-generation-inference/pull/2577
* Update ROCM libs and improvements  by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2579
* Add support for GPTQ-quantized MoE models using MoE Marlin by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2557
* feat: support phi3.5 moe by @drbh in https://github.com/huggingface/text-generation-inference/pull/2479
* Move flake back to tgi-nix `main` by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2586
* MoE Marlin: support `desc_act` for `groupsize != -1` by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2590
* nix: experimental support for building a Docker container by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2470
* Mllama flash version by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2585
* Max token capacity metric by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2595
* CI (2592): Allow LoRA adapter revision in server launcher by @drbh in https://github.com/huggingface/text-generation-inference/pull/2602
* Unroll notify error into generate response by @drbh in https://github.com/huggingface/text-generation-inference/pull/2597
* New release 2.3.1 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2604
* Revert "Unroll notify error into generate response" by @drbh in https://github.com/huggingface/text-generation-inference/pull/2605
* nix: example of local package overrides during development by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2607
* Add basic FP8 KV cache support by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2603
* Fp8 Cache condition by @flozi00 in https://github.com/huggingface/text-generation-inference/pull/2611
* enable mllama in intel platform by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2610
* Upgrade minor rust version (Fixes rust build compilation cache) by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2617
* Add support for fused MoE Marlin for AWQ by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2616
* nix: move back to the tgi-nix main branch by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2620
* CI (2599): Update ToolType input schema by @drbh in https://github.com/huggingface/text-generation-inference/pull/2601
* nix: add black and isort to the closure by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2619
* AMD CI by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2589
* feat: allow tool calling to respond without a tool by @drbh in https://github.com/huggingface/text-generation-inference/pull/2614
* Update documentation to most recent stable version of TGI. by @Vaibhavs10 in https://github.com/huggingface/text-generation-inference/pull/2625
* Intel ci by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2630
* Fixing intel Supports windowing. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2637
* Small fixes for supported models by @osanseviero in https://github.com/huggingface/text-generation-inference/pull/2471
* Cpu perf by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2596
* Clarify gated description and quicktour by @osanseviero in https://github.com/huggingface/text-generation-inference/pull/2631
* update ipex to fix incorrect output of mllama in cpu by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2640
* feat: enable pytorch xpu support for non-attention models by @dvrogozh in https://github.com/huggingface/text-generation-inference/pull/2561
* Fixing linters. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2650
* Rollback to `ChatRequest` for Vertex AI Chat instead of `VertexChat` by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2651
* Fp8 e4m3_fnuz support for rocm by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2588
* feat: prefill chunking by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2600
* Support `e4m3fn` KV cache by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2655
* Simplify the `attention` function by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2609
* fix tgi-entrypoint wrapper in docker file: exec instead of spawning a child process by @oOraph in https://github.com/huggingface/text-generation-inference/pull/2663
* fix: prefer inplace softmax to avoid copy by @drbh in https://github.com/huggingface/text-generation-inference/pull/2661
* Break cycle between the attention implementations and KV cache by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2627
* CI job. Gpt awq 4 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2665
* Make handling of FP8 scales more consisent by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2666
* Test Marlin MoE with `desc_act=true` by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2622
* break when there's nothing to read by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2582
* Add `impureWithCuda` dev shell by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2677
* Make moe-kernels and marlin-kernels mandatory in CUDA installs by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2632
* feat: natively support Granite models by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2682
* feat: allow any supported payload on /invocations by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2683
* flashinfer: reminder to remove contiguous call in the future by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2685
* Fix Phi 3.5 MoE tests by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2684
* Add support for FP8 KV cache scales by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2628
* Fixing "deadlock" when python prompts for trust_remote_code by always by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2664
* [TENSORRT-LLM] - Implement new looper thread based backend by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2357
* Fixing rocm gptq by using triton code too (renamed cuda into triton). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2691
* Fixing mt0 test. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2692
* Add support for stop words in TRTLLM  by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2678
* Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2688

## New Contributors
* @alvarobartt made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2547
* @orhun made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2546
* @ariG23498 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2548
* @ulhaqi12 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2577
* @mht-sharma made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2579
* @dvrogozh made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2561

**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v2.3.0...v2.4
</Release>

<Release version="v2.3.1" date="October 3, 2024" published="2024-10-03T13:01:49.000Z" url="https://github.com/huggingface/text-generation-inference/releases/tag/v2.3.1">
## Important changes

* Added support for Mllama (3.2, vision models). Flashed, unpadded.
* FP8 performance improvements
* Moe performance improvements
* BREAKING CHANGE - When using tools, models could answer with a tool call `notify_error` with the content error, it will instead output regular generation.

## What's Changed
* nix: remove unused `_server.nix` file by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2538
* chore: Add old V2 backend by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2551
* Remove duplicated `RUN` in `Dockerfile` by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2547
* Micro cleanup. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2555
* Hotfixing main by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2556
* Add support for scalar FP8 weight scales by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2550
* Add `DenseMoELayer` and wire it up in Mixtral/Deepseek V2 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2537
* Update the link to the Ratatui organization by @orhun in https://github.com/huggingface/text-generation-inference/pull/2546
* Simplify crossterm imports by @orhun in https://github.com/huggingface/text-generation-inference/pull/2545
* Adding note for private models in quick-tour document by @ariG23498 in https://github.com/huggingface/text-generation-inference/pull/2548
* Hotfixing main. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2562
* Cleanup Vertex + Chat by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2553
* More tensor cores. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2558
* remove LORA_ADAPTERS_PATH by @nbroad1881 in https://github.com/huggingface/text-generation-inference/pull/2563
* Add LoRA adapters support for Gemma2 by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2567
* Fix build with `--features google` by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2566
* Improve support for GPUs with capability < 8 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2575
* flashinfer: pass window size and dtype by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2574
* Remove compute capability lazy cell by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2580
* Update architecture.md by @ulhaqi12 in https://github.com/huggingface/text-generation-inference/pull/2577
* Update ROCM libs and improvements  by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2579
* Add support for GPTQ-quantized MoE models using MoE Marlin by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2557
* feat: support phi3.5 moe by @drbh in https://github.com/huggingface/text-generation-inference/pull/2479
* Move flake back to tgi-nix `main` by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2586
* MoE Marlin: support `desc_act` for `groupsize != -1` by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2590
* nix: experimental support for building a Docker container by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2470
* Mllama flash version by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2585
* Max token capacity metric by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2595
* CI (2592): Allow LoRA adapter revision in server launcher by @drbh in https://github.com/huggingface/text-generation-inference/pull/2602
* Unroll notify error into generate response by @drbh in https://github.com/huggingface/text-generation-inference/pull/2597
* New release 2.3.1 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2604

## New Contributors
* @alvarobartt made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2547
* @orhun made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2546
* @ariG23498 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2548
* @ulhaqi12 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2577
* @mht-sharma made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2579

**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v2.3.0...v2.3.1
</Release>

<Pagination page="1" total-pages="4" total-items="67" next="https://releases.sh/hugging-face/text-generation-inference.md?page=2" />
