{"id":"src_8VF2j2OWHfhvBPnI2jCTO","slug":"text-generation-inference","name":"Text Generation Inference","type":"github","url":"https://github.com/huggingface/text-generation-inference","orgId":"org_GDdYeYynEgCEBNBwy-m6s","org":{"slug":"hugging-face","name":"Hugging Face"},"isPrimary":false,"metadata":"{\"evaluatedMethod\":\"github\",\"evaluatedAt\":\"2026-04-07T17:19:16.315Z\",\"changelogDetectedAt\":\"2026-04-07T17:28:40.142Z\"}","releaseCount":67,"releasesLast30Days":0,"avgReleasesPerWeek":0,"latestVersion":"v3.3.7","latestDate":"2025-12-19T14:35:25.000Z","changelogUrl":null,"hasChangelogFile":false,"lastFetchedAt":"2026-04-19T07:02:03.391Z","trackingSince":"2023-02-03T11:56:09.000Z","releases":[{"id":"rel_gWHs7FC2nAiWHjn56O08J","version":"v3.3.7","title":"v3.3.7","summary":"## What's Changed\r\n* misc(gha): expose action cache url and runtime as secrets by @mfuntowicz in https://github.com/huggingface/text-generation-infere...","content":"## What's Changed\r\n* misc(gha): expose action cache url and runtime as secrets by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2964\r\n* feat: support max_image_fetch_size to limit by @drbh in https://github.com/huggingface/text-generation-inference/pull/3339\r\n* Maintenance mode by @LysandreJik in https://github.com/huggingface/text-generation-inference/pull/3344\r\n* Maintenance mode by @LysandreJik in https://github.com/huggingface/text-generation-inference/pull/3345\r\n* fix(num_devices): fix num_shard/num device auto compute when NVIDIA_VISIBLE_DEVICES == \"all\" or \"void\" by @oOraph in https://github.com/huggingface/text-generation-inference/pull/3346\r\n\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.3.6...v3.3.7","publishedAt":"2025-12-19T14:35:25.000Z","url":"https://github.com/huggingface/text-generation-inference/releases/tag/v3.3.7","media":[]},{"id":"rel_5Fl1DiTs67D6onh213me0","version":"v3.3.6","title":"v3.3.6","summary":"## What's Changed\r\n* Add missing backslash by @philsupertramp in https://github.com/huggingface/text-generation-inference/pull/3311\r\n* Revert \"feat: b...","content":"## What's Changed\r\n* Add missing backslash by @philsupertramp in https://github.com/huggingface/text-generation-inference/pull/3311\r\n* Revert \"feat: bump flake including transformers and huggingface_hub versions\" by @drbh in https://github.com/huggingface/text-generation-inference/pull/3323\r\n* fix: remove azure by @drbh in https://github.com/huggingface/text-generation-inference/pull/3325\r\n* Fix mask passed to flashinfer by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3324\r\n* Update iframe sources for streaming demo by @coyotte508 in https://github.com/huggingface/text-generation-inference/pull/3327\r\n* Revert \"Revert \"feat: bump flake including transformers and huggingfa… by @drbh in https://github.com/huggingface/text-generation-inference/pull/3326\r\n* Revert \"feat: bump flake including transformers and huggingface_hub versions\" by @drbh in https://github.com/huggingface/text-generation-inference/pull/3330\r\n* Patch version 3.3.6 by @tengomucho in https://github.com/huggingface/text-generation-inference/pull/3329\r\n\r\n## New Contributors\r\n* @philsupertramp made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3311\r\n* @coyotte508 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3327\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.3.5...v3.3.6","publishedAt":"2025-09-17T00:48:54.000Z","url":"https://github.com/huggingface/text-generation-inference/releases/tag/v3.3.6","media":[]},{"id":"rel_hjBniksH_PR3kY-c8fL3_","version":"v3.3.5","title":"v3.3.5","summary":"## What's Changed\r\n* [gaudi] Refine rope memory, do not need to keep sin/cos cache per layer by @sywangyi in https://github.com/huggingface/text-gener...","content":"## What's Changed\r\n* [gaudi] Refine rope memory, do not need to keep sin/cos cache per layer by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3274\r\n* Gaudi: add CI by @baptistecolle in https://github.com/huggingface/text-generation-inference/pull/3160\r\n* [gaudi] Gemma3 sliding window support by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3280\r\n* xpu lora support by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3232\r\n* Optimum neuron 0.2.2 by @dacorvo in https://github.com/huggingface/text-generation-inference/pull/3281\r\n* [gaudi] Remove unnecessary reinitialize to HeterogeneousNextTokenChooser to m… by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3284\r\n* [gaudi] Deepseek v2 mla and add ep to unquantized moe by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3287\r\n* [gaudi] Fix the CI test errors by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3286\r\n* Hpu gptq gidx support by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3297\r\n* Migrate to V2 Pydantic interface by @emmanuel-ferdman in https://github.com/huggingface/text-generation-inference/pull/3262\r\n* Xccl by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3252\r\n* Multi modality fix by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3283\r\n* some gptq case could not be handled by ipex. but could be handle by t… by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3298\r\n* fix outline import issue by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3282\r\n* HuggingFaceM4/Idefics3-8B-Llama3 crash fix by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3267\r\n* Optimum neuron 0.3.0 by @tengomucho in https://github.com/huggingface/text-generation-inference/pull/3308\r\n* Disable Cachix pushes by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3312\r\n* chore: prepare version 3.3.5 by @tengomucho in https://github.com/huggingface/text-generation-inference/pull/3314\r\n* feat: bump flake including transformers and huggingface_hub versions by @drbh in https://github.com/huggingface/text-generation-inference/pull/3313\r\n\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.3.4...git","publishedAt":"2025-09-02T15:02:33.000Z","url":"https://github.com/huggingface/text-generation-inference/releases/tag/v3.3.5","media":[]},{"id":"rel_hXE5FkBe7ER-2mbJzPZnx","version":"v3.3.4","title":"v3.3.4","summary":"Fix for Neuron models exported with batch_size 1.\r\n\r\n## What's Changed\r\n* [gaudi] gemma3 text and vlm model intial support. need to add sliding window...","content":"Fix for Neuron models exported with batch_size 1.\r\n\r\n## What's Changed\r\n* [gaudi] gemma3 text and vlm model intial support. need to add sliding window … by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3270\r\n* Neuron backend fix by @dacorvo in https://github.com/huggingface/text-generation-inference/pull/3273\r\n\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.3.3...v3.3.4","publishedAt":"2025-06-19T10:00:28.000Z","url":"https://github.com/huggingface/text-generation-inference/releases/tag/v3.3.4","media":[]},{"id":"rel_0YUG8Lu0T6s916Xqe6B2R","version":"v3.3.3","title":"v3.3.3","summary":"Neuron backend update.\r\n\r\n## What's Changed\r\n* Remove useless packages by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull...","content":"Neuron backend update.\r\n\r\n## What's Changed\r\n* Remove useless packages by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3253\r\n* Bump neuron SDK version by @dacorvo in https://github.com/huggingface/text-generation-inference/pull/3260\r\n* Perf opt by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3256\r\n* [gaudi] Vlm rebase and issue fix in benchmark test by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3263\r\n* Move the _update_cos_sin_cache into get_cos_sin by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3254\r\n* [Gaudi] Remove optimum-habana by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3261\r\n* [gaudi] HuggingFaceM4/idefics2-8b issue fix by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3264\r\n* [Gaudi] Enable Qwen3_moe model by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3244\r\n* [Gaudi]Fix the integration-test issues by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3265\r\n* [Gaudi] use pad_token_id to pad input id by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3268\r\n* chore: prepare release 3.3.3 by @dacorvo in https://github.com/huggingface/text-generation-inference/pull/3269\r\n* [gaudi] Refine logging for Gaudi warmup by @regisss in https://github.com/huggingface/text-generation-inference/pull/3222\r\n* doc: fix README by @dacorvo in https://github.com/huggingface/text-generation-inference/pull/3271\r\n\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.3.2...v3.3.3","publishedAt":"2025-06-18T13:11:39.000Z","url":"https://github.com/huggingface/text-generation-inference/releases/tag/v3.3.3","media":[]},{"id":"rel_bcZ1pQOZ_vFVGxP7XXMAI","version":"v3.3.2","title":"v3.3.2","summary":"Gaudi improvements.\r\n\r\n## What's Changed\r\n* upgrade to new vllm extension ops(fix issue in exponential bucketing) by @sywangyi in https://github.com/h...","content":"Gaudi improvements.\r\n\r\n## What's Changed\r\n* upgrade to new vllm extension ops(fix issue in exponential bucketing) by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3239\r\n* Nix: switch to hf-nix by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3240\r\n* Add Qwen3 by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3229\r\n* fp8 compressed_tensors w8a8 support by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3242\r\n* [Gaudi] Fix the OOM issue of Llama-4-Scout-17B-16E-Instruct by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3245\r\n* Fix the Llama-4-Maverick-17B-128E crash issue by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3246\r\n* Prepare for 3.3.2 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3249\r\n\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.3.1...v3.3.2","publishedAt":"2025-05-30T14:20:39.000Z","url":"https://github.com/huggingface/text-generation-inference/releases/tag/v3.3.2","media":[]},{"id":"rel_o-rKLY_QpJ4QF6UZJaPCq","version":"v3.3.1","title":"v3.3.1","summary":"This release updates TGI to Torch 2.7 and CUDA 12.8.\r\n\r\n## What's Changed\r\n* change HPU warmup logic: seq length should be with exponential growth by ...","content":"This release updates TGI to Torch 2.7 and CUDA 12.8.\r\n\r\n## What's Changed\r\n* change HPU warmup logic: seq length should be with exponential growth by @kaixuanliu in https://github.com/huggingface/text-generation-inference/pull/3217\r\n* adjust the `round_up_seq` logic to align with prefill warmup phase on… by @kaixuanliu in https://github.com/huggingface/text-generation-inference/pull/3224\r\n* Update to Torch 2.7.0 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3221\r\n* Enable Llama4 for gaudi backend by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3223\r\n* fix: count gpu uuids if NVIDIA_VISIBLE_DEVICES env set to all by @drbh in https://github.com/huggingface/text-generation-inference/pull/3230\r\n* Deepseek r1 by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3211\r\n* Refine warmup and upgrade to synapse AI 1.21.0 by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3234\r\n* fix the crash in default ATTENTION path by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3235\r\n* Switch to punica-sgmv kernel from the Hub by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3236\r\n* move input_ids to hpu and remove disposal of adapter_meta by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3237\r\n* Prepare for 3.3.1 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3238\r\n\r\n## New Contributors\r\n* @kaixuanliu made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3217\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.3.0...v3.3.1","publishedAt":"2025-05-22T07:49:07.000Z","url":"https://github.com/huggingface/text-generation-inference/releases/tag/v3.3.1","media":[]},{"id":"rel_iew88pCm-clkg__gCCAnV","version":"v3.3.0","title":"v3.3.0","summary":"## Notable changes\r\n\r\n* Prefill chunking for VLMs.\r\n\r\n## What's Changed\r\n* Fixing Qwen 2.5 VL (32B). by @Narsil in https://github.com/huggingface/text...","content":"## Notable changes\r\n\r\n* Prefill chunking for VLMs.\r\n\r\n## What's Changed\r\n* Fixing Qwen 2.5 VL (32B). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3157\r\n* Fixing tokenization like https://github.com/huggingface/text-embeddin… by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3156\r\n* Gaudi: clean cuda/rocm code in hpu backend, enable flat_hpu by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3113\r\n* L4 fixes by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/3161\r\n* setuptools <= 70.0 is vulnerable: CVE-2024-6345 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3171\r\n* transformers flash llm/vlm enabling in ipex by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3152\r\n* Upgrading the dependencies in Gaudi backend. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3170\r\n* Hotfixing gaudi deps. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3174\r\n* Hotfix gaudi2 with newer transformers. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3176\r\n* Support flashinfer for Gemma3 prefill by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3167\r\n* Get opentelemetry trace id from request headers instead of creating a new trace by @kozistr in https://github.com/huggingface/text-generation-inference/pull/2648\r\n* Bump `sccache` to 0.10.0 by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/3179\r\n* Fixing CI by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3184\r\n* Add option to configure prometheus port by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/3187\r\n* Warmup gaudi backend by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3172\r\n* Put more wiggle room. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3189\r\n* Fixing the router + template for Qwen3. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3200\r\n* Skip `{% generation %}` and `{% endgeneration %}` template handling by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/3204\r\n* doc typo by @julien-c in https://github.com/huggingface/text-generation-inference/pull/3206\r\n* Pr 2982 ci branch by @drbh in https://github.com/huggingface/text-generation-inference/pull/3046\r\n* fix: bump snaps for mllama by @drbh in https://github.com/huggingface/text-generation-inference/pull/3202\r\n* Update client SDK snippets by @julien-c in https://github.com/huggingface/text-generation-inference/pull/3207\r\n* Fix `HF_HUB_OFFLINE=1` for Gaudi backend by @regisss in https://github.com/huggingface/text-generation-inference/pull/3193\r\n* IPEX support FP8 kvcache/softcap/slidingwindow by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3144\r\n* forward and tokenize chooser use the same shape by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3196\r\n* Chunked Prefill VLM by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/3188\r\n* Prepare for 3.3.0 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3220\r\n\r\n## New Contributors\r\n* @kozistr made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2648\r\n* @julien-c made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3206\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.2.3...v3.3.0","publishedAt":"2025-05-09T13:57:39.000Z","url":"https://github.com/huggingface/text-generation-inference/releases/tag/v3.3.0","media":[]},{"id":"rel_tN266zhCy8gOeeC2A3Lbk","version":"v3.2.3","title":"v3.2.3","summary":"## Main changes\r\n\r\n- Patching Llama 4\r\n\r\n## What's Changed\r\n* Use ROCM 6.3.1 by @mht-sharma in https://github.com/huggingface/text-generation-inferenc...","content":"## Main changes\r\n\r\n- Patching Llama 4\r\n\r\n## What's Changed\r\n* Use ROCM 6.3.1 by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/3141\r\n* Update transformers to 4.51 by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/3148\r\n* Gaudi: Add Integration Test for Gaudi Backend by @baptistecolle in https://github.com/huggingface/text-generation-inference/pull/3142\r\n* fix: compute type typo by @oOraph in https://github.com/huggingface/text-generation-inference/pull/3150\r\n* 3.2.3 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3151\r\n\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.2.2...v3.2.3","publishedAt":"2025-04-08T08:18:36.000Z","url":"https://github.com/huggingface/text-generation-inference/releases/tag/v3.2.3","media":[]},{"id":"rel__CGJ5T-rqbHplSDUrJaSV","version":"v3.2.2","title":"v3.2.2","summary":"## What's Changed\r\n* Minor fixes. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3125\r\n* configurable termination timeout...","content":"## What's Changed\r\n* Minor fixes. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3125\r\n* configurable termination timeout by @ErikKaum in https://github.com/huggingface/text-generation-inference/pull/3126\r\n* CI: enable server tests for backends by @baptistecolle in https://github.com/huggingface/text-generation-inference/pull/3128\r\n* Torch 2.6 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3134\r\n* Gaudi: Fix llava-next and mllama crash issue by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3127\r\n* nix-v3.2.1 -> v3.2.1-nix by @co42 in https://github.com/huggingface/text-generation-inference/pull/3129\r\n* Gaudi: Use exponential growth to replace BATCH_BUCKET_SIZE by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3131\r\n* Add llama4 by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/3145\r\n* Preparing for release. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3147\r\n\r\n## New Contributors\r\n* @co42 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3129\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.2.1...v3.2.2","publishedAt":"2025-04-06T09:41:33.000Z","url":"https://github.com/huggingface/text-generation-inference/releases/tag/v3.2.2","media":[]},{"id":"rel_vOpIq43Im1heLX6CGJBt4","version":"v3.2.1","title":"v3.2.1","summary":"## What's Changed\r\n* Update to `kernels` 0.2.1 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3084\r\n* Router: add `gemm...","content":"## What's Changed\r\n* Update to `kernels` 0.2.1 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3084\r\n* Router: add `gemma3-text` model type by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3107\r\n* We need gcc during runtime to enable triton to compile kernels. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3103\r\n* Release of Gaudi Backend for TGI by @baptistecolle in https://github.com/huggingface/text-generation-inference/pull/3091\r\n* Fixing the docker build. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3108\r\n* Make the Nix-based Docker container work on non-NixOS by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3109\r\n* xpu 2.6 update by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3051\r\n* launcher: correctly get the head dimension for VLMs by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3116\r\n* Gaudi: Sync TGI with the latest changes from the TGI-Gaudi fork by @baptistecolle in https://github.com/huggingface/text-generation-inference/pull/3117\r\n* Bug Fix: Sliding Window Attention  by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/3112\r\n* Publish nix docker image. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3122\r\n* Prepare for patch release. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3124\r\n* Intel docker. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3121\r\n\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.2.0...v3.2.1","publishedAt":"2025-03-18T14:28:12.000Z","url":"https://github.com/huggingface/text-generation-inference/releases/tag/v3.2.1","media":[]},{"id":"rel_8_qb2Uj-H1HfKpHsaBTFg","version":"v3.2.0","title":"v3.2.0","summary":"## Important changes\r\n\r\n- BREAKING CHANGE: Lots of modifications around tool calling. Tool calling now respects fully OpenAI return results (arguments...","content":"## Important changes\r\n\r\n- BREAKING CHANGE: Lots of modifications around tool calling. Tool calling now respects fully OpenAI return results (arguments return type is a string instead of a real JSON object). Lots of improvements around the tool calling and side effects fixed.\r\n\r\n- Added Gemma 3 support.\r\n\r\n## What's Changed\r\n* fix(neuron): explicitly install toolchain by @dacorvo in https://github.com/huggingface/text-generation-inference/pull/3072\r\n* Only add token when it is defined. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3073\r\n* Making sure Olmo (transformers backend) works. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3074\r\n* Making `tool_calls` a vector. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3075\r\n* Nix: add `openai` to impure shell for integration tests by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3081\r\n* Update `--max-batch-total-tokens` description by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/3083\r\n* Fix tool call2 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3076\r\n* Nix: the launcher needs a Python env with Torch for GPU detection by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3085\r\n* Add request parameters to OTel span for `/v1/chat/completions` endpoint by @aW3st in https://github.com/huggingface/text-generation-inference/pull/3000\r\n* Add qwen2 multi lora layers support by @EachSheep in https://github.com/huggingface/text-generation-inference/pull/3089\r\n* Add modules_to_not_convert in quantized model by @jiqing-feng in https://github.com/huggingface/text-generation-inference/pull/3053\r\n* Small test and typing fixes by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3078\r\n* hotfix: qwen2 formatting by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3093\r\n* Pr 3003 ci branch by @drbh in https://github.com/huggingface/text-generation-inference/pull/3007\r\n* Update the llamacpp backend by @angt in https://github.com/huggingface/text-generation-inference/pull/3022\r\n* Fix qwen vl by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3096\r\n* Update README.md by @celsowm in https://github.com/huggingface/text-generation-inference/pull/3095\r\n* Fix tool call3 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3086\r\n* Add gemma3 model by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/3099\r\n* Fix tool call4 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3094\r\n* Update neuron backend by @dacorvo in https://github.com/huggingface/text-generation-inference/pull/3098\r\n* Preparing relase 3.2.0 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3100\r\n* Try to fix on main CI color. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3101\r\n\r\n## New Contributors\r\n* @EachSheep made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3089\r\n* @jiqing-feng made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3053\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.1.1...v3.2.0","publishedAt":"2025-03-12T10:17:46.000Z","url":"https://github.com/huggingface/text-generation-inference/releases/tag/v3.2.0","media":[]},{"id":"rel__tAiyOnFlI9jCAYiz3zjU","version":"v3.1.1","title":"v3.1.1","summary":"## What's Changed\r\n* Back on nix main. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2979\r\n* hotfix: fix trtllm CI build...","content":"## What's Changed\r\n* Back on nix main. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2979\r\n* hotfix: fix trtllm CI build on release by @Hugoch in https://github.com/huggingface/text-generation-inference/pull/2981\r\n* Add `strftime_now` callable function for `minijinja` chat templates by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2983\r\n* impureWithCuda: fix gcc version by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2990\r\n* Improve qwen vl impl by @drbh in https://github.com/huggingface/text-generation-inference/pull/2943\r\n* Using the \"lockfile\". by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2992\r\n* Triton fix by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2995\r\n* [Backend] Bump TRTLLM to v.0.17.0 by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2991\r\n* Updating mllama after strftime. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2993\r\n* Use kernels from the kernel hub by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2988\r\n* fix Qwen VL break in intel platform by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3002\r\n* Update the flaky mllama test. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3015\r\n* Preventing single user hugging the server to death by asking by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3016\r\n* Putting back the NCCL forced upgrade. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2999\r\n* Support sigmoid scoring function in GPTQ-MoE by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3017\r\n* [Backend] Add Llamacpp backend by @angt in https://github.com/huggingface/text-generation-inference/pull/2975\r\n* Use eetq kernel from the hub by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3029\r\n* Update README.md by @celsowm in https://github.com/huggingface/text-generation-inference/pull/3024\r\n* Add `loop_controls` feature to `minijinja` to handle `{% break %}` by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2998\r\n* Pinning trufflehog. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3032\r\n* It's find in some machine. using hf_hub::api::sync::Api to download c… by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3030\r\n* Improve Transformers support by @Cyrilvallez in https://github.com/huggingface/text-generation-inference/pull/2970\r\n* feat: add initial qwen2.5-vl model and test by @drbh in https://github.com/huggingface/text-generation-inference/pull/2971\r\n* Using public external registry (to use external runners for CI). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3031\r\n* Having less logs in case of failure for checking CI more easily. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3037\r\n* feat: Add the parsing of HF_HUB_USER_AGENT_ORIGIN environment variable for telemetry by @Hugoch in https://github.com/huggingface/text-generation-inference/pull/3027\r\n* update ipex and torch to 2.6 for cpu by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3039\r\n* flashinfer 0.2.0.post1 -> post2 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3040\r\n* fix qwen2 vl crash in continous batching by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3004\r\n* Simplify logs2. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3045\r\n* Update Gradio ChatInterface configuration in consuming_tgi.md by @angt in https://github.com/huggingface/text-generation-inference/pull/3042\r\n* Improve tool call message processing by @drbh in https://github.com/huggingface/text-generation-inference/pull/3036\r\n* Use `rotary` kernel from the Hub by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3041\r\n* Add Neuron backend by @dacorvo in https://github.com/huggingface/text-generation-inference/pull/3033\r\n* You need to seek apparently. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3049\r\n* some minor fix by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3048\r\n* fix: run linters and fix formatting by @drbh in https://github.com/huggingface/text-generation-inference/pull/3057\r\n* Avoid running neuron integration tests twice by @dacorvo in https://github.com/huggingface/text-generation-inference/pull/3054\r\n* Add Gaudi Backend by @baptistecolle in https://github.com/huggingface/text-generation-inference/pull/3055\r\n* Fix two edge cases in `RadixTrie::find` by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3067\r\n* Add property-based testing for `RadixAllocator` by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3068\r\n* feat: add support for HF_HUB_USER_AGENT_ORIGIN to add user-agent Origin field in Hub requests. by @Hugoch in https://github.com/huggingface/text-generation-inference/pull/3061\r\n* Preparing for release. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3060\r\n* Fix a tiny typo in `monitoring.md` tutorial by @sadra-barikbin in https://github.com/huggingface/text-generation-inference/pull/3056\r\n* Patch rust release. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3069\r\n\r\n## New Contributors\r\n* @angt made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2975\r\n* @celsowm made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3024\r\n* @dacorvo made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3033\r\n* @sadra-barikbin made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3056\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.1.0...v3.1.1","publishedAt":"2025-03-04T17:15:23.000Z","url":"https://github.com/huggingface/text-generation-inference/releases/tag/v3.1.1","media":[]},{"id":"rel_BfVpy-58ipeS1jLeiB0ku","version":"v3.1.0","title":"v3.1.0","summary":"## Important changes\r\n\r\nDeepseek R1 is fully supported on both AMD and Nvidia !\r\n\r\n```\r\ndocker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/dat...","content":"## Important changes\r\n\r\nDeepseek R1 is fully supported on both AMD and Nvidia !\r\n\r\n```\r\ndocker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \\\r\n    ghcr.io/huggingface/text-generation-inference:3.1.0 --model-id deepseek-ai/DeepSeek-R1\r\n```\r\n\r\n## What's Changed\r\n* Attempt to remove AWS S3 flaky cache for sccache by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2953\r\n* Update to attention-kernels 0.2.0 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2950\r\n* fix: Telemetry by @Hugoch in https://github.com/huggingface/text-generation-inference/pull/2957\r\n* Fixing the oom maybe with 2.5.1 change. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2958\r\n* Add backend name to telemetry by @Hugoch in https://github.com/huggingface/text-generation-inference/pull/2962\r\n* Add fp8 support moe models by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2928\r\n* Update to moe-kernels 0.8.0 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2966\r\n* Hotfixing intel-cpu (not sure how it was working before). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2967\r\n* Add deepseekv3 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2968\r\n* doc: Update TRTLLM deployment doc.  by @Hugoch in https://github.com/huggingface/text-generation-inference/pull/2960\r\n* Update moe-kernel to 0.8.2 for rocm by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2977\r\n* Prepare for release 3.1.0 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2972\r\n\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.0.2...v3.1.0","publishedAt":"2025-01-31T13:26:50.000Z","url":"https://github.com/huggingface/text-generation-inference/releases/tag/v3.1.0","media":[]},{"id":"rel_aC2VldelTfmyZTztrOMRf","version":"v3.0.2","title":"v3.0.2","summary":"Tl;dr\r\n\r\n**New transformers backend supporting flashattention at roughly same performance as pure TGI for all non officially supported models directly...","content":"Tl;dr\r\n\r\n**New transformers backend supporting flashattention at roughly same performance as pure TGI for all non officially supported models directly in TGI. Congrats @Cyrilvallez**\r\n\r\n**New models unlocked**: Cohere2, olmo, olmo2, helium.\r\n\r\n## What's Changed\r\n* docs(README): supported hardware links TGI AMD GPUs by @guspan-tanadi in https://github.com/huggingface/text-generation-inference/pull/2814\r\n* Fixing latest flavor by disabling it. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2831\r\n* fix facebook/opt-125m not working issue by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2824\r\n* Fixup opt to reduce the amount of odd if statements. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2833\r\n* TensorRT-LLM backend bump to latest version + misc fixes by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2791\r\n* Feat/trtllm cancellation dev container by @Hugoch in https://github.com/huggingface/text-generation-inference/pull/2795\r\n* New arg. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2845\r\n* Fixing CI. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2846\r\n* fix: lint backend and doc files by @drbh in https://github.com/huggingface/text-generation-inference/pull/2850\r\n* Qwen2-VL runtime error fix when prompted with multiple images by @janne-alatalo in https://github.com/huggingface/text-generation-inference/pull/2840\r\n* Update vllm kernels for ROCM by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2826\r\n* change xpu lib download link by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2852\r\n* fix: include add_special_tokens in kserve request by @drbh in https://github.com/huggingface/text-generation-inference/pull/2859\r\n* chore: fixed some typos and attribute issues in README by @ruidazeng in https://github.com/huggingface/text-generation-inference/pull/2891\r\n* update ipex xpu to fix issue in ARC770 by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2884\r\n* Basic flashinfer 0.2 support by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2862\r\n* Improve vlm support (add idefics3 support) by @drbh in https://github.com/huggingface/text-generation-inference/pull/2437\r\n* Update to marlin-kernels 0.3.7 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2882\r\n* chore: Update jsonschema to 0.28.0 by @Stranger6667 in https://github.com/huggingface/text-generation-inference/pull/2870\r\n* Add possible variants for A100 and H100 GPUs for auto-detecting flops by @lazariv in https://github.com/huggingface/text-generation-inference/pull/2837\r\n* Update using_guidance.md by @nbroad1881 in https://github.com/huggingface/text-generation-inference/pull/2901\r\n* fix crash in torch2.6 if TP=1 by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2885\r\n* Add Flash decoding kernel ROCm by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2855\r\n* Enable FP8 Per-Tensor Scales and Integrate Marlin/MoE Kernels Repo for ROCm by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2825\r\n* Baichuan2-13B does not have max_position_embeddings in config by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2903\r\n* docs(conceptual/speculation): available links Train Medusa by @guspan-tanadi in https://github.com/huggingface/text-generation-inference/pull/2863\r\n* Fix `docker run` in `README.md` by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2861\r\n* :memo: add guide on using TPU with TGI in the docs by @baptistecolle in https://github.com/huggingface/text-generation-inference/pull/2907\r\n* Upgrading our rustc version. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2908\r\n* Fix typo in TPU docs by @baptistecolle in https://github.com/huggingface/text-generation-inference/pull/2911\r\n* Removing the github runner. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2912\r\n* Upgrading bitsandbytes. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2910\r\n* Do not convert weight scale to e4m3fnuz on CUDA by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2917\r\n* feat: improve star coder to support multi lora layers by @drbh in https://github.com/huggingface/text-generation-inference/pull/2883\r\n* Flash decoding kernel adding and prefill-chunking and prefix caching enabling in intel cpu/xpu by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2815\r\n* nix: update to PyTorch 2.5.1 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2921\r\n* Moving to `uv` instead of `poetry`. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2919\r\n* Add fp8 kv cache for ROCm by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2856\r\n* fix the crash of meta-llama/Llama-3.2-1B by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2918\r\n* feat: improve qwen2-vl startup  by @drbh in https://github.com/huggingface/text-generation-inference/pull/2802\r\n* Revert \"feat: improve qwen2-vl startup \" by @drbh in https://github.com/huggingface/text-generation-inference/pull/2924\r\n* flashinfer: switch to plan API by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2904\r\n* Fixing TRTLLM dockerfile. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2922\r\n* Flash Transformers modeling backend support by @Cyrilvallez in https://github.com/huggingface/text-generation-inference/pull/2913\r\n* Give TensorRT-LLMa proper CI/CD 😍 by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2886\r\n* Trying to avoid the random timeout. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2929\r\n* Run `pre-commit run --all-files` to fix CI by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2933\r\n* Upgrading the deps to have transformers==4.48.0 necessary by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2937\r\n* fix moe in quantization path by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2935\r\n* Clarify FP8-Marlin use on capability 8.9 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2940\r\n* Bump TensorRT-LLM backend dependency to v0.16.0 by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2931\r\n* Set `alias` for `max_completion_tokens` in `ChatRequest` by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2932\r\n* Add NVIDIA A40 to known cards by @kldzj in https://github.com/huggingface/text-generation-inference/pull/2941\r\n* [TRTLLM] Expose finish reason by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2841\r\n* Tmp tp transformers by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2942\r\n* Transformers backend TP fix by @Cyrilvallez in https://github.com/huggingface/text-generation-inference/pull/2945\r\n* Trying to put back the archlist (to fix the oom). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2947\r\n\r\n## New Contributors\r\n* @janne-alatalo made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2840\r\n* @ruidazeng made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2891\r\n* @Stranger6667 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2870\r\n* @lazariv made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2837\r\n* @baptistecolle made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2907\r\n* @Cyrilvallez made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2913\r\n* @kldzj made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2941\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.0.1...v3.0.2","publishedAt":"2025-01-24T11:16:11.000Z","url":"https://github.com/huggingface/text-generation-inference/releases/tag/v3.0.2","media":[]},{"id":"rel_S-9kQH-QMrgwKooQWsK4G","version":"v3.0.1","title":"v3.0.1","summary":"## Summary\r\n\r\nPatch release to handle a few older models and corner cases.\r\n\r\n## What's Changed\r\n* Hotfix link2 by @Narsil in https://github.com/huggi...","content":"## Summary\r\n\r\nPatch release to handle a few older models and corner cases.\r\n\r\n## What's Changed\r\n* Hotfix link2 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2812\r\n* Small update to docs by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2816\r\n* Using both value from config as they might not be correct. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2817\r\n* Update README.md by @RodriMora in https://github.com/huggingface/text-generation-inference/pull/2827\r\n* Prepare patch release. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2829\r\n\r\n## New Contributors\r\n* @RodriMora made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2827\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.0.0...v3.0.1","publishedAt":"2024-12-11T20:13:58.000Z","url":"https://github.com/huggingface/text-generation-inference/releases/tag/v3.0.1","media":[]},{"id":"rel_dLQxNc9fj-7pgnvW9YjzC","version":"v3.0.0","title":"v3.0.0","summary":"## TL;DR\r\n\r\nBig new release\r\n\r\n\r\n![benchmarks_v3](https://github.com/huggingface/text-generation-inference/blob/042791fbd5742b1644d42c493db6bec669df65...","content":"## TL;DR\r\n\r\nBig new release\r\n\r\n\r\n![benchmarks_v3](https://github.com/huggingface/text-generation-inference/blob/042791fbd5742b1644d42c493db6bec669df6537/assets/v3_benchmarks.png)\r\n\r\nDetails: https://huggingface.co/docs/text-generation-inference/conceptual/chunking\r\n\r\n## What's Changed\r\n* feat: concat the adapter id to the model id in chat response by @drbh in https://github.com/huggingface/text-generation-inference/pull/2779\r\n* Move JSON grammar -> regex grammar conversion to the router by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2772\r\n* Use FP8 KV cache when specified by compressed-tensors by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2761\r\n* upgrade ipex cpu to fix coredump in tiiuae/falcon-7b-instruct (pageat… by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2778\r\n* Fix: docs typo by @jp1924 in https://github.com/huggingface/text-generation-inference/pull/2777\r\n* Support continue final message by @drbh in https://github.com/huggingface/text-generation-inference/pull/2733\r\n* Fix doc. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2792\r\n* Removing ../ that broke the link by @Getty in https://github.com/huggingface/text-generation-inference/pull/2789\r\n* fix: add merge-lora arg for model id by @drbh in https://github.com/huggingface/text-generation-inference/pull/2788\r\n* fix: only use eos_token_id as pad_token_id if int by @dvrogozh in https://github.com/huggingface/text-generation-inference/pull/2774\r\n* Sync (most) server dependencies with Nix by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2782\r\n* Saving some VRAM. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2790\r\n* fix: avoid setting use_sgmv if no kernels present by @drbh in https://github.com/huggingface/text-generation-inference/pull/2796\r\n* use oneapi 2024 docker image directly for xpu by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2793\r\n* feat: auto max_new_tokens by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2803\r\n* Auto max prefill by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2797\r\n* Adding A100 compute. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2806\r\n* Enable paligemma2 by @drbh in https://github.com/huggingface/text-generation-inference/pull/2807\r\n* Attempt for cleverer auto batch_prefill values (some simplifications). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2808\r\n* V3 doc by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2809\r\n* Prep new version by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2810\r\n* Hotfixing the link. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2811\r\n\r\n## New Contributors\r\n* @jp1924 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2777\r\n* @Getty made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2789\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v2.4.1...v3.0.0","publishedAt":"2024-12-09T20:22:42.000Z","url":"https://github.com/huggingface/text-generation-inference/releases/tag/v3.0.0","media":[]},{"id":"rel_RfpZ_og2OR7Y_jUITQS9L","version":"v2.4.1","title":"v2.4.1","summary":"## Notable changes\r\n\r\n* Choose input/total tokens automatically based on available VRAM\r\n* Support Qwen2 VL\r\n* Decrease latency of very large batches ...","content":"## Notable changes\r\n\r\n* Choose input/total tokens automatically based on available VRAM\r\n* Support Qwen2 VL\r\n* Decrease latency of very large batches (> 128)\r\n\r\n\r\n## What's Changed\r\n\r\n* feat: add triton kernels to decrease latency of large batches by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2687\r\n* Avoiding timeout for bloom tests. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2693\r\n* Green main by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2697\r\n* Choosing input/total tokens automatically based on available VRAM? by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2673\r\n* We can have a tokenizer anywhere. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2527\r\n* Update poetry lock. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2698\r\n* Fixing auto bloom test. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2699\r\n* More timeout on docker start ? by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2701\r\n* Monkey patching as a desperate measure. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2704\r\n* add xpu triton in dockerfile, or will show \"Could not import Flash At… by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2702\r\n* Support qwen2 vl by @drbh in https://github.com/huggingface/text-generation-inference/pull/2689\r\n* fix cuda graphs for qwen2-vl by @drbh in https://github.com/huggingface/text-generation-inference/pull/2708\r\n* fix: create position ids for text only input by @drbh in https://github.com/huggingface/text-generation-inference/pull/2714\r\n* fix: add chat_tokenize endpoint to api docs by @drbh in https://github.com/huggingface/text-generation-inference/pull/2710\r\n* Hotfixing auto length (warmup max_s was wrong). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2716\r\n* Fix prefix caching + speculative decoding by @tgaddair in https://github.com/huggingface/text-generation-inference/pull/2711\r\n* Fixing linting on main. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2719\r\n* nix: move to tgi-nix `main` by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2718\r\n* fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Inst… by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2717\r\n* add trust_remote_code in tokenizer to fix baichuan issue by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2725\r\n* Add initial support for compressed-tensors checkpoints by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2732\r\n* nix: update nixpkgs by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2746\r\n* benchmark: fix prefill throughput by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2741\r\n* Fix: Change model_type from ssm to mamba by @mokeddembillel in https://github.com/huggingface/text-generation-inference/pull/2740\r\n* Fix: Change embeddings to embedding by @mokeddembillel in https://github.com/huggingface/text-generation-inference/pull/2738\r\n* fix response type of document for Text Generation Inference by @jitokim in https://github.com/huggingface/text-generation-inference/pull/2743\r\n* Upgrade outlines to 0.1.1 by @aW3st in https://github.com/huggingface/text-generation-inference/pull/2742\r\n* Upgrading our deps. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2750\r\n* feat: return streaming errors as an event formatted for openai's client by @drbh in https://github.com/huggingface/text-generation-inference/pull/2668\r\n* Remove vLLM dependency for CUDA by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2751\r\n* fix: improve find_segments via numpy diff by @drbh in https://github.com/huggingface/text-generation-inference/pull/2686\r\n* add ipex moe implementation to support Mixtral and PhiMoe by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2707\r\n* Add support for compressed-tensors w8a8 int checkpoints by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2745\r\n* feat: support flash attention 2 in qwen2 vl vision blocks by @drbh in https://github.com/huggingface/text-generation-inference/pull/2721\r\n* Simplify two ipex conditions by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2755\r\n* Update to moe-kernels 0.7.0 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2720\r\n* PR 2634 CI - Fix the tool_choice format for named choice by adapting OpenAIs scheme by @drbh in https://github.com/huggingface/text-generation-inference/pull/2645\r\n* fix: adjust llama MLP name from dense to mlp to correctly apply lora by @drbh in https://github.com/huggingface/text-generation-inference/pull/2760\r\n* nix: update for outlines 0.1.4 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2764\r\n* Add support for wNa16 int 2:4 compressed-tensors checkpoints by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2758\r\n* nix: build and cache impure devshells by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2765\r\n* fix: set outlines version to 0.1.3 to avoid caching serialization issue by @drbh in https://github.com/huggingface/text-generation-inference/pull/2766\r\n* nix: downgrade to outlines 0.1.3 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2768\r\n* fix: incomplete generations w/ single tokens generations and models that did not support chunking by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2770\r\n* fix: tweak grammar test response by @drbh in https://github.com/huggingface/text-generation-inference/pull/2769\r\n* Add a README section about using Nix by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2767\r\n* Remove guideline from API by @Wauplin in https://github.com/huggingface/text-generation-inference/pull/2762\r\n* feat: Add automatic nightly benchmarks by @Hugoch in https://github.com/huggingface/text-generation-inference/pull/2591\r\n* feat: add payload limit by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2726\r\n* Update to marlin-kernels 0.3.6 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2771\r\n* chore: prepare 2.4.1 release by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2773\r\n\r\n## New Contributors\r\n* @tgaddair made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2711\r\n* @mokeddembillel made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2740\r\n* @jitokim made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2743\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v2.3.0...v2.4.1","publishedAt":"2024-11-22T17:35:00.000Z","url":"https://github.com/huggingface/text-generation-inference/releases/tag/v2.4.1","media":[]},{"id":"rel_3V4rTkMvZccspqXWYQlS-","version":"v2.4.0","title":"v2.4.0","summary":"## Notable changes\r\n\r\n* Experimental prefill chunking (`PREFILL_CHUNKING=1`)\r\n* Experimental FP8 KV cache support\r\n* Greatly decrease latency for larg...","content":"## Notable changes\r\n\r\n* Experimental prefill chunking (`PREFILL_CHUNKING=1`)\r\n* Experimental FP8 KV cache support\r\n* Greatly decrease latency for large batches (> 128 requests)\r\n* Faster MoE kernels and support for GPTQ-quantized MoE\r\n* Faster implementation of MLLama\r\n\r\n## What's Changed\r\n* nix: remove unused `_server.nix` file by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2538\r\n* chore: Add old V2 backend by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2551\r\n* Remove duplicated `RUN` in `Dockerfile` by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2547\r\n* Micro cleanup. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2555\r\n* Hotfixing main by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2556\r\n* Add support for scalar FP8 weight scales by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2550\r\n* Add `DenseMoELayer` and wire it up in Mixtral/Deepseek V2 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2537\r\n* Update the link to the Ratatui organization by @orhun in https://github.com/huggingface/text-generation-inference/pull/2546\r\n* Simplify crossterm imports by @orhun in https://github.com/huggingface/text-generation-inference/pull/2545\r\n* Adding note for private models in quick-tour document by @ariG23498 in https://github.com/huggingface/text-generation-inference/pull/2548\r\n* Hotfixing main. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2562\r\n* Cleanup Vertex + Chat by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2553\r\n* More tensor cores. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2558\r\n* remove LORA_ADAPTERS_PATH by @nbroad1881 in https://github.com/huggingface/text-generation-inference/pull/2563\r\n* Add LoRA adapters support for Gemma2 by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2567\r\n* Fix build with `--features google` by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2566\r\n* Improve support for GPUs with capability < 8 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2575\r\n* flashinfer: pass window size and dtype by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2574\r\n* Remove compute capability lazy cell by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2580\r\n* Update architecture.md by @ulhaqi12 in https://github.com/huggingface/text-generation-inference/pull/2577\r\n* Update ROCM libs and improvements  by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2579\r\n* Add support for GPTQ-quantized MoE models using MoE Marlin by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2557\r\n* feat: support phi3.5 moe by @drbh in https://github.com/huggingface/text-generation-inference/pull/2479\r\n* Move flake back to tgi-nix `main` by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2586\r\n* MoE Marlin: support `desc_act` for `groupsize != -1` by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2590\r\n* nix: experimental support for building a Docker container by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2470\r\n* Mllama flash version by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2585\r\n* Max token capacity metric by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2595\r\n* CI (2592): Allow LoRA adapter revision in server launcher by @drbh in https://github.com/huggingface/text-generation-inference/pull/2602\r\n* Unroll notify error into generate response by @drbh in https://github.com/huggingface/text-generation-inference/pull/2597\r\n* New release 2.3.1 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2604\r\n* Revert \"Unroll notify error into generate response\" by @drbh in https://github.com/huggingface/text-generation-inference/pull/2605\r\n* nix: example of local package overrides during development by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2607\r\n* Add basic FP8 KV cache support by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2603\r\n* Fp8 Cache condition by @flozi00 in https://github.com/huggingface/text-generation-inference/pull/2611\r\n* enable mllama in intel platform by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2610\r\n* Upgrade minor rust version (Fixes rust build compilation cache) by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2617\r\n* Add support for fused MoE Marlin for AWQ by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2616\r\n* nix: move back to the tgi-nix main branch by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2620\r\n* CI (2599): Update ToolType input schema by @drbh in https://github.com/huggingface/text-generation-inference/pull/2601\r\n* nix: add black and isort to the closure by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2619\r\n* AMD CI by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2589\r\n* feat: allow tool calling to respond without a tool by @drbh in https://github.com/huggingface/text-generation-inference/pull/2614\r\n* Update documentation to most recent stable version of TGI. by @Vaibhavs10 in https://github.com/huggingface/text-generation-inference/pull/2625\r\n* Intel ci by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2630\r\n* Fixing intel Supports windowing. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2637\r\n* Small fixes for supported models by @osanseviero in https://github.com/huggingface/text-generation-inference/pull/2471\r\n* Cpu perf by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2596\r\n* Clarify gated description and quicktour by @osanseviero in https://github.com/huggingface/text-generation-inference/pull/2631\r\n* update ipex to fix incorrect output of mllama in cpu by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2640\r\n* feat: enable pytorch xpu support for non-attention models by @dvrogozh in https://github.com/huggingface/text-generation-inference/pull/2561\r\n* Fixing linters. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2650\r\n* Rollback to `ChatRequest` for Vertex AI Chat instead of `VertexChat` by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2651\r\n* Fp8 e4m3_fnuz support for rocm by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2588\r\n* feat: prefill chunking by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2600\r\n* Support `e4m3fn` KV cache by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2655\r\n* Simplify the `attention` function by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2609\r\n* fix tgi-entrypoint wrapper in docker file: exec instead of spawning a child process by @oOraph in https://github.com/huggingface/text-generation-inference/pull/2663\r\n* fix: prefer inplace softmax to avoid copy by @drbh in https://github.com/huggingface/text-generation-inference/pull/2661\r\n* Break cycle between the attention implementations and KV cache by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2627\r\n* CI job. Gpt awq 4 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2665\r\n* Make handling of FP8 scales more consisent by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2666\r\n* Test Marlin MoE with `desc_act=true` by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2622\r\n* break when there's nothing to read by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2582\r\n* Add `impureWithCuda` dev shell by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2677\r\n* Make moe-kernels and marlin-kernels mandatory in CUDA installs by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2632\r\n* feat: natively support Granite models by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2682\r\n* feat: allow any supported payload on /invocations by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2683\r\n* flashinfer: reminder to remove contiguous call in the future by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2685\r\n* Fix Phi 3.5 MoE tests by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2684\r\n* Add support for FP8 KV cache scales by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2628\r\n* Fixing \"deadlock\" when python prompts for trust_remote_code by always by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2664\r\n* [TENSORRT-LLM] - Implement new looper thread based backend by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2357\r\n* Fixing rocm gptq by using triton code too (renamed cuda into triton). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2691\r\n* Fixing mt0 test. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2692\r\n* Add support for stop words in TRTLLM  by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2678\r\n* Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2688\r\n\r\n## New Contributors\r\n* @alvarobartt made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2547\r\n* @orhun made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2546\r\n* @ariG23498 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2548\r\n* @ulhaqi12 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2577\r\n* @mht-sharma made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2579\r\n* @dvrogozh made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2561\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v2.3.0...v2.4","publishedAt":"2024-10-25T21:14:13.000Z","url":"https://github.com/huggingface/text-generation-inference/releases/tag/v2.4.0","media":[]},{"id":"rel_egkXtuVMd2ErXz2Bm0BYm","version":"v2.3.1","title":"v2.3.1","summary":"## Important changes\r\n\r\n* Added support for Mllama (3.2, vision models). Flashed, unpadded.\r\n* FP8 performance improvements\r\n* Moe performance improve...","content":"## Important changes\r\n\r\n* Added support for Mllama (3.2, vision models). Flashed, unpadded.\r\n* FP8 performance improvements\r\n* Moe performance improvements\r\n* BREAKING CHANGE - When using tools, models could answer with a tool call `notify_error` with the content error, it will instead output regular generation.\r\n\r\n## What's Changed\r\n* nix: remove unused `_server.nix` file by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2538\r\n* chore: Add old V2 backend by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2551\r\n* Remove duplicated `RUN` in `Dockerfile` by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2547\r\n* Micro cleanup. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2555\r\n* Hotfixing main by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2556\r\n* Add support for scalar FP8 weight scales by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2550\r\n* Add `DenseMoELayer` and wire it up in Mixtral/Deepseek V2 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2537\r\n* Update the link to the Ratatui organization by @orhun in https://github.com/huggingface/text-generation-inference/pull/2546\r\n* Simplify crossterm imports by @orhun in https://github.com/huggingface/text-generation-inference/pull/2545\r\n* Adding note for private models in quick-tour document by @ariG23498 in https://github.com/huggingface/text-generation-inference/pull/2548\r\n* Hotfixing main. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2562\r\n* Cleanup Vertex + Chat by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2553\r\n* More tensor cores. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2558\r\n* remove LORA_ADAPTERS_PATH by @nbroad1881 in https://github.com/huggingface/text-generation-inference/pull/2563\r\n* Add LoRA adapters support for Gemma2 by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2567\r\n* Fix build with `--features google` by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2566\r\n* Improve support for GPUs with capability < 8 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2575\r\n* flashinfer: pass window size and dtype by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2574\r\n* Remove compute capability lazy cell by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2580\r\n* Update architecture.md by @ulhaqi12 in https://github.com/huggingface/text-generation-inference/pull/2577\r\n* Update ROCM libs and improvements  by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2579\r\n* Add support for GPTQ-quantized MoE models using MoE Marlin by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2557\r\n* feat: support phi3.5 moe by @drbh in https://github.com/huggingface/text-generation-inference/pull/2479\r\n* Move flake back to tgi-nix `main` by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2586\r\n* MoE Marlin: support `desc_act` for `groupsize != -1` by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2590\r\n* nix: experimental support for building a Docker container by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2470\r\n* Mllama flash version by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2585\r\n* Max token capacity metric by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2595\r\n* CI (2592): Allow LoRA adapter revision in server launcher by @drbh in https://github.com/huggingface/text-generation-inference/pull/2602\r\n* Unroll notify error into generate response by @drbh in https://github.com/huggingface/text-generation-inference/pull/2597\r\n* New release 2.3.1 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2604\r\n\r\n## New Contributors\r\n* @alvarobartt made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2547\r\n* @orhun made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2546\r\n* @ariG23498 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2548\r\n* @ulhaqi12 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2577\r\n* @mht-sharma made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2579\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v2.3.0...v2.3.1","publishedAt":"2024-10-03T13:01:49.000Z","url":"https://github.com/huggingface/text-generation-inference/releases/tag/v2.3.1","media":[]}],"pagination":{"page":1,"pageSize":20,"totalPages":4,"totalItems":67},"summaries":{"rolling":null,"monthly":[]}}