--- name: Text Generation Inference slug: text-generation-inference type: github source_url: https://github.com/huggingface/text-generation-inference organization: Hugging Face organization_slug: hugging-face total_releases: 67 latest_version: v3.3.7 latest_date: 2025-12-19 last_updated: 2026-04-19 tracking_since: 2023-02-03 canonical: https://releases.sh/hugging-face/text-generation-inference organization_url: https://releases.sh/hugging-face --- ## What's Changed * misc(gha): expose action cache url and runtime as secrets by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2964 * feat: support max_image_fetch_size to limit by @drbh in https://github.com/huggingface/text-generation-inference/pull/3339 * Maintenance mode by @LysandreJik in https://github.com/huggingface/text-generation-inference/pull/3344 * Maintenance mode by @LysandreJik in https://github.com/huggingface/text-generation-inference/pull/3345 * fix(num_devices): fix num_shard/num device auto compute when NVIDIA_VISIBLE_DEVICES == "all" or "void" by @oOraph in https://github.com/huggingface/text-generation-inference/pull/3346 **Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.3.6...v3.3.7 ## What's Changed * Add missing backslash by @philsupertramp in https://github.com/huggingface/text-generation-inference/pull/3311 * Revert "feat: bump flake including transformers and huggingface_hub versions" by @drbh in https://github.com/huggingface/text-generation-inference/pull/3323 * fix: remove azure by @drbh in https://github.com/huggingface/text-generation-inference/pull/3325 * Fix mask passed to flashinfer by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3324 * Update iframe sources for streaming demo by @coyotte508 in https://github.com/huggingface/text-generation-inference/pull/3327 * Revert "Revert "feat: bump flake including transformers and huggingfa… by @drbh in https://github.com/huggingface/text-generation-inference/pull/3326 * Revert "feat: bump flake including transformers and huggingface_hub versions" by @drbh in https://github.com/huggingface/text-generation-inference/pull/3330 * Patch version 3.3.6 by @tengomucho in https://github.com/huggingface/text-generation-inference/pull/3329 ## New Contributors * @philsupertramp made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3311 * @coyotte508 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3327 **Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.3.5...v3.3.6 ## What's Changed * [gaudi] Refine rope memory, do not need to keep sin/cos cache per layer by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3274 * Gaudi: add CI by @baptistecolle in https://github.com/huggingface/text-generation-inference/pull/3160 * [gaudi] Gemma3 sliding window support by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3280 * xpu lora support by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3232 * Optimum neuron 0.2.2 by @dacorvo in https://github.com/huggingface/text-generation-inference/pull/3281 * [gaudi] Remove unnecessary reinitialize to HeterogeneousNextTokenChooser to m… by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3284 * [gaudi] Deepseek v2 mla and add ep to unquantized moe by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3287 * [gaudi] Fix the CI test errors by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3286 * Hpu gptq gidx support by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3297 * Migrate to V2 Pydantic interface by @emmanuel-ferdman in https://github.com/huggingface/text-generation-inference/pull/3262 * Xccl by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3252 * Multi modality fix by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3283 * some gptq case could not be handled by ipex. but could be handle by t… by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3298 * fix outline import issue by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3282 * HuggingFaceM4/Idefics3-8B-Llama3 crash fix by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3267 * Optimum neuron 0.3.0 by @tengomucho in https://github.com/huggingface/text-generation-inference/pull/3308 * Disable Cachix pushes by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3312 * chore: prepare version 3.3.5 by @tengomucho in https://github.com/huggingface/text-generation-inference/pull/3314 * feat: bump flake including transformers and huggingface_hub versions by @drbh in https://github.com/huggingface/text-generation-inference/pull/3313 **Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.3.4...git Fix for Neuron models exported with batch_size 1. ## What's Changed * [gaudi] gemma3 text and vlm model intial support. need to add sliding window … by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3270 * Neuron backend fix by @dacorvo in https://github.com/huggingface/text-generation-inference/pull/3273 **Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.3.3...v3.3.4 Neuron backend update. ## What's Changed * Remove useless packages by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3253 * Bump neuron SDK version by @dacorvo in https://github.com/huggingface/text-generation-inference/pull/3260 * Perf opt by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3256 * [gaudi] Vlm rebase and issue fix in benchmark test by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3263 * Move the _update_cos_sin_cache into get_cos_sin by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3254 * [Gaudi] Remove optimum-habana by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3261 * [gaudi] HuggingFaceM4/idefics2-8b issue fix by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3264 * [Gaudi] Enable Qwen3_moe model by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3244 * [Gaudi]Fix the integration-test issues by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3265 * [Gaudi] use pad_token_id to pad input id by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3268 * chore: prepare release 3.3.3 by @dacorvo in https://github.com/huggingface/text-generation-inference/pull/3269 * [gaudi] Refine logging for Gaudi warmup by @regisss in https://github.com/huggingface/text-generation-inference/pull/3222 * doc: fix README by @dacorvo in https://github.com/huggingface/text-generation-inference/pull/3271 **Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.3.2...v3.3.3 Gaudi improvements. ## What's Changed * upgrade to new vllm extension ops(fix issue in exponential bucketing) by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3239 * Nix: switch to hf-nix by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3240 * Add Qwen3 by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3229 * fp8 compressed_tensors w8a8 support by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3242 * [Gaudi] Fix the OOM issue of Llama-4-Scout-17B-16E-Instruct by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3245 * Fix the Llama-4-Maverick-17B-128E crash issue by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3246 * Prepare for 3.3.2 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3249 **Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.3.1...v3.3.2 This release updates TGI to Torch 2.7 and CUDA 12.8. ## What's Changed * change HPU warmup logic: seq length should be with exponential growth by @kaixuanliu in https://github.com/huggingface/text-generation-inference/pull/3217 * adjust the `round_up_seq` logic to align with prefill warmup phase on… by @kaixuanliu in https://github.com/huggingface/text-generation-inference/pull/3224 * Update to Torch 2.7.0 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3221 * Enable Llama4 for gaudi backend by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3223 * fix: count gpu uuids if NVIDIA_VISIBLE_DEVICES env set to all by @drbh in https://github.com/huggingface/text-generation-inference/pull/3230 * Deepseek r1 by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3211 * Refine warmup and upgrade to synapse AI 1.21.0 by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3234 * fix the crash in default ATTENTION path by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3235 * Switch to punica-sgmv kernel from the Hub by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3236 * move input_ids to hpu and remove disposal of adapter_meta by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3237 * Prepare for 3.3.1 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3238 ## New Contributors * @kaixuanliu made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3217 **Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.3.0...v3.3.1 ## Notable changes * Prefill chunking for VLMs. ## What's Changed * Fixing Qwen 2.5 VL (32B). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3157 * Fixing tokenization like https://github.com/huggingface/text-embeddin… by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3156 * Gaudi: clean cuda/rocm code in hpu backend, enable flat_hpu by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3113 * L4 fixes by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/3161 * setuptools <= 70.0 is vulnerable: CVE-2024-6345 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3171 * transformers flash llm/vlm enabling in ipex by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3152 * Upgrading the dependencies in Gaudi backend. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3170 * Hotfixing gaudi deps. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3174 * Hotfix gaudi2 with newer transformers. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3176 * Support flashinfer for Gemma3 prefill by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3167 * Get opentelemetry trace id from request headers instead of creating a new trace by @kozistr in https://github.com/huggingface/text-generation-inference/pull/2648 * Bump `sccache` to 0.10.0 by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/3179 * Fixing CI by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3184 * Add option to configure prometheus port by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/3187 * Warmup gaudi backend by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3172 * Put more wiggle room. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3189 * Fixing the router + template for Qwen3. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3200 * Skip `{% generation %}` and `{% endgeneration %}` template handling by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/3204 * doc typo by @julien-c in https://github.com/huggingface/text-generation-inference/pull/3206 * Pr 2982 ci branch by @drbh in https://github.com/huggingface/text-generation-inference/pull/3046 * fix: bump snaps for mllama by @drbh in https://github.com/huggingface/text-generation-inference/pull/3202 * Update client SDK snippets by @julien-c in https://github.com/huggingface/text-generation-inference/pull/3207 * Fix `HF_HUB_OFFLINE=1` for Gaudi backend by @regisss in https://github.com/huggingface/text-generation-inference/pull/3193 * IPEX support FP8 kvcache/softcap/slidingwindow by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3144 * forward and tokenize chooser use the same shape by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3196 * Chunked Prefill VLM by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/3188 * Prepare for 3.3.0 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3220 ## New Contributors * @kozistr made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2648 * @julien-c made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3206 **Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.2.3...v3.3.0 ## Main changes - Patching Llama 4 ## What's Changed * Use ROCM 6.3.1 by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/3141 * Update transformers to 4.51 by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/3148 * Gaudi: Add Integration Test for Gaudi Backend by @baptistecolle in https://github.com/huggingface/text-generation-inference/pull/3142 * fix: compute type typo by @oOraph in https://github.com/huggingface/text-generation-inference/pull/3150 * 3.2.3 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3151 **Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.2.2...v3.2.3 ## What's Changed * Minor fixes. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3125 * configurable termination timeout by @ErikKaum in https://github.com/huggingface/text-generation-inference/pull/3126 * CI: enable server tests for backends by @baptistecolle in https://github.com/huggingface/text-generation-inference/pull/3128 * Torch 2.6 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3134 * Gaudi: Fix llava-next and mllama crash issue by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3127 * nix-v3.2.1 -> v3.2.1-nix by @co42 in https://github.com/huggingface/text-generation-inference/pull/3129 * Gaudi: Use exponential growth to replace BATCH_BUCKET_SIZE by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3131 * Add llama4 by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/3145 * Preparing for release. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3147 ## New Contributors * @co42 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3129 **Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.2.1...v3.2.2 ## What's Changed * Update to `kernels` 0.2.1 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3084 * Router: add `gemma3-text` model type by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3107 * We need gcc during runtime to enable triton to compile kernels. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3103 * Release of Gaudi Backend for TGI by @baptistecolle in https://github.com/huggingface/text-generation-inference/pull/3091 * Fixing the docker build. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3108 * Make the Nix-based Docker container work on non-NixOS by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3109 * xpu 2.6 update by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3051 * launcher: correctly get the head dimension for VLMs by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3116 * Gaudi: Sync TGI with the latest changes from the TGI-Gaudi fork by @baptistecolle in https://github.com/huggingface/text-generation-inference/pull/3117 * Bug Fix: Sliding Window Attention by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/3112 * Publish nix docker image. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3122 * Prepare for patch release. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3124 * Intel docker. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3121 **Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.2.0...v3.2.1 ## Important changes - BREAKING CHANGE: Lots of modifications around tool calling. Tool calling now respects fully OpenAI return results (arguments return type is a string instead of a real JSON object). Lots of improvements around the tool calling and side effects fixed. - Added Gemma 3 support. ## What's Changed * fix(neuron): explicitly install toolchain by @dacorvo in https://github.com/huggingface/text-generation-inference/pull/3072 * Only add token when it is defined. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3073 * Making sure Olmo (transformers backend) works. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3074 * Making `tool_calls` a vector. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3075 * Nix: add `openai` to impure shell for integration tests by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3081 * Update `--max-batch-total-tokens` description by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/3083 * Fix tool call2 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3076 * Nix: the launcher needs a Python env with Torch for GPU detection by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3085 * Add request parameters to OTel span for `/v1/chat/completions` endpoint by @aW3st in https://github.com/huggingface/text-generation-inference/pull/3000 * Add qwen2 multi lora layers support by @EachSheep in https://github.com/huggingface/text-generation-inference/pull/3089 * Add modules_to_not_convert in quantized model by @jiqing-feng in https://github.com/huggingface/text-generation-inference/pull/3053 * Small test and typing fixes by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3078 * hotfix: qwen2 formatting by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3093 * Pr 3003 ci branch by @drbh in https://github.com/huggingface/text-generation-inference/pull/3007 * Update the llamacpp backend by @angt in https://github.com/huggingface/text-generation-inference/pull/3022 * Fix qwen vl by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3096 * Update README.md by @celsowm in https://github.com/huggingface/text-generation-inference/pull/3095 * Fix tool call3 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3086 * Add gemma3 model by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/3099 * Fix tool call4 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3094 * Update neuron backend by @dacorvo in https://github.com/huggingface/text-generation-inference/pull/3098 * Preparing relase 3.2.0 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3100 * Try to fix on main CI color. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3101 ## New Contributors * @EachSheep made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3089 * @jiqing-feng made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3053 **Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.1.1...v3.2.0 ## What's Changed * Back on nix main. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2979 * hotfix: fix trtllm CI build on release by @Hugoch in https://github.com/huggingface/text-generation-inference/pull/2981 * Add `strftime_now` callable function for `minijinja` chat templates by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2983 * impureWithCuda: fix gcc version by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2990 * Improve qwen vl impl by @drbh in https://github.com/huggingface/text-generation-inference/pull/2943 * Using the "lockfile". by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2992 * Triton fix by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2995 * [Backend] Bump TRTLLM to v.0.17.0 by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2991 * Updating mllama after strftime. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2993 * Use kernels from the kernel hub by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2988 * fix Qwen VL break in intel platform by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3002 * Update the flaky mllama test. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3015 * Preventing single user hugging the server to death by asking by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3016 * Putting back the NCCL forced upgrade. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2999 * Support sigmoid scoring function in GPTQ-MoE by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3017 * [Backend] Add Llamacpp backend by @angt in https://github.com/huggingface/text-generation-inference/pull/2975 * Use eetq kernel from the hub by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3029 * Update README.md by @celsowm in https://github.com/huggingface/text-generation-inference/pull/3024 * Add `loop_controls` feature to `minijinja` to handle `{% break %}` by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2998 * Pinning trufflehog. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3032 * It's find in some machine. using hf_hub::api::sync::Api to download c… by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3030 * Improve Transformers support by @Cyrilvallez in https://github.com/huggingface/text-generation-inference/pull/2970 * feat: add initial qwen2.5-vl model and test by @drbh in https://github.com/huggingface/text-generation-inference/pull/2971 * Using public external registry (to use external runners for CI). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3031 * Having less logs in case of failure for checking CI more easily. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3037 * feat: Add the parsing of HF_HUB_USER_AGENT_ORIGIN environment variable for telemetry by @Hugoch in https://github.com/huggingface/text-generation-inference/pull/3027 * update ipex and torch to 2.6 for cpu by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3039 * flashinfer 0.2.0.post1 -> post2 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3040 * fix qwen2 vl crash in continous batching by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3004 * Simplify logs2. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3045 * Update Gradio ChatInterface configuration in consuming_tgi.md by @angt in https://github.com/huggingface/text-generation-inference/pull/3042 * Improve tool call message processing by @drbh in https://github.com/huggingface/text-generation-inference/pull/3036 * Use `rotary` kernel from the Hub by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3041 * Add Neuron backend by @dacorvo in https://github.com/huggingface/text-generation-inference/pull/3033 * You need to seek apparently. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3049 * some minor fix by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3048 * fix: run linters and fix formatting by @drbh in https://github.com/huggingface/text-generation-inference/pull/3057 * Avoid running neuron integration tests twice by @dacorvo in https://github.com/huggingface/text-generation-inference/pull/3054 * Add Gaudi Backend by @baptistecolle in https://github.com/huggingface/text-generation-inference/pull/3055 * Fix two edge cases in `RadixTrie::find` by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3067 * Add property-based testing for `RadixAllocator` by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3068 * feat: add support for HF_HUB_USER_AGENT_ORIGIN to add user-agent Origin field in Hub requests. by @Hugoch in https://github.com/huggingface/text-generation-inference/pull/3061 * Preparing for release. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3060 * Fix a tiny typo in `monitoring.md` tutorial by @sadra-barikbin in https://github.com/huggingface/text-generation-inference/pull/3056 * Patch rust release. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3069 ## New Contributors * @angt made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2975 * @celsowm made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3024 * @dacorvo made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3033 * @sadra-barikbin made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3056 **Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.1.0...v3.1.1 ## Important changes Deepseek R1 is fully supported on both AMD and Nvidia ! ``` docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \ ghcr.io/huggingface/text-generation-inference:3.1.0 --model-id deepseek-ai/DeepSeek-R1 ``` ## What's Changed * Attempt to remove AWS S3 flaky cache for sccache by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2953 * Update to attention-kernels 0.2.0 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2950 * fix: Telemetry by @Hugoch in https://github.com/huggingface/text-generation-inference/pull/2957 * Fixing the oom maybe with 2.5.1 change. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2958 * Add backend name to telemetry by @Hugoch in https://github.com/huggingface/text-generation-inference/pull/2962 * Add fp8 support moe models by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2928 * Update to moe-kernels 0.8.0 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2966 * Hotfixing intel-cpu (not sure how it was working before). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2967 * Add deepseekv3 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2968 * doc: Update TRTLLM deployment doc. by @Hugoch in https://github.com/huggingface/text-generation-inference/pull/2960 * Update moe-kernel to 0.8.2 for rocm by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2977 * Prepare for release 3.1.0 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2972 **Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.0.2...v3.1.0 Tl;dr **New transformers backend supporting flashattention at roughly same performance as pure TGI for all non officially supported models directly in TGI. Congrats @Cyrilvallez** **New models unlocked**: Cohere2, olmo, olmo2, helium. ## What's Changed * docs(README): supported hardware links TGI AMD GPUs by @guspan-tanadi in https://github.com/huggingface/text-generation-inference/pull/2814 * Fixing latest flavor by disabling it. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2831 * fix facebook/opt-125m not working issue by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2824 * Fixup opt to reduce the amount of odd if statements. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2833 * TensorRT-LLM backend bump to latest version + misc fixes by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2791 * Feat/trtllm cancellation dev container by @Hugoch in https://github.com/huggingface/text-generation-inference/pull/2795 * New arg. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2845 * Fixing CI. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2846 * fix: lint backend and doc files by @drbh in https://github.com/huggingface/text-generation-inference/pull/2850 * Qwen2-VL runtime error fix when prompted with multiple images by @janne-alatalo in https://github.com/huggingface/text-generation-inference/pull/2840 * Update vllm kernels for ROCM by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2826 * change xpu lib download link by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2852 * fix: include add_special_tokens in kserve request by @drbh in https://github.com/huggingface/text-generation-inference/pull/2859 * chore: fixed some typos and attribute issues in README by @ruidazeng in https://github.com/huggingface/text-generation-inference/pull/2891 * update ipex xpu to fix issue in ARC770 by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2884 * Basic flashinfer 0.2 support by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2862 * Improve vlm support (add idefics3 support) by @drbh in https://github.com/huggingface/text-generation-inference/pull/2437 * Update to marlin-kernels 0.3.7 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2882 * chore: Update jsonschema to 0.28.0 by @Stranger6667 in https://github.com/huggingface/text-generation-inference/pull/2870 * Add possible variants for A100 and H100 GPUs for auto-detecting flops by @lazariv in https://github.com/huggingface/text-generation-inference/pull/2837 * Update using_guidance.md by @nbroad1881 in https://github.com/huggingface/text-generation-inference/pull/2901 * fix crash in torch2.6 if TP=1 by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2885 * Add Flash decoding kernel ROCm by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2855 * Enable FP8 Per-Tensor Scales and Integrate Marlin/MoE Kernels Repo for ROCm by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2825 * Baichuan2-13B does not have max_position_embeddings in config by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2903 * docs(conceptual/speculation): available links Train Medusa by @guspan-tanadi in https://github.com/huggingface/text-generation-inference/pull/2863 * Fix `docker run` in `README.md` by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2861 * :memo: add guide on using TPU with TGI in the docs by @baptistecolle in https://github.com/huggingface/text-generation-inference/pull/2907 * Upgrading our rustc version. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2908 * Fix typo in TPU docs by @baptistecolle in https://github.com/huggingface/text-generation-inference/pull/2911 * Removing the github runner. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2912 * Upgrading bitsandbytes. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2910 * Do not convert weight scale to e4m3fnuz on CUDA by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2917 * feat: improve star coder to support multi lora layers by @drbh in https://github.com/huggingface/text-generation-inference/pull/2883 * Flash decoding kernel adding and prefill-chunking and prefix caching enabling in intel cpu/xpu by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2815 * nix: update to PyTorch 2.5.1 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2921 * Moving to `uv` instead of `poetry`. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2919 * Add fp8 kv cache for ROCm by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2856 * fix the crash of meta-llama/Llama-3.2-1B by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2918 * feat: improve qwen2-vl startup by @drbh in https://github.com/huggingface/text-generation-inference/pull/2802 * Revert "feat: improve qwen2-vl startup " by @drbh in https://github.com/huggingface/text-generation-inference/pull/2924 * flashinfer: switch to plan API by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2904 * Fixing TRTLLM dockerfile. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2922 * Flash Transformers modeling backend support by @Cyrilvallez in https://github.com/huggingface/text-generation-inference/pull/2913 * Give TensorRT-LLMa proper CI/CD 😍 by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2886 * Trying to avoid the random timeout. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2929 * Run `pre-commit run --all-files` to fix CI by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2933 * Upgrading the deps to have transformers==4.48.0 necessary by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2937 * fix moe in quantization path by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2935 * Clarify FP8-Marlin use on capability 8.9 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2940 * Bump TensorRT-LLM backend dependency to v0.16.0 by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2931 * Set `alias` for `max_completion_tokens` in `ChatRequest` by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2932 * Add NVIDIA A40 to known cards by @kldzj in https://github.com/huggingface/text-generation-inference/pull/2941 * [TRTLLM] Expose finish reason by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2841 * Tmp tp transformers by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2942 * Transformers backend TP fix by @Cyrilvallez in https://github.com/huggingface/text-generation-inference/pull/2945 * Trying to put back the archlist (to fix the oom). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2947 ## New Contributors * @janne-alatalo made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2840 * @ruidazeng made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2891 * @Stranger6667 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2870 * @lazariv made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2837 * @baptistecolle made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2907 * @Cyrilvallez made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2913 * @kldzj made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2941 **Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.0.1...v3.0.2 ## Summary Patch release to handle a few older models and corner cases. ## What's Changed * Hotfix link2 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2812 * Small update to docs by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2816 * Using both value from config as they might not be correct. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2817 * Update README.md by @RodriMora in https://github.com/huggingface/text-generation-inference/pull/2827 * Prepare patch release. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2829 ## New Contributors * @RodriMora made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2827 **Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.0.0...v3.0.1 ## TL;DR Big new release ![benchmarks_v3](https://github.com/huggingface/text-generation-inference/blob/042791fbd5742b1644d42c493db6bec669df6537/assets/v3_benchmarks.png) Details: https://huggingface.co/docs/text-generation-inference/conceptual/chunking ## What's Changed * feat: concat the adapter id to the model id in chat response by @drbh in https://github.com/huggingface/text-generation-inference/pull/2779 * Move JSON grammar -> regex grammar conversion to the router by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2772 * Use FP8 KV cache when specified by compressed-tensors by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2761 * upgrade ipex cpu to fix coredump in tiiuae/falcon-7b-instruct (pageat… by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2778 * Fix: docs typo by @jp1924 in https://github.com/huggingface/text-generation-inference/pull/2777 * Support continue final message by @drbh in https://github.com/huggingface/text-generation-inference/pull/2733 * Fix doc. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2792 * Removing ../ that broke the link by @Getty in https://github.com/huggingface/text-generation-inference/pull/2789 * fix: add merge-lora arg for model id by @drbh in https://github.com/huggingface/text-generation-inference/pull/2788 * fix: only use eos_token_id as pad_token_id if int by @dvrogozh in https://github.com/huggingface/text-generation-inference/pull/2774 * Sync (most) server dependencies with Nix by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2782 * Saving some VRAM. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2790 * fix: avoid setting use_sgmv if no kernels present by @drbh in https://github.com/huggingface/text-generation-inference/pull/2796 * use oneapi 2024 docker image directly for xpu by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2793 * feat: auto max_new_tokens by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2803 * Auto max prefill by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2797 * Adding A100 compute. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2806 * Enable paligemma2 by @drbh in https://github.com/huggingface/text-generation-inference/pull/2807 * Attempt for cleverer auto batch_prefill values (some simplifications). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2808 * V3 doc by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2809 * Prep new version by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2810 * Hotfixing the link. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2811 ## New Contributors * @jp1924 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2777 * @Getty made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2789 **Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v2.4.1...v3.0.0 ## Notable changes * Choose input/total tokens automatically based on available VRAM * Support Qwen2 VL * Decrease latency of very large batches (> 128) ## What's Changed * feat: add triton kernels to decrease latency of large batches by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2687 * Avoiding timeout for bloom tests. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2693 * Green main by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2697 * Choosing input/total tokens automatically based on available VRAM? by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2673 * We can have a tokenizer anywhere. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2527 * Update poetry lock. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2698 * Fixing auto bloom test. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2699 * More timeout on docker start ? by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2701 * Monkey patching as a desperate measure. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2704 * add xpu triton in dockerfile, or will show "Could not import Flash At… by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2702 * Support qwen2 vl by @drbh in https://github.com/huggingface/text-generation-inference/pull/2689 * fix cuda graphs for qwen2-vl by @drbh in https://github.com/huggingface/text-generation-inference/pull/2708 * fix: create position ids for text only input by @drbh in https://github.com/huggingface/text-generation-inference/pull/2714 * fix: add chat_tokenize endpoint to api docs by @drbh in https://github.com/huggingface/text-generation-inference/pull/2710 * Hotfixing auto length (warmup max_s was wrong). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2716 * Fix prefix caching + speculative decoding by @tgaddair in https://github.com/huggingface/text-generation-inference/pull/2711 * Fixing linting on main. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2719 * nix: move to tgi-nix `main` by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2718 * fix incorrect output of Qwen2-7B-Instruct-GPTQ-Int4 and Qwen2-7B-Inst… by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2717 * add trust_remote_code in tokenizer to fix baichuan issue by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2725 * Add initial support for compressed-tensors checkpoints by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2732 * nix: update nixpkgs by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2746 * benchmark: fix prefill throughput by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2741 * Fix: Change model_type from ssm to mamba by @mokeddembillel in https://github.com/huggingface/text-generation-inference/pull/2740 * Fix: Change embeddings to embedding by @mokeddembillel in https://github.com/huggingface/text-generation-inference/pull/2738 * fix response type of document for Text Generation Inference by @jitokim in https://github.com/huggingface/text-generation-inference/pull/2743 * Upgrade outlines to 0.1.1 by @aW3st in https://github.com/huggingface/text-generation-inference/pull/2742 * Upgrading our deps. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2750 * feat: return streaming errors as an event formatted for openai's client by @drbh in https://github.com/huggingface/text-generation-inference/pull/2668 * Remove vLLM dependency for CUDA by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2751 * fix: improve find_segments via numpy diff by @drbh in https://github.com/huggingface/text-generation-inference/pull/2686 * add ipex moe implementation to support Mixtral and PhiMoe by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2707 * Add support for compressed-tensors w8a8 int checkpoints by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2745 * feat: support flash attention 2 in qwen2 vl vision blocks by @drbh in https://github.com/huggingface/text-generation-inference/pull/2721 * Simplify two ipex conditions by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2755 * Update to moe-kernels 0.7.0 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2720 * PR 2634 CI - Fix the tool_choice format for named choice by adapting OpenAIs scheme by @drbh in https://github.com/huggingface/text-generation-inference/pull/2645 * fix: adjust llama MLP name from dense to mlp to correctly apply lora by @drbh in https://github.com/huggingface/text-generation-inference/pull/2760 * nix: update for outlines 0.1.4 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2764 * Add support for wNa16 int 2:4 compressed-tensors checkpoints by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2758 * nix: build and cache impure devshells by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2765 * fix: set outlines version to 0.1.3 to avoid caching serialization issue by @drbh in https://github.com/huggingface/text-generation-inference/pull/2766 * nix: downgrade to outlines 0.1.3 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2768 * fix: incomplete generations w/ single tokens generations and models that did not support chunking by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2770 * fix: tweak grammar test response by @drbh in https://github.com/huggingface/text-generation-inference/pull/2769 * Add a README section about using Nix by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2767 * Remove guideline from API by @Wauplin in https://github.com/huggingface/text-generation-inference/pull/2762 * feat: Add automatic nightly benchmarks by @Hugoch in https://github.com/huggingface/text-generation-inference/pull/2591 * feat: add payload limit by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2726 * Update to marlin-kernels 0.3.6 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2771 * chore: prepare 2.4.1 release by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2773 ## New Contributors * @tgaddair made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2711 * @mokeddembillel made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2740 * @jitokim made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2743 **Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v2.3.0...v2.4.1 ## Notable changes * Experimental prefill chunking (`PREFILL_CHUNKING=1`) * Experimental FP8 KV cache support * Greatly decrease latency for large batches (> 128 requests) * Faster MoE kernels and support for GPTQ-quantized MoE * Faster implementation of MLLama ## What's Changed * nix: remove unused `_server.nix` file by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2538 * chore: Add old V2 backend by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2551 * Remove duplicated `RUN` in `Dockerfile` by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2547 * Micro cleanup. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2555 * Hotfixing main by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2556 * Add support for scalar FP8 weight scales by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2550 * Add `DenseMoELayer` and wire it up in Mixtral/Deepseek V2 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2537 * Update the link to the Ratatui organization by @orhun in https://github.com/huggingface/text-generation-inference/pull/2546 * Simplify crossterm imports by @orhun in https://github.com/huggingface/text-generation-inference/pull/2545 * Adding note for private models in quick-tour document by @ariG23498 in https://github.com/huggingface/text-generation-inference/pull/2548 * Hotfixing main. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2562 * Cleanup Vertex + Chat by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2553 * More tensor cores. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2558 * remove LORA_ADAPTERS_PATH by @nbroad1881 in https://github.com/huggingface/text-generation-inference/pull/2563 * Add LoRA adapters support for Gemma2 by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2567 * Fix build with `--features google` by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2566 * Improve support for GPUs with capability < 8 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2575 * flashinfer: pass window size and dtype by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2574 * Remove compute capability lazy cell by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2580 * Update architecture.md by @ulhaqi12 in https://github.com/huggingface/text-generation-inference/pull/2577 * Update ROCM libs and improvements by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2579 * Add support for GPTQ-quantized MoE models using MoE Marlin by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2557 * feat: support phi3.5 moe by @drbh in https://github.com/huggingface/text-generation-inference/pull/2479 * Move flake back to tgi-nix `main` by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2586 * MoE Marlin: support `desc_act` for `groupsize != -1` by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2590 * nix: experimental support for building a Docker container by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2470 * Mllama flash version by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2585 * Max token capacity metric by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2595 * CI (2592): Allow LoRA adapter revision in server launcher by @drbh in https://github.com/huggingface/text-generation-inference/pull/2602 * Unroll notify error into generate response by @drbh in https://github.com/huggingface/text-generation-inference/pull/2597 * New release 2.3.1 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2604 * Revert "Unroll notify error into generate response" by @drbh in https://github.com/huggingface/text-generation-inference/pull/2605 * nix: example of local package overrides during development by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2607 * Add basic FP8 KV cache support by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2603 * Fp8 Cache condition by @flozi00 in https://github.com/huggingface/text-generation-inference/pull/2611 * enable mllama in intel platform by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2610 * Upgrade minor rust version (Fixes rust build compilation cache) by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2617 * Add support for fused MoE Marlin for AWQ by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2616 * nix: move back to the tgi-nix main branch by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2620 * CI (2599): Update ToolType input schema by @drbh in https://github.com/huggingface/text-generation-inference/pull/2601 * nix: add black and isort to the closure by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2619 * AMD CI by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2589 * feat: allow tool calling to respond without a tool by @drbh in https://github.com/huggingface/text-generation-inference/pull/2614 * Update documentation to most recent stable version of TGI. by @Vaibhavs10 in https://github.com/huggingface/text-generation-inference/pull/2625 * Intel ci by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2630 * Fixing intel Supports windowing. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2637 * Small fixes for supported models by @osanseviero in https://github.com/huggingface/text-generation-inference/pull/2471 * Cpu perf by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2596 * Clarify gated description and quicktour by @osanseviero in https://github.com/huggingface/text-generation-inference/pull/2631 * update ipex to fix incorrect output of mllama in cpu by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2640 * feat: enable pytorch xpu support for non-attention models by @dvrogozh in https://github.com/huggingface/text-generation-inference/pull/2561 * Fixing linters. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2650 * Rollback to `ChatRequest` for Vertex AI Chat instead of `VertexChat` by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2651 * Fp8 e4m3_fnuz support for rocm by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2588 * feat: prefill chunking by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2600 * Support `e4m3fn` KV cache by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2655 * Simplify the `attention` function by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2609 * fix tgi-entrypoint wrapper in docker file: exec instead of spawning a child process by @oOraph in https://github.com/huggingface/text-generation-inference/pull/2663 * fix: prefer inplace softmax to avoid copy by @drbh in https://github.com/huggingface/text-generation-inference/pull/2661 * Break cycle between the attention implementations and KV cache by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2627 * CI job. Gpt awq 4 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2665 * Make handling of FP8 scales more consisent by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2666 * Test Marlin MoE with `desc_act=true` by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2622 * break when there's nothing to read by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2582 * Add `impureWithCuda` dev shell by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2677 * Make moe-kernels and marlin-kernels mandatory in CUDA installs by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2632 * feat: natively support Granite models by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2682 * feat: allow any supported payload on /invocations by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2683 * flashinfer: reminder to remove contiguous call in the future by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2685 * Fix Phi 3.5 MoE tests by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2684 * Add support for FP8 KV cache scales by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2628 * Fixing "deadlock" when python prompts for trust_remote_code by always by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2664 * [TENSORRT-LLM] - Implement new looper thread based backend by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2357 * Fixing rocm gptq by using triton code too (renamed cuda into triton). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2691 * Fixing mt0 test. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2692 * Add support for stop words in TRTLLM by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2678 * Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2688 ## New Contributors * @alvarobartt made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2547 * @orhun made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2546 * @ariG23498 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2548 * @ulhaqi12 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2577 * @mht-sharma made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2579 * @dvrogozh made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2561 **Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v2.3.0...v2.4 ## Important changes * Added support for Mllama (3.2, vision models). Flashed, unpadded. * FP8 performance improvements * Moe performance improvements * BREAKING CHANGE - When using tools, models could answer with a tool call `notify_error` with the content error, it will instead output regular generation. ## What's Changed * nix: remove unused `_server.nix` file by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2538 * chore: Add old V2 backend by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2551 * Remove duplicated `RUN` in `Dockerfile` by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2547 * Micro cleanup. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2555 * Hotfixing main by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2556 * Add support for scalar FP8 weight scales by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2550 * Add `DenseMoELayer` and wire it up in Mixtral/Deepseek V2 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2537 * Update the link to the Ratatui organization by @orhun in https://github.com/huggingface/text-generation-inference/pull/2546 * Simplify crossterm imports by @orhun in https://github.com/huggingface/text-generation-inference/pull/2545 * Adding note for private models in quick-tour document by @ariG23498 in https://github.com/huggingface/text-generation-inference/pull/2548 * Hotfixing main. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2562 * Cleanup Vertex + Chat by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2553 * More tensor cores. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2558 * remove LORA_ADAPTERS_PATH by @nbroad1881 in https://github.com/huggingface/text-generation-inference/pull/2563 * Add LoRA adapters support for Gemma2 by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2567 * Fix build with `--features google` by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2566 * Improve support for GPUs with capability < 8 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2575 * flashinfer: pass window size and dtype by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2574 * Remove compute capability lazy cell by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2580 * Update architecture.md by @ulhaqi12 in https://github.com/huggingface/text-generation-inference/pull/2577 * Update ROCM libs and improvements by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2579 * Add support for GPTQ-quantized MoE models using MoE Marlin by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2557 * feat: support phi3.5 moe by @drbh in https://github.com/huggingface/text-generation-inference/pull/2479 * Move flake back to tgi-nix `main` by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2586 * MoE Marlin: support `desc_act` for `groupsize != -1` by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2590 * nix: experimental support for building a Docker container by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2470 * Mllama flash version by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2585 * Max token capacity metric by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2595 * CI (2592): Allow LoRA adapter revision in server launcher by @drbh in https://github.com/huggingface/text-generation-inference/pull/2602 * Unroll notify error into generate response by @drbh in https://github.com/huggingface/text-generation-inference/pull/2597 * New release 2.3.1 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2604 ## New Contributors * @alvarobartt made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2547 * @orhun made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2546 * @ariG23498 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2548 * @ulhaqi12 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2577 * @mht-sharma made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2579 **Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v2.3.0...v2.3.1