releases.shpreview

Hugging Face/Inference

v2.4.0

October 25, 2024Text Generation InferenceView original ↗

Notable changes

Experimental prefill chunking (PREFILL_CHUNKING=1)
Experimental FP8 KV cache support
Greatly decrease latency for large batches (> 128 requests)
Faster MoE kernels and support for GPTQ-quantized MoE
Faster implementation of MLLama

What's Changed

nix: remove unused _server.nix file by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2538
chore: Add old V2 backend by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2551
Remove duplicated RUN in Dockerfile by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2547
Micro cleanup. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2555
Hotfixing main by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2556
Add support for scalar FP8 weight scales by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2550
Add DenseMoELayer and wire it up in Mixtral/Deepseek V2 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2537
Update the link to the Ratatui organization by @orhun in https://github.com/huggingface/text-generation-inference/pull/2546
Simplify crossterm imports by @orhun in https://github.com/huggingface/text-generation-inference/pull/2545
Adding note for private models in quick-tour document by @ariG23498 in https://github.com/huggingface/text-generation-inference/pull/2548
Hotfixing main. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2562
Cleanup Vertex + Chat by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2553
More tensor cores. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2558
remove LORA_ADAPTERS_PATH by @nbroad1881 in https://github.com/huggingface/text-generation-inference/pull/2563
Add LoRA adapters support for Gemma2 by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2567
Fix build with --features google by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2566
Improve support for GPUs with capability < 8 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2575
flashinfer: pass window size and dtype by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2574
Remove compute capability lazy cell by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2580
Update architecture.md by @ulhaqi12 in https://github.com/huggingface/text-generation-inference/pull/2577
Update ROCM libs and improvements by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2579
Add support for GPTQ-quantized MoE models using MoE Marlin by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2557
feat: support phi3.5 moe by @drbh in https://github.com/huggingface/text-generation-inference/pull/2479
Move flake back to tgi-nix main by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2586
MoE Marlin: support desc_act for groupsize != -1 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2590
nix: experimental support for building a Docker container by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2470
Mllama flash version by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2585
Max token capacity metric by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2595
CI (2592): Allow LoRA adapter revision in server launcher by @drbh in https://github.com/huggingface/text-generation-inference/pull/2602
Unroll notify error into generate response by @drbh in https://github.com/huggingface/text-generation-inference/pull/2597
New release 2.3.1 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2604
Revert "Unroll notify error into generate response" by @drbh in https://github.com/huggingface/text-generation-inference/pull/2605
nix: example of local package overrides during development by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2607
Add basic FP8 KV cache support by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2603
Fp8 Cache condition by @flozi00 in https://github.com/huggingface/text-generation-inference/pull/2611
enable mllama in intel platform by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2610
Upgrade minor rust version (Fixes rust build compilation cache) by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2617
Add support for fused MoE Marlin for AWQ by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2616
nix: move back to the tgi-nix main branch by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2620
CI (2599): Update ToolType input schema by @drbh in https://github.com/huggingface/text-generation-inference/pull/2601
nix: add black and isort to the closure by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2619
AMD CI by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2589
feat: allow tool calling to respond without a tool by @drbh in https://github.com/huggingface/text-generation-inference/pull/2614
Update documentation to most recent stable version of TGI. by @Vaibhavs10 in https://github.com/huggingface/text-generation-inference/pull/2625
Intel ci by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2630
Fixing intel Supports windowing. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2637
Small fixes for supported models by @osanseviero in https://github.com/huggingface/text-generation-inference/pull/2471
Cpu perf by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2596
Clarify gated description and quicktour by @osanseviero in https://github.com/huggingface/text-generation-inference/pull/2631
update ipex to fix incorrect output of mllama in cpu by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2640
feat: enable pytorch xpu support for non-attention models by @dvrogozh in https://github.com/huggingface/text-generation-inference/pull/2561
Fixing linters. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2650
Rollback to ChatRequest for Vertex AI Chat instead of VertexChat by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/2651
Fp8 e4m3_fnuz support for rocm by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/2588
feat: prefill chunking by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2600
Support e4m3fn KV cache by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2655
Simplify the attention function by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2609
fix tgi-entrypoint wrapper in docker file: exec instead of spawning a child process by @oOraph in https://github.com/huggingface/text-generation-inference/pull/2663
fix: prefer inplace softmax to avoid copy by @drbh in https://github.com/huggingface/text-generation-inference/pull/2661
Break cycle between the attention implementations and KV cache by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2627
CI job. Gpt awq 4 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2665
Make handling of FP8 scales more consisent by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2666
Test Marlin MoE with desc_act=true by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2622
break when there's nothing to read by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2582
Add impureWithCuda dev shell by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2677
Make moe-kernels and marlin-kernels mandatory in CUDA installs by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2632
feat: natively support Granite models by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2682
feat: allow any supported payload on /invocations by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2683
flashinfer: reminder to remove contiguous call in the future by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2685
Fix Phi 3.5 MoE tests by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2684
Add support for FP8 KV cache scales by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2628
Fixing "deadlock" when python prompts for trust_remote_code by always by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2664
[TENSORRT-LLM] - Implement new looper thread based backend by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2357
Fixing rocm gptq by using triton code too (renamed cuda into triton). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2691
Fixing mt0 test. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2692
Add support for stop words in TRTLLM by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2678
Switch from fbgemm-gpu w8a8 scaled matmul to vLLM/marlin-kernels by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2688

New Contributors

@alvarobartt made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2547
@orhun made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2546
@ariG23498 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2548
@ulhaqi12 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2577
@mht-sharma made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2579
@dvrogozh made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2561

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.3.0...v2.4

Fetched April 7, 2026