v2.3.0 — Text Generation Inference

Important changes

Renamed HUGGINGFACE_HUB_CACHE to use HF_HOME. This is done to harmonize environment variables across HF ecosystem. So locations of data moved from /data/models-.... to /data/hub/models-.... on the Docker.
Prefix caching by default ! To help with long running queries TGI will use prefix caching a reuse pre-existing queries in the kv-cache in order to speed up TTFT. This should be totally transparent for most users, however this has required a instense rewrite of internals and therefore bugs can potentially exist. Also we changed kernels from paged_attention to flashinfer (and flashdecoding as a fallback for some specific models that aren't supported by flashinfer).
Lots of performance improvements with Marlin and quantization.

What's Changed

chore: update to torch 2.4 by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2259
fix crash in multi-modal by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2245
fix of use of unquantized weights in cohere GQA loading, also enable … by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2291
Split up layers.marlin into several files by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2292
fix: refactor adapter weight loading and mapping by @drbh in https://github.com/huggingface/text-generation-inference/pull/2193
Using g6 instead of g5. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2281
Some small fixes for the Torch 2.4.0 update by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2304
Fixing idefics on g6 tests. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2306
Fix registry name by @XciD in https://github.com/huggingface/text-generation-inference/pull/2307
Support tied embeddings in 0.5B and 1.5B Qwen2 models by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2313
feat: add ruff and resolve issue by @drbh in https://github.com/huggingface/text-generation-inference/pull/2262
Run ci api key by @ErikKaum in https://github.com/huggingface/text-generation-inference/pull/2315
Install Marlin from standalone package by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2320
fix: reject grammars without properties by @drbh in https://github.com/huggingface/text-generation-inference/pull/2309
patch-error-on-invalid-grammar by @ErikKaum in https://github.com/huggingface/text-generation-inference/pull/2282
fix: adjust test snapshots and small refactors by @drbh in https://github.com/huggingface/text-generation-inference/pull/2323
server quantize: store quantizer config in standard format by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2299
Rebase TRT-llm by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2331
Handle GPTQ-Marlin loading in GPTQMarlinWeightLoader by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2300
Pr 2290 ci run by @drbh in https://github.com/huggingface/text-generation-inference/pull/2329
refactor usage stats by @ErikKaum in https://github.com/huggingface/text-generation-inference/pull/2339
enable HuggingFaceM4/idefics-9b in intel gpu by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2338
Fix cache block size for flash decoding by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2351
Unify attention output handling by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2343
fix: attempt forward on flash attn2 to check hardware support by @drbh in https://github.com/huggingface/text-generation-inference/pull/2335
feat: include local lora adapter loading docs by @drbh in https://github.com/huggingface/text-generation-inference/pull/2359
fix: return the out tensor rather then the functions return value by @drbh in https://github.com/huggingface/text-generation-inference/pull/2361
feat: implement a templated endpoint for visibility into chat requests by @drbh in https://github.com/huggingface/text-generation-inference/pull/2333
feat: prefer stop over eos_token to align with openai finish_reason by @drbh in https://github.com/huggingface/text-generation-inference/pull/2344
feat: return the generated text when parsing fails by @drbh in https://github.com/huggingface/text-generation-inference/pull/2353
fix: default num_ln_in_parallel_attn to one if not supplied by @drbh in https://github.com/huggingface/text-generation-inference/pull/2364
fix: prefer original layernorm names for 180B by @drbh in https://github.com/huggingface/text-generation-inference/pull/2365
fix: fix num_ln_in_parallel_attn attribute name typo in RWConfig by @almersawi in https://github.com/huggingface/text-generation-inference/pull/2350
add gptj modeling in TGI #2366 (CI RUN) by @drbh in https://github.com/huggingface/text-generation-inference/pull/2372
Fix the prefix for OPT model in opt_modelling.py #2370 (CI RUN) by @drbh in https://github.com/huggingface/text-generation-inference/pull/2371
Pr 2374 ci branch by @drbh in https://github.com/huggingface/text-generation-inference/pull/2378
fix EleutherAI/gpt-neox-20b does not work in tgi by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2346
Pr 2337 ci branch by @drbh in https://github.com/huggingface/text-generation-inference/pull/2379
fix: prefer hidden_activation over hidden_act in gemma2 by @drbh in https://github.com/huggingface/text-generation-inference/pull/2381
Update Quantization docs and minor doc fix. by @Vaibhavs10 in https://github.com/huggingface/text-generation-inference/pull/2368
Pr 2352 ci branch by @drbh in https://github.com/huggingface/text-generation-inference/pull/2382
Add FlashInfer support by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2354
Add experimental flake by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2384
Using HF_HOME instead of CACHE to get token read in addition to models. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2288
flake: add fmt and clippy by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2389
Update documentation for Supported models by @Vaibhavs10 in https://github.com/huggingface/text-generation-inference/pull/2386
flake: use rust-overlay by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2390
Using an enum for flash backens (paged/flashdecoding/flashinfer) by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2385
feat: add guideline to chat request and template by @drbh in https://github.com/huggingface/text-generation-inference/pull/2391
Update flake for 9.0a capability in Torch by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2394
nix: add router to the devshell by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2396
Upgrade fbgemm by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2398
Adding launcher to build. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2397
Fixing import exl2 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2399
Cpu dockerimage by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2367
Add support for prefix caching to the v3 router by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2392
Keeping the benchmark somewhere by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2401
feat: validate template variables before apply and improve sliding wi… by @drbh in https://github.com/huggingface/text-generation-inference/pull/2403
fix: allocate tmp based on sgmv kernel if available by @drbh in https://github.com/huggingface/text-generation-inference/pull/2345
fix: improve completions to send a final chunk with usage details by @drbh in https://github.com/huggingface/text-generation-inference/pull/2336
Updating the flake. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2404
Pr 2395 ci run by @drbh in https://github.com/huggingface/text-generation-inference/pull/2406
fix: include create_exllama_buffers and set_device for exllama by @drbh in https://github.com/huggingface/text-generation-inference/pull/2407
nix: incremental build of the launcher by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2410
Adding more kernels to flake. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2411
add numa to improve cpu inference perf by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2330
fix: adds causal to attention params by @drbh in https://github.com/huggingface/text-generation-inference/pull/2408
nix: partial incremental build of the router by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2416
Upgrading exl2. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2415
More fixes trtllm by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2342
nix: build router incrementally by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2422
Fixing exl2 and other quanize tests again. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2419
Upgrading the tests to match the current workings. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2423
nix: try to reduce the number of Rust rebuilds by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2424
Improve the Consuming TGI + Streaming docs. by @Vaibhavs10 in https://github.com/huggingface/text-generation-inference/pull/2412
Further fixes. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2426
doc: Add metrics documentation and add a 'Reference' section by @Hugoch in https://github.com/huggingface/text-generation-inference/pull/2230
All integration tests back everywhere (too many failed CI). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2428
nix: update to CUDA 12.4 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2429
Prefix caching by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2402
nix: add pure server to flake, add both pure and impure devshells by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2430
nix: add text-generation-benchmark to pure devshell by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2431
Adding eetq to flake. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2438
nix: add awq-inference-engine as server dependency by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2442
nix: add default package by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2453
Fix: don't apply post layernorm in SiglipVisionTransformer by @drbh in https://github.com/huggingface/text-generation-inference/pull/2459
Pr 2451 ci branch by @drbh in https://github.com/huggingface/text-generation-inference/pull/2454
Fixing CI. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2462
fix: bump minijinja version and add test for llama 3.1 tools by @drbh in https://github.com/huggingface/text-generation-inference/pull/2463
fix: improve regex expression by @drbh in https://github.com/huggingface/text-generation-inference/pull/2468
nix: build Torch against MKL and various other improvements by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2469
Lots of improvements (Still 2 allocators) by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2449
feat: add /v1/models endpoint by @drbh in https://github.com/huggingface/text-generation-inference/pull/2433
update doc with intel cpu part by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2420
Tied embeddings in MLP speculator. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2473
nix: improve impure devshell by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2478
nix: add punica-kernels by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2477
fix: enable chat requests in vertex endpoint by @drbh in https://github.com/huggingface/text-generation-inference/pull/2481
feat: support lora revisions and qkv_proj weights by @drbh in https://github.com/huggingface/text-generation-inference/pull/2482
hotfix: avoid non-prefilled block use when using prefix caching by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2489
Adding links to Adyen blogpost. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2492
Add two handy gitignores for Nix environments by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2484
hotfix: fix regression of attention api change in intel platform by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2439
nix: add pyright/ruff for proper LSP in the impure devshell by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2496
Fix incompatibility with latest syrupy and update in Poetry by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2497
radix trie: add assertions by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2491
hotfix: add syrupy to the right subproject by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2499
Add links to Adyen blogpost by @martinigoyanes in https://github.com/huggingface/text-generation-inference/pull/2500
Fixing more correctly the invalid drop of the batch. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2498
Add Directory Check to Prevent Redundant Cloning in Build Process by @vamsivallepu in https://github.com/huggingface/text-generation-inference/pull/2486
Prefix test - Different kind of load test to trigger prefix test bugs. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2490
Fix tokenization yi by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2507
Fix truffle by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2514
nix: support Python tokenizer conversion in the router by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2515
Add nix test. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2513
fix: pass missing revision arg for lora adapter when loading multiple… by @drbh in https://github.com/huggingface/text-generation-inference/pull/2510
hotfix : enable intel ipex cpu and xpu in python3.11 by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2517
Use ratatui not (deprecated) tui by @strickvl in https://github.com/huggingface/text-generation-inference/pull/2521
Add tests for Mixtral by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2520
Adding a test for FD. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2516
nix: pure Rust check/fmt/clippy/test by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2525
fix: metrics unbounded memory by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2528
Move to moe-kernels package and switch to common MoE layer by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2511
Stream options. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2533
Update to moe-kenels 0.3.1 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2535
doc: clarify that --quantize is not needed for pre-quantized models by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2536
hotfix: ipex fails since cuda moe kernel is not supported by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2532
fix: wrap python basic logs in debug assertion in launcher by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2539
Preparing for release. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2540

New Contributors

@almersawi made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2350
@Vaibhavs10 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2368
@mfuntowicz made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2342
@vamsivallepu made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2486
@strickvl made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2521

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.2.0...v2.3.0