Renamed HUGGINGFACE_HUB_CACHE to use HF_HOME. This is done to harmonize environment variables across HF ecosystem.
So locations of data moved from /data/models-.... to /data/hub/models-.... on the Docker.
Prefix caching by default ! To help with long running queries TGI will use prefix caching a reuse pre-existing queries in the kv-cache in order to speed up TTFT. This should be totally transparent for most users, however this has required a instense rewrite of internals and therefore bugs can potentially exist. Also we changed kernels from paged_attention to flashinfer (and flashdecoding as a fallback for some specific models that aren't supported by flashinfer).
Lots of performance improvements with Marlin and quantization.
layers.marlin into several files by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2292GPTQMarlinWeightLoader by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2300text-generation-benchmark to pure devshell by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2431syrupy and update in Poetry by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2497ratatui not (deprecated) tui by @strickvl in https://github.com/huggingface/text-generation-inference/pull/2521--quantize is not needed for pre-quantized models by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2536Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.2.0...v2.3.0
Fetched April 7, 2026