Renamed HUGGINGFACE_HUB_CACHE to use HF_HOME. This is done to harmonize environment variables across HF ecosystem.
So locations of data moved from /data/models-.... to /data/hub/models-.... on the Docker.
Prefix caching by default ! To help with long running queries TGI will use prefix caching a reuse pre-existing queries in the kv-cache in order to speed up TTFT. This should be totally transparent for most users, however this has required a instense rewrite of internals and therefore bugs can potentially exist. Also we changed kernels from paged_attention to flashinfer (and flashdecoding as a fallback for some specific models that aren't supported by flashinfer).
Lots of performance improvements with Marlin and quantization.
layers.marlin into several files by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2292GPTQMarlinWeightLoader by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2300text-generation-benchmark to pure devshell by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2431syrupy and update in Poetry by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2497ratatui not (deprecated) tui by @strickvl in https://github.com/huggingface/text-generation-inference/pull/2521--quantize is not needed for pre-quantized models by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2536Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.2.0...v2.3.0
flash_xxx.py files. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2166prefix in model constructors by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2191Weights class by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2194quantize subcommand by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2120server quantize: expose groupsize option by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2225quantize argument in get_weights_col_packed_qkv by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2237VlmCausalLM by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2258Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.1.1...v2.2.0
object field for regular completions. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2175Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.1.0...v2.1.1
New models : gemma2
Multi lora adapters. You can now run multiple loras on the same TGI deployment https://github.com/huggingface/text-generation-inference/pull/2010
Faster GPTQ inference and Marlin support (up to 2x speedup).
Reworked the entire scheduling logic (better block allocations, and allowing further speedups in new releases)
Lots of Rocm support and bugfixes,
Lots of new contributors ! Thanks a lot for these contributions
AutoTokenizer. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1947layers/attention and make hardware differences more obvious with 1 file per hardware. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1986tp>1 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2003make install work better by default. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2004make install. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2008text-generation-server quantize by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2103HF_TOKEN environment variable by @Wauplin in https://github.com/huggingface/text-generation-inference/pull/2066Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.0.3...v2.1.0
AutoTokenizer. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1947Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.0.3...v2.0.4
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.0.2...v2.0.3
--cuda-graphs 0 work as expected (bis) by @fxmarty in https://github.com/huggingface/text-generation-inference/pull/1768GenerateParameters by @Wauplin in https://github.com/huggingface/text-generation-inference/pull/1798HF_HUB_OFFLINE support in the router. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1789tool_prompt parameter to Python client by @maziyarpanahi in https://github.com/huggingface/text-generation-inference/pull/1825Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.0.1...v2.0.2
/v1/chat/completions and /v1/completions by @Wauplin in https://github.com/huggingface/text-generation-inference/pull/1747Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.0.0...v2.0.1
Try out Command R+ with Medusa heads on 4xA100s with:
model=text-generation-inference/commandrplus-medusa
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run
docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --speculate 3 --num-shard 4
--trust-remote-code. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1704Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.4.5...v2.0.0
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.4.4...v1.4.5
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.4.3...v1.4.4
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.4.2...v1.4.3
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.4.1...v1.4.2
name field to OpenAI compatible API Messages by @amihalik in https://github.com/huggingface/text-generation-inference/pull/1563Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.4.0...v1.4.1
decoder_input_details on OpenAI-compatible chat streaming, pass temp and top-k from API by @EndlessReform in https://github.com/huggingface/text-generation-inference/pull/1470/tokenize route to get the tokenized input by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1471Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.3.4...v1.4.0
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.3.3...v1.3.4
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.3.2...v1.3.3
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.3.1...v1.3.2
Hotfix Mixtral implementation
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.3.0...v1.3.1
Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.2.0...v1.3.0