Hugging Face/Text Generation Inference

Text Generation Inference

$npx -y @buildinternet/releases show text-generation-inference

AprMayJunJulAugSepOctNovDecJanFebMarApr

Releases2Avg0/wkVersionsv3.3.6 → v3.3.7

Sep 20, 2024

Important changes

Renamed HUGGINGFACE_HUB_CACHE to use HF_HOME. This is done to harmonize environment variables across HF ecosystem. So locations of data moved from /data/models-.... to /data/hub/models-.... on the Docker.
Prefix caching by default ! To help with long running queries TGI will use prefix caching a reuse pre-existing queries in the kv-cache in order to speed up TTFT. This should be totally transparent for most users, however this has required a instense rewrite of internals and therefore bugs can potentially exist. Also we changed kernels from paged_attention to flashinfer (and flashdecoding as a fallback for some specific models that aren't supported by flashinfer).
Lots of performance improvements with Marlin and quantization.

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.2.0...v2.3.0

Jul 23, 2024

Notable changes

Llama 3.1 support (including 405B, FP8 support in a lot of mixed configurations, FP8, AWQ, GPTQ, FP8+FP16).
Gemma2 softcap support
Deepseek v2 support.
Lots of internal reworks/cleanup (allowing for cool features)
Lots of AWQ/GPTQ work with marlin kernels (everything should be faster by default)
Flash decoding support (FLASH_DECODING=1 environment variables which will probably enable some nice improvements in the future)

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.1.1...v2.2.0

Main changes

Bugfixes
Added FlashDecoding support (Beta) use FLASH_DECODING=1 to use TGI with flash decoding (large speedups on long queries). https://github.com/huggingface/text-generation-inference/pull/1940
Use Marlin over GPTQ kernels for faster GPTQ inference https://github.com/huggingface/text-generation-inference/pull/2111

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.1.0...v2.1.1

Jun 28, 2024

Notable changes

New models : gemma2
Multi lora adapters. You can now run multiple loras on the same TGI deployment https://github.com/huggingface/text-generation-inference/pull/2010
Faster GPTQ inference and Marlin support (up to 2x speedup).
Reworked the entire scheduling logic (better block allocations, and allowing further speedups in new releases)
Lots of Rocm support and bugfixes,
Lots of new contributors ! Thanks a lot for these contributions

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.0.3...v2.1.0

May 24, 2024

Main changes

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.0.3...v2.0.4

May 16, 2024

Important changes

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.0.2...v2.0.3

Tl;dr

New models (idefics2, phi3)
Cleaner VLM support in the openai layer
Upgraded to pytorch 2.3.0

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.0.1...v2.0.2

Apr 18, 2024

What's Changed

feat: improve tools to include name and add tests by @drbh in https://github.com/huggingface/text-generation-inference/pull/1693
Update response type for /v1/chat/completions and /v1/completions by @Wauplin in https://github.com/huggingface/text-generation-inference/pull/1747
accept list as prompt for OpenAI API by @drbh in https://github.com/huggingface/text-generation-inference/pull/1702
fix ROCm docker image

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.0.0...v2.0.1

Apr 12, 2024

TGI is back to Apache 2.0!

Highlights

License was reverted to Apache 2.0
Cuda graphs are now used by default. They improve latency substancially on high end nodes.
Llava-next was added. It is the second multimodal model available on TGI after Idefics.
Cohere Command R+ support. TGI is the fastest open source backend for Command R+
FP8 support.
We now share the vocabulary for all medusa heads, greatly improving latency and memory use.

Try out Command R+ with Medusa heads on 4xA100s with:

model=text-generation-inference/commandrplus-medusa
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --speculate 3 --num-shard 4

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.4.5...v2.0.0

Mar 29, 2024

Highlights

What's Changed

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.4.4...v1.4.5

Mar 22, 2024

Highlights

CohereForAI/c4ai-command-r-v01 model support

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.4.3...v1.4.4

Feb 28, 2024

Highlights

Add support for Starcoder 2
Add support for Qwen2

What's Changed

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.4.2...v1.4.3

Feb 21, 2024

Highlights

Add support for Google Gemma models

What's Changed

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.4.1...v1.4.2

Feb 16, 2024

Highlights

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.4.0...v1.4.1

Jan 26, 2024

Highlights

OpenAI compatible API #1427
exllama v2 Tensor Parallel #1490
GPTQ support for AMD GPUs #1489
Phi support #1442

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.3.4...v1.4.0

Dec 22, 2023

What's Changed

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.3.3...v1.3.4

Dec 15, 2023

What's Changed

fix gptq params loading
improve decode latency for long sequences two fold
feat: add more latency metrics in forward by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1346
fix: max_past default value must be -1, not 0 by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1348

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.3.2...v1.3.3

Dec 12, 2023

What's Changed

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.3.1...v1.3.2

Dec 11, 2023

Hotfix Mixtral implementation

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.3.0...v1.3.1

What's Changed

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.2.0...v1.3.0

Previous 1 2 3 4 Next

Similar releases

Other sources from this team

Similar sources

@huggingface/text-generation-inference

Tracking Since

Last fetched Apr 19, 2026

.json·.md·.atom

Text Generation Inference

Important changes

What's Changed

New Contributors

Notable changes

What's Changed

New Contributors

Main changes

What's Changed

New Contributors

Notable changes

What's Changed

New Contributors

Main changes

What's Changed

New Contributors

Important changes

What's Changed

New Contributors

Tl;dr

What's Changed

New Contributors

What's Changed

TGI is back to Apache 2.0!

Highlights

What's Changed

New Contributors

Highlights

What's Changed

Highlights

What's Changed

New Contributors

Highlights

What's Changed

Highlights

What's Changed

Highlights

What's Changed

New Contributors

Highlights

What's Changed

New Contributors

What's Changed

What's Changed

What's Changed

What's Changed

More from this team

Similar releases

Other sources from this team

Similar sources

Other sources from this team

Similar sources

More from this team

Similar releases