releases.shpreview
Hugging Face/Text Generation Inference

Text Generation Inference

$npx -y @buildinternet/releases show text-generation-inference
Mon
Wed
Fri
AprMayJunJulAugSepOctNovDecJanFebMarApr
Less
More
Releases2Avg0/wkVersionsv3.3.6 → v3.3.7
Sep 20, 2024

Important changes

  • Renamed HUGGINGFACE_HUB_CACHE to use HF_HOME. This is done to harmonize environment variables across HF ecosystem. So locations of data moved from /data/models-.... to /data/hub/models-.... on the Docker.

  • Prefix caching by default ! To help with long running queries TGI will use prefix caching a reuse pre-existing queries in the kv-cache in order to speed up TTFT. This should be totally transparent for most users, however this has required a instense rewrite of internals and therefore bugs can potentially exist. Also we changed kernels from paged_attention to flashinfer (and flashdecoding as a fallback for some specific models that aren't supported by flashinfer).

  • Lots of performance improvements with Marlin and quantization.

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.2.0...v2.3.0

Jul 23, 2024

Notable changes

  • Llama 3.1 support (including 405B, FP8 support in a lot of mixed configurations, FP8, AWQ, GPTQ, FP8+FP16).
  • Gemma2 softcap support
  • Deepseek v2 support.
  • Lots of internal reworks/cleanup (allowing for cool features)
  • Lots of AWQ/GPTQ work with marlin kernels (everything should be faster by default)
  • Flash decoding support (FLASH_DECODING=1 environment variables which will probably enable some nice improvements in the future)

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.1.1...v2.2.0

Jul 4, 2024

Main changes

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.1.0...v2.1.1

Jun 28, 2024

Notable changes

  • New models : gemma2

  • Multi lora adapters. You can now run multiple loras on the same TGI deployment https://github.com/huggingface/text-generation-inference/pull/2010

  • Faster GPTQ inference and Marlin support (up to 2x speedup).

  • Reworked the entire scheduling logic (better block allocations, and allowing further speedups in new releases)

  • Lots of Rocm support and bugfixes,

  • Lots of new contributors ! Thanks a lot for these contributions

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.0.3...v2.1.0

May 24, 2024

Main changes

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.0.3...v2.0.4

May 16, 2024

Important changes

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.0.2...v2.0.3

May 1, 2024

Tl;dr

  • New models (idefics2, phi3)
  • Cleaner VLM support in the openai layer
  • Upgraded to pytorch 2.3.0

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.0.1...v2.0.2

Apr 18, 2024

What's Changed

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.0.0...v2.0.1

Apr 12, 2024

TGI is back to Apache 2.0!

Highlights

  • License was reverted to Apache 2.0
  • Cuda graphs are now used by default. They improve latency substancially on high end nodes.
  • Llava-next was added. It is the second multimodal model available on TGI after Idefics.
  • Cohere Command R+ support. TGI is the fastest open source backend for Command R+
  • FP8 support.
  • We now share the vocabulary for all medusa heads, greatly improving latency and memory use.

Try out Command R+ with Medusa heads on 4xA100s with:

model=text-generation-inference/commandrplus-medusa
volume=$PWD/data # share a volume with the Docker container to avoid downloading weights every run

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data ghcr.io/huggingface/text-generation-inference:2.0 --model-id $model --speculate 3 --num-shard 4

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.4.5...v2.0.0

Mar 29, 2024
v.1.4.5

Highlights

What's Changed

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.4.4...v1.4.5

Mar 22, 2024
v.1.4.4

Highlights

  • CohereForAI/c4ai-command-r-v01 model support

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.4.3...v1.4.4

Feb 28, 2024

Highlights

  • Add support for Starcoder 2
  • Add support for Qwen2

What's Changed

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.4.2...v1.4.3

Feb 21, 2024

Highlights

  • Add support for Google Gemma models

What's Changed

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.4.1...v1.4.2

Feb 16, 2024

Highlights

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.4.0...v1.4.1

Jan 26, 2024

Highlights

  • OpenAI compatible API #1427
  • exllama v2 Tensor Parallel #1490
  • GPTQ support for AMD GPUs #1489
  • Phi support #1442

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.3.4...v1.4.0

Dec 22, 2023

What's Changed

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.3.3...v1.3.4

Dec 15, 2023

What's Changed

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.3.2...v1.3.3

Dec 12, 2023

What's Changed

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.3.1...v1.3.2

Dec 11, 2023
Latest
v3.3.7
Tracking Since
Feb 3, 2023
Last fetched Apr 19, 2026