releases.shpreview
Hugging Face/Text Generation Inference

Text Generation Inference

Mon
Wed
Fri
JunJulAugSepOctNovDecJanFebMarAprMay
Less
More
Releases2Avg0/wkVersionsv3.3.6 to v3.3.7

What's Changed

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v3.3.4...git

Neuron backend update.

What's Changed

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v3.3.2...v3.3.3

This release updates TGI to Torch 2.7 and CUDA 12.8.

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v3.3.0...v3.3.1

Notable changes

  • Prefill chunking for VLMs.

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v3.2.3...v3.3.0

What's Changed

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v3.2.0...v3.2.1

Important changes

  • BREAKING CHANGE: Lots of modifications around tool calling. Tool calling now respects fully OpenAI return results (arguments return type is a string instead of a real JSON object). Lots of improvements around the tool calling and side effects fixed.

  • Added Gemma 3 support.

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v3.1.1...v3.2.0

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v3.1.0...v3.1.1

Important changes

Deepseek R1 is fully supported on both AMD and Nvidia !

docker run --gpus all --shm-size 1g -p 8080:80 -v $volume:/data \
    ghcr.io/huggingface/text-generation-inference:3.1.0 --model-id deepseek-ai/DeepSeek-R1

What's Changed

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v3.0.2...v3.1.0

Tl;dr

New transformers backend supporting flashattention at roughly same performance as pure TGI for all non officially supported models directly in TGI. Congrats @Cyrilvallez

New models unlocked: Cohere2, olmo, olmo2, helium.

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v3.0.1...v3.0.2

TL;DR

Big new release

Details: https://huggingface.co/docs/text-generation-inference/conceptual/chunking

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.4.1...v3.0.0

Notable changes

  • Choose input/total tokens automatically based on available VRAM
  • Support Qwen2 VL
  • Decrease latency of very large batches (> 128)

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.3.0...v2.4.1

Notable changes

  • Experimental prefill chunking (PREFILL_CHUNKING=1)
  • Experimental FP8 KV cache support
  • Greatly decrease latency for large batches (> 128 requests)
  • Faster MoE kernels and support for GPTQ-quantized MoE
  • Faster implementation of MLLama

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.3.0...v2.4

Important changes

  • Added support for Mllama (3.2, vision models). Flashed, unpadded.
  • FP8 performance improvements
  • Moe performance improvements
  • BREAKING CHANGE - When using tools, models could answer with a tool call notify_error with the content error, it will instead output regular generation.

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.3.0...v2.3.1

Last Checked
35m ago
Tracking since Feb 3, 2023