New models : gemma2
Multi lora adapters. You can now run multiple loras on the same TGI deployment https://github.com/huggingface/text-generation-inference/pull/2010
Faster GPTQ inference and Marlin support (up to 2x speedup).
Reworked the entire scheduling logic (better block allocations, and allowing further speedups in new releases)
Lots of Rocm support and bugfixes,
Lots of new contributors ! Thanks a lot for these contributions
AutoTokenizer. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1947layers/attention and make hardware differences more obvious with 1 file per hardware. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1986tp>1 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2003make install work better by default. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2004make install. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2008text-generation-server quantize by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2103HF_TOKEN environment variable by @Wauplin in https://github.com/huggingface/text-generation-inference/pull/2066Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.0.3...v2.1.0
Fetched April 7, 2026