releases.shpreview

Hugging Face/Text Generation Inference/v2.1.0

v2.1.0

June 28, 2024Text Generation InferenceView original ↗

$npx -y @buildinternet/releases show rel_5HF6_dH19Cu35zxcF8Hhj

Notable changes

New models : gemma2
Multi lora adapters. You can now run multiple loras on the same TGI deployment https://github.com/huggingface/text-generation-inference/pull/2010
Faster GPTQ inference and Marlin support (up to 2x speedup).
Reworked the entire scheduling logic (better block allocations, and allowing further speedups in new releases)
Lots of Rocm support and bugfixes,
Lots of new contributors ! Thanks a lot for these contributions

What's Changed

OpenAI function calling compatible support by @phangiabao98 in https://github.com/huggingface/text-generation-inference/pull/1888
Fixing types. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1906
Types. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1909
Fixing signals. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1910
Removing some unused code. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1915
MI300 compatibility by @fxmarty in https://github.com/huggingface/text-generation-inference/pull/1764
Add TGI monitoring guide through Grafana and Prometheus by @fxmarty in https://github.com/huggingface/text-generation-inference/pull/1908
Update grafana template by @fxmarty in https://github.com/huggingface/text-generation-inference/pull/1918
Fix TunableOp bug by @fxmarty in https://github.com/huggingface/text-generation-inference/pull/1920
Fix TGI issues with ROCm by @fxmarty in https://github.com/huggingface/text-generation-inference/pull/1921
Fixing the download strategy for ibm-fms by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1917
ROCm: make CK FA2 default instead of Triton by @fxmarty in https://github.com/huggingface/text-generation-inference/pull/1924
docs: Fix grafana dashboard url by @edwardzjl in https://github.com/huggingface/text-generation-inference/pull/1925
feat: include token in client test like server tests by @drbh in https://github.com/huggingface/text-generation-inference/pull/1932
Creating doc automatically for supported models. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1929
fix: use path inside of speculator config by @drbh in https://github.com/huggingface/text-generation-inference/pull/1935
feat: add train medusa head tutorial by @drbh in https://github.com/huggingface/text-generation-inference/pull/1934
reenable xpu for tgi by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/1939
Fixing some legacy behavior (big swapout of serverless on legacy stuff). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1937
Add completion route to client and add stop parameter where it's missing by @thomas-schillaci in https://github.com/huggingface/text-generation-inference/pull/1869
Improving the logging system. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1938
Fixing codellama loads by using purely AutoTokenizer. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1947
Fix seeded output. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1949
Fix (flash) Gemma prefix and enable tests by @danieldk in https://github.com/huggingface/text-generation-inference/pull/1950
Fix GPTQ for models which do not have float16 at the default dtype (simpler) by @danieldk in https://github.com/huggingface/text-generation-inference/pull/1953
Processor config chat template by @drbh in https://github.com/huggingface/text-generation-inference/pull/1954
fix small typo and broken link by @MoritzLaurer in https://github.com/huggingface/text-generation-inference/pull/1958
Upgrade to Axum 0.7 and Hyper 1.0 (Breaking change: disabled ngrok tunneling). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1959
Fix (non-container) pytest stdout buffering-related lock-up by @danieldk in https://github.com/huggingface/text-generation-inference/pull/1963
Fixing the text part from tokenizer endpoint. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1967
feat: adjust attn weight loading logic by @drbh in https://github.com/huggingface/text-generation-inference/pull/1975
Add support for exl2-quantized models by @danieldk in https://github.com/huggingface/text-generation-inference/pull/1965
Update documentation version to 2.0.4 by @fxmarty in https://github.com/huggingface/text-generation-inference/pull/1980
Purely refactors paged/attention into layers/attention and make hardware differences more obvious with 1 file per hardware. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1986
Fixing exl2 scratch buffer. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1990
single char ` addition for docs by @nbroad1881 in https://github.com/huggingface/text-generation-inference/pull/1989
Fixing GPTQ imports. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1994
reable xpu, broken by gptq and setuptool upgrade by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/1988
router: send the input as chunks to the backend by @danieldk in https://github.com/huggingface/text-generation-inference/pull/1981
Fix Phi-2 with tp>1 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2003
fix: update triton implementation reference by @emmanuel-ferdman in https://github.com/huggingface/text-generation-inference/pull/2002
feat: add SchedulerV3 by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1996
Support GPTQ models with column-packed up/gate tensor by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2006
Making make install work better by default. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2004
Hotfixing make install. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2008
Do not initialize scratch space when there are no ExLlamaV2 layers by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2015
feat: move allocation logic to rust by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1835
Fixing rocm. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2021
Fix GPTQWeight import by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2020
Update version on init.py to 0.7.0 by @andimarafioti in https://github.com/huggingface/text-generation-inference/pull/2017
Add support for Marlin-quantized models by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2014
marlin: support tp>1 when group_size==-1 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2032
marlin: improve build by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2031
Internal runner ? by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2023
Xpu gqa by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2013
server: use chunked inputs by @danieldk in https://github.com/huggingface/text-generation-inference/pull/1985
ROCm and sliding windows fixes by @fxmarty in https://github.com/huggingface/text-generation-inference/pull/2033
Add Phi-3 medium support by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2039
feat(ci): add trufflehog secrets detection by @McPatate in https://github.com/huggingface/text-generation-inference/pull/2038
fix(ci): remove unnecessary permissions by @McPatate in https://github.com/huggingface/text-generation-inference/pull/2045
Update LLMM1 bound by @fxmarty in https://github.com/huggingface/text-generation-inference/pull/2050
Support chat response format by @drbh in https://github.com/huggingface/text-generation-inference/pull/2046
fix(server): fix OPT implementation by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2061
fix(layers): fix SuRotaryEmbedding by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/2060
PR #2049 CI run by @drbh in https://github.com/huggingface/text-generation-inference/pull/2054
implement Open Inference Protocol endpoints by @drbh in https://github.com/huggingface/text-generation-inference/pull/1942
Add support for GPTQ Marlin by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2052
Update the link for qwen2 by @xianbaoqian in https://github.com/huggingface/text-generation-inference/pull/2068
Adding architecture document by @tengomucho in https://github.com/huggingface/text-generation-inference/pull/2044
Support different image sizes in prefill in VLMs by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2065
Contributing guide & Code of Conduct by @LysandreJik in https://github.com/huggingface/text-generation-inference/pull/2074
fix build.rs watch files by @zirconium-n in https://github.com/huggingface/text-generation-inference/pull/2072
Set maximum grpc message receive size to 2GiB by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2075
CI: Tailscale improvements by @glegendre01 in https://github.com/huggingface/text-generation-inference/pull/2079
CI: pass pre-commit hooks again by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2084
feat: rotate tests ci token by @drbh in https://github.com/huggingface/text-generation-inference/pull/2091
Support exl2-quantized Qwen2 models by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2085
Factor out sharding of packed tensors by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2059
Fix text-generation-server quantize by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2103
feat: sort cuda graphs in descending order by @drbh in https://github.com/huggingface/text-generation-inference/pull/2104
New runner. Manual squash. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2110
Fix cargo-chef prepare by @ur4t in https://github.com/huggingface/text-generation-inference/pull/2101
Support HF_TOKEN environment variable by @Wauplin in https://github.com/huggingface/text-generation-inference/pull/2066
Add OTLP Service Name Environment Variable by @KevinDuffy94 in https://github.com/huggingface/text-generation-inference/pull/2076
corrected Pydantic warning. by @yukiman76 in https://github.com/huggingface/text-generation-inference/pull/2095
use xpu-smi to dump used memory by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2047
fix ChatCompletion and ChatCompletionChunk object string not compatible with standard openai api by @sunxichen in https://github.com/huggingface/text-generation-inference/pull/2089
Cpu tgi by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/1936
feat: add simple tests for weights by @drbh in https://github.com/huggingface/text-generation-inference/pull/2092
Removing IPEX_AVAIL. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2115
fix cpu and xpu issue by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/2116
Add pytest release marker by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2114
Fix CI . by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2118
Enable multiple LoRa adapters by @drbh in https://github.com/huggingface/text-generation-inference/pull/2010
Support AWQ quantization with bias by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2117
Add support for Marlin 2:4 sparsity by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2102
fix: simplify kserve endpoint and fix imports by @drbh in https://github.com/huggingface/text-generation-inference/pull/2119
Fixing prom leak by upgrading. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2129
Bumping to 2.1 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2131
Idefics2: sync added image tokens with transformers by @danieldk in https://github.com/huggingface/text-generation-inference/pull/2080
Fixing malformed rust tokenizers by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2134
Fixing gemma2. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/2135
fix: refactor post_processor logic and add test by @drbh in https://github.com/huggingface/text-generation-inference/pull/2137

New Contributors

@phangiabao98 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1888
@edwardzjl made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1925
@thomas-schillaci made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1869
@nbroad1881 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1989
@emmanuel-ferdman made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2002
@andimarafioti made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2017
@McPatate made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2038
@xianbaoqian made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2068
@tengomucho made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2044
@LysandreJik made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2074
@zirconium-n made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2072
@glegendre01 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2079
@ur4t made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2101
@KevinDuffy94 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2076
@yukiman76 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2095
@sunxichen made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2089

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v2.0.3...v2.1.0

Fetched April 7, 2026