Hugging Face/Text Generation Inference

Text Generation Inference

$npx -y @buildinternet/releases show text-generation-inference

Sun

Mon

Tue

Wed

Thu

Fri

Sat

AprMayJunJulAugSepOctNovDecJanFebMarApr

Less

Releases2Avg0/wkVersionsv3.3.6 → v3.3.7

Nov 30, 2023

v.1.2.0

What's Changed

fix: do not leak inputs on error by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1228
Fix missing trust_remote_code flag for AutoTokenizer in utils.peft by @creatorrr in https://github.com/huggingface/text-generation-inference/pull/1270
Load PEFT weights from local directory by @tleyden in https://github.com/huggingface/text-generation-inference/pull/1260
chore: update to torch 2.1.0 by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1182
Fix IDEFICS dtype by @vakker in https://github.com/huggingface/text-generation-inference/pull/1214
Exllama v2 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1211
Add RoCm support by @fxmarty in https://github.com/huggingface/text-generation-inference/pull/1243
Let each model resolve their own default dtype. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1287
Make GPTQ test less flaky by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1295

New Contributors

@creatorrr made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1270
@tleyden made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1260
@vakker made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1214

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.1.1...v1.2.0

Nov 16, 2023

What's Changed

Fix launcher.md by @mishig25 in https://github.com/huggingface/text-generation-inference/pull/1075
Update launcher.md to wrap code blocks by @mishig25 in https://github.com/huggingface/text-generation-inference/pull/1076
Fixing eetq dockerfile. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1081
Fix window_size_left for flash attention v1 by @peterlowrance in https://github.com/huggingface/text-generation-inference/pull/1089
raise exception on invalid images by @leot13 in https://github.com/huggingface/text-generation-inference/pull/999
[Doc page] Fix launcher page highlighting by @mishig25 in https://github.com/huggingface/text-generation-inference/pull/1080
Handling bloom prefix. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1090
Update idefics_image_processing.py by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1091
fixed command line arguments in docs by @Fluder-Paradyne in https://github.com/huggingface/text-generation-inference/pull/1092
Adding titles to CLI doc. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1094
Receive base64 encoded images for idefics. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1096
Modify the default for max_new_tokens. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1097
fix: type hint typo in tokens.py by @vejvarm in https://github.com/huggingface/text-generation-inference/pull/1102
Fixing GPTQ exllama kernel usage. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1101
Adding yarn support. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1099
Hotfixing idefics base64 parsing. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1103
Prepare for v1.1.1 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1100
Remove some content from the README in favour of the documentation by @osanseviero in https://github.com/huggingface/text-generation-inference/pull/958
Fix link in preparing_model.md by @mishig25 in https://github.com/huggingface/text-generation-inference/pull/1140
Fix calling cuda() on load_in_8bit by @mmngays in https://github.com/huggingface/text-generation-inference/pull/1153
Fix: Replace view() with reshape() in neox_modeling.py to resolve RuntimeError by @Mario928 in https://github.com/huggingface/text-generation-inference/pull/1155
fix: EETQLinear with bias in layers.py by @SidaZh in https://github.com/huggingface/text-generation-inference/pull/1176
fix: remove useless token by @rtrompier in https://github.com/huggingface/text-generation-inference/pull/1179
#1049 CI by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1178
Fix link to quantization page in preparing_model.md by @aasthavar in https://github.com/huggingface/text-generation-inference/pull/1187
feat: paged attention v2 by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1183
feat: remove flume by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1184
Adding the video -> moving the architecture picture lower by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1239
Narsil patch 1 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1241
Update README.md by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1242
Fix link in quantization guide by @osanseviero in https://github.com/huggingface/text-generation-inference/pull/1246

New Contributors

@peterlowrance made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1089
@leot13 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/999
@Fluder-Paradyne made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1092
@vejvarm made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1102
@mmngays made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1153
@Mario928 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1155
@SidaZh made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1176
@rtrompier made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1179
@aasthavar made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1187

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.1.0...v1.1.1

Sep 28, 2023

Notable changes

Support for Mistral models (#1071)
AWQ quantization (#1019)
EETQ quantization (#1068)

What's Changed

Fix f180 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/951
Fix Falcon weight mapping for H2O.ai checkpoints by @Vinno97 in https://github.com/huggingface/text-generation-inference/pull/953
Fixing top_k tokens when k ends up < 0 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/966
small fix on idefics by @VictorSanh in https://github.com/huggingface/text-generation-inference/pull/954
chore(client): Support Pydantic 2 by @JelleZijlstra in https://github.com/huggingface/text-generation-inference/pull/900
docs: typo in streaming.js by @revolunet in https://github.com/huggingface/text-generation-inference/pull/971
Disabling exllama on old compute. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/986
sync text-generation version from 0.3.0 to 0.6.0 with pyproject.toml by @yzbx in https://github.com/huggingface/text-generation-inference/pull/950
Fix exllama wronfully loading by @maximelaboisson in https://github.com/huggingface/text-generation-inference/pull/990
add transformers gptq support by @flozi00 in https://github.com/huggingface/text-generation-inference/pull/963
Fix call vs forward. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/993
fit for baichuan models by @XiaoBin1992 in https://github.com/huggingface/text-generation-inference/pull/981
Fix missing arguments in Galactica's from_pb by @Vinno97 in https://github.com/huggingface/text-generation-inference/pull/1022
Fixing t5 loading. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1042
Add AWQ quantization inference support (#1019) by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1054
Fix GQA llama + AWQ by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1061
support local model config file by @zhangsibo1129 in https://github.com/huggingface/text-generation-inference/pull/1058
fix discard_names bug in safetensors convertion by @zhangsibo1129 in https://github.com/huggingface/text-generation-inference/pull/1052
Install curl to be able to perform more advanced healthchecks by @oOraph in https://github.com/huggingface/text-generation-inference/pull/1033
Fix position ids logic instantiation of idefics vision part by @VictorSanh in https://github.com/huggingface/text-generation-inference/pull/1064
Fix top_n_tokens returning non-log probs for some models by @Vinno97 in https://github.com/huggingface/text-generation-inference/pull/1023
Support eetq weight only quantization by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1068
Remove the stripping of the prefix space (and any other mangling that tokenizers might do). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/1065
Complete FastLinear.load parameters in OPTDecoder initialization by @zhangsibo1129 in https://github.com/huggingface/text-generation-inference/pull/1060
feat: add mistral model by @OlivierDehaene in https://github.com/huggingface/text-generation-inference/pull/1071

New Contributors

@VictorSanh made their first contribution in https://github.com/huggingface/text-generation-inference/pull/954
@JelleZijlstra made their first contribution in https://github.com/huggingface/text-generation-inference/pull/900
@revolunet made their first contribution in https://github.com/huggingface/text-generation-inference/pull/971
@yzbx made their first contribution in https://github.com/huggingface/text-generation-inference/pull/950
@maximelaboisson made their first contribution in https://github.com/huggingface/text-generation-inference/pull/990
@XiaoBin1992 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/981
@sywangyi made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1034
@zhangsibo1129 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/1058

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.0.3...v1.1.0

Aug 29, 2023

What's Changed

Codellama.

Upgrade version number in docs. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/910
Added gradio example to docs by @merveenoyan in https://github.com/huggingface/text-generation-inference/pull/867
Supporting code llama. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/918
Fixing the lora adaptation on docker. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/935
Rebased #617 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/868
New release. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/941

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.0.2...v1.0.3

Aug 23, 2023

What's Changed

Have snippets in Python/JavaScript in quicktour by @osanseviero in https://github.com/huggingface/text-generation-inference/pull/809
Added two more features in readme.md file by @sawanjr in https://github.com/huggingface/text-generation-inference/pull/831
Fix rope dynamic + factor by @Narsil in https://github.com/huggingface/text-generation-inference/pull/822
fix: LlamaTokenizerFast to AutoTokenizer at flash_llama.py by @dongs0104 in https://github.com/huggingface/text-generation-inference/pull/619
README edit -- running the service with no GPU or CUDA support by @pminervini in https://github.com/huggingface/text-generation-inference/pull/773
Fix tokenizers==0.13.4 . by @Narsil in https://github.com/huggingface/text-generation-inference/pull/838
Update README.md by @adarshxs in https://github.com/huggingface/text-generation-inference/pull/848
Fixing watermark. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/851
Misc minor improvements for InferenceClient docs by @osanseviero in https://github.com/huggingface/text-generation-inference/pull/852
"Fix" for rw-1b. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/860
Upgrading versions of python client. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/862
Adding Idefics multi modal model. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/842
Add streaming guide by @osanseviero in https://github.com/huggingface/text-generation-inference/pull/858
Adding small benchmark script. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/881

New Contributors

@sawanjr made their first contribution in https://github.com/huggingface/text-generation-inference/pull/831
@dongs0104 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/619
@pminervini made their first contribution in https://github.com/huggingface/text-generation-inference/pull/773
@adarshxs made their first contribution in https://github.com/huggingface/text-generation-inference/pull/848

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.0.1...v1.0.2

Aug 14, 2023

Notable changes:

More GPTQ support
Rope scaling (linear + dynamic)
Bitsandbytes 4bits (both modes)
Added more documentation

What's Changed

Local gptq support. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/738
Fix typing in Model.generate_token by @jaywonchung in https://github.com/huggingface/text-generation-inference/pull/733
Adding Rope scaling. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/741
chore: fix typo in mpt_modeling.py by @eltociear in https://github.com/huggingface/text-generation-inference/pull/737
fix(server): Failing quantize config after local read. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/743
Typo fix. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/746
fix typo for dynamic rotary by @flozi00 in https://github.com/huggingface/text-generation-inference/pull/745
add FastLinear import by @zspo in https://github.com/huggingface/text-generation-inference/pull/750
Automatically map deduplicated safetensors weights to their original values (#501) by @Narsil in https://github.com/huggingface/text-generation-inference/pull/761
feat(server): Add native support for PEFT Lora models by @Narsil in https://github.com/huggingface/text-generation-inference/pull/762
This should prevent the PyTorch overriding. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/767
fix build tokenizer in quantize and remove duplicate import by @zspo in https://github.com/huggingface/text-generation-inference/pull/768
Merge BNB 4bit. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/770
Fix dynamic rope. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/783
Fixing non 4bits quantization. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/785
Update init.py by @Narsil in https://github.com/huggingface/text-generation-inference/pull/794
Llama change. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/793
Setup for doc-builder and docs for TGI by @merveenoyan in https://github.com/huggingface/text-generation-inference/pull/740
Use destructuring in router arguments to avoid '.0' by @ivarflakstad in https://github.com/huggingface/text-generation-inference/pull/798
Fix gated docs by @osanseviero in https://github.com/huggingface/text-generation-inference/pull/805
Minor docs style fixes by @osanseviero in https://github.com/huggingface/text-generation-inference/pull/806
Added CLI docs and rename docker launch by @merveenoyan in https://github.com/huggingface/text-generation-inference/pull/799
[docs] Build docs only when doc files change by @mishig25 in https://github.com/huggingface/text-generation-inference/pull/812
Added ChatUI Screenshot to Docs by @merveenoyan in https://github.com/huggingface/text-generation-inference/pull/823
Upgrade transformers (fix protobuf==3.20 issue) by @Narsil in https://github.com/huggingface/text-generation-inference/pull/795
Added streaming for InferenceClient by @merveenoyan in https://github.com/huggingface/text-generation-inference/pull/821
Version 1.0.1 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/836

New Contributors

@jaywonchung made their first contribution in https://github.com/huggingface/text-generation-inference/pull/733
@eltociear made their first contribution in https://github.com/huggingface/text-generation-inference/pull/737
@flozi00 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/745
@zspo made their first contribution in https://github.com/huggingface/text-generation-inference/pull/750
@ivarflakstad made their first contribution in https://github.com/huggingface/text-generation-inference/pull/798
@osanseviero made their first contribution in https://github.com/huggingface/text-generation-inference/pull/805
@mishig25 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/812

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.0.0...v1.0.1

Jul 28, 2023

License change

We are releasing TGI v1.0 under a new license: HFOIL 1.0. All prior versions of TGI remain licensed under Apache 2.0, the last Apache 2.0 version being version 0.9.4.

HFOIL stands for Hugging Face Optimized Inference License, and it has been specifically designed for our optimized inference solutions. While the source code remains accessible, HFOIL is not a true open source license because we added a restriction: to sell a hosted or managed service built on top of TGI, we now require a separate agreement. You can consult the new license here.

What does this mean for you?

This change in source code licensing has no impact on the overwhelming majority of our user community who use TGI for free. Additionally, both our Inference Endpoint customers and those of our commercial partners will also remain unaffected.

However, it will restrict non-partnered cloud service providers from offering TGI v1.0+ as a service without requesting a license.

To elaborate further:

If you are an existing user of TGI prior to v1.0, your current version is still Apache 2.0 and you can use it commercially without restrictions.
If you are using TGI for personal use or research purposes, the HFOIL 1.0 restrictions do not apply to you.
If you are using TGI for commercial purposes as part of an internal company project (that will not be sold to third parties as a hosted or managed service), the HFOIL 1.0 restrictions do not apply to you.
If you integrate TGI into a hosted or managed service that you sell to customers, then consider requesting a license to upgrade to v1.0 and later versions - you can email us at api-enterprise@huggingface.co with information about your service.

For more information, see: #726.

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v0.9.4...v1.0.0

Jul 27, 2023

Features

server: auto max_batch_total_tokens for flash att models https://github.com/huggingface/text-generation-inference/pull/630
router: ngrok edge https://github.com/huggingface/text-generation-inference/pull/642
server: Add trust_remote_code to quantize script by @ChristophRaab https://github.com/huggingface/text-generation-inference/pull/647
server: Add exllama GPTQ CUDA kernel support #553 https://github.com/huggingface/text-generation-inference/pull/666
server: Directly load GPTBigCode to specified device by @Atry in https://github.com/huggingface/text-generation-inference/pull/618
server: add cuda memory fraction https://github.com/huggingface/text-generation-inference/pull/659
server: Using quantize_config.json instead of GPTQ_BITS env variables https://github.com/huggingface/text-generation-inference/pull/671
server: support new falcon config https://github.com/huggingface/text-generation-inference/pull/712

Fix

server: llama v2 GPTQ https://github.com/huggingface/text-generation-inference/pull/648
server: Fixing non parameters in quantize script bigcode/starcoder was an example https://github.com/huggingface/text-generation-inference/pull/661
server: use mem_get_info to get kv cache size https://github.com/huggingface/text-generation-inference/pull/664
server: fix exllama buffers https://github.com/huggingface/text-generation-inference/pull/689
server: fix quantization python requirements https://github.com/huggingface/text-generation-inference/pull/708

New Contributors

@ChristophRaab made their first contribution in https://github.com/huggingface/text-generation-inference/pull/647
@fxmarty made their first contribution in https://github.com/huggingface/text-generation-inference/pull/648
@Atry made their first contribution in https://github.com/huggingface/text-generation-inference/pull/618

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v0.9.3...v0.9.4

Jul 18, 2023

Highlights

server: add support for flash attention v2
server: add support for llamav2

Features

launcher: add debug logs
server: rework the quantization to support all models

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v0.9.2...v0.9.3

Jul 14, 2023

Features

server: harden a bit the weights choice to save on disk
server: better errors for warmup and TP
server: Support for env value for GPTQ_BITS and GPTQ_GROUPSIZE
server: Implements sharding for non divisible vocab_size
launcher: add arg validation and drop subprocess
router: explicit warning if revision is not set

Fix

server: Fixing RW code (it's remote code so the Arch checking doesn't work to see which weights to keep
server: T5 weights names
server: Adding logger import to t5_modeling.py by @akowalsk
server: Bug fixes for GPTQ_BITS environment variable passthrough by @ssmi153
server: GPTQ Env vars: catch correct type of error by @ssmi153
server: blacklist local files

New Contributors

@akowalsk made their first contribution in https://github.com/huggingface/text-generation-inference/pull/585
@ssmi153 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/590
@gary149 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/611

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v0.9.1...v0.9.2

Jul 6, 2023

Highlights

server: Non flash MPT
server: decrease memory fragmentation

Features

server: use latest flash attention
router: add argument for hostname in router
docs: Adding some help for the options in text-generation-benchmark

Fix

makefile: Update server/Makefile to include Makefile-vllm
server: Handle loading from local files for MPT
server: avoid errors for very small top_p values

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v0.9.0...v0.9.1

Jul 1, 2023

Highlights

server: add paged attention to flash models
server: Inference support for GPTQ (llama + falcon tested) + Quantization script
server: only compute prefill logprobs when asked

Features

launcher: parse oom signals
server: batch tokenization for flash causal lm
server: Rework loading by
server: optimize dist ops
router: add ngrok integration
server: improve flash attention import errors
server: Refactor conversion logic
router: add header option to disable buffering for the generate_stream response by @rkimball
router: add arg validation

Fix

docs: CUDA_VISIBLE_DEVICES comment by @antferdom
docs: Fix typo and use POSIX comparison in the makefile by @piratos
server: fix warpers on CPU
server: Fixing T5 in case the names are mixed up
router: add timeout on flume sends
server: Do not init process group if already initialized
server: Add the option to force another dtype than f16
launcher: fix issue where launcher does not properly report shard failures

New Contributors

@antferdom made their first contribution in https://github.com/huggingface/text-generation-inference/pull/441
@piratos made their first contribution in https://github.com/huggingface/text-generation-inference/pull/443
@Yard1 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/388
@rkimball made their first contribution in https://github.com/huggingface/text-generation-inference/pull/498

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v0.8.2...v0.9.0

Jun 1, 2023

Features

server: remove trust_remote_code requirement for falcon models
server: load santacoder/starcoder models with safetensors

Fix

server: fix has_position_ids

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v0.8.1...v0.8.2

May 31, 2023

Features

server: add retry on download

Fix

server: fix bnb quantization for CausalLM models

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v0.8.0...v0.8.1

May 30, 2023

Features

router: support vectorized warpers in flash causal lm (co-authored by @jlamypoirier )
proto: decrease IPC proto size
benchmarker: add summary tables
server: support RefinedWeb models

Fix

server: Fix issue when load AutoModelForSeq2SeqLM model (contributed by @CL-Shang)

New Contributors

@CL-Shang made their first contribution in https://github.com/huggingface/text-generation-inference/pull/370
@jlamypoirier made their first contribution in https://github.com/huggingface/text-generation-inference/pull/317

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v0.7.0...v0.8.0

May 23, 2023

Features

server: reduce vram requirements of continuous batching (contributed by @njhill)
server: Support BLOOMChat-176B (contributed by @njhill)
server: add watermarking tests (contributed by @ehsanmok)
router: Adding response schema for compat_generate (contributed by @gsaivinay)
router: use number of tokins in batch as input for dynamic batching (co-authored by @njhill)
server: improve download and decrease conversion to safetensors RAM requirements
server: optimize flash causal lm decode token
server: shard decode token
server: use cuda graph in logits warping
server: support trust_remote_code
tests: add snapshot testing

Fix

server: use float16
server: fix multinomial implem in Sampling
server: do not use device_map auto on single GPU

Misc

docker: use nvidia base image

New Contributors

@ehsanmok made their first contribution in https://github.com/huggingface/text-generation-inference/pull/248
@gsaivinay made their first contribution in https://github.com/huggingface/text-generation-inference/pull/292
@xyang16 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/343
@oOraph made their first contribution in https://github.com/huggingface/text-generation-inference/pull/359

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v0.6.0...v0.7.0

Apr 21, 2023

Features

server: flash attention past key values optimization (contributed by @njhill)
router: remove requests when client closes the connection (co-authored by @njhill)
server: support quantization for flash models
router: add info route
server: optimize token decode
server: support flash sharded santacoder
security: image signing with cosign
security: image analysis with trivy
docker: improve image size

Fix

server: check cuda capability before importing flash attention
server: fix hf_transfer issue with private repositories
router: add auth token for private tokenizers

Misc

rust: update to 1.69

Apr 11, 2023

Features

server: add flash-attention based version of Llama
server: add flash-attention based version of Santacoder
server: support OPT models
router: make router input validation optional
docker: improve layer caching

Fix

server: improve token streaming decoding
server: fix escape charcaters in stop sequences
router: fix NCCL desync issues
router: use buckets for metrics histograms

Mar 30, 2023

Fix

router: fix OTLP distributed tracing initialization

Features

benchmark: tui based benchmarking tool
router: Clear cache on error
server: Add mypy-protobuf
server: reduce mlp and attn in one op for flash neox
image: aws sagemaker compatible image

Fix

server: avoid try/except to determine the kind of AutoModel
server: fix flash neox rotary embedding

Previous 1 2 3 4 Next

Similar releases

Other sources from this team

Similar sources

Latest

v3.3.7

Source

@huggingface/text-generation-inference

Tracking Since

Feb 3, 2023

Last fetched Apr 19, 2026

.json·.md·.atom

Text Generation Inference

What's Changed

New Contributors

What's Changed

New Contributors

Notable changes

What's Changed

New Contributors

What's Changed

What's Changed

New Contributors

Notable changes:

What's Changed

New Contributors

License change

What does this mean for you?

Features

Fix

New Contributors

Highlights

Features

Features

Fix

New Contributors

Highlights

Features

Fix

Highlights

Features

Fix

New Contributors

Features

Fix

Features

Fix

Features

Fix

New Contributors

Features

Fix

Misc

New Contributors

Features

Fix

Misc

Features

Fix

Fix

Features

Fix

More from this team

Similar releases

Other sources from this team

Similar sources

Other sources from this team

Similar sources

More from this team

Similar releases