releases.shpreview
Hugging Face/Text Generation Inference

Text Generation Inference

$npx -y @buildinternet/releases show text-generation-inference
Mon
Wed
Fri
AprMayJunJulAugSepOctNovDecJanFebMarApr
Less
More
Releases2Avg0/wkVersionsv3.3.6 → v3.3.7
Nov 30, 2023
v.1.2.0

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.1.1...v1.2.0

Nov 16, 2023

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.1.0...v1.1.1

Sep 28, 2023

Notable changes

  • Support for Mistral models (#1071)
  • AWQ quantization (#1019)
  • EETQ quantization (#1068)

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.0.3...v1.1.0

Aug 23, 2023

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.0.1...v1.0.2

Aug 14, 2023

Notable changes:

  • More GPTQ support
  • Rope scaling (linear + dynamic)
  • Bitsandbytes 4bits (both modes)
  • Added more documentation

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v1.0.0...v1.0.1

Jul 28, 2023

License change

We are releasing TGI v1.0 under a new license: HFOIL 1.0. All prior versions of TGI remain licensed under Apache 2.0, the last Apache 2.0 version being version 0.9.4.

HFOIL stands for Hugging Face Optimized Inference License, and it has been specifically designed for our optimized inference solutions. While the source code remains accessible, HFOIL is not a true open source license because we added a restriction: to sell a hosted or managed service built on top of TGI, we now require a separate agreement. You can consult the new license here.

What does this mean for you?

This change in source code licensing has no impact on the overwhelming majority of our user community who use TGI for free. Additionally, both our Inference Endpoint customers and those of our commercial partners will also remain unaffected.

However, it will restrict non-partnered cloud service providers from offering TGI v1.0+ as a service without requesting a license.

To elaborate further:

  • If you are an existing user of TGI prior to v1.0, your current version is still Apache 2.0 and you can use it commercially without restrictions.

  • If you are using TGI for personal use or research purposes, the HFOIL 1.0 restrictions do not apply to you.

  • If you are using TGI for commercial purposes as part of an internal company project (that will not be sold to third parties as a hosted or managed service), the HFOIL 1.0 restrictions do not apply to you.

  • If you integrate TGI into a hosted or managed service that you sell to customers, then consider requesting a license to upgrade to v1.0 and later versions - you can email us at api-enterprise@huggingface.co with information about your service.

For more information, see: #726.

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v0.9.4...v1.0.0

Jul 27, 2023

Features

Fix

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v0.9.3...v0.9.4

Jul 18, 2023

Highlights

  • server: add support for flash attention v2
  • server: add support for llamav2

Features

  • launcher: add debug logs
  • server: rework the quantization to support all models

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v0.9.2...v0.9.3

Jul 14, 2023

Features

  • server: harden a bit the weights choice to save on disk
  • server: better errors for warmup and TP
  • server: Support for env value for GPTQ_BITS and GPTQ_GROUPSIZE
  • server: Implements sharding for non divisible vocab_size
  • launcher: add arg validation and drop subprocess
  • router: explicit warning if revision is not set

Fix

  • server: Fixing RW code (it's remote code so the Arch checking doesn't work to see which weights to keep
  • server: T5 weights names
  • server: Adding logger import to t5_modeling.py by @akowalsk
  • server: Bug fixes for GPTQ_BITS environment variable passthrough by @ssmi153
  • server: GPTQ Env vars: catch correct type of error by @ssmi153
  • server: blacklist local files

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v0.9.1...v0.9.2

Jul 6, 2023

Highlights

  • server: Non flash MPT
  • server: decrease memory fragmentation

Features

  • server: use latest flash attention
  • router: add argument for hostname in router
  • docs: Adding some help for the options in text-generation-benchmark

Fix

  • makefile: Update server/Makefile to include Makefile-vllm
  • server: Handle loading from local files for MPT
  • server: avoid errors for very small top_p values

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v0.9.0...v0.9.1

Jul 1, 2023

Highlights

  • server: add paged attention to flash models
  • server: Inference support for GPTQ (llama + falcon tested) + Quantization script
  • server: only compute prefill logprobs when asked

Features

  • launcher: parse oom signals
  • server: batch tokenization for flash causal lm
  • server: Rework loading by
  • server: optimize dist ops
  • router: add ngrok integration
  • server: improve flash attention import errors
  • server: Refactor conversion logic
  • router: add header option to disable buffering for the generate_stream response by @rkimball
  • router: add arg validation

Fix

  • docs: CUDA_VISIBLE_DEVICES comment by @antferdom
  • docs: Fix typo and use POSIX comparison in the makefile by @piratos
  • server: fix warpers on CPU
  • server: Fixing T5 in case the names are mixed up
  • router: add timeout on flume sends
  • server: Do not init process group if already initialized
  • server: Add the option to force another dtype than f16
  • launcher: fix issue where launcher does not properly report shard failures

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v0.8.2...v0.9.0

Jun 1, 2023

Features

  • server: remove trust_remote_code requirement for falcon models
  • server: load santacoder/starcoder models with safetensors

Fix

  • server: fix has_position_ids

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v0.8.1...v0.8.2

May 31, 2023

Features

  • server: add retry on download

Fix

  • server: fix bnb quantization for CausalLM models

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v0.8.0...v0.8.1

May 30, 2023

Features

  • router: support vectorized warpers in flash causal lm (co-authored by @jlamypoirier )
  • proto: decrease IPC proto size
  • benchmarker: add summary tables
  • server: support RefinedWeb models

Fix

  • server: Fix issue when load AutoModelForSeq2SeqLM model (contributed by @CL-Shang)

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v0.7.0...v0.8.0

May 23, 2023

Features

  • server: reduce vram requirements of continuous batching (contributed by @njhill)
  • server: Support BLOOMChat-176B (contributed by @njhill)
  • server: add watermarking tests (contributed by @ehsanmok)
  • router: Adding response schema for compat_generate (contributed by @gsaivinay)
  • router: use number of tokins in batch as input for dynamic batching (co-authored by @njhill)
  • server: improve download and decrease conversion to safetensors RAM requirements
  • server: optimize flash causal lm decode token
  • server: shard decode token
  • server: use cuda graph in logits warping
  • server: support trust_remote_code
  • tests: add snapshot testing

Fix

  • server: use float16
  • server: fix multinomial implem in Sampling
  • server: do not use device_map auto on single GPU

Misc

  • docker: use nvidia base image

New Contributors

Full Changelog: https://github.com/huggingface/text-generation-inference/compare/v0.6.0...v0.7.0

Apr 21, 2023

Features

  • server: flash attention past key values optimization (contributed by @njhill)
  • router: remove requests when client closes the connection (co-authored by @njhill)
  • server: support quantization for flash models
  • router: add info route
  • server: optimize token decode
  • server: support flash sharded santacoder
  • security: image signing with cosign
  • security: image analysis with trivy
  • docker: improve image size

Fix

  • server: check cuda capability before importing flash attention
  • server: fix hf_transfer issue with private repositories
  • router: add auth token for private tokenizers

Misc

  • rust: update to 1.69
Apr 11, 2023

Features

  • server: add flash-attention based version of Llama
  • server: add flash-attention based version of Santacoder
  • server: support OPT models
  • router: make router input validation optional
  • docker: improve layer caching

Fix

  • server: improve token streaming decoding
  • server: fix escape charcaters in stop sequences
  • router: fix NCCL desync issues
  • router: use buckets for metrics histograms
Mar 30, 2023

Fix

  • router: fix OTLP distributed tracing initialization

Features

  • benchmark: tui based benchmarking tool
  • router: Clear cache on error
  • server: Add mypy-protobuf
  • server: reduce mlp and attn in one op for flash neox
  • image: aws sagemaker compatible image

Fix

  • server: avoid try/except to determine the kind of AutoModel
  • server: fix flash neox rotary embedding
Latest
v3.3.7
Tracking Since
Feb 3, 2023
Last fetched Apr 19, 2026