releases.shpreview
Hugging Face/Text Generation Inference

Text Generation Inference

$npx -y @buildinternet/releases show text-generation-inference
Mon
Wed
Fri
AprMayJunJulAugSepOctNovDecJanFebMarApr
Less
More
Releases2Avg0/wkVersionsv3.3.6 → v3.3.7
Mar 26, 2023

Features

  • server: New faster GPTNeoX implementation based on flash attention

Fix

  • server: fix input-length discrepancy between Rust and Python tokenizers
Mar 9, 2023

Features

  • router: support best_of sampling
  • router: support left truncation
  • server: support typical sampling
  • launcher: allow local models
  • clients: add text-generation Python client
  • launcher: allow parsing num_shard from CUDA_VISIBLE_DEVICES

Fix

  • server: do not warp prefill logits
  • server: fix formatting issues in generate_stream tokens
  • server: fix galactica batch
  • server: fix index out of range issue with watermarking
Mar 3, 2023

Features

  • router: add support for huggingface api-inference
  • server: add logits watermark with "A Watermark for Large Language Models"
  • server: use a fixed transformers commit

Fix

  • launcher: add missing parameters to launcher
  • server: update to hf_transfer==0.1.2 to fix corrupted files issue
Feb 24, 2023

Features

  • server: allocate full attention mask to decrease latency
  • server: enable hf-transfer for insane download speeds
  • router: add CORS options

Fix

  • server: remove position_ids from galactica forward
Feb 16, 2023

Features

  • server: support t5 models
  • router: add max_total_tokens and empty_input validation
  • launcher: add the possibility to disable custom CUDA kernels
  • server: add automatic safetensors conversion
  • router: add prometheus scrape endpoint
  • server, router: add distributed tracing

Fix

  • launcher: copy current env vars to subprocesses
  • docker: add note around shared memory
Feb 7, 2023

Fix

  • server: fix bug with repetition penalty when using GPUs and inference mode
Feb 3, 2023

Features

  • router: support Token streaming using Server Side Events
  • router: support seeding
  • server: support gpt-neox
  • server: support santacoder
  • server: support repetition penalty
  • server: allow the server to use a local weight cache

Breaking changes

  • router: refactor Token API
  • router: modify /generate API to only return generated text

Misc

  • router: use background task to manage request queue
  • ci: docker build/push on update
Latest
v3.3.7
Tracking Since
Feb 3, 2023
Last checked Apr 21, 2026