releases.shpreview

Hugging Face/Inference

v1.9.0

February 17, 2026Text Embeddings InferenceView original ↗

<img width="1800" height="972" alt="text-embeddings-inference-v1 9 0" src="https://github.com/user-attachments/assets/fe3751d1-1a3a-4b1f-8cf5-5c2326c14a62" />

What's changed?

🚨 Breaking changes

Default HiddenAct::Gelu to GeLU + tanh in favour of GeLU erf by @vrdn-23 in https://github.com/huggingface/text-embeddings-inference/pull/753

Default GeLU implementation is now GeLU + tanh approximation instead of exact GeLU (aka. GeLU erf) to make sure that the CPU and CUDA embeddings are the same (as cuBLASlt only supports GeLU + tanh), which represents a slight misalignment from how Transformers handles it, as when hidden_act="gelu" is set in config.json, GeLU erf should be used. The numerical differences between GeLU + tanh and GeLU erf should have negligible impact on inference quality.

Set --auto-truncate to true by default by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/829

--auto-truncate now defaults to true, meaning that the sequences will be truncated to the lower value between the --max-batch-tokens or the maximum model length, to prevent the --max-batch-tokens from being lower than the actual maximum supported length.

🎉 Additions

Add --served-model-name for OpenAI requests via HTTP by @vrdn-23 in https://github.com/huggingface/text-embeddings-inference/pull/685
Extend download_onnx to download sharded ONNX by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/817
Add support for llama 2 by @michaelfeil in https://github.com/huggingface/text-embeddings-inference/pull/802
Add support for blackwell architecture (sm100, sm120) by @danielealbano in https://github.com/huggingface/text-embeddings-inference/pull/735
Mf/add-support-for-llama-3-and-nemotron by @michaelfeil in https://github.com/huggingface/text-embeddings-inference/pull/805
Add support for DebertaV2 by @vrdn-23 in https://github.com/huggingface/text-embeddings-inference/pull/746
Add bidirectional attention and projection layer support for Qwen3-based models by @williambarberjr in https://github.com/huggingface/text-embeddings-inference/pull/808

🐛 Fixes

Fix reading non-standard config for past_key_values in ONNX by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/751
Fix TruncationDirection to deserialize from lowercase and capitalized by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/755
Fix sagemaker-entrypoint* & remove SageMaker and Vertex from Dockerfile* by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/699
Bug: Critical accuracy bugs for model_type=qwen2: no causal attention and wrong tokenizer by @michaelfeil in https://github.com/huggingface/text-embeddings-inference/pull/762
Fix config.json reading w/ aliases for ORT by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/786
Fix HTTP error code for validation by @vrdn-23 in https://github.com/huggingface/text-embeddings-inference/pull/818
Fix to acquire the permit in a blocking way by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/726
Read Hugging Face Hub token from cache if not provided by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/814
Align the normalize param between the gRPC and HTTP /embed interfaces by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/810

⚡ Improvements

Serialization in tokio thread instead of blocking thread, 50% reduction in latency for small models by @michaelfeil in https://github.com/huggingface/text-embeddings-inference/pull/767
Remove default --model-id argument by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/679
feat: better Tokenization # workers heuristic by @michaelfeil in https://github.com/huggingface/text-embeddings-inference/pull/766
add faster index select kernel by @michaelfeil in https://github.com/huggingface/text-embeddings-inference/pull/773
feat: speedup Parallel safetensors download by @michaelfeil in https://github.com/huggingface/text-embeddings-inference/pull/765
feat: startup time: add cloned tokenzier fix, saves ~1-20s cold start time by @michaelfeil in https://github.com/huggingface/text-embeddings-inference/pull/772
Adjust the warmup phase for CPU by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/792

📄 Other

Skip Gemma3 tests when HF_TOKEN not set by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/812
Bump Rust 1.92, CUDA 12.6, Ubuntu 24.04 and add Dockerfile-cuda-blackwell-all by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/823
Update rustc version to 1.92.0 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/826
Add use_flash_attn for better FA + FA2 feature gating by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/825
Update CUDA to 12.9 w/ cuda-compat-12-9 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/828
Upgrade GitHub Actions for Node 24 compatibility by @salmanmkc in https://github.com/huggingface/text-embeddings-inference/pull/782
Lint: cargo fmt and clippy fix warnings by @michaelfeil in https://github.com/huggingface/text-embeddings-inference/pull/776
Fix rustfmt on backend/candle/tests/*.rs files by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/800
Upgrade GitHub Actions to latest versions by @salmanmkc in https://github.com/huggingface/text-embeddings-inference/pull/783
Update version to 1.9.0 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/830

🆕 New Contributors

@salmanmkc made their first contribution in https://github.com/huggingface/text-embeddings-inference/pull/782
@danielealbano made their first contribution in https://github.com/huggingface/text-embeddings-inference/pull/735
@williambarberjr made their first contribution in https://github.com/huggingface/text-embeddings-inference/pull/808

Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.8.3...v1.9.0

Fetched April 7, 2026