releases.shpreview
Hugging Face/Text Embeddings Inference

Text Embeddings Inference

$npx -y @buildinternet/releases show text-embeddings-inference
Mon
Wed
Fri
AprMayJunJulAugSepOctNovDecJanFebMarApr
Less
More
Releases3Avg0/wkVersionsv1.9.0 → v1.9.2
Mar 23, 2026

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.9.2...v1.9.3

Feb 25, 2026

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.9.1...v1.9.2

Feb 17, 2026

What's Changed

🚨 Fix

When releasing ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 with CUDA 12.9 and cuda-compat-12-9 there was an issue when running that same container on instances with CUDA 13.0+, as the cuda-compat-12-9 set in LD_LIBRARY_PATH was leading to a CUDA_ERROR_SYSTEM_DRIVER_MISMATCH = 803, which is now solved with a custom entrypoint that dynamically includes the cuda-compat on the LD_LIBRARY_PATH depending on the instance CUDA version.

Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.9.0...v1.9.1

<img width="1800" height="972" alt="text-embeddings-inference-v1 9 0" src="https://github.com/user-attachments/assets/fe3751d1-1a3a-4b1f-8cf5-5c2326c14a62" />

What's changed?

🚨 Breaking changes

Default GeLU implementation is now GeLU + tanh approximation instead of exact GeLU (aka. GeLU erf) to make sure that the CPU and CUDA embeddings are the same (as cuBLASlt only supports GeLU + tanh), which represents a slight misalignment from how Transformers handles it, as when hidden_act="gelu" is set in config.json, GeLU erf should be used. The numerical differences between GeLU + tanh and GeLU erf should have negligible impact on inference quality.

--auto-truncate now defaults to true, meaning that the sequences will be truncated to the lower value between the --max-batch-tokens or the maximum model length, to prevent the --max-batch-tokens from being lower than the actual maximum supported length.

🎉 Additions

🐛 Fixes

⚡ Improvements

📄 Other

🆕 New Contributors

Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.8.3...v1.9.0

Oct 30, 2025

What's Changed

Bug Fixes

Tests, Documentation & Release

New Contributors

Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.8.2...v1.8.3

Sep 9, 2025

🔧 Fixed Intel MKL Support

Since Text Embeddings Inference (TEI) v1.7.0, Intel MKL support had been broken due to changes in the candle dependency. Neither static-linking nor dynamic-linking worked correctly, which caused models using Intel MKL on CPU to fail with errors such as: "Intel oneMKL ERROR: Parameter 13 was incorrect on entry to SGEMM".

Starting with v1.8.2, this issue has been resolved by fixing how the intel-mkl-src dependency is defined. Both features, static-linking and dynamic-linking (the default), now work correctly, ensuring that Intel MKL libraries are properly linked.

This issue occurred in the following scenarios:

  • Users installing text-embeddings-router via cargo with the --feature mkl flag. Although dynamic-linking should have been used, it was not working as intended.
  • Users relying on the CPU Dockerfile when running models without ONNX weights. In these cases, Safetensors weights were used with candle as backend (with MKL optimizations), instead of ort.

The following table shows the affected versions and containers:

VersionImage
1.7.0ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.0
1.7.1ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.1
1.7.2ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.2
1.7.3ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.3
1.7.4ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.4
1.8.0ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.0
1.8.1ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.1

More details: PR #715

Full Changelog: v1.8.1...v1.8.2

Sep 4, 2025
<img width="1200" height="648" alt="text-embeddings-inference-v1 8 1-embedding-gemma(1)" src="https://github.com/user-attachments/assets/8ad8fb64-cee4-409f-8488-1d10f5ffe995" />

Today, Google releases EmbeddingGemma, a state-of-the-art multilingual embedding model perfect for on-device use cases. Designed for speed and efficiency, the model features a compact size of 308M parameters and a 2K context window, unlocking new possibilities for mobile RAG pipelines, agents, and more. EmbeddingGemma is trained to support over 100 languages and is the highest-ranking text-only multilingual embedding model under 500M on the Massive Text Embedding Benchmark (MTEB) at the time of writing.

  • CPU:
docker run -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.1 \
    --model-id google/embeddinggemma-300m --dtype float32
  • CPU with ONNX Runtime:
docker run -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.1 \
    --model-id onnx-community/embeddinggemma-300m-ONNX --dtype float32 --pooling mean
  • NVIDIA CUDA:
docker run --gpus all --shm-size 1g -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cuda-1.8.1 \
    --model-id google/embeddinggemma-300m --dtype float32

Notable Changes

  • Add support for Gemma3 (text-only) architecture
  • Intel updates to Synapse 1.21.3 and IPEX 2.8
  • Extend ONNX Runtime support in OrtRuntime
    • Support position_ids and past_key_values as inputs
    • Handle padding_side and pad_token_id

What's Changed

Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.8.0...v1.8.1

Aug 5, 2025
<img width="3600" height="1944" alt="text-embeddings-inference-v1 8 0(2)" src="https://github.com/user-attachments/assets/50df05b6-3821-4e2a-8de0-3e5c911b2a27" />

Notable Changes

  • Qwen3 support for 0.6B, 4B and 8B on CPU, MPS, and FlashQwen3 on CUDA and Intel HPUs
  • NomicBert MoE support
  • JinaAI Re-Rankers V1 support
  • Matryoshka Representation Learning (MRL)
  • Dense layer module support (after pooling)

[!NOTE] Some of the aforementioned changes were released within the patch versions on top of v1.7.0, whilst both Matryoshka Representation Learning (MRL) and Dense layer module support have been recently included and were not released yet.

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.7.0...v1.8.0

Jul 7, 2025

Noticeable Changes

Qwen3 was not working fine on CPU / MPS when sending batched requests on FP16 precision, due to the FP32 minimum value downcast (now manually set to FP16 minimum value instead) leading to null values, as well as a missing to_dtype call on the attention_bias when working with batches.

What's Changed

Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.7.3...v1.7.4

Jun 30, 2025

Noticeable Changes

Qwen3 support included for Intel HPU, and fixed for CPU / Metal / CUDA.

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.7.2...v1.7.3

Jun 16, 2025

Notable change

  • Added support for Qwen3 embeddigns

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.7.1...v1.7.2

Jun 3, 2025

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.7.0...v1.7.1

Apr 8, 2025

Notable changes

  • Upgrade dependencies heavily (candle 0.5 -> 0.8 and related)
  • Added ModernBert support by @kozistr !

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.6.1...v1.7.0

Mar 28, 2025

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.6.0...v1.6.1

Dec 13, 2024

What's Changed

Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.5.1...v1.6.0

Nov 5, 2024

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.5.0...v1.5.1

Jul 10, 2024

Notable Changes

  • ONNX runtime for CPU deployments: greatly improve CPU deployment throughput
  • Add /similarity route

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.4.0...v1.5.0

Jul 2, 2024

Notable Changes

  • Cuda support for the Qwen2 model architecture

What's Changed

Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.3.0...v1.4.0

Jun 28, 2024

Notable changes

  • New truncation direction parameter
  • Cuda support for JinaCode model architecture
  • Cuda support for Mistral model architecture
  • Cuda support for Alibaba GTE model architecture
  • New prompt name parameter: you can now add a prompt name to the body of your request to add a pre-prompt to your input, based on the Sentence Transformers configuration. You can also set a default prompt / prompt name to always add a pre-prompt to your requests.

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.2.3...v1.3.0

Apr 25, 2024

What's Changed

Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.2.2...v1.2.3

Previous12Next
Latest
v1.9.3
Tracking Since
Oct 13, 2023
Last fetched Apr 18, 2026