rust-toolchain.toml before rustup on Dockerfile-{cuda,cuda-all} by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/842version 1.9.3 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/849Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.9.2...v1.9.3
pad_token_id as nullable & add support for rope_parameters by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/832Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.9.1...v1.9.2
When releasing ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 with CUDA 12.9 and
cuda-compat-12-9there was an issue when running that same container on instances with CUDA 13.0+, as thecuda-compat-12-9set inLD_LIBRARY_PATHwas leading to aCUDA_ERROR_SYSTEM_DRIVER_MISMATCH = 803, which is now solved with a custom entrypoint that dynamically includes thecuda-compaton theLD_LIBRARY_PATHdepending on the instance CUDA version.
Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.9.0...v1.9.1
HiddenAct::Gelu to GeLU + tanh in favour of GeLU erf by @vrdn-23 in https://github.com/huggingface/text-embeddings-inference/pull/753Default GeLU implementation is now GeLU + tanh approximation instead of exact GeLU (aka. GeLU erf) to make sure that the CPU and CUDA embeddings are the same (as cuBLASlt only supports GeLU + tanh), which represents a slight misalignment from how Transformers handles it, as when
hidden_act="gelu"is set inconfig.json, GeLU erf should be used. The numerical differences between GeLU + tanh and GeLU erf should have negligible impact on inference quality.
--auto-truncate to true by default by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/829
--auto-truncatenow defaults to true, meaning that the sequences will be truncated to the lower value between the--max-batch-tokensor the maximum model length, to prevent the--max-batch-tokensfrom being lower than the actual maximum supported length.
--served-model-name for OpenAI requests via HTTP by @vrdn-23 in https://github.com/huggingface/text-embeddings-inference/pull/685download_onnx to download sharded ONNX by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/817past_key_values in ONNX by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/751TruncationDirection to deserialize from lowercase and capitalized by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/755sagemaker-entrypoint* & remove SageMaker and Vertex from Dockerfile* by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/699config.json reading w/ aliases for ORT by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/786normalize param between the gRPC and HTTP /embed interfaces by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/810--model-id argument by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/679HF_TOKEN not set by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/812Dockerfile-cuda-blackwell-all by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/823rustc version to 1.92.0 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/826use_flash_attn for better FA + FA2 feature gating by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/825cuda-compat-12-9 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/828rustfmt on backend/candle/tests/*.rs files by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/800version to 1.9.0 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/830Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.8.3...v1.9.0
max_input_length is bigger than max-batch-tokens by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/725modules.json for Dense modules in local models by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/738test_gemma3.rs for EmbeddingGemma by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/718HF_TOKEN in ApiBuilder for candle/tests by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/724cargo install commands for candle with CUDA by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/719version to 1.8.3 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/745Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.8.2...v1.8.3
Since Text Embeddings Inference (TEI) v1.7.0, Intel MKL support had been broken due to changes in the candle dependency. Neither static-linking nor dynamic-linking worked correctly, which caused models using Intel MKL on CPU to fail with errors such as: "Intel oneMKL ERROR: Parameter 13 was incorrect on entry to SGEMM".
Starting with v1.8.2, this issue has been resolved by fixing how the intel-mkl-src dependency is defined. Both features, static-linking and dynamic-linking (the default), now work correctly, ensuring that Intel MKL libraries are properly linked.
This issue occurred in the following scenarios:
text-embeddings-router via cargo with the --feature mkl flag. Although dynamic-linking should have been used, it was not working as intended.Dockerfile when running models without ONNX weights. In these cases, Safetensors weights were used with candle as backend (with MKL optimizations), instead of ort.The following table shows the affected versions and containers:
| Version | Image |
|---|---|
| 1.7.0 | ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.0 |
| 1.7.1 | ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.1 |
| 1.7.2 | ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.2 |
| 1.7.3 | ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.3 |
| 1.7.4 | ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.4 |
| 1.8.0 | ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.0 |
| 1.8.1 | ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.1 |
More details: PR #715
Full Changelog: v1.8.1...v1.8.2
Today, Google releases EmbeddingGemma, a state-of-the-art multilingual embedding model perfect for on-device use cases. Designed for speed and efficiency, the model features a compact size of 308M parameters and a 2K context window, unlocking new possibilities for mobile RAG pipelines, agents, and more. EmbeddingGemma is trained to support over 100 languages and is the highest-ranking text-only multilingual embedding model under 500M on the Massive Text Embedding Benchmark (MTEB) at the time of writing.
docker run -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.1 \
--model-id google/embeddinggemma-300m --dtype float32
docker run -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.1 \
--model-id onnx-community/embeddinggemma-300m-ONNX --dtype float32 --pooling mean
docker run --gpus all --shm-size 1g -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cuda-1.8.1 \
--model-id google/embeddinggemma-300m --dtype float32
OrtRuntime
position_ids and past_key_values as inputspadding_side and pad_token_idextra_args to trufflehog to exclude unverified results by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/696USE_FLASH_ATTENTION by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/692position_ids and past_key_values in OrtBackend by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/700modules.json to identify default Dense modules by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/701padding_side and pad_token_id in OrtBackend by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/705docs/openapi.json for v1.8.0 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/708version to 1.8.1 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/712Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.8.0...v1.8.1
[!NOTE] Some of the aforementioned changes were released within the patch versions on top of v1.7.0, whilst both Matryoshka Representation Learning (MRL) and Dense layer module support have been recently included and were not released yet.
README.md and supported_models.md by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/572sccache to 0.10.0 and sccache-action to 0.0.9 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/586head. prefix in the weight name in ModernBertClassificationHead by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/591text-embeddings-router --help output by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/603jinaai/jina-embeddings-v2-base-code to avoid auto_map to other repository by @kaixuanliu in https://github.com/huggingface/text-embeddings-inference/pull/612Qwen3Model by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/627HiddenAct::Silu (remove serde alias) by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/631README.md and docs/ examples by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/641version to 1.7.3 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/666fmt by re-running pre-commit by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/671version to 1.7.4 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/677Dense layer for 2_Dense/ modules by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/660version to 1.8.0 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/686Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.7.0...v1.8.0
Qwen3 was not working fine on CPU / MPS when sending batched requests on FP16 precision, due to the FP32 minimum value downcast (now manually set to FP16 minimum value instead) leading to null values, as well as a missing to_dtype call on the attention_bias when working with batches.
fmt by re-running pre-commit by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/671version to 1.7.4 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/677Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.7.3...v1.7.4
Qwen3 support included for Intel HPU, and fixed for CPU / Metal / CUDA.
README.md and docs/ examples by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/641version to 1.7.3 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/666Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.7.2...v1.7.3
Qwen3Model by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/627HiddenAct::Silu (remove serde alias) by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/631Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.7.1...v1.7.2
README.md and supported_models.md by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/572sccache to 0.10.0 and sccache-action to 0.0.9 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/586head. prefix in the weight name in ModernBertClassificationHead by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/591text-embeddings-router --help output by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/603jinaai/jina-embeddings-v2-base-code to avoid auto_map to other repository by @kaixuanliu in https://github.com/huggingface/text-embeddings-inference/pull/612Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.7.0...v1.7.1
sliding_window for Qwen2 optional by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/546serde deserializer for JinaBERT models by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/559ModernBert model by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/459{Bert,DistilBert}SpladeHead when loading from Safetensors by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/564docs/source/en/custom_container.md by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/568Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.6.1...v1.7.0
cargo-chef installation to 0.1.62 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/469TRUST_REMOTE_CODE param to python backend. by @kaixuanliu in https://github.com/huggingface/text-embeddings-inference/pull/485MaskedLanguageModel class` by @kaixuanliu in https://github.com/huggingface/text-embeddings-inference/pull/513te_request_count metric by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/486README.md to include ONNX by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/507HF_HUB_USER_AGENT_ORIGIN by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/534--hf-token instead of --hf-api-token by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/535disable-spans to toggle span trace logging by @obloomfield in https://github.com/huggingface/text-embeddings-inference/pull/481VarBuilder handling in GTE e.g. gte-multilingual-reranker-base by @Narsil in https://github.com/huggingface/text-embeddings-inference/pull/538safetensor file by @kaixuanliu in https://github.com/huggingface/text-embeddings-inference/pull/515match on onnx/model.onnx download by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/472Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.6.0...v1.6.1
Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.5.1...v1.6.0
model.onnx_data by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/343ort crate version to 2.0.0-rc.4 to support onnx IR version 10 by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/361Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.5.0...v1.5.1
/similarity route/similarity route by @OlivierDehaene in https://github.com/huggingface/text-embeddings-inference/pull/331Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.4.0...v1.5.0
Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.3.0...v1.4.0
HUGGING_FACE_HUB_TOKEN to HF_API_TOKEN in README by @kevinhu in https://github.com/huggingface/text-embeddings-inference/pull/263Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.2.3...v1.3.0
Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.2.2...v1.2.3