HiddenAct::Gelu to GeLU + tanh in favour of GeLU erf by @vrdn-23 in https://github.com/huggingface/text-embeddings-inference/pull/753Default GeLU implementation is now GeLU + tanh approximation instead of exact GeLU (aka. GeLU erf) to make sure that the CPU and CUDA embeddings are the same (as cuBLASlt only supports GeLU + tanh), which represents a slight misalignment from how Transformers handles it, as when
hidden_act="gelu"is set inconfig.json, GeLU erf should be used. The numerical differences between GeLU + tanh and GeLU erf should have negligible impact on inference quality.
--auto-truncate to true by default by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/829
--auto-truncatenow defaults to true, meaning that the sequences will be truncated to the lower value between the--max-batch-tokensor the maximum model length, to prevent the--max-batch-tokensfrom being lower than the actual maximum supported length.
--served-model-name for OpenAI requests via HTTP by @vrdn-23 in https://github.com/huggingface/text-embeddings-inference/pull/685download_onnx to download sharded ONNX by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/817past_key_values in ONNX by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/751TruncationDirection to deserialize from lowercase and capitalized by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/755sagemaker-entrypoint* & remove SageMaker and Vertex from Dockerfile* by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/699config.json reading w/ aliases for ORT by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/786normalize param between the gRPC and HTTP /embed interfaces by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/810--model-id argument by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/679HF_TOKEN not set by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/812Dockerfile-cuda-blackwell-all by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/823rustc version to 1.92.0 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/826use_flash_attn for better FA + FA2 feature gating by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/825cuda-compat-12-9 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/828rustfmt on backend/candle/tests/*.rs files by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/800version to 1.9.0 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/830Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.8.3...v1.9.0
Fetched April 7, 2026