{"id":"prod_SJzp_HiFczjtIySFAwlVd","name":"Inference","slug":"inference","orgId":"org_GDdYeYynEgCEBNBwy-m6s","url":null,"description":"Optimized inference servers for text and embeddings","category":"ai","kind":"platform","avatarUrl":null,"createdAt":"2026-04-10T16:06:55.274Z","embeddedAt":"2026-04-15T16:19:40.805Z","deletedAt":null,"sources":[{"id":"src_YM5oUOL-MSWOWf517en37","slug":"text-embeddings-inference","name":"Text Embeddings Inference","type":"github","url":"https://github.com/huggingface/text-embeddings-inference","metadata":"{\"evaluatedMethod\":\"github\",\"evaluatedAt\":\"2026-04-07T17:19:19.545Z\",\"changelogDetectedAt\":\"2026-04-07T17:28:46.634Z\",\"wellKnownSweptAt\":\"2026-06-24T06:00:01.224Z\"}","kind":"platform"},{"id":"src_8VF2j2OWHfhvBPnI2jCTO","slug":"text-generation-inference","name":"Text Generation Inference","type":"github","url":"https://github.com/huggingface/text-generation-inference","metadata":"{\"evaluatedMethod\":\"github\",\"evaluatedAt\":\"2026-04-07T17:19:16.315Z\",\"changelogDetectedAt\":\"2026-04-07T17:28:40.142Z\",\"wellKnownSweptAt\":\"2026-06-24T06:00:01.224Z\"}","kind":"platform"}],"tags":["inference","rust","serving"],"aliases":[],"notice":null,"releases":[{"id":"rel_ZjYO-ZfPKrOPwOZSmU-m7","version":"v1.9.3","type":"feature","title":"v1.9.3","summary":"## What's Changed\r\n* Use `rust-toolchain.toml` before `rustup` on `Dockerfile-{cuda,cuda-all}` by @alvarobartt in https://github.com/huggingface/text-...","titleGenerated":null,"titleShort":null,"content":"## What's Changed\r\n* Use `rust-toolchain.toml` before `rustup` on `Dockerfile-{cuda,cuda-all}` by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/842\r\n* fix(backend): replace bare except with Exception in device check by @llukito in https://github.com/huggingface/text-embeddings-inference/pull/821\r\n* Set `version` 1.9.3 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/849\r\n\r\n## New Contributors\r\n* @llukito made their first contribution in https://github.com/huggingface/text-embeddings-inference/pull/821\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-embeddings-inference/compare/v1.9.2...v1.9.3","publishedAt":"2026-03-23T11:57:19.000Z","url":"https://github.com/huggingface/text-embeddings-inference/releases/tag/v1.9.3","media":[],"prerelease":false,"source":{"slug":"text-embeddings-inference","name":"Text Embeddings Inference","type":"github"},"product":{"slug":"inference","name":"Inference"},"groupSlug":"inference","groupName":"Inference","coverageCount":0,"contentChars":null,"contentTokens":null,"composition":null},{"id":"rel_u5i8Y5-E1Ty1fTgVeHnUB","version":"v1.9.2","type":"feature","title":"v1.9.2","summary":"## What's Changed\r\n\r\n* Fix auto-truncate false setting by @vrdn-23 in https://github.com/huggingface/text-embeddings-inference/pull/836\r\n* Set `pad_to...","titleGenerated":null,"titleShort":null,"content":"## What's Changed\r\n\r\n* Fix auto-truncate false setting by @vrdn-23 in https://github.com/huggingface/text-embeddings-inference/pull/836\r\n* Set `pad_token_id` as nullable & add support for `rope_parameters` by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/832\r\n* docs: add Homebrew installation to README by @Peredery in https://github.com/huggingface/text-embeddings-inference/pull/834\r\n* feat: support pplx-embed-v1 by @mkrimmel-pplx in https://github.com/huggingface/text-embeddings-inference/pull/824\r\n\r\n## New Contributors\r\n* @Peredery made their first contribution in https://github.com/huggingface/text-embeddings-inference/pull/834\r\n* @mkrimmel-pplx made their first contribution in https://github.com/huggingface/text-embeddings-inference/pull/824\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-embeddings-inference/compare/v1.9.1...v1.9.2","publishedAt":"2026-02-25T11:17:59.000Z","url":"https://github.com/huggingface/text-embeddings-inference/releases/tag/v1.9.2","media":[],"prerelease":false,"source":{"slug":"text-embeddings-inference","name":"Text Embeddings Inference","type":"github"},"product":{"slug":"inference","name":"Inference"},"groupSlug":"inference","groupName":"Inference","coverageCount":0,"contentChars":null,"contentTokens":null,"composition":null},{"id":"rel_jBoNJNG0GwO7Qj50CRnpr","version":"v1.9.1","type":"feature","title":"v1.9.1","summary":"## What's Changed\r\n\r\n### 🚨 Fix\r\n\r\n* Fix support for containers w/ CUDA 13.0+ by @alvarobartt in https://github.com/huggingface/text-embeddings-infere...","titleGenerated":null,"titleShort":null,"content":"## What's Changed\r\n\r\n### 🚨 Fix\r\n\r\n* Fix support for containers w/ CUDA 13.0+ by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/831\r\n> When releasing ghcr.io/huggingface/text-embeddings-inference:cuda-1.9 with CUDA 12.9 and `cuda-compat-12-9` there was an issue when running that same container on instances with CUDA 13.0+, as the `cuda-compat-12-9` set in `LD_LIBRARY_PATH` was leading to a `CUDA_ERROR_SYSTEM_DRIVER_MISMATCH = 803`, which is now solved with a custom entrypoint that dynamically includes the `cuda-compat` on the `LD_LIBRARY_PATH` depending on the instance CUDA version.\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-embeddings-inference/compare/v1.9.0...v1.9.1","publishedAt":"2026-02-17T20:59:31.000Z","url":"https://github.com/huggingface/text-embeddings-inference/releases/tag/v1.9.1","media":[],"prerelease":false,"source":{"slug":"text-embeddings-inference","name":"Text Embeddings Inference","type":"github"},"product":{"slug":"inference","name":"Inference"},"groupSlug":"inference","groupName":"Inference","coverageCount":0,"contentChars":null,"contentTokens":null,"composition":null},{"id":"rel_Y8D6Bpmw3UbyRwtU5a0H4","version":"v1.9.0","type":"feature","title":"v1.9.0","summary":"<img width=\"1800\" height=\"972\" alt=\"text-embeddings-inference-v1 9 0\" src=\"https://github.com/user-attachments/assets/fe3751d1-1a3a-4b1f-8cf5-5c2326c1...","titleGenerated":null,"titleShort":null,"content":"<img width=\"1800\" height=\"972\" alt=\"text-embeddings-inference-v1 9 0\" src=\"https://github.com/user-attachments/assets/fe3751d1-1a3a-4b1f-8cf5-5c2326c14a62\" />\r\n\r\n## What's changed?\r\n\r\n### 🚨 Breaking changes\r\n\r\n* Default `HiddenAct::Gelu` to GeLU + tanh in favour of GeLU erf  by @vrdn-23 in https://github.com/huggingface/text-embeddings-inference/pull/753\r\n\r\n> Default GeLU implementation is now GeLU + tanh approximation instead of exact GeLU (aka. GeLU erf) to make sure that the CPU and CUDA embeddings are the same (as cuBLASlt only supports GeLU + tanh), which represents a slight misalignment from how Transformers handles it, as when `hidden_act=\"gelu\"` is set in `config.json`, GeLU erf should be used. The numerical differences between GeLU + tanh and GeLU erf should have negligible impact on inference quality.\r\n\r\n* Set `--auto-truncate` to `true` by default by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/829\r\n\r\n> `--auto-truncate` now defaults to true, meaning that the sequences will be truncated to the lower value between the `--max-batch-tokens` or the maximum model length, to prevent the `--max-batch-tokens` from being lower than the actual maximum supported length.\r\n\r\n### 🎉 Additions\r\n\r\n* Add `--served-model-name` for OpenAI requests via HTTP by @vrdn-23 in https://github.com/huggingface/text-embeddings-inference/pull/685\r\n* Extend `download_onnx` to download sharded ONNX by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/817\r\n* Add support for llama 2 by @michaelfeil in https://github.com/huggingface/text-embeddings-inference/pull/802\r\n* Add support for blackwell architecture (sm100, sm120) by @danielealbano in https://github.com/huggingface/text-embeddings-inference/pull/735\r\n* Mf/add-support-for-llama-3-and-nemotron by @michaelfeil in https://github.com/huggingface/text-embeddings-inference/pull/805\r\n* Add support for DebertaV2 by @vrdn-23 in https://github.com/huggingface/text-embeddings-inference/pull/746\r\n* Add bidirectional attention and projection layer support for Qwen3-based models by @williambarberjr in https://github.com/huggingface/text-embeddings-inference/pull/808\r\n\r\n### 🐛 Fixes\r\n\r\n* Fix reading non-standard config for `past_key_values` in ONNX by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/751\r\n* Fix `TruncationDirection` to deserialize from lowercase and capitalized by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/755\r\n* Fix `sagemaker-entrypoint*` & remove SageMaker and Vertex from `Dockerfile*` by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/699\r\n* Bug: Critical accuracy bugs for model_type=qwen2: no causal attention and wrong tokenizer by @michaelfeil in https://github.com/huggingface/text-embeddings-inference/pull/762\r\n* Fix `config.json` reading w/ aliases for ORT by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/786\r\n* Fix HTTP error code for validation by @vrdn-23 in https://github.com/huggingface/text-embeddings-inference/pull/818\r\n* Fix to acquire the permit in a blocking way by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/726\r\n* Read Hugging Face Hub token from cache if not provided by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/814\r\n* Align the `normalize` param between the gRPC and HTTP /embed interfaces by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/810\r\n\r\n### ⚡ Improvements\r\n\r\n* Serialization in tokio thread instead of blocking thread, 50% reduction in latency for small models by @michaelfeil in https://github.com/huggingface/text-embeddings-inference/pull/767\r\n* Remove default `--model-id` argument by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/679\r\n* feat: better Tokenization # workers heuristic by @michaelfeil in https://github.com/huggingface/text-embeddings-inference/pull/766\r\n* add faster index select kernel by @michaelfeil in https://github.com/huggingface/text-embeddings-inference/pull/773\r\n* feat: speedup Parallel safetensors download by @michaelfeil in https://github.com/huggingface/text-embeddings-inference/pull/765\r\n* feat: startup time: add cloned tokenzier fix, saves ~1-20s cold start time by @michaelfeil in https://github.com/huggingface/text-embeddings-inference/pull/772\r\n* Adjust the warmup phase for CPU by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/792\r\n\r\n### 📄 Other\r\n\r\n* Skip Gemma3 tests when `HF_TOKEN` not set by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/812\r\n* Bump Rust 1.92, CUDA 12.6, Ubuntu 24.04 and add `Dockerfile-cuda-blackwell-all` by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/823\r\n* Update `rustc` version to 1.92.0 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/826\r\n* Add `use_flash_attn` for better FA + FA2 feature gating by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/825\r\n* Update CUDA to 12.9 w/ `cuda-compat-12-9` by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/828\r\n* Upgrade GitHub Actions for Node 24 compatibility by @salmanmkc in https://github.com/huggingface/text-embeddings-inference/pull/782\r\n* Lint: cargo fmt and clippy fix warnings by @michaelfeil in https://github.com/huggingface/text-embeddings-inference/pull/776\r\n* Fix `rustfmt` on `backend/candle/tests/*.rs` files by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/800\r\n* Upgrade GitHub Actions to latest versions by @salmanmkc in https://github.com/huggingface/text-embeddings-inference/pull/783\r\n* Update `version` to 1.9.0 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/830\r\n\r\n## 🆕 New Contributors\r\n* @salmanmkc made their first contribution in https://github.com/huggingface/text-embeddings-inference/pull/782\r\n* @danielealbano made their first contribution in https://github.com/huggingface/text-embeddings-inference/pull/735\r\n* @williambarberjr made their first contribution in https://github.com/huggingface/text-embeddings-inference/pull/808\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-embeddings-inference/compare/v1.8.3...v1.9.0","publishedAt":"2026-02-17T13:42:14.000Z","url":"https://github.com/huggingface/text-embeddings-inference/releases/tag/v1.9.0","media":[],"prerelease":false,"source":{"slug":"text-embeddings-inference","name":"Text Embeddings Inference","type":"github"},"product":{"slug":"inference","name":"Inference"},"groupSlug":"inference","groupName":"Inference","coverageCount":0,"contentChars":null,"contentTokens":null,"composition":null},{"id":"rel_gWHs7FC2nAiWHjn56O08J","version":"v3.3.7","type":"feature","title":"v3.3.7","summary":"## What's Changed\r\n* misc(gha): expose action cache url and runtime as secrets by @mfuntowicz in https://github.com/huggingface/text-generation-infere...","titleGenerated":null,"titleShort":null,"content":"## What's Changed\r\n* misc(gha): expose action cache url and runtime as secrets by @mfuntowicz in https://github.com/huggingface/text-generation-inference/pull/2964\r\n* feat: support max_image_fetch_size to limit by @drbh in https://github.com/huggingface/text-generation-inference/pull/3339\r\n* Maintenance mode by @LysandreJik in https://github.com/huggingface/text-generation-inference/pull/3344\r\n* Maintenance mode by @LysandreJik in https://github.com/huggingface/text-generation-inference/pull/3345\r\n* fix(num_devices): fix num_shard/num device auto compute when NVIDIA_VISIBLE_DEVICES == \"all\" or \"void\" by @oOraph in https://github.com/huggingface/text-generation-inference/pull/3346\r\n\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.3.6...v3.3.7","publishedAt":"2025-12-19T14:35:25.000Z","url":"https://github.com/huggingface/text-generation-inference/releases/tag/v3.3.7","media":[],"prerelease":false,"source":{"slug":"text-generation-inference","name":"Text Generation Inference","type":"github"},"product":{"slug":"inference","name":"Inference"},"groupSlug":"inference","groupName":"Inference","coverageCount":0,"contentChars":null,"contentTokens":null,"composition":null},{"id":"rel_Nw3le7FfL8wAm4wA6OGOY","version":"v1.8.3","type":"feature","title":"v1.8.3","summary":"## What's Changed\r\n\r\n### Bug Fixes\r\n\r\n* Fix error code for empty requests by @vrdn-23 in https://github.com/huggingface/text-embeddings-inference/pull...","titleGenerated":null,"titleShort":null,"content":"## What's Changed\r\n\r\n### Bug Fixes\r\n\r\n* Fix error code for empty requests by @vrdn-23 in https://github.com/huggingface/text-embeddings-inference/pull/727\r\n* Fix the infinite loop when `max_input_length` is bigger than `max-batch-tokens` by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/725\r\n* Fix reading `modules.json` for `Dense` modules in local models by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/738\r\n\r\n### Tests, Documentation & Release\r\n\r\n* Add `test_gemma3.rs` for EmbeddingGemma by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/718\r\n* Fix OpenAI client usage example for embeddings by @ZahraDehghani99 in https://github.com/huggingface/text-embeddings-inference/pull/720\r\n* Handle `HF_TOKEN` in `ApiBuilder` for `candle/tests` by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/724\r\n* Fix `cargo install` commands for `candle` with CUDA by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/719\r\n* Update `version` to 1.8.3 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/745\r\n\r\n## New Contributors\r\n* @ZahraDehghani99 made their first contribution in https://github.com/huggingface/text-embeddings-inference/pull/720\r\n* @vrdn-23 made their first contribution in https://github.com/huggingface/text-embeddings-inference/pull/727\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-embeddings-inference/compare/v1.8.2...v1.8.3","publishedAt":"2025-10-30T09:08:18.000Z","url":"https://github.com/huggingface/text-embeddings-inference/releases/tag/v1.8.3","media":[],"prerelease":false,"source":{"slug":"text-embeddings-inference","name":"Text Embeddings Inference","type":"github"},"product":{"slug":"inference","name":"Inference"},"groupSlug":"inference","groupName":"Inference","coverageCount":0,"contentChars":null,"contentTokens":null,"composition":null},{"id":"rel_5Fl1DiTs67D6onh213me0","version":"v3.3.6","type":"feature","title":"v3.3.6","summary":"## What's Changed\r\n* Add missing backslash by @philsupertramp in https://github.com/huggingface/text-generation-inference/pull/3311\r\n* Revert \"feat: b...","titleGenerated":null,"titleShort":null,"content":"## What's Changed\r\n* Add missing backslash by @philsupertramp in https://github.com/huggingface/text-generation-inference/pull/3311\r\n* Revert \"feat: bump flake including transformers and huggingface_hub versions\" by @drbh in https://github.com/huggingface/text-generation-inference/pull/3323\r\n* fix: remove azure by @drbh in https://github.com/huggingface/text-generation-inference/pull/3325\r\n* Fix mask passed to flashinfer by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3324\r\n* Update iframe sources for streaming demo by @coyotte508 in https://github.com/huggingface/text-generation-inference/pull/3327\r\n* Revert \"Revert \"feat: bump flake including transformers and huggingfa… by @drbh in https://github.com/huggingface/text-generation-inference/pull/3326\r\n* Revert \"feat: bump flake including transformers and huggingface_hub versions\" by @drbh in https://github.com/huggingface/text-generation-inference/pull/3330\r\n* Patch version 3.3.6 by @tengomucho in https://github.com/huggingface/text-generation-inference/pull/3329\r\n\r\n## New Contributors\r\n* @philsupertramp made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3311\r\n* @coyotte508 made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3327\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.3.5...v3.3.6","publishedAt":"2025-09-17T00:48:54.000Z","url":"https://github.com/huggingface/text-generation-inference/releases/tag/v3.3.6","media":[],"prerelease":false,"source":{"slug":"text-generation-inference","name":"Text Generation Inference","type":"github"},"product":{"slug":"inference","name":"Inference"},"groupSlug":"inference","groupName":"Inference","coverageCount":0,"contentChars":null,"contentTokens":null,"composition":null},{"id":"rel_DtdRMbQKjMithLTXCaRWm","version":"v1.8.2","type":"feature","title":"v1.8.2","summary":"## 🔧 Fixed Intel MKL Support\r\n\r\nSince Text Embeddings Inference (TEI) v1.7.0, Intel MKL support had been broken due to changes in the `candle` depend...","titleGenerated":null,"titleShort":null,"content":"## 🔧 Fixed Intel MKL Support\r\n\r\nSince Text Embeddings Inference (TEI) v1.7.0, Intel MKL support had been broken due to changes in the `candle` dependency. Neither `static-linking` nor `dynamic-linking` worked correctly, which caused models using Intel MKL on CPU to fail with errors such as:  \"Intel oneMKL ERROR: Parameter 13 was incorrect on entry to SGEMM\".\r\n\r\nStarting with v1.8.2, this issue has been resolved by fixing how the `intel-mkl-src` dependency is defined. Both features, `static-linking` and `dynamic-linking` (the default), now work correctly, ensuring that Intel MKL libraries are properly linked.\r\n\r\nThis issue occurred in the following scenarios:\r\n- Users installing `text-embeddings-router` via `cargo` with the `--feature mkl` flag. Although `dynamic-linking` should have been used, it was not working as intended.\r\n- Users relying on the CPU `Dockerfile` when running models without ONNX weights. In these cases, Safetensors weights were used with `candle` as backend (with MKL optimizations), instead of `ort`.\r\n\r\nThe following table shows the affected versions and containers:\r\n\r\n| Version | Image |\r\n|---------|-------|\r\n| 1.7.0   | `ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.0` |\r\n| 1.7.1   | `ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.1` |\r\n| 1.7.2   | `ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.2` |\r\n| 1.7.3   | `ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.3` |\r\n| 1.7.4   | `ghcr.io/huggingface/text-embeddings-inference:cpu-1.7.4` |\r\n| 1.8.0   | `ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.0` |\r\n| 1.8.1   | `ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.1` |\r\n\r\nMore details: [PR #715](https://github.com/huggingface/text-embeddings-inference/pull/715)\r\n\r\n**Full Changelog**: [v1.8.1...v1.8.2](https://github.com/huggingface/text-embeddings-inference/compare/v1.8.1...v1.8.2)","publishedAt":"2025-09-09T14:45:29.000Z","url":"https://github.com/huggingface/text-embeddings-inference/releases/tag/v1.8.2","media":[],"prerelease":false,"source":{"slug":"text-embeddings-inference","name":"Text Embeddings Inference","type":"github"},"product":{"slug":"inference","name":"Inference"},"groupSlug":"inference","groupName":"Inference","coverageCount":0,"contentChars":null,"contentTokens":null,"composition":null},{"id":"rel_BwxGl_zBfEgFELWANCF50","version":"v1.8.1","type":"feature","title":"v1.8.1","summary":"<img width=\"1200\" height=\"648\" alt=\"text-embeddings-inference-v1 8 1-embedding-gemma(1)\" src=\"https://github.com/user-attachments/assets/8ad8fb64-cee4...","titleGenerated":null,"titleShort":null,"content":"<img width=\"1200\" height=\"648\" alt=\"text-embeddings-inference-v1 8 1-embedding-gemma(1)\" src=\"https://github.com/user-attachments/assets/8ad8fb64-cee4-409f-8488-1d10f5ffe995\" />\r\n\r\nToday, Google releases EmbeddingGemma, a state-of-the-art multilingual embedding model perfect for on-device use cases. Designed for speed and efficiency, the model features a compact size of 308M parameters and a 2K context window, unlocking new possibilities for mobile RAG pipelines, agents, and more. EmbeddingGemma is trained to support over 100 languages and is the highest-ranking text-only multilingual embedding model under 500M on the Massive Text Embedding Benchmark (MTEB) at the time of writing.\r\n\r\n- CPU:\r\n\r\n```bash\r\ndocker run -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.1 \\\r\n    --model-id google/embeddinggemma-300m --dtype float32\r\n```\r\n\r\n- CPU with ONNX Runtime:\r\n\r\n```bash\r\ndocker run -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.1 \\\r\n    --model-id onnx-community/embeddinggemma-300m-ONNX --dtype float32 --pooling mean\r\n```\r\n\r\n- NVIDIA CUDA:\r\n\r\n```bash\r\ndocker run --gpus all --shm-size 1g -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cuda-1.8.1 \\\r\n    --model-id google/embeddinggemma-300m --dtype float32\r\n```\r\n\r\n## Notable Changes\r\n\r\n* Add support for Gemma3 (text-only) architecture\r\n* Intel updates to Synapse 1.21.3 and IPEX 2.8\r\n* Extend ONNX Runtime support in `OrtRuntime`\r\n    * Support `position_ids` and `past_key_values` as inputs\r\n    * Handle `padding_side` and `pad_token_id`\r\n\r\n## What's Changed\r\n\r\n* Adjust HPU warmup: use dummy inputs with shape more close to real scenario  by @kaixuanliu in https://github.com/huggingface/text-embeddings-inference/pull/689\r\n* Add `extra_args` to `trufflehog` to exclude unverified results by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/696\r\n* Update GitHub templates & fix mentions to Text Embeddings Inference by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/697\r\n* Disable Flash Attention with `USE_FLASH_ATTENTION` by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/692\r\n* Add support for `position_ids` and `past_key_values` in `OrtBackend` by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/700\r\n* HPU upgrade to Synapse 1.21.3 by @kaixuanliu in https://github.com/huggingface/text-embeddings-inference/pull/703\r\n* Upgrade to IPEX 2.8 by @kaixuanliu in https://github.com/huggingface/text-embeddings-inference/pull/702\r\n* Parse `modules.json` to identify default `Dense` modules by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/701\r\n* Add `padding_side` and `pad_token_id` in `OrtBackend` by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/705\r\n* Update `docs/openapi.json` for v1.8.0 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/708\r\n* Add Gemma3 architecture (text-only) by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/711\r\n* Update `version` to 1.8.1 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/712\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-embeddings-inference/compare/v1.8.0...v1.8.1","publishedAt":"2025-09-04T15:22:14.000Z","url":"https://github.com/huggingface/text-embeddings-inference/releases/tag/v1.8.1","media":[],"prerelease":false,"source":{"slug":"text-embeddings-inference","name":"Text Embeddings Inference","type":"github"},"product":{"slug":"inference","name":"Inference"},"groupSlug":"inference","groupName":"Inference","coverageCount":0,"contentChars":null,"contentTokens":null,"composition":null},{"id":"rel_hjBniksH_PR3kY-c8fL3_","version":"v3.3.5","type":"feature","title":"v3.3.5","summary":"## What's Changed\r\n* [gaudi] Refine rope memory, do not need to keep sin/cos cache per layer by @sywangyi in https://github.com/huggingface/text-gener...","titleGenerated":null,"titleShort":null,"content":"## What's Changed\r\n* [gaudi] Refine rope memory, do not need to keep sin/cos cache per layer by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3274\r\n* Gaudi: add CI by @baptistecolle in https://github.com/huggingface/text-generation-inference/pull/3160\r\n* [gaudi] Gemma3 sliding window support by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3280\r\n* xpu lora support by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3232\r\n* Optimum neuron 0.2.2 by @dacorvo in https://github.com/huggingface/text-generation-inference/pull/3281\r\n* [gaudi] Remove unnecessary reinitialize to HeterogeneousNextTokenChooser to m… by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3284\r\n* [gaudi] Deepseek v2 mla and add ep to unquantized moe by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3287\r\n* [gaudi] Fix the CI test errors by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3286\r\n* Hpu gptq gidx support by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3297\r\n* Migrate to V2 Pydantic interface by @emmanuel-ferdman in https://github.com/huggingface/text-generation-inference/pull/3262\r\n* Xccl by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3252\r\n* Multi modality fix by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3283\r\n* some gptq case could not be handled by ipex. but could be handle by t… by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3298\r\n* fix outline import issue by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3282\r\n* HuggingFaceM4/Idefics3-8B-Llama3 crash fix by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3267\r\n* Optimum neuron 0.3.0 by @tengomucho in https://github.com/huggingface/text-generation-inference/pull/3308\r\n* Disable Cachix pushes by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3312\r\n* chore: prepare version 3.3.5 by @tengomucho in https://github.com/huggingface/text-generation-inference/pull/3314\r\n* feat: bump flake including transformers and huggingface_hub versions by @drbh in https://github.com/huggingface/text-generation-inference/pull/3313\r\n\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.3.4...git","publishedAt":"2025-09-02T15:02:33.000Z","url":"https://github.com/huggingface/text-generation-inference/releases/tag/v3.3.5","media":[],"prerelease":false,"source":{"slug":"text-generation-inference","name":"Text Generation Inference","type":"github"},"product":{"slug":"inference","name":"Inference"},"groupSlug":"inference","groupName":"Inference","coverageCount":0,"contentChars":null,"contentTokens":null,"composition":null},{"id":"rel_hK3Gxqm8pCSBYATJ1Wdsa","version":"v1.8.0","type":"feature","title":"v1.8.0","summary":"<img width=\"3600\" height=\"1944\" alt=\"text-embeddings-inference-v1 8 0(2)\" src=\"https://github.com/user-attachments/assets/50df05b6-3821-4e2a-8de0-3e5c...","titleGenerated":null,"titleShort":null,"content":"<img width=\"3600\" height=\"1944\" alt=\"text-embeddings-inference-v1 8 0(2)\" src=\"https://github.com/user-attachments/assets/50df05b6-3821-4e2a-8de0-3e5c911b2a27\" />\r\n\r\n## Notable Changes\r\n\r\n- Qwen3 support for 0.6B, 4B and 8B on CPU, MPS, and FlashQwen3 on CUDA and Intel HPUs\r\n- NomicBert MoE support\r\n- JinaAI Re-Rankers V1 support\r\n- Matryoshka Representation Learning (MRL)\r\n- Dense layer module support (after pooling)\r\n\r\n> [!NOTE]\r\n> Some of the aforementioned changes were released within the patch versions on top of v1.7.0, whilst both Matryoshka Representation Learning (MRL) and Dense layer module support have been recently included and were not released yet.\r\n\r\n## What's Changed\r\n\r\n* [Docs] Update quick tour by @NielsRogge in https://github.com/huggingface/text-embeddings-inference/pull/574\r\n* Update `README.md` and `supported_models.md` by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/572\r\n* Back with linting. by @Narsil in https://github.com/huggingface/text-embeddings-inference/pull/577\r\n* [Docs] Add cloud run example by @NielsRogge in https://github.com/huggingface/text-embeddings-inference/pull/573\r\n* Fixup by @Narsil in https://github.com/huggingface/text-embeddings-inference/pull/578\r\n* Fixing the tokenization routes token (offsets are in bytes, not in by @Narsil in https://github.com/huggingface/text-embeddings-inference/pull/576\r\n* Removing requirements file. by @Narsil in https://github.com/huggingface/text-embeddings-inference/pull/585\r\n* Removing candle-extensions to live on crates.io by @Narsil in https://github.com/huggingface/text-embeddings-inference/pull/583\r\n* Bump `sccache` to 0.10.0 and `sccache-action` to 0.0.9 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/586\r\n* optimize the performance of FlashBert Path for HPU by @kaixuanliu in https://github.com/huggingface/text-embeddings-inference/pull/575\r\n* Revert \"Removing requirements file. (#585)\" by @Narsil in https://github.com/huggingface/text-embeddings-inference/pull/588\r\n* Get opentelemetry trace id from request headers by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/425\r\n* Add argument for configuring Prometheus port by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/589\r\n* Adding missing `head.` prefix in the weight name in `ModernBertClassificationHead` by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/591\r\n* Fixing the CI (grpc path). by @Narsil in https://github.com/huggingface/text-embeddings-inference/pull/593\r\n* fix xpu env issue that cannot find right libur_loader.so.0 by @kaixuanliu in https://github.com/huggingface/text-embeddings-inference/pull/595\r\n* enable flash mistral model for HPU device by @kaixuanliu in https://github.com/huggingface/text-embeddings-inference/pull/594\r\n* remove optimum-habana dependency by @kaixuanliu in https://github.com/huggingface/text-embeddings-inference/pull/599\r\n* Support NomicBert MoE by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/596\r\n* Remove duplicate short option '-p' to fix router executable by @cebtenzzre in https://github.com/huggingface/text-embeddings-inference/pull/602\r\n* Update `text-embeddings-router --help` output by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/603\r\n* Warmup padded models too. by @Narsil in https://github.com/huggingface/text-embeddings-inference/pull/592\r\n* Add support for JinaAI Re-Rankers V1 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/582\r\n* Gte diffs by @Narsil in https://github.com/huggingface/text-embeddings-inference/pull/604\r\n* Fix the weight name in GTEClassificationHead by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/606\r\n* upgrade pytorch and ipex to 2.7 version by @kaixuanliu in https://github.com/huggingface/text-embeddings-inference/pull/607\r\n* upgrade HPU FW to 1.21; upgrade transformers to 4.51.3 by @kaixuanliu in https://github.com/huggingface/text-embeddings-inference/pull/608\r\n* Patch DistilBERT variants with different weight keys by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/614\r\n* add offline modeling for model `jinaai/jina-embeddings-v2-base-code` to avoid `auto_map` to other repository by @kaixuanliu in https://github.com/huggingface/text-embeddings-inference/pull/612\r\n* Add mean pooling strategy for Modernbert classifier by @kwnath in https://github.com/huggingface/text-embeddings-inference/pull/616\r\n* Using serde for pool validation. by @Narsil in https://github.com/huggingface/text-embeddings-inference/pull/620\r\n* Preparing the update to 1.7.1 by @Narsil in https://github.com/huggingface/text-embeddings-inference/pull/623\r\n* Adding suggestions to fixing missing ONNX files. by @Narsil in https://github.com/huggingface/text-embeddings-inference/pull/624\r\n* Add `Qwen3Model` by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/627\r\n* Add `HiddenAct::Silu` (remove `serde` alias) by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/631\r\n* Add CPU support for Qwen3-Embedding models by @randomm in https://github.com/huggingface/text-embeddings-inference/pull/632\r\n* refactor the code and add wrap_in_hpu_graph to corner case by @kaixuanliu in https://github.com/huggingface/text-embeddings-inference/pull/625\r\n* Support Qwen3 w/ fp32 on GPU by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/634\r\n* Preparing the release. by @Narsil in https://github.com/huggingface/text-embeddings-inference/pull/639\r\n* Default to Qwen3 in `README.md` and `docs/` examples by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/641\r\n* Fix Qwen3 by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/646\r\n* Add integration tests for Gaudi by @baptistecolle in https://github.com/huggingface/text-embeddings-inference/pull/598\r\n* Fix Qwen3-Embedding batch vs single inference inconsistency by @lance-miles in https://github.com/huggingface/text-embeddings-inference/pull/648\r\n* Fix FlashQwen3 by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/650\r\n* Make flake work on metal by @Narsil in https://github.com/huggingface/text-embeddings-inference/pull/654\r\n* Fixing metal backend. by @Narsil in https://github.com/huggingface/text-embeddings-inference/pull/655\r\n* Qwen3 hpu support by @kaixuanliu in https://github.com/huggingface/text-embeddings-inference/pull/656\r\n* change HPU warmup logic: seq length should be with exponential growth by @kaixuanliu in https://github.com/huggingface/text-embeddings-inference/pull/659\r\n* Update `version` to 1.7.3 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/666\r\n* Add last token pooling support for ORT. by @tpendragon in https://github.com/huggingface/text-embeddings-inference/pull/664\r\n* Fix Qwen3 Embedding Float16 DType by @tpendragon in https://github.com/huggingface/text-embeddings-inference/pull/663\r\n* Fix `fmt` by re-running `pre-commit` by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/671\r\n* Update `version` to 1.7.4 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/677\r\n* Support MRL (Matryoshka Representation Learning) by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/676\r\n* Add `Dense` layer for `2_Dense/` modules by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/660\r\n* Update `version` to 1.8.0 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/686\r\n\r\n## New Contributors\r\n* @NielsRogge made their first contribution in https://github.com/huggingface/text-embeddings-inference/pull/574\r\n* @cebtenzzre made their first contribution in https://github.com/huggingface/text-embeddings-inference/pull/602\r\n* @kwnath made their first contribution in https://github.com/huggingface/text-embeddings-inference/pull/616\r\n* @randomm made their first contribution in https://github.com/huggingface/text-embeddings-inference/pull/632\r\n* @lance-miles made their first contribution in https://github.com/huggingface/text-embeddings-inference/pull/648\r\n* @tpendragon made their first contribution in https://github.com/huggingface/text-embeddings-inference/pull/664\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-embeddings-inference/compare/v1.7.0...v1.8.0","publishedAt":"2025-08-05T08:31:22.000Z","url":"https://github.com/huggingface/text-embeddings-inference/releases/tag/v1.8.0","media":[],"prerelease":false,"source":{"slug":"text-embeddings-inference","name":"Text Embeddings Inference","type":"github"},"product":{"slug":"inference","name":"Inference"},"groupSlug":"inference","groupName":"Inference","coverageCount":0,"contentChars":null,"contentTokens":null,"composition":null},{"id":"rel_luwUHLwo6gD4f51Qq-akB","version":"v1.7.4","type":"feature","title":"v1.7.4","summary":"## Noticeable Changes\r\n\r\nQwen3 was not working fine on CPU / MPS when sending batched requests on FP16 precision, due to the FP32 minimum value downca...","titleGenerated":null,"titleShort":null,"content":"## Noticeable Changes\r\n\r\nQwen3 was not working fine on CPU / MPS when sending batched requests on FP16 precision, due to the FP32 minimum value downcast (now manually set to FP16 minimum value instead) leading to `null` values, as well as a missing `to_dtype` call on the `attention_bias` when working with batches.\r\n\r\n## What's Changed\r\n\r\n* Fix Qwen3 Embedding Float16 DType by @tpendragon in https://github.com/huggingface/text-embeddings-inference/pull/663\r\n* Fix `fmt` by re-running `pre-commit` by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/671\r\n* Update `version` to 1.7.4 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/677\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-embeddings-inference/compare/v1.7.3...v1.7.4","publishedAt":"2025-07-07T12:33:34.000Z","url":"https://github.com/huggingface/text-embeddings-inference/releases/tag/v1.7.4","media":[],"prerelease":false,"source":{"slug":"text-embeddings-inference","name":"Text Embeddings Inference","type":"github"},"product":{"slug":"inference","name":"Inference"},"groupSlug":"inference","groupName":"Inference","coverageCount":0,"contentChars":null,"contentTokens":null,"composition":null},{"id":"rel_lSE1FYm8XcqJNJ65WK6UW","version":"v1.7.3","type":"feature","title":"v1.7.3","summary":"## Noticeable Changes\r\n\r\nQwen3 support included for Intel HPU, and fixed for CPU / Metal / CUDA.\r\n\r\n## What's Changed\r\n\r\n* Default to Qwen3 in `README...","titleGenerated":null,"titleShort":null,"content":"## Noticeable Changes\r\n\r\nQwen3 support included for Intel HPU, and fixed for CPU / Metal / CUDA.\r\n\r\n## What's Changed\r\n\r\n* Default to Qwen3 in `README.md` and `docs/` examples by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/641\r\n* Fix Qwen3 by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/646\r\n* Add integration tests for Gaudi by @baptistecolle in https://github.com/huggingface/text-embeddings-inference/pull/598\r\n* Fix Qwen3-Embedding batch vs single inference inconsistency by @lance-miles in https://github.com/huggingface/text-embeddings-inference/pull/648\r\n* Fix FlashQwen3 by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/650\r\n* Make flake work on metal by @Narsil in https://github.com/huggingface/text-embeddings-inference/pull/654\r\n* Fixing metal backend. by @Narsil in https://github.com/huggingface/text-embeddings-inference/pull/655\r\n* Qwen3 hpu support by @kaixuanliu in https://github.com/huggingface/text-embeddings-inference/pull/656\r\n* change HPU warmup logic: seq length should be with exponential growth by @kaixuanliu in https://github.com/huggingface/text-embeddings-inference/pull/659\r\n* Update `version` to 1.7.3 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/666\r\n* Add last token pooling support for ORT. by @tpendragon in https://github.com/huggingface/text-embeddings-inference/pull/664\r\n\r\n## New Contributors\r\n\r\n* @lance-miles made their first contribution in https://github.com/huggingface/text-embeddings-inference/pull/648\r\n* @tpendragon made their first contribution in https://github.com/huggingface/text-embeddings-inference/pull/664\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-embeddings-inference/compare/v1.7.2...v1.7.3","publishedAt":"2025-06-30T10:54:30.000Z","url":"https://github.com/huggingface/text-embeddings-inference/releases/tag/v1.7.3","media":[],"prerelease":false,"source":{"slug":"text-embeddings-inference","name":"Text Embeddings Inference","type":"github"},"product":{"slug":"inference","name":"Inference"},"groupSlug":"inference","groupName":"Inference","coverageCount":0,"contentChars":null,"contentTokens":null,"composition":null},{"id":"rel_hXE5FkBe7ER-2mbJzPZnx","version":"v3.3.4","type":"feature","title":"v3.3.4","summary":"Fix for Neuron models exported with batch_size 1.\r\n\r\n## What's Changed\r\n* [gaudi] gemma3 text and vlm model intial support. need to add sliding window...","titleGenerated":null,"titleShort":null,"content":"Fix for Neuron models exported with batch_size 1.\r\n\r\n## What's Changed\r\n* [gaudi] gemma3 text and vlm model intial support. need to add sliding window … by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3270\r\n* Neuron backend fix by @dacorvo in https://github.com/huggingface/text-generation-inference/pull/3273\r\n\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.3.3...v3.3.4","publishedAt":"2025-06-19T10:00:28.000Z","url":"https://github.com/huggingface/text-generation-inference/releases/tag/v3.3.4","media":[],"prerelease":false,"source":{"slug":"text-generation-inference","name":"Text Generation Inference","type":"github"},"product":{"slug":"inference","name":"Inference"},"groupSlug":"inference","groupName":"Inference","coverageCount":0,"contentChars":null,"contentTokens":null,"composition":null},{"id":"rel_0YUG8Lu0T6s916Xqe6B2R","version":"v3.3.3","type":"feature","title":"v3.3.3","summary":"Neuron backend update.\r\n\r\n## What's Changed\r\n* Remove useless packages by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull...","titleGenerated":null,"titleShort":null,"content":"Neuron backend update.\r\n\r\n## What's Changed\r\n* Remove useless packages by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3253\r\n* Bump neuron SDK version by @dacorvo in https://github.com/huggingface/text-generation-inference/pull/3260\r\n* Perf opt by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3256\r\n* [gaudi] Vlm rebase and issue fix in benchmark test by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3263\r\n* Move the _update_cos_sin_cache into get_cos_sin by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3254\r\n* [Gaudi] Remove optimum-habana by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3261\r\n* [gaudi] HuggingFaceM4/idefics2-8b issue fix by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3264\r\n* [Gaudi] Enable Qwen3_moe model by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3244\r\n* [Gaudi]Fix the integration-test issues by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3265\r\n* [Gaudi] use pad_token_id to pad input id by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3268\r\n* chore: prepare release 3.3.3 by @dacorvo in https://github.com/huggingface/text-generation-inference/pull/3269\r\n* [gaudi] Refine logging for Gaudi warmup by @regisss in https://github.com/huggingface/text-generation-inference/pull/3222\r\n* doc: fix README by @dacorvo in https://github.com/huggingface/text-generation-inference/pull/3271\r\n\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.3.2...v3.3.3","publishedAt":"2025-06-18T13:11:39.000Z","url":"https://github.com/huggingface/text-generation-inference/releases/tag/v3.3.3","media":[],"prerelease":false,"source":{"slug":"text-generation-inference","name":"Text Generation Inference","type":"github"},"product":{"slug":"inference","name":"Inference"},"groupSlug":"inference","groupName":"Inference","coverageCount":0,"contentChars":null,"contentTokens":null,"composition":null},{"id":"rel_5d_RnBqUCE_TD7jQgQ8eK","version":"v1.7.2","type":"feature","title":"v1.7.2","summary":"## Notable change\r\n\r\n* Added support for Qwen3 embeddigns\r\n\r\n## What's Changed\r\n* Adding suggestions to fixing missing ONNX files. by @Narsil in https...","titleGenerated":null,"titleShort":null,"content":"## Notable change\r\n\r\n* Added support for Qwen3 embeddigns\r\n\r\n## What's Changed\r\n* Adding suggestions to fixing missing ONNX files. by @Narsil in https://github.com/huggingface/text-embeddings-inference/pull/624\r\n* Add `Qwen3Model` by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/627\r\n* Add `HiddenAct::Silu` (remove `serde` alias) by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/631\r\n* Add CPU support for Qwen3-Embedding models by @randomm in https://github.com/huggingface/text-embeddings-inference/pull/632\r\n* refactor the code and add wrap_in_hpu_graph to corner case by @kaixuanliu in https://github.com/huggingface/text-embeddings-inference/pull/625\r\n* Support Qwen3 w/ fp32 on GPU by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/634\r\n* Preparing the release. by @Narsil in https://github.com/huggingface/text-embeddings-inference/pull/639\r\n\r\n## New Contributors\r\n* @randomm made their first contribution in https://github.com/huggingface/text-embeddings-inference/pull/632\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-embeddings-inference/compare/v1.7.1...v1.7.2","publishedAt":"2025-06-16T06:44:57.000Z","url":"https://github.com/huggingface/text-embeddings-inference/releases/tag/v1.7.2","media":[],"prerelease":false,"source":{"slug":"text-embeddings-inference","name":"Text Embeddings Inference","type":"github"},"product":{"slug":"inference","name":"Inference"},"groupSlug":"inference","groupName":"Inference","coverageCount":0,"contentChars":null,"contentTokens":null,"composition":null},{"id":"rel_C0_iE5ArfCPAYk8sIdB-Z","version":"v1.7.1","type":"feature","title":"v1.7.1","summary":"## What's Changed\r\n* [Docs] Update quick tour by @NielsRogge in https://github.com/huggingface/text-embeddings-inference/pull/574\r\n* Update `README.md...","titleGenerated":null,"titleShort":null,"content":"## What's Changed\r\n* [Docs] Update quick tour by @NielsRogge in https://github.com/huggingface/text-embeddings-inference/pull/574\r\n* Update `README.md` and `supported_models.md` by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/572\r\n* Back with linting. by @Narsil in https://github.com/huggingface/text-embeddings-inference/pull/577\r\n* [Docs] Add cloud run example by @NielsRogge in https://github.com/huggingface/text-embeddings-inference/pull/573\r\n* Fixup by @Narsil in https://github.com/huggingface/text-embeddings-inference/pull/578\r\n* Fixing the tokenization routes token (offsets are in bytes, not in by @Narsil in https://github.com/huggingface/text-embeddings-inference/pull/576\r\n* Removing requirements file. by @Narsil in https://github.com/huggingface/text-embeddings-inference/pull/585\r\n* Removing candle-extensions to live on crates.io by @Narsil in https://github.com/huggingface/text-embeddings-inference/pull/583\r\n* Bump `sccache` to 0.10.0 and `sccache-action` to 0.0.9 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/586\r\n* optimize the performance of FlashBert Path for HPU by @kaixuanliu in https://github.com/huggingface/text-embeddings-inference/pull/575\r\n* Revert \"Removing requirements file. (#585)\" by @Narsil in https://github.com/huggingface/text-embeddings-inference/pull/588\r\n* Get opentelemetry trace id from request headers by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/425\r\n* Add argument for configuring Prometheus port by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/589\r\n* Adding missing `head.` prefix in the weight name in `ModernBertClassificationHead` by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/591\r\n* Fixing the CI (grpc path). by @Narsil in https://github.com/huggingface/text-embeddings-inference/pull/593\r\n* fix xpu env issue that cannot find right libur_loader.so.0 by @kaixuanliu in https://github.com/huggingface/text-embeddings-inference/pull/595\r\n* enable flash mistral model for HPU device by @kaixuanliu in https://github.com/huggingface/text-embeddings-inference/pull/594\r\n* remove optimum-habana dependency by @kaixuanliu in https://github.com/huggingface/text-embeddings-inference/pull/599\r\n* Support NomicBert MoE by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/596\r\n* Remove duplicate short option '-p' to fix router executable by @cebtenzzre in https://github.com/huggingface/text-embeddings-inference/pull/602\r\n* Update `text-embeddings-router --help` output by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/603\r\n* Warmup padded models too. by @Narsil in https://github.com/huggingface/text-embeddings-inference/pull/592\r\n* Add support for JinaAI Re-Rankers V1 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/582\r\n* Gte diffs by @Narsil in https://github.com/huggingface/text-embeddings-inference/pull/604\r\n* Fix the weight name in GTEClassificationHead by @kozistr in https://github.com/huggingface/text-embeddings-inference/pull/606\r\n* upgrade pytorch and ipex to 2.7 version by @kaixuanliu in https://github.com/huggingface/text-embeddings-inference/pull/607\r\n* upgrade HPU FW to 1.21; upgrade transformers to 4.51.3 by @kaixuanliu in https://github.com/huggingface/text-embeddings-inference/pull/608\r\n* Patch DistilBERT variants with different weight keys by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/614\r\n* add offline modeling for model `jinaai/jina-embeddings-v2-base-code` to avoid `auto_map` to other repository by @kaixuanliu in https://github.com/huggingface/text-embeddings-inference/pull/612\r\n* Add mean pooling strategy for Modernbert classifier by @kwnath in https://github.com/huggingface/text-embeddings-inference/pull/616\r\n* Using serde for pool validation. by @Narsil in https://github.com/huggingface/text-embeddings-inference/pull/620\r\n* Preparing the update to 1.7.1 by @Narsil in https://github.com/huggingface/text-embeddings-inference/pull/623\r\n\r\n## New Contributors\r\n* @NielsRogge made their first contribution in https://github.com/huggingface/text-embeddings-inference/pull/574\r\n* @cebtenzzre made their first contribution in https://github.com/huggingface/text-embeddings-inference/pull/602\r\n* @kwnath made their first contribution in https://github.com/huggingface/text-embeddings-inference/pull/616\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-embeddings-inference/compare/v1.7.0...v1.7.1","publishedAt":"2025-06-03T13:38:50.000Z","url":"https://github.com/huggingface/text-embeddings-inference/releases/tag/v1.7.1","media":[],"prerelease":false,"source":{"slug":"text-embeddings-inference","name":"Text Embeddings Inference","type":"github"},"product":{"slug":"inference","name":"Inference"},"groupSlug":"inference","groupName":"Inference","coverageCount":0,"contentChars":null,"contentTokens":null,"composition":null},{"id":"rel_bcZ1pQOZ_vFVGxP7XXMAI","version":"v3.3.2","type":"feature","title":"v3.3.2","summary":"Gaudi improvements.\r\n\r\n## What's Changed\r\n* upgrade to new vllm extension ops(fix issue in exponential bucketing) by @sywangyi in https://github.com/h...","titleGenerated":null,"titleShort":null,"content":"Gaudi improvements.\r\n\r\n## What's Changed\r\n* upgrade to new vllm extension ops(fix issue in exponential bucketing) by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3239\r\n* Nix: switch to hf-nix by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3240\r\n* Add Qwen3 by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3229\r\n* fp8 compressed_tensors w8a8 support by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3242\r\n* [Gaudi] Fix the OOM issue of Llama-4-Scout-17B-16E-Instruct by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3245\r\n* Fix the Llama-4-Maverick-17B-128E crash issue by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3246\r\n* Prepare for 3.3.2 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3249\r\n\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.3.1...v3.3.2","publishedAt":"2025-05-30T14:20:39.000Z","url":"https://github.com/huggingface/text-generation-inference/releases/tag/v3.3.2","media":[],"prerelease":false,"source":{"slug":"text-generation-inference","name":"Text Generation Inference","type":"github"},"product":{"slug":"inference","name":"Inference"},"groupSlug":"inference","groupName":"Inference","coverageCount":0,"contentChars":null,"contentTokens":null,"composition":null},{"id":"rel_o-rKLY_QpJ4QF6UZJaPCq","version":"v3.3.1","type":"feature","title":"v3.3.1","summary":"This release updates TGI to Torch 2.7 and CUDA 12.8.\r\n\r\n## What's Changed\r\n* change HPU warmup logic: seq length should be with exponential growth by ...","titleGenerated":null,"titleShort":null,"content":"This release updates TGI to Torch 2.7 and CUDA 12.8.\r\n\r\n## What's Changed\r\n* change HPU warmup logic: seq length should be with exponential growth by @kaixuanliu in https://github.com/huggingface/text-generation-inference/pull/3217\r\n* adjust the `round_up_seq` logic to align with prefill warmup phase on… by @kaixuanliu in https://github.com/huggingface/text-generation-inference/pull/3224\r\n* Update to Torch 2.7.0 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3221\r\n* Enable Llama4 for gaudi backend by @yuanwu2017 in https://github.com/huggingface/text-generation-inference/pull/3223\r\n* fix: count gpu uuids if NVIDIA_VISIBLE_DEVICES env set to all by @drbh in https://github.com/huggingface/text-generation-inference/pull/3230\r\n* Deepseek r1 by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3211\r\n* Refine warmup and upgrade to synapse AI 1.21.0 by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3234\r\n* fix the crash in default ATTENTION path by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3235\r\n* Switch to punica-sgmv kernel from the Hub by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3236\r\n* move input_ids to hpu and remove disposal of adapter_meta by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3237\r\n* Prepare for 3.3.1 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3238\r\n\r\n## New Contributors\r\n* @kaixuanliu made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3217\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.3.0...v3.3.1","publishedAt":"2025-05-22T07:49:07.000Z","url":"https://github.com/huggingface/text-generation-inference/releases/tag/v3.3.1","media":[],"prerelease":false,"source":{"slug":"text-generation-inference","name":"Text Generation Inference","type":"github"},"product":{"slug":"inference","name":"Inference"},"groupSlug":"inference","groupName":"Inference","coverageCount":0,"contentChars":null,"contentTokens":null,"composition":null},{"id":"rel_iew88pCm-clkg__gCCAnV","version":"v3.3.0","type":"feature","title":"v3.3.0","summary":"## Notable changes\r\n\r\n* Prefill chunking for VLMs.\r\n\r\n## What's Changed\r\n* Fixing Qwen 2.5 VL (32B). by @Narsil in https://github.com/huggingface/text...","titleGenerated":null,"titleShort":null,"content":"## Notable changes\r\n\r\n* Prefill chunking for VLMs.\r\n\r\n## What's Changed\r\n* Fixing Qwen 2.5 VL (32B). by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3157\r\n* Fixing tokenization like https://github.com/huggingface/text-embeddin… by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3156\r\n* Gaudi: clean cuda/rocm code in hpu backend, enable flat_hpu by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3113\r\n* L4 fixes by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/3161\r\n* setuptools <= 70.0 is vulnerable: CVE-2024-6345 by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3171\r\n* transformers flash llm/vlm enabling in ipex by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3152\r\n* Upgrading the dependencies in Gaudi backend. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3170\r\n* Hotfixing gaudi deps. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3174\r\n* Hotfix gaudi2 with newer transformers. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3176\r\n* Support flashinfer for Gemma3 prefill by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3167\r\n* Get opentelemetry trace id from request headers instead of creating a new trace by @kozistr in https://github.com/huggingface/text-generation-inference/pull/2648\r\n* Bump `sccache` to 0.10.0 by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/3179\r\n* Fixing CI by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3184\r\n* Add option to configure prometheus port by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/3187\r\n* Warmup gaudi backend by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3172\r\n* Put more wiggle room. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3189\r\n* Fixing the router + template for Qwen3. by @Narsil in https://github.com/huggingface/text-generation-inference/pull/3200\r\n* Skip `{% generation %}` and `{% endgeneration %}` template handling by @alvarobartt in https://github.com/huggingface/text-generation-inference/pull/3204\r\n* doc typo by @julien-c in https://github.com/huggingface/text-generation-inference/pull/3206\r\n* Pr 2982 ci branch by @drbh in https://github.com/huggingface/text-generation-inference/pull/3046\r\n* fix: bump snaps for mllama by @drbh in https://github.com/huggingface/text-generation-inference/pull/3202\r\n* Update client SDK snippets by @julien-c in https://github.com/huggingface/text-generation-inference/pull/3207\r\n* Fix `HF_HUB_OFFLINE=1` for Gaudi backend by @regisss in https://github.com/huggingface/text-generation-inference/pull/3193\r\n* IPEX support FP8 kvcache/softcap/slidingwindow by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3144\r\n* forward and tokenize chooser use the same shape by @sywangyi in https://github.com/huggingface/text-generation-inference/pull/3196\r\n* Chunked Prefill VLM by @mht-sharma in https://github.com/huggingface/text-generation-inference/pull/3188\r\n* Prepare for 3.3.0 by @danieldk in https://github.com/huggingface/text-generation-inference/pull/3220\r\n\r\n## New Contributors\r\n* @kozistr made their first contribution in https://github.com/huggingface/text-generation-inference/pull/2648\r\n* @julien-c made their first contribution in https://github.com/huggingface/text-generation-inference/pull/3206\r\n\r\n**Full Changelog**: https://github.com/huggingface/text-generation-inference/compare/v3.2.3...v3.3.0","publishedAt":"2025-05-09T13:57:39.000Z","url":"https://github.com/huggingface/text-generation-inference/releases/tag/v3.3.0","media":[],"prerelease":false,"source":{"slug":"text-generation-inference","name":"Text Generation Inference","type":"github"},"product":{"slug":"inference","name":"Inference"},"groupSlug":"inference","groupName":"Inference","coverageCount":0,"contentChars":null,"contentTokens":null,"composition":null}],"pagination":{"nextCursor":"2025-05-09T13:57:39.000Z|2026-04-07T17:28:38.713Z|rel_iew88pCm-clkg__gCCAnV","limit":20}}