Today, Google releases EmbeddingGemma, a state-of-the-art multilingual embedding model perfect for on-device use cases. Designed for speed and efficiency, the model features a compact size of 308M parameters and a 2K context window, unlocking new possibilities for mobile RAG pipelines, agents, and more. EmbeddingGemma is trained to support over 100 languages and is the highest-ranking text-only multilingual embedding model under 500M on the Massive Text Embedding Benchmark (MTEB) at the time of writing.
docker run -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.1 \
--model-id google/embeddinggemma-300m --dtype float32
docker run -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cpu-1.8.1 \
--model-id onnx-community/embeddinggemma-300m-ONNX --dtype float32 --pooling mean
docker run --gpus all --shm-size 1g -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:cuda-1.8.1 \
--model-id google/embeddinggemma-300m --dtype float32
OrtRuntime
position_ids and past_key_values as inputspadding_side and pad_token_idextra_args to trufflehog to exclude unverified results by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/696USE_FLASH_ATTENTION by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/692position_ids and past_key_values in OrtBackend by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/700modules.json to identify default Dense modules by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/701padding_side and pad_token_id in OrtBackend by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/705docs/openapi.json for v1.8.0 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/708version to 1.8.1 by @alvarobartt in https://github.com/huggingface/text-embeddings-inference/pull/712Full Changelog: https://github.com/huggingface/text-embeddings-inference/compare/v1.8.0...v1.8.1
Fetched April 7, 2026