You can now compute CLIP text and vision embeddings separately, allowing for faster inference when you only need to query one of the modalities. We've also released a demo application for semantic image search to showcase this functionality.
Example: Compute text embeddings with CLIPTextModelWithProjection.
import { AutoTokenizer, CLIPTextModelWithProjection } from '@xenova/transformers';
// Load tokenizer and text model
const tokenizer = await AutoTokenizer.from_pretrained('Xenova/clip-vit-base-patch16');
const text_model = await CLIPTextModelWithProjection.from_pretrained('Xenova/clip-vit-base-patch16');
// Run tokenization
let texts = ['a photo of a car', 'a photo of a football match'];
let text_inputs = tokenizer(texts, { padding: true, truncation: true });
// Compute embeddings
const { text_embeds } = await text_model(text_inputs);
// Tensor {
// dims: [ 2, 512 ],
// type: 'float32',
// data: Float32Array(1024) [ ... ],
// size: 1024
// }
Example: Compute vision embeddings with CLIPVisionModelWithProjection.
import { AutoProcessor, CLIPVisionModelWithProjection, RawImage} from '@xenova/transformers';
// Load processor and vision model
const processor = await AutoProcessor.from_pretrained('Xenova/clip-vit-base-patch16');
const vision_model = await CLIPVisionModelWithProjection.from_pretrained('Xenova/clip-vit-base-patch16');
// Read image and run processor
let image = await RawImage.read('https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/football-match.jpg');
let image_inputs = await processor(image);
// Compute embeddings
const { image_embeds } = await vision_model(image_inputs);
// Tensor {
// dims: [ 1, 512 ],
// type: 'float32',
// data: Float32Array(512) [ ... ],
// size: 512
// }
We've updated the source code for our example browser extension, making the following improvements:
RawImage.save()Fetched April 7, 2026