Hugging Face/Transformers.js

Transformers.js

$npx -y @buildinternet/releases show transformers-js

Sun

Mon

Tue

Wed

Thu

Fri

Sat

AprMayJunJulAugSepOctNovDecJanFebMarApr

Less

Releases0Avg0/wk

Feb 6, 2025

What's new?

Bump onnxruntime-web and @huggingface/jinja in https://github.com/huggingface/transformers.js/pull/1183.

Full Changelog: https://github.com/huggingface/transformers.js/compare/3.3.2...3.3.3

Jan 22, 2025

What's new?

Add support for Helium and Glm in https://github.com/huggingface/transformers.js/pull/1156
Improve build process and fix usage with certain bundlers in https://github.com/huggingface/transformers.js/pull/1158
Auto-detect wordpiece tokenizer when model.type is missing in https://github.com/huggingface/transformers.js/pull/1151
Update Moonshine config values for transformers v4.48.0 in https://github.com/huggingface/transformers.js/pull/1155
Support simultaneous tensor op execution in WASM in https://github.com/huggingface/transformers.js/pull/1162
Update react tutorial sample code in https://github.com/huggingface/transformers.js/pull/1152

Full Changelog: https://github.com/huggingface/transformers.js/compare/3.3.1...3.3.2

Jan 15, 2025

What's new?

hotfix: Copy missing ort-wasm-simd-threaded.jsep.mjs to dist folder (https://github.com/huggingface/transformers.js/pull/1150)

Full Changelog: https://github.com/huggingface/transformers.js/compare/3.3.0...3.3.1

🔥 Transformers.js v3.3 — StyleTTS 2 (Kokoro) for state-of-the-art text-to-speech, Grounding DINO for zero-shot object detection

🤖 New models: StyleTTS 2, Grounding Dino
- StyleTTS 2: High-quality speech synthesis
- Grounding DINO: Zero-shot object detection
🛠️ Other improvements
🤗 New contributors

<h2 id="new-models">🤖 New models: StyleTTS 2, Grounding DINO</h2> <h3 id="style_text_to_speech_2">StyleTTS 2 for high-quality speech synthesis</h3>

See https://github.com/huggingface/transformers.js/pull/1148 for more information and here for the list of supported models.

First, install the kokoro-js library, which uses Transformers.js, from NPM using:

npm i kokoro-js

You can then generate speech as follows:

import { KokoroTTS } from "kokoro-js";

const model_id = "onnx-community/Kokoro-82M-ONNX";
const tts = await KokoroTTS.from_pretrained(model_id, {
  dtype: "q8", // Options: "fp32", "fp16", "q8", "q4", "q4f16"
});

const text = "Life is like a box of chocolates. You never know what you're gonna get.";
const audio = await tts.generate(text, {
  // Use `tts.list_voices()` to list all available voices
  voice: "af_bella",
});
audio.save("audio.wav");

<h3 id="grounding-dino">Grounding DINO for zero-shot object detection</h3>

See https://github.com/huggingface/transformers.js/pull/1137 for more information and here for the list of supported models.

Example: Zero-shot object detection with onnx-community/grounding-dino-tiny-ONNX using the pipeline API.

import { pipeline } from "@huggingface/transformers";

const detector = await pipeline("zero-shot-object-detection", "onnx-community/grounding-dino-tiny-ONNX");

const url = "http://images.cocodataset.org/val2017/000000039769.jpg";
const candidate_labels = ["a cat."];
const output = await detector(url, candidate_labels, {
  threshold: 0.3,
});

<details> <summary>See example output</summary>

[
  { score: 0.45316222310066223, label: "a cat", box: { xmin: 343, ymin: 23, xmax: 637, ymax: 372 } },
  { score: 0.36190420389175415, label: "a cat", box: { xmin: 12, ymin: 52, xmax: 317, ymax: 472 } },
]

</details> <h2 id="other-improvements">🛠️ Other improvements</h2>

Add the RawAudio class by @Th3G33k in https://github.com/huggingface/transformers.js/pull/682
Update React guide for v3 by @sroussey in https://github.com/huggingface/transformers.js/pull/1128
Add option to skip special tokens in TextStreamer by @sroussey in https://github.com/huggingface/transformers.js/pull/1139

<h2 id="new-contributors">🤗 New contributors</h2>

@sroussey made their first contribution in https://github.com/huggingface/transformers.js/pull/1128

Full Changelog: https://github.com/huggingface/transformers.js/compare/3.2.4...3.3.0

Dec 28, 2024

What's new?

Add support for visualizing self-attention heatmaps in https://github.com/huggingface/transformers.js/pull/1117

<table> <tr> <td rowspan="2"> <img src="https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/cats.jpg" alt="Cats" width="200"> </td> <td> <img src="https://github.com/user-attachments/assets/928c3d97-2c67-4ddb-9e9c-2a06745a532f" alt="Attention Head 0" width="200"> </td> <td> <img src="https://github.com/user-attachments/assets/e7725424-10fd-4a47-8350-8f367d21657d" alt="Attention Head 1" width="200"> </td> <td> <img src="https://github.com/user-attachments/assets/81790060-f4bf-4e5c-8d35-a9246acb9a36" alt="Attention Head 2" width="200"> </td> </tr> <tr> <td> <img src="https://github.com/user-attachments/assets/ebe44550-8a40-4e17-84eb-75fe6fce5df5" alt="Attention Head 3" width="200"> </td> <td> <img src="https://github.com/user-attachments/assets/32439d8d-7798-40e2-a4aa-d0e109afe1b5" alt="Attention Head 4" width="200"> </td> <td> <img src="https://github.com/user-attachments/assets/2faff471-fba1-4456-8332-e66a4a05bc5d" alt="Attention Head 5" width="200"> </td> </tr> </table> <details> <summary>Example code</summary>

import { AutoProcessor, AutoModelForImageClassification, interpolate_4d, RawImage } from "@huggingface/transformers";

// Load model and processor
const model_id = "onnx-community/dinov2-with-registers-small-with-attentions";
const model = await AutoModelForImageClassification.from_pretrained(model_id);
const processor = await AutoProcessor.from_pretrained(model_id);

// Load image from URL
const image = await RawImage.read("https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/cats.jpg");

// Pre-process image
const inputs = await processor(image);

// Perform inference
const { logits, attentions } = await model(inputs);

// Get the predicted class
const cls = logits[0].argmax().item();
const label = model.config.id2label[cls];
console.log(`Predicted class: ${label}`);

// Set config values
const patch_size = model.config.patch_size;
const [width, height] = inputs.pixel_values.dims.slice(-2);
const w_featmap = Math.floor(width / patch_size);
const h_featmap = Math.floor(height / patch_size);
const num_heads = model.config.num_attention_heads;
const num_cls_tokens = 1;
const num_register_tokens = model.config.num_register_tokens ?? 0;

// Visualize attention maps
const selected_attentions = attentions
    .at(-1) // we are only interested in the attention maps of the last layer
    .slice(0, null, 0, [num_cls_tokens + num_register_tokens, null])
    .view(num_heads, 1, w_featmap, h_featmap);

const upscaled = await interpolate_4d(selected_attentions, {
    size: [width, height],
    mode: "nearest",
});

for (let i = 0; i < num_heads; ++i) {
    const head_attentions = upscaled[i];
    const minval = head_attentions.min().item();
    const maxval = head_attentions.max().item();
    const image = RawImage.fromTensor(
        head_attentions
            .sub_(minval)
            .div_(maxval - minval)
            .mul_(255)
            .to("uint8"),
    );
    await image.save(`attn-head-${i}.png`);
}

</details>

Add min, max, argmin, argmax tensor ops for dim=null
Add support for nearest-neighbour interpolation in interpolate_4d
Depth Estimation pipeline improvements (faster & returns resized depth map)
TypeScript improvements by @ocavue and @shrirajh in https://github.com/huggingface/transformers.js/pull/1081 and https://github.com/huggingface/transformers.js/pull/1122
Remove unused imports from tokenizers.js by @pratapvardhan in https://github.com/huggingface/transformers.js/pull/1116

New Contributors

@shrirajh made their first contribution in https://github.com/huggingface/transformers.js/pull/1122
@pratapvardhan made their first contribution in https://github.com/huggingface/transformers.js/pull/1116

Full Changelog: https://github.com/huggingface/transformers.js/compare/3.2.3...3.2.4

Dec 25, 2024

What's new?

Fix setting of model_file_name for image feature extraction pipeline in https://github.com/huggingface/transformers.js/pull/1114. Thanks @xitanggg for reporting the issue!

Add support for dinov2 with registers in https://github.com/huggingface/transformers.js/pull/1110. Example usage:

import { pipeline } from '@huggingface/transformers';

// Create image classification pipeline
const classifier = await pipeline('image-classification', 'onnx-community/dinov2-with-registers-small-imagenet1k-1-layer');

// Classify an image
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/cats.jpg';
const output = await classifier(url);
console.log(output);
// [
//   { label: 'tabby, tabby cat', score: 0.8135351538658142 },
//   { label: 'tiger cat', score: 0.08967583626508713 },
//   { label: 'Egyptian cat', score: 0.06800546497106552 },
//   { label: 'radiator', score: 0.003501888597384095 },
//   { label: 'quilt, comforter, comfort, puff', score: 0.003408448537811637 },
// ]

Full Changelog: https://github.com/huggingface/transformers.js/compare/3.2.2...3.2.3

Dec 23, 2024

What's new?

Fix env.backends.onnx.wasm.proxy = true: Clone tensor if using onnx wasm proxy in https://github.com/huggingface/transformers.js/pull/1108

Full Changelog: https://github.com/huggingface/transformers.js/compare/3.2.1...3.2.2

Dec 19, 2024

What's new?

Add support for ModernBert in https://github.com/huggingface/transformers.js/pull/1104. Check out the blog post for more information!

Example:

import { pipeline } from '@huggingface/transformers';

const pipe = await pipeline('fill-mask', 'answerdotai/ModernBERT-base');
const answer = await pipe('The capital of France is [MASK].');
console.log(answer);

Full Changelog: https://github.com/huggingface/transformers.js/compare/3.2.0...3.2.1

Dec 15, 2024

🔥 Transformers.js v3.2 — Moonshine for real-time speech recognition, Phi-3.5 Vision for multi-frame image understanding and reasoning, and more!

Table of contents:

🤖 New models: Moonshine, Phi-3.5 Vision, EXAONE
- Moonshine: Real-time speech recognition
- Phi-3.5 Vision: Multi-frame image understanding and reasoning
- EXAONE: Bilingual (English and Korean) text generation
🐛 Bug fixes
🛠️ Other improvements

<h2 id="new-models">🤖 New models: Moonshine, Phi-3.5 Vision, EXAONE</h2> <h3 id="moonshine">Moonshine for real-time speech recognition</h3>

Moonshine is a family of speech-to-text models optimized for fast and accurate automatic speech recognition (ASR) on resource-constrained devices. They are well-suited to real-time, on-device applications like live transcription and voice command recognition, and are perfect for in-browser usage (check out the online demo). See https://github.com/huggingface/transformers.js/pull/1099 for more information and here for the list of supported models.

Example: Automatic speech recognition w/ Moonshine tiny.

import { pipeline } from "@huggingface/transformers";

const transcriber = await pipeline("automatic-speech-recognition", "onnx-community/moonshine-tiny-ONNX");
const output = await transcriber("https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav");
console.log(output);
// { text: 'And so my fellow Americans ask not what your country can do for you as what you can do for your country.' }

<details> <summary>See example using the MoonshineForConditionalGeneration API</summary>

import { MoonshineForConditionalGeneration, AutoProcessor, read_audio } from "@huggingface/transformers";

// Load model and processor
const model_id = "onnx-community/moonshine-tiny-ONNX";
const model = await MoonshineForConditionalGeneration.from_pretrained(model_id, {
    dtype: "q4",
});
const processor = await AutoProcessor.from_pretrained(model_id);

// Load audio and prepare inputs
const audio = await read_audio("https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav", 16000);
const inputs = await processor(audio);

// Generate outputs
const outputs = await model.generate({ ...inputs, max_new_tokens: 100 });

// Decode outputs
const decoded = processor.batch_decode(outputs, { skip_special_tokens: true });
console.log(decoded[0]);
// And so my fellow Americans ask not what your country can do for you, ask what you can do for your country.

</details> <h3 id="phi3_v">Phi-3.5 Vision for multi-frame image understanding and reasoning</h3>

Phi-3.5 Vision is a lightweight, state-of-the-art, open multimodal model that can be used for multi-frame image understanding and reasoning. See https://github.com/huggingface/transformers.js/pull/1094 for more information and here for the list of supported models.

Examples:

<table> <tr> <th style="width: 40%;">Input</th> <th style="width: 60%;">Output</th> </tr> <tr> <td> "What's funny about this image?" <img src="https://github.com/user-attachments/assets/631df9f7-6076-47f6-ba8a-8fd34eb34439" width="400" /> </td> <td>The humor in this image stems from the exaggerated depiction of human evolution, using the Shiba Inu dog breed to represent both ancient and modern humans. The left side shows a muscular, hunter-like figure labeled as 'Humans 100,000 years ago' with the caption 'me hungry me hunt mammoth,' suggesting a time when humans were physically robust and actively hunting. The right side contrasts this with a modern, slim Shiba Inu labeled as 'Humans today' with the caption 'why food delivery slow,' humorously commenting on the modern human's reliance on convenience and technology, such as food delivery services, rather than hunting for sustenance. The use of a dog, which is often associated with loyalty and companionship, adds a layer of irony and humor as it portrays humans in a more diminished, dependent state.</td> </tr> <tr> <td> "Summarize the deck of slides." <img src="https://github.com/user-attachments/assets/78f45879-fa75-4636-9a69-57fc33747a6e" width="400" /> <img src="https://github.com/user-attachments/assets/3616cc53-ac1c-4f7e-8fba-55542354fd9a" width="400" /> <img src="https://github.com/user-attachments/assets/a6897365-c502-4815-9160-1092bda679cf" width="400" /> </td> <td> <p>To summarize, the slides are composed of these sections:</p> <ul> <li> <strong>Introduction to Azure:</strong> <p>The presentation introduces Microsoft Azure, a cloud computing platform. It highlights Azure's three service tiers: Hyper-scale, Enterprise, and Hybrid. The presenter is Dinesh Kumar Wickramasinghe, a Senior Software Engineer from CMS Private Limited in Sri Lanka.</p> </li> <li> <strong>Azure Overview:</strong> <p>Azure is described as Microsoft's cloud computing platform, continuously expanding to meet current and future business challenges. It offers freedom to build, manage, and deploy applications on a global network using preferred tools and frameworks.</p> </li> <li> <strong>Cloud Computing Services:</strong> <p>The presentation outlines three types of cloud computing services provided by Azure: Infrastructure-as-a-Service (IaaS) with a 'host' component, Platform-as-a-Service (PaaS) with a 'build' component, and Software-as-a-Service (SaaS) with a 'consume' component.</p> </li> </ul> </td> </tr> </table> <details> <summary>See example code</summary>

Example: Single-frame (critique an image)

import {
  AutoProcessor,
  AutoModelForCausalLM,
  TextStreamer,
  load_image,
} from "@huggingface/transformers";

// Load processor and model
const model_id = "onnx-community/Phi-3.5-vision-instruct";
const processor = await AutoProcessor.from_pretrained(model_id, {
  legacy: true, // Use legacy to match python version
});
const model = await AutoModelForCausalLM.from_pretrained(model_id, {
  dtype: {
    vision_encoder: "q4", // 'q4' or 'q4f16'
    prepare_inputs_embeds: "q4", // 'q4' or 'q4f16'
    model: "q4f16", // 'q4f16'
  },
});

// Load image
const image = await load_image("https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/meme.png");

// Prepare inputs
const messages = [
  { role: "user", content: "<|image_1|>What's funny about this image?" },
];
const prompt = processor.tokenizer.apply_chat_template(messages, {
  tokenize: false,
  add_generation_prompt: true,
});
const inputs = await processor(prompt, image, { num_crops: 4 });

// (Optional) Set up text streamer
const streamer = new TextStreamer(processor.tokenizer, {
  skip_prompt: true,
  skip_special_tokens: true,
});

// Generate response
const output = await model.generate({
  ...inputs,
  streamer,
  max_new_tokens: 256,
});

Or, decode the output at the end:

// Decode and display the answer
const generated_ids = output.slice(null, [inputs.input_ids.dims[1], null]);
const answer = processor.batch_decode(generated_ids, {
  skip_special_tokens: true,
});
console.log(answer[0]);

Example: Multi-frame (summarize slides)

import {
  AutoProcessor,
  AutoModelForCausalLM,
  TextStreamer,
  load_image,
} from "@huggingface/transformers";

// Load processor and model
const model_id = "onnx-community/Phi-3.5-vision-instruct";
const processor = await AutoProcessor.from_pretrained(model_id, {
  legacy: true, // Use legacy to match python version
});
const model = await AutoModelForCausalLM.from_pretrained(model_id, {
  dtype: {
    vision_encoder: "q4", // 'q4' or 'q4f16'
    prepare_inputs_embeds: "q4", // 'q4' or 'q4f16'
    model: "q4f16", // 'q4f16'
  },
});

// Load images
const urls = [
  "https://image.slidesharecdn.com/azureintroduction-191206101932/75/Introduction-to-Microsoft-Azure-Cloud-1-2048.jpg",
  "https://image.slidesharecdn.com/azureintroduction-191206101932/75/Introduction-to-Microsoft-Azure-Cloud-2-2048.jpg",
  "https://image.slidesharecdn.com/azureintroduction-191206101932/75/Introduction-to-Microsoft-Azure-Cloud-3-2048.jpg",
];
const images = await Promise.all(urls.map(load_image));

// Prepare inputs
const placeholder = images.map((_, i) => `<|image_${i + 1}|>\n`).join("");
const messages = [
  { role: "user", content: placeholder + "Summarize the deck of slides." },
];
const prompt = processor.tokenizer.apply_chat_template(messages, {
  tokenize: false,
  add_generation_prompt: true,
});
const inputs = await processor(prompt, images, { num_crops: 4 });

// (Optional) Set up text streamer
const streamer = new TextStreamer(processor.tokenizer, {
  skip_prompt: true,
  skip_special_tokens: true,
});

// Generate response
const output = await model.generate({
  ...inputs,
  streamer,
  max_new_tokens: 256,
});

</details> <h3 id="exaone">EXAONE 3.5 for bilingual (English and Korean) text generation</h3>

EXAONE 3.5 is a collection of instruction-tuned bilingual (English and Korean) generative models, developed and released by LG AI Research. See https://github.com/huggingface/transformers.js/pull/1084 for more information and here for the list of supported models.

Example: Text-generation w/ EXAONE-3.5-2.4B-Instruct:

import { pipeline } from "@huggingface/transformers";

// Create a text generation pipeline
const generator = await pipeline(
  "text-generation",
  "onnx-community/EXAONE-3.5-2.4B-Instruct",
  { dtype: "q4f16" },
);

// Define the list of messages
const messages = [
  { role: "system", content: "You are a helpful assistant." },
  { role: "user", content: "Tell me a joke." },
];

// Generate a response
const output = await generator(messages, { max_new_tokens: 128 });
console.log(output[0].generated_text.at(-1).content);

<details> <summary>See example output</summary>

Sure! Here's a light joke for you:

Why don't scientists trust atoms?

Because they make up everything! 

I hope you found that amusing! If you want another one, feel free to ask!

</details> <h2 id="bug-fixes">🐛 Bug fixes</h2>

Fix pyannote processor post_process_speaker_diarization in https://github.com/huggingface/transformers.js/pull/1082. Thanks to @patrick-ve for reporting the issue!

<h2 id="other-improvements">🛠️ Other improvements</h2>

Improve unit testing framework in https://github.com/huggingface/transformers.js/pull/1083 and https://github.com/huggingface/transformers.js/pull/1095, bring coverage up to 91% (from 84%).

Full Changelog: https://github.com/huggingface/transformers.js/compare/3.1.2...3.2.0

Dec 7, 2024

🤖 New models

Add support for PaliGemma (& PaliGemma2) in https://github.com/huggingface/transformers.js/pull/1074

Example: Image captioning with onnx-community/paligemma2-3b-ft-docci-448.

import { AutoProcessor, PaliGemmaForConditionalGeneration, load_image } from '@huggingface/transformers';

// Load processor and model
const model_id = 'onnx-community/paligemma2-3b-ft-docci-448';
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await PaliGemmaForConditionalGeneration.from_pretrained(model_id, {
    dtype: {
        embed_tokens: 'fp16', // or 'q8'
        vision_encoder: 'fp16', // or 'q4', 'q8'
        decoder_model_merged: 'q4', // or 'q4f16'
    },
});

// Prepare inputs
const url = 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg'
const raw_image = await load_image(url);
const prompt = '<image>caption en'; // Caption the image in English
const inputs = await processor(raw_image, prompt);

// Generate a response
const output = await model.generate({
    ...inputs,
    max_new_tokens: 100,
})

const generated_ids = output.slice(null, [inputs.input_ids.dims[1], null]);
const answer = processor.batch_decode(
    generated_ids,
    { skip_special_tokens: true },
);
console.log(answer[0]);
// A side view of a light blue 1970s Volkswagen Beetle parked on a gray cement road. It is facing to the right. It has a reflection on the side of it. Behind it is a yellow building with a brown double door on the right. It has a white frame around it. Part of a gray cement wall is visible on the far left.

List of supported models: https://huggingface.co/models?library=transformers.js&other=paligemma

Add support for I-JEPA in https://github.com/huggingface/transformers.js/pull/1073

Example: Image feature extraction with onnx-community/ijepa_vith14_1k.

import { pipeline, cos_sim } from "@huggingface/transformers";

// Create an image feature extraction pipeline
const extractor = await pipeline(
  "image-feature-extraction",
  "onnx-community/ijepa_vith14_1k",
  { dtype: "q8" },
);

// Compute image embeddings
const url_1 = "http://images.cocodataset.org/val2017/000000039769.jpg"
const url_2 = "http://images.cocodataset.org/val2017/000000219578.jpg"
const output = await extractor([url_1, url_2]);
const pooled_output = output.mean(1); // Apply mean pooling

// Compute cosine similarity
const similarity = cos_sim(pooled_output[0].data, pooled_output[1].data);
console.log(similarity); // 0.5168613045518973

List of supported models: https://huggingface.co/models?library=transformers.js&other=ijepa

Add support for OLMo2 in https://github.com/huggingface/transformers.js/pull/1076. List of supported models: https://huggingface.co/models?library=transformers.js&other=olmo2

🐛 Bug fixes

Fix whisper timestamp extraction for tokenizers with added tokens by @aravindMahadevan in https://github.com/huggingface/transformers.js/pull/804
Add missing 'ready' status in the ProgressInfo type by @ocavue in https://github.com/huggingface/transformers.js/pull/1070

🛠️ Other improvements

Add function to apply mask to RawImage by @BritishWerewolf in https://github.com/huggingface/transformers.js/pull/1020
Bump versions + webpack improvements in https://github.com/huggingface/transformers.js/pull/1075

🤗 New contributors

@aravindMahadevan made their first contribution in https://github.com/huggingface/transformers.js/pull/804

Full Changelog: https://github.com/huggingface/transformers.js/compare/3.1.1...3.1.2

Dec 3, 2024

🤖 New models

Add support for Idefics3 (SmolVLM) in https://github.com/huggingface/transformers.js/pull/1059

import {
  AutoProcessor,
  AutoModelForVision2Seq,
  load_image,
} from "@huggingface/transformers";

// Initialize processor and model
const model_id = "HuggingFaceTB/SmolVLM-Instruct";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await AutoModelForVision2Seq.from_pretrained(model_id, {
  dtype: {
    embed_tokens: "fp16", // "fp32", "fp16", "q8"
    vision_encoder: "q4", // "fp32", "fp16", "q8", "q4", "q4f16"
    decoder_model_merged: "q4", // "q8", "q4", "q4f16"
  }
});

// Load images
const image1 = await load_image("https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg");
const image2 = await load_image("https://huggingface.co/spaces/merve/chameleon-7b/resolve/main/bee.jpg");

// Create input messages
const messages = [
  {
    role: "user",
    content: [
      { type: "image" },
      { type: "image" },
      { type: "text", text: "Can you describe the two images?" },
    ],
  },
];

// Prepare inputs
const text = processor.apply_chat_template(messages, { add_generation_prompt: true });
const inputs = await processor(text, [image1, image2], {
  // Set `do_image_splitting: true` to split images into multiple patches.
  // NOTE: This uses more memory, but can provide more accurate results.
  do_image_splitting: false,
});

// Generate outputs
const generated_ids = await model.generate({
  ...inputs,
  max_new_tokens: 500,
});
const generated_texts = processor.batch_decode(
  generated_ids.slice(null, [inputs.input_ids.dims.at(-1), null]),
  { skip_special_tokens: true },
);
console.log(generated_texts[0]);
// ' In the first image, there is a green statue of liberty on a pedestal in the middle of the water. The water is surrounded by trees and buildings in the background. In the second image, there are pink and red flowers with a bee on the pink flower.'

🐛 Bug fixes

Fix repetition penalty logits processor in https://github.com/huggingface/transformers.js/pull/1062
Fix optional chaining for batch size calculation in PreTrainedModel by @emojiiii in https://github.com/huggingface/transformers.js/pull/1063

📝 Documentation improvements

Add an example and type enhancement for TextStreamer by @seonglae in https://github.com/huggingface/transformers.js/pull/1066
The smallest typo fix for webgpu.md by @JoramMillenaar in https://github.com/huggingface/transformers.js/pull/1068

🛠️ Other improvements

Only log warning if type not explicitly set to "custom" in https://github.com/huggingface/transformers.js/pull/1061
Improve browser vs. webworker detection in https://github.com/huggingface/transformers.js/pull/1067

🤗 New contributors

@emojiiii made their first contribution in https://github.com/huggingface/transformers.js/pull/1063
@seonglae made their first contribution in https://github.com/huggingface/transformers.js/pull/1066
@JoramMillenaar made their first contribution in https://github.com/huggingface/transformers.js/pull/1068

Full Changelog: https://github.com/huggingface/transformers.js/compare/3.1.0...3.1.1

Nov 26, 2024

🚀 Transformers.js v3.1 — any-to-any, text-to-image, image-to-text, pose estimation, time series forecasting, and more!

Table of contents:

🤖 New models: Janus, Qwen2-VL, JinaCLIP, LLaVA-OneVision, ViTPose, MGP-STR, PatchTST, PatchTSMixer.
- Janus: Any-to-Any generation
- Qwen2-VL: Image-Text-to-Text
- JinaCLIP: Multimodal embeddings
- LLaVA-OneVision: Image-Text-to-Text
- ViTPose: Pose-estimation
- MGP-STR: Optical Character Recognition (OCR)
- PatchTST and PatchTSMixer: Time series forecasting.
🐛 Bug fixes
📝 Documentation improvements
🛠️ Other improvements
🤗 New contributors

<h2 id="new-models">🤖 New models: Janus, Qwen2-VL, JinaCLIP, LLaVA-OneVision, ViTPose, MGP-STR, PatchTST, PatchTSMixer.</h2> <h3 id="janus">Janus for Any-to-Any generation (e.g., image-to-text and text-to-image)</h3>

First of all, this release adds support for Janus, a novel autoregressive framework that unifies multimodal understanding and generation. The most popular model, deepseek-ai/Janus-1.3B, is tagged as an "any-to-any" model, and has specifically been trained for the following tasks:

Example: Image-Text-to-Text

import { AutoProcessor, MultiModalityCausalLM } from "@huggingface/transformers";

// Load processor and model
const model_id = "onnx-community/Janus-1.3B-ONNX";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await MultiModalityCausalLM.from_pretrained(model_id);

// Prepare inputs
const conversation = [
  {
    role: "User",
    content: "<image_placeholder>\nConvert the formula into latex code.",
    images: ["https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/quadratic_formula.png"],
  },
];
const inputs = await processor(conversation);

// Generate response
const outputs = await model.generate({
  ...inputs,
  max_new_tokens: 150,
  do_sample: false,
});

// Decode output
const new_tokens = outputs.slice(null, [inputs.input_ids.dims.at(-1), null]);
const decoded = processor.batch_decode(new_tokens, { skip_special_tokens: true });
console.log(decoded[0]);

Sample output:

Sure, here is the LaTeX code for the given formula:

```
x = \frac{-b \pm \sqrt{b^2 - 4a c}}{2a}
```

This code represents the mathematical expression for the variable \( x \).

Example: Text-to-Image

import { AutoProcessor, MultiModalityCausalLM } from "@huggingface/transformers";

// Load processor and model
const model_id = "onnx-community/Janus-1.3B-ONNX";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await MultiModalityCausalLM.from_pretrained(model_id);

// Prepare inputs
const conversation = [
  {
    role: "User",
    content: "A cute and adorable baby fox with big brown eyes, autumn leaves in the background enchanting,immortal,fluffy, shiny mane,Petals,fairyism,unreal engine 5 and Octane Render,highly detailed, photorealistic, cinematic, natural colors.",
  },
];
const inputs = await processor(conversation, { chat_template: "text_to_image" });

// Generate response
const num_image_tokens = processor.num_image_tokens;
const outputs = await model.generate_images({
  ...inputs,
  min_new_tokens: num_image_tokens,
  max_new_tokens: num_image_tokens,
  do_sample: true,
});

// Save the generated image
await outputs[0].save("test.png");

Sample outputs:

What to play around with the model? Check out our online WebGPU demo! 👇

https://github.com/user-attachments/assets/513b3119-ba8c-4a2d-b5fe-6869be47abfa

<h3 id="qwen2vl">Qwen2-VL for Image-Text-to-Text</h3>

Example: Image-Text-to-Text

Next, we added support for Qwen2-VL, the multimodal large language model series developed by Qwen team, Alibaba Cloud. It introduces the Naive Dynamic Resolution mechanism, allowing the model to process images of varying resolutions and leading to more efficient and accurate visual representations.

import { AutoProcessor, Qwen2VLForConditionalGeneration, RawImage } from "@huggingface/transformers";

// Load processor and model
const model_id = "onnx-community/Qwen2-VL-2B-Instruct";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await Qwen2VLForConditionalGeneration.from_pretrained(model_id);

// Prepare inputs
const url = "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg";
const image = await (await RawImage.read(url)).resize(448, 448);
const conversation = [
  {
    role: "user",
    content: [
      { type: "image" },
      { type: "text", text: "Describe this image." },
    ],
  },
];
const text = processor.apply_chat_template(conversation, { add_generation_prompt: true });
const inputs = await processor(text, image);

// Perform inference
const outputs = await model.generate({
  ...inputs,
  max_new_tokens: 128,
});

// Decode output
const decoded = processor.batch_decode(
  outputs.slice(null, [inputs.input_ids.dims.at(-1), null]),
  { skip_special_tokens: true },
);
console.log(decoded[0]);
// The image depicts a serene beach scene with a woman and a dog. The woman is sitting on the sand, wearing a plaid shirt, and appears to be engaged in a playful interaction with the dog. The dog, which is a large breed, is sitting on its hind legs and appears to be reaching out to the woman, possibly to give her a high-five or a paw. The background shows the ocean with gentle waves, and the sky is clear, suggesting it might be either sunrise or sunset. The overall atmosphere is calm and relaxed, capturing a moment of connection between the woman and the dog.

<h3 id="jina_clip">JinaCLIP for multimodal embeddings</h3>

JinaCLIP is a series of general-purpose multilingual multimodal embedding models for text & images, created by Jina AI.

Example: Compute text and/or image embeddings with jinaai/jina-clip-v2:

import { AutoModel, AutoProcessor, RawImage, matmul } from "@huggingface/transformers";

// Load processor and model
const model_id = "jinaai/jina-clip-v2";
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await AutoModel.from_pretrained(model_id, { dtype: "q4" /* e.g., "fp16", "q8", or "q4" */ });

// Prepare inputs
const urls = ["https://i.ibb.co/nQNGqL0/beach1.jpg", "https://i.ibb.co/r5w8hG8/beach2.jpg"];
const images = await Promise.all(urls.map(url => RawImage.read(url)));
const sentences = [
    "غروب جميل على الشاطئ", // Arabic
    "海滩上美丽的日落", // Chinese
    "Un beau coucher de soleil sur la plage", // French
    "Ein wunderschöner Sonnenuntergang am Strand", // German
    "Ένα όμορφο ηλιοβασίλεμα πάνω από την παραλία", // Greek
    "समुद्र तट पर एक खूबसूरत सूर्यास्त", // Hindi
    "Un bellissimo tramonto sulla spiaggia", // Italian
    "浜辺に沈む美しい夕日", // Japanese
    "해변 위로 아름다운 일몰", // Korean
];

// Encode text and images
const inputs = await processor(sentences, images, { padding: true, truncation: true });
const { l2norm_text_embeddings, l2norm_image_embeddings } = await model(inputs);

// Encode query (text-only)
const query_prefix = "Represent the query for retrieving evidence documents: ";
const query_inputs = await processor(query_prefix + "beautiful sunset over the beach");
const { l2norm_text_embeddings: query_embeddings } = await model(query_inputs);

// Compute text-image similarity scores
const text_to_image_scores = await matmul(query_embeddings, l2norm_image_embeddings.transpose(1, 0));
console.log("text-image similarity scores", text_to_image_scores.tolist()[0]); // [0.29530206322669983, 0.3183615803718567]

// Compute image-image similarity scores
const image_to_image_score = await matmul(l2norm_image_embeddings[0], l2norm_image_embeddings[1]);
console.log("image-image similarity score", image_to_image_score.item()); // 0.9344457387924194

// Compute text-text similarity scores
const text_to_text_scores = await matmul(query_embeddings, l2norm_text_embeddings.transpose(1, 0));
console.log("text-text similarity scores", text_to_text_scores.tolist()[0]); // [0.5566609501838684, 0.7028406858444214, 0.582255482673645, 0.6648036241531372, 0.5462006330490112, 0.6791588068008423, 0.6192430257797241, 0.6258729100227356, 0.6453716158866882]

<h3 id="llava_onevision">LLaVA-OneVision for Image-Text-to-Text</h3>

LLaVA-OneVision is a Vision-Language Model that can generate text conditioned on one or several images/videos. The model consists of SigLIP vision encoder and a Qwen2 language backbone.

Example: Multi-round conversations w/ PKV caching

import { AutoProcessor, AutoTokenizer, LlavaOnevisionForConditionalGeneration, RawImage } from '@huggingface/transformers';

// Load tokenizer, processor and model
const model_id = 'llava-hf/llava-onevision-qwen2-0.5b-ov-hf';

const tokenizer = await AutoTokenizer.from_pretrained(model_id);
const processor = await AutoProcessor.from_pretrained(model_id);
const model = await LlavaOnevisionForConditionalGeneration.from_pretrained(model_id, {
    dtype: {
        embed_tokens: 'fp16', // or 'fp32' or 'q8'
        vision_encoder: 'fp16', // or 'fp32' or 'q8'
        decoder_model_merged: 'q4', // or 'q8'
    },
    // device: 'webgpu',
});

// Prepare text inputs
const prompt = 'What does the text say?';
const messages = [
    { role: 'system', content: 'Answer the question.' },
    { role: 'user', content: `<image>\n${prompt}` }
]
const text = tokenizer.apply_chat_template(messages, { tokenize: false, add_generation_prompt: true });
const text_inputs = tokenizer(text);

// Prepare vision inputs
const url = 'https://huggingface.co/qnguyen3/nanoLLaVA/resolve/main/example_1.png';
const image = await RawImage.fromURL(url);
const vision_inputs = await processor(image);

// Generate response
const { past_key_values, sequences } = await model.generate({
    ...text_inputs,
    ...vision_inputs,
    do_sample: false,
    max_new_tokens: 64,
    return_dict_in_generate: true,
});

// Decode output
const answer = tokenizer.decode(
    sequences.slice(0, [text_inputs.input_ids.dims[1], null]),
    { skip_special_tokens: true },
);
console.log(answer);
// The text says "small but mighty" in a playful font.

const new_messages = [
    ...messages,
    { role: 'assistant', content: answer },
    { role: 'user', content: 'How does the text correlate to the context of the image?' }
]
const new_text = tokenizer.apply_chat_template(new_messages, { tokenize: false, add_generation_prompt: true });
const new_text_inputs = tokenizer(new_text);

// Generate another response
const output = await model.generate({
    ...new_text_inputs,
    past_key_values,
    do_sample: false,
    max_new_tokens: 256,
});
const new_answer = tokenizer.decode(
    output.slice(0, [new_text_inputs.input_ids.dims[1], null]),
    { skip_special_tokens: true },
);
console.log(new_answer);
// The text "small but mighty" is likely a playful or humorous reference to the image of the blue mouse with the orange dumbbell. It could be used as a motivational phrase or a playful way to express the idea that even small things can be impressive or powerful.

<h3 id="vitpose">ViTPose for pose-estimation</h3>

A state-of-the-art pose estimation model which employs a standard, non-hierarchical vision transformer as a backbone for the task of keypoint estimation (combined with a simple decoder head to predict heatmaps from a given image).

Example: Pose estimation w/ onnx-community/vitpose-base-simple.

import { AutoModel, AutoImageProcessor, RawImage } from '@huggingface/transformers';

// Load model and processor
const model_id = 'onnx-community/vitpose-base-simple';
const model = await AutoModel.from_pretrained(model_id);
const processor = await AutoImageProcessor.from_pretrained(model_id);

// Load image and prepare inputs
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/ryan-gosling.jpg';
const image = await RawImage.read(url);
const inputs = await processor(image);

// Predict heatmaps
const { heatmaps } = await model(inputs);

// Post-process heatmaps to get keypoints and scores
const boxes = [[[0, 0, image.width, image.height]]];
const results = processor.post_process_pose_estimation(heatmaps, boxes)[0][0];
console.log(results);

<details> <summary>Optionally, visualize the outputs (Node.js usage shown here, using the node-canvas library):</summary>

import { createCanvas, createImageData } from 'canvas';

// Create canvas and draw image
const canvas = createCanvas(image.width, image.height);
const ctx = canvas.getContext('2d');
const imageData = createImageData(image.rgba().data, image.width, image.height);
ctx.putImageData(imageData, 0, 0);

// Draw edges between keypoints
const points = results.keypoints;
ctx.lineWidth = 4;
ctx.strokeStyle = 'blue';
for (const [i, j] of model.config.edges) {
    const [x1, y1] = points[i];
    const [x2, y2] = points[j];
    ctx.beginPath();
    ctx.moveTo(x1, y1);
    ctx.lineTo(x2, y2);
    ctx.stroke();
}

// Draw circle at each keypoint
ctx.fillStyle = 'red';
for (const [x, y] of points) {
    ctx.beginPath();
    ctx.arc(x, y, 8, 0, 2 * Math.PI);
    ctx.fill();
}

// Save image to file
import fs from 'fs';
const out = fs.createWriteStream('pose.png');
const stream = canvas.createPNGStream();
stream.pipe(out)
out.on('finish', () =>  console.log('The PNG file was created.'));

</details>

Input image	Output image

<h3 id="mgp-str">MGP-STR for Optical Character Recognition (OCR)</h3>

A simple yet powerful vision scene text recognition model, built upon the vision transformer (ViT).

Example: Optical Character Recognition (OCR) w/ onnx-community/mgp-str-base

import { MgpstrForSceneTextRecognition, MgpstrProcessor, RawImage } from '@huggingface/transformers';

const model_id = 'onnx-community/mgp-str-base';
const model = await MgpstrForSceneTextRecognition.from_pretrained(model_id);
const processor = await MgpstrProcessor.from_pretrained(model_id);

// Load image from the IIIT-5k dataset
const url = "https://i.postimg.cc/ZKwLg2Gw/367-14.png";
const image = await RawImage.read(url);

// Preprocess the image
const result = await processor(image);

// Perform inference
const outputs = await model(result);

// Decode the model outputs
const generated_text = processor.batch_decode(outputs.logits).generated_text;
console.log(generated_text); // [ 'ticket' ]

<h3 id="patchtst-and-patchtsmixer">PatchTST and PatchTSMixer for time series forecasting.</h3>

Example: Time series forecasting w/ onnx-community/granite-timeseries-patchtst

Models which can be used for multivariate time series forecasting.

import { PatchTSTForPrediction, Tensor } from "@huggingface/transformers";

const model_id = "onnx-community/granite-timeseries-patchtst";
const model = await PatchTSTForPrediction.from_pretrained(model_id, { dtype: "fp32" });

const dims = [64, 512, 7];
const prod = dims.reduce((a, b) => a * b, 1);
const past_values = new Tensor('float32',
    Float32Array.from({ length: prod }, (_, i) => i / prod),
    dims,
);
const { prediction_outputs } = await model({ past_values });
console.log(prediction_outputs);

Example: Time series forecasting w/ onnx-community/granite-timeseries-patchtsmixer

import { PatchTSMixerForPrediction, Tensor } from "@huggingface/transformers";

const model_id = "onnx-community/granite-timeseries-patchtsmixer";
const model = await PatchTSMixerForPrediction.from_pretrained(model_id, { dtype: "fp32" });

const dims = [64, 512, 7];
const prod = dims.reduce((a, b) => a * b, 1);
const past_values = new Tensor('float32',
    Float32Array.from({ length: prod }, (_, i) => i / prod),
    dims,
);
const { prediction_outputs } = await model({ past_values });
console.log(prediction_outputs);

<h2 id="bug-fixes">🐛 Bug fixes</h2>

When padding an image, the dimensions get stretched by @BritishWerewolf in https://github.com/huggingface/transformers.js/pull/1015
fix(scale): add missing scale element by @tosinamuda in https://github.com/huggingface/transformers.js/pull/1017

<h2 id="documentation-improvements">📝 Documentation improvements</h2>

Updated link to sentence similarity models. by @uzyn in https://github.com/huggingface/transformers.js/pull/893
fix(docs): fixed a broken link to quantization guide by @ThomasWT in https://github.com/huggingface/transformers.js/pull/1014
fix(docs): Fixed Typos in README and docs/snippets/6_supported-models.snippet by @hitchhiker3010 in https://github.com/huggingface/transformers.js/pull/1030

<h2 id="other-improvements">🛠️ Other improvements</h2>

Add option to maintain aspect ratio on resize by @BritishWerewolf in https://github.com/huggingface/transformers.js/pull/971
Add functionality to split RawImage into channels; Update slice documentation and tests by @BritishWerewolf in https://github.com/huggingface/transformers.js/pull/978
Avoid resizing images when they already have the desired size by @nemphys in https://github.com/huggingface/transformers.js/pull/1027
Add support for Split pretokenizer w/ behavior=removed & invert=false by @xenova in https://github.com/huggingface/transformers.js/pull/1033
Add type declaration for progress_callback by @ocavue in https://github.com/huggingface/transformers.js/pull/1034
Add support for op_block_list by @pdufour in https://github.com/huggingface/transformers.js/pull/1036

<h2 id="new-contributors">🤗 New contributors</h2>

@uzyn made their first contribution in https://github.com/huggingface/transformers.js/pull/893
@ThomasWT made their first contribution in https://github.com/huggingface/transformers.js/pull/1014
@tosinamuda made their first contribution in https://github.com/huggingface/transformers.js/pull/1017
@nemphys made their first contribution in https://github.com/huggingface/transformers.js/pull/1027
@hitchhiker3010 made their first contribution in https://github.com/huggingface/transformers.js/pull/1030
@pdufour made their first contribution in https://github.com/huggingface/transformers.js/pull/1036

Full Changelog: https://github.com/huggingface/transformers.js/compare/3.0.2...3.1.0

Nov 4, 2024

What's new?

Add support for MobileLLM in https://github.com/huggingface/transformers.js/pull/1003

Example: Text generation with onnx-community/MobileLLM-125M.

import { pipeline } from "@huggingface/transformers";

// Create a text generation pipeline
const generator = await pipeline(
  "text-generation",
  "onnx-community/MobileLLM-125M",
  { dtype: "fp32" },
);

// Define the list of messages
const text = "Q: What is the capital of France?\nA: Paris\nQ: What is the capital of England?\nA:";

// Generate a response
const output = await generator(text, { max_new_tokens: 30 });
console.log(output[0].generated_text);

<details> <summary>Example output</summary>

Q: What is the capital of France?
A: Paris
Q: What is the capital of England?
A: London
Q: What is the capital of Scotland?
A: Edinburgh
Q: What is the capital of Wales?
A: Cardiff

</details>

Add support for OLMo in https://github.com/huggingface/transformers.js/pull/1011

Example: Text generation with onnx-community/AMD-OLMo-1B-SFT-DPO".

import { pipeline } from "@huggingface/transformers";

// Create a text generation pipeline
const generator = await pipeline(
  "text-generation",
  "onnx-community/AMD-OLMo-1B-SFT-DPO",
  { dtype: "q4" },
);

// Define the list of messages
const messages = [
  { role: "system", content: "You are a helpful assistant." },
  { role: "user", content: "Tell me a joke." },
];

// Generate a response
const output = await generator(messages, { max_new_tokens: 128 });
console.log(output[0].generated_text.at(-1).content);

<details> <summary>Example output</summary>

Why don't scientists trust atoms?

Because they make up everything!

</details>

Fix CommonJS bundling in https://github.com/huggingface/transformers.js/pull/1012. Thanks @jens-ghc for reporting!
Doc fixes by @roschler in https://github.com/huggingface/transformers.js/pull/1002
Remove duplicate gemma value from NO_PER_CHANNEL_REDUCE_RANGE_MODEL by @bekzod in https://github.com/huggingface/transformers.js/pull/1005

🤗 New contributors

@roschler made their first contribution in https://github.com/huggingface/transformers.js/pull/1002
@bekzod made their first contribution in https://github.com/huggingface/transformers.js/pull/1005

Full Changelog: https://github.com/huggingface/transformers.js/compare/3.0.1...3.0.2

Oct 25, 2024

What's new?

Fix Document QA pipeline in https://github.com/huggingface/transformers.js/pull/987. Thanks @martinsomm for reporting!
Next.js 15 (code; demo) and SvelteKit 5 (code; demo) server-side templates
Minor documentation fixes

Full Changelog: https://github.com/huggingface/transformers.js/compare/3.0.0...3.0.1

Oct 22, 2024

Transformers.js v3: WebGPU Support, New Models & Tasks, New Quantizations, Deno & Bun Compatibility, and More…

After more than a year of development, we're excited to announce the release of 🤗 Transformers.js v3!

You can get started by installing Transformers.js v3 from NPM using:

npm i @huggingface/transformers

Then, importing the library with

import { pipeline } from "@huggingface/transformers";

or, via a CDN

import { pipeline } from "https://cdn.jsdelivr.net/npm/@huggingface/transformers@3.0.0";

For more information, check out the documentation.

⚡ WebGPU support (up to 100x faster than WASM!)

WebGPU is a new web standard for accelerated graphics and compute. The API enables web developers to use the underlying system's GPU to carry out high-performance computations directly in the browser. WebGPU is the successor to WebGL and provides significantly better performance, because it allows for more direct interaction with modern GPUs. Lastly, it supports general-purpose GPU computations, which makes it just perfect for machine learning!

[!WARNING]
As of October 2024, global WebGPU support is around 70% (according to caniuse.com), meaning some users may not be able to use the API.

If the following demos do not work in your browser, you may need to enable it using a feature flag:

Firefox: with the dom.webgpu.enabled flag (see here).

Safari: with the WebGPU feature flag (see here).

Older Chromium browsers (on Windows, macOS, Linux): with the enable-unsafe-webgpu flag (see here).

Usage in Transformers.js v3

Thanks to our collaboration with ONNX Runtime Web, enabling WebGPU acceleration is as simple as setting device: 'webgpu' when loading a model. Let's see some examples!

Example: Compute text embeddings on WebGPU (demo)

import { pipeline } from "@huggingface/transformers";

// Create a feature-extraction pipeline
const extractor = await pipeline(
  "feature-extraction",
  "mixedbread-ai/mxbai-embed-xsmall-v1",
  { device: "webgpu" },
});

// Compute embeddings
const texts = ["Hello world!", "This is an example sentence."];
const embeddings = await extractor(texts, { pooling: "mean", normalize: true });
console.log(embeddings.tolist());
// [
//   [-0.016986183822155, 0.03228696808218956, -0.0013630966423079371, ... ],
//   [0.09050482511520386, 0.07207386940717697, 0.05762749910354614, ... ],
// ]

Example: Perform automatic speech recognition with OpenAI whisper on WebGPU (demo)

import { pipeline } from "@huggingface/transformers";

// Create automatic speech recognition pipeline
const transcriber = await pipeline(
  "automatic-speech-recognition",
  "onnx-community/whisper-tiny.en",
  { device: "webgpu" },
);

// Transcribe audio from a URL
const url = "https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav";
const output = await transcriber(url);
console.log(output);
// { text: ' And so my fellow Americans ask not what your country can do for you, ask what you can do for your country.' }

Example: Perform image classification with MobileNetV4 on WebGPU (demo)

import { pipeline } from "@huggingface/transformers";

// Create image classification pipeline
const classifier = await pipeline(
  "image-classification",
  "onnx-community/mobilenetv4_conv_small.e2400_r224_in1k",
  { device: "webgpu" },
);

// Classify an image from a URL
const url = "https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/tiger.jpg";
const output = await classifier(url);
console.log(output);
// [
//   { label: 'tiger, Panthera tigris', score: 0.6149784922599792 },
//   { label: 'tiger cat', score: 0.30281734466552734 },
//   { label: 'tabby, tabby cat', score: 0.0019135422771796584 },
//   { label: 'lynx, catamount', score: 0.0012161266058683395 },
//   { label: 'Egyptian cat', score: 0.0011465961579233408 }
// ]

🔢 New quantization formats (dtypes)

Before Transformers.js v3, we used the quantized option to specify whether to use a quantized (q8) or full-precision (fp32) variant of the model by setting quantized to true or false, respectively. Now, we've added the ability to select from a much larger list with the dtype parameter.

The list of available quantizations depends on the model, but some common ones are: full-precision ("fp32"), half-precision ("fp16"), 8-bit ("q8", "int8", "uint8"), and 4-bit ("q4", "bnb4", "q4f16").

<p align="center"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/transformersjs-v3/dtypes-dark.jpg" style="max-width: 100%;"> <source media="(prefers-color-scheme: light)" srcset="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/transformersjs-v3/dtypes-light.jpg" style="max-width: 100%;"> <img alt="Available dtypes for mixedbread-ai/mxbai-embed-xsmall-v1" src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/transformersjs-v3/dtypes-dark.jpg" style="max-width: 100%;"> </picture> <a href="https://huggingface.co/mixedbread-ai/mxbai-embed-xsmall-v1/tree/main/onnx">(e.g., mixedbread-ai/mxbai-embed-xsmall-v1)</a> </p>

Basic usage

Example: Run Qwen2.5-0.5B-Instruct in 4-bit quantization (demo)

import { pipeline } from "@huggingface/transformers";

// Create a text generation pipeline
const generator = await pipeline(
  "text-generation",
  "onnx-community/Qwen2.5-0.5B-Instruct",
  { dtype: "q4", device: "webgpu" },
);

// Define the list of messages
const messages = [
  { role: "system", content: "You are a helpful assistant." },
  { role: "user", content: "Tell me a funny joke." },
];

// Generate a response
const output = await generator(messages, { max_new_tokens: 128 });
console.log(output[0].generated_text.at(-1).content);

Per-module dtypes

Some encoder-decoder models, like Whisper or Florence-2, are extremely sensitive to quantization settings: especially of the encoder. For this reason, we added the ability to select per-module dtypes, which can be done by providing a mapping from module name to dtype.

Example: Run Florence-2 on WebGPU (demo)

import { Florence2ForConditionalGeneration } from "@huggingface/transformers";

const model = await Florence2ForConditionalGeneration.from_pretrained(
  "onnx-community/Florence-2-base-ft",
  {
    dtype: {
      embed_tokens: "fp16",
      vision_encoder: "fp16",
      encoder_model: "q4",
      decoder_model_merged: "q4",
    },
    device: "webgpu",
  },
);

<p align="middle"> <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/transformersjs-v3/florence-2-webgpu.gif" alt="Florence-2 running on WebGPU" /> </p> <details> <summary> See full code example </summary>

import {
  Florence2ForConditionalGeneration,
  AutoProcessor,
  AutoTokenizer,
  RawImage,
} from "@huggingface/transformers";

// Load model, processor, and tokenizer
const model_id = "onnx-community/Florence-2-base-ft";
const model = await Florence2ForConditionalGeneration.from_pretrained(
  model_id,
  {
    dtype: {
      embed_tokens: "fp16",
      vision_encoder: "fp16",
      encoder_model: "q4",
      decoder_model_merged: "q4",
    },
    device: "webgpu",
  },
);
const processor = await AutoProcessor.from_pretrained(model_id);
const tokenizer = await AutoTokenizer.from_pretrained(model_id);

// Load image and prepare vision inputs
const url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg";
const image = await RawImage.fromURL(url);
const vision_inputs = await processor(image);

// Specify task and prepare text inputs
const task = "<MORE_DETAILED_CAPTION>";
const prompts = processor.construct_prompts(task);
const text_inputs = tokenizer(prompts);

// Generate text
const generated_ids = await model.generate({
  ...text_inputs,
  ...vision_inputs,
  max_new_tokens: 100,
});

// Decode generated text
const generated_text = tokenizer.batch_decode(generated_ids, {
  skip_special_tokens: false,
})[0];

// Post-process the generated text
const result = processor.post_process_generation(
  generated_text,
  task,
  image.size,
);
console.log(result);
// { '<MORE_DETAILED_CAPTION>': 'A green car is parked in front of a tan building. The building has a brown door and two brown windows. The car is a two door and the door is closed. The green car has black tires.' }

</details>

🏛 A total of 120 supported architectures

This release increases the total number of supported architectures to 120 (see full list), spanning a wide range of input modalities and tasks. Notable new names include: Phi-3, Gemma & Gemma 2, LLaVa, Moondream, Florence-2, MusicGen, Sapiens, Depth Pro, PyAnnote, and RT-DETR.

<p align="middle"> <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/transformersjs-v3/architectures.png" alt="Bubble diagram of new architectures in Transformers.js v3" /> </p> <details> <summary>List of new models</summary>

Cohere (from Cohere) released with the paper Command-R: Retrieval Augmented Generation at Production Scale by Cohere.
Decision Transformer (from Berkeley/Facebook/Google) released with the paper Decision Transformer: Reinforcement Learning via Sequence Modeling by Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, Igor Mordatch.
Depth Pro (from Apple) released with the paper Depth Pro: Sharp Monocular Metric Depth in Less Than a Second by Aleksei Bochkovskii, Amaël Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R. Richter, Vladlen Koltun.
Florence2 (from Microsoft) released with the paper Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks by Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, Lu Yuan.
Gemma (from Google) released with the paper Gemma: Open Models Based on Gemini Technology and Research by the Gemma Google team.
Gemma2 (from Google) released with the paper Gemma2: Open Models Based on Gemini Technology and Research by the Gemma Google team.
Granite (from IBM) released with the paper Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler by Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox, Rameswar Panda.
GroupViT (from UCSD, NVIDIA) released with the paper GroupViT: Semantic Segmentation Emerges from Text Supervision by Jiarui Xu, Shalini De Mello, Sifei Liu, Wonmin Byeon, Thomas Breuel, Jan Kautz, Xiaolong Wang.
Hiera (from Meta) released with the paper Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles by Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer.
JAIS (from Core42) released with the paper Jais and Jais-chat: Arabic-Centric Foundation and Instruction-Tuned Open Generative Large Language Models by Neha Sengupta, Sunil Kumar Sahu, Bokang Jia, Satheesh Katipomu, Haonan Li, Fajri Koto, William Marshall, Gurpreet Gosal, Cynthia Liu, Zhiming Chen, Osama Mohammed Afzal, Samta Kamboj, Onkar Pandit, Rahul Pal, Lalit Pradhan, Zain Muhammad Mujahid, Massa Baali, Xudong Han, Sondos Mahmoud Bsharat, Alham Fikri Aji, Zhiqiang Shen, Zhengzhong Liu, Natalia Vassilieva, Joel Hestness, Andy Hock, Andrew Feldman, Jonathan Lee, Andrew Jackson, Hector Xuguang Ren, Preslav Nakov, Timothy Baldwin, Eric Xing.
LLaVa (from Microsoft Research & University of Wisconsin-Madison) released with the paper Visual Instruction Tuning by Haotian Liu, Chunyuan Li, Yuheng Li and Yong Jae Lee.
MaskFormer (from Meta and UIUC) released with the paper Per-Pixel Classification is Not All You Need for Semantic Segmentation by Bowen Cheng, Alexander G. Schwing, Alexander Kirillov.
MusicGen (from Meta) released with the paper Simple and Controllable Music Generation by Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi and Alexandre Défossez.
MobileCLIP (from Apple) released with the paper MobileCLIP: Fast Image-Text Models through Multi-Modal Reinforced Training by Pavan Kumar Anasosalu Vasu, Hadi Pouransari, Fartash Faghri, Raviteja Vemulapalli, Oncel Tuzel.
MobileNetV1 (from Google Inc.) released with the paper MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications by Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, Hartwig Adam.
MobileNetV2 (from Google Inc.) released with the paper MobileNetV2: Inverted Residuals and Linear Bottlenecks by Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, Liang-Chieh Chen.
MobileNetV3 (from Google Inc.) released with the paper Searching for MobileNetV3 by Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, Quoc V. Le, Hartwig Adam.
MobileNetV4 (from Google Inc.) released with the paper MobileNetV4 - Universal Models for the Mobile Ecosystem by Danfeng Qin, Chas Leichner, Manolis Delakis, Marco Fornoni, Shixin Luo, Fan Yang, Weijun Wang, Colby Banbury, Chengxi Ye, Berkin Akin, Vaibhav Aggarwal, Tenghui Zhu, Daniele Moro, Andrew Howard.
Moondream1 released in the repository moondream by vikhyat.
OpenELM (from Apple) released with the paper OpenELM: An Efficient Language Model Family with Open-source Training and Inference Framework by Sachin Mehta, Mohammad Hossein Sekhavat, Qingqing Cao, Maxwell Horton, Yanzi Jin, Chenfan Sun, Iman Mirzadeh, Mahyar Najibi, Dmitry Belenko, Peter Zatloukal, Mohammad Rastegari.
Phi3 (from Microsoft) released with the paper Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone by Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, Sébastien Bubeck, Martin Cai, Caio César Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Parul Chopra, Allie Del Giorno, Gustavo de Rosa, Matthew Dixon, Ronen Eldan, Dan Iter, Amit Garg, Abhishek Goswami, Suriya Gunasekar, Emman Haider, Junheng Hao, Russell J. Hewett, Jamie Huynh, Mojan Javaheripi, Xin Jin, Piero Kauffmann, Nikos Karampatziakis, Dongwoo Kim, Mahoud Khademi, Lev Kurilenko, James R. Lee, Yin Tat Lee, Yuanzhi Li, Chen Liang, Weishung Liu, Eric Lin, Zeqi Lin, Piyush Madan, Arindam Mitra, Hardik Modi, Anh Nguyen, Brandon Norick, Barun Patra, Daniel Perez-Becker, Thomas Portet, Reid Pryzant, Heyang Qin, Marko Radmilac, Corby Rosset, Sambudha Roy, Olatunji Ruwase, Olli Saarikivi, Amin Saied, Adil Salim, Michael Santacroce, Shital Shah, Ning Shang, Hiteshi Sharma, Xia Song, Masahiro Tanaka, Xin Wang, Rachel Ward, Guanhua Wang, Philipp Witte, Michael Wyatt, Can Xu, Jiahang Xu, Sonali Yadav, Fan Yang, Ziyi Yang, Donghan Yu, Chengruidong Zhang, Cyril Zhang, Jianwen Zhang, Li Lyna Zhang, Yi Zhang, Yue Zhang, Yunan Zhang, Xiren Zhou.
PVT (from Nanjing University, The University of Hong Kong etc.) released with the paper Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions by Wenhai Wang, Enze Xie, Xiang Li, Deng-Ping Fan, Kaitao Song, Ding Liang, Tong Lu, Ping Luo, Ling Shao.
PyAnnote released in the repository pyannote/pyannote-audio by Hervé Bredin.
RT-DETR (from Baidu), released together with the paper DETRs Beat YOLOs on Real-time Object Detection by Yian Zhao, Wenyu Lv, Shangliang Xu, Jinman Wei, Guanzhong Wang, Qingqing Dang, Yi Liu, Jie Chen.
Sapiens (from Meta AI) released with the paper Sapiens: Foundation for Human Vision Models by Rawal Khirodkar, Timur Bagautdinov, Julieta Martinez, Su Zhaoen, Austin James, Peter Selednik, Stuart Anderson, Shunsuke Saito.
ViTMAE (from Meta AI) released with the paper Masked Autoencoders Are Scalable Vision Learners by Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, Ross Girshick.
ViTMSN (from Meta AI) released with the paper Masked Siamese Networks for Label-Efficient Learning by Mahmoud Assran, Mathilde Caron, Ishan Misra, Piotr Bojanowski, Florian Bordes, Pascal Vincent, Armand Joulin, Michael Rabbat, Nicolas Ballas.

</details>

📂 25 new example projects and templates

As part of the release, we've published 25 new example projects and templates, primarily focused on showcasing WebGPU support! This includes demos like Phi-3.5 WebGPU and Whisper WebGPU, as shown below.

[!NOTE]
We're in the process of moving all our example projects and demos to https://github.com/huggingface/transformers.js-examples, so stay tuned for updates on this!

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/transformersjs-v3/phi-3.5-webgpu.gif" style="max-height: 500px;" alt="Phi-3.5 running on WebGPU" />	<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/transformersjs-v3/whisper-turbo-webgpu.gif" style="max-height: 500px;" alt="Whisper Turbo running on WebGPU" />

🤖 Over 1200 pre-converted models on the Hugging Face Hub

As of today's release, the community has converted over 1200 models to be compatible with Transformers.js! You can find the full list of available models here.

If you'd like to convert your own models or fine-tunes, you can use our conversion script as follows:

python -m scripts.convert --quantize --model_id <model_name_or_path>

After uploading the generated files to the Hugging Face Hub, remember to add the transformers.js tag so others can easily find and use your model!

🌐 Node.js (ESM + CJS), Deno, and Bun compatibility

Transformers.js v3 is now compatible with the three most popular server-side JavaScript runtimes:

Runtime	Description	Examples
Node.js	A widely-used JavaScript runtime built on Chrome's V8. It has a large ecosystem and supports a wide range of libraries and frameworks.	ESM Example / CJS Example
Deno	A modern runtime for JavaScript and TypeScript that is secure by default. It uses ES modules and even features experimental WebGPU support.	Deno Example
Bun	A fast JavaScript runtime optimized for performance. It features a built-in bundler, transpiler, and package manager.	Bun Example

🏡 A new home on GitHub and NPM

Finally, we're delighted to announce that Transformers.js will now be published under the official Hugging Face organization on NPM as @huggingface/transformers (instead of @xenova/transformers, which was used for v1 and v2).

We've also moved the repository to the official Hugging Face organization on GitHub (https://github.com/huggingface/transformers.js), which will be our new home — come say hi! We look forward to hearing your feedback, responding to your issues, and reviewing your PRs!

This is a significant milestone and we're extremely grateful to the community for helping us achieve this long-term goal! None of this would be possible without all of you… thank you! 🤗

🤗 New contributors

@guschmue made their first contribution in https://github.com/huggingface/transformers.js/pull/631
@flatsiedatsie made their first contribution in https://github.com/huggingface/transformers.js/pull/686
@vlinder made their first contribution in https://github.com/huggingface/transformers.js/pull/809
@inisis made their first contribution in https://github.com/huggingface/transformers.js/pull/811
@taha-yassine made their first contribution in https://github.com/huggingface/transformers.js/pull/812
@fs-eire made their first contribution in https://github.com/huggingface/transformers.js/pull/864
@ibelem made their first contribution in https://github.com/huggingface/transformers.js/pull/890
@kallebysantos made their first contribution in https://github.com/huggingface/transformers.js/pull/947
@SeanGallen made their first contribution in https://github.com/huggingface/transformers.js/pull/949
@BritishWerewolf made their first contribution in https://github.com/huggingface/transformers.js/pull/966

Full Changelog: https://github.com/huggingface/transformers.js/compare/2.17.2...3.0.0

May 29, 2024

🚀 What's new?

Add support for MobileViTv2 in https://github.com/xenova/transformers.js/pull/721

import { pipeline } from '@xenova/transformers';

// Create an image classification pipeline
const classifier = await pipeline('image-classification', 'Xenova/mobilevitv2-1.0-imagenet1k-256', {
    quantized: false,
});

// Classify an image
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/tiger.jpg';
const output = await classifier(url);
// [{ label: 'tiger, Panthera tigris', score: 0.6491137742996216 }]

See here for the full list of supported models.

Add support for FastViT in https://github.com/xenova/transformers.js/pull/749

import { pipeline } from '@xenova/transformers';

// Create an image classification pipeline
const classifier = await pipeline('image-classification', 'Xenova/fastvit_t12.apple_in1k', {
  quantized: false
});

// Classify an image
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/tiger.jpg';
const output = await classifier(url, { topk: 5 });
// [
//   { label: 'tiger, Panthera tigris', score: 0.6649345755577087 },
//   { label: 'tiger cat', score: 0.12454754114151001 },
//   { label: 'lynx, catamount', score: 0.0010689536575227976 },
//   { label: 'dhole, Cuon alpinus', score: 0.0010422508930787444 },
//   { label: 'silky terrier, Sydney silky', score: 0.0009548701345920563 }
// ]

See here for the full list of supported models.

Optimize FFT in https://github.com/xenova/transformers.js/pull/766
Auto rotate image by @KTibow in https://github.com/xenova/transformers.js/pull/737
Support reading data from blob URI by @hans00 in https://github.com/xenova/transformers.js/pull/645
Add sequence post processor in https://github.com/xenova/transformers.js/pull/771
Add model file name by @NawarA in https://github.com/xenova/transformers.js/pull/594
Update pipelines.js to allow for token_embeddings as well by @NikhilVerma in https://github.com/xenova/transformers.js/pull/770
Remove old import from stream/web for ReadableStream in https://github.com/xenova/transformers.js/pull/752
Update tokenizer playground by @xenova in https://github.com/xenova/transformers.js/pull/717
Use ungated version of mistral tokenizer by @xenova in https://github.com/xenova/transformers.js/pull/718
docs: update vanilla-js.md by @eltociear in https://github.com/xenova/transformers.js/pull/738
Fix CI by in https://github.com/xenova/transformers.js/pull/768
Update Next.js demos to 14.2.3 in https://github.com/xenova/transformers.js/pull/772

🤗 New contributors

@eltociear made their first contribution in https://github.com/xenova/transformers.js/pull/738
@KTibow made their first contribution in https://github.com/xenova/transformers.js/pull/737
@NawarA made their first contribution in https://github.com/xenova/transformers.js/pull/594
@NikhilVerma made their first contribution in https://github.com/xenova/transformers.js/pull/770

Full Changelog: https://github.com/xenova/transformers.js/compare/2.17.1...2.17.2

Apr 18, 2024

What's new?

Add ignore_merges option to BPE tokenizers in https://github.com/xenova/transformers.js/pull/716

Full Changelog: https://github.com/xenova/transformers.js/compare/2.17.0...2.17.1

Apr 11, 2024

What's new?

💬 Improved `text-generation` pipeline for conversational models

This version adds support for passing an array of chat messages (with "role" and "content" properties) to the text-generation pipeline (PR). Check out the list of supported models here.

Example: Chat with Xenova/Qwen1.5-0.5B-Chat.

import { pipeline } from '@xenova/transformers';

// Create text-generation pipeline
const generator = await pipeline('text-generation', 'Xenova/Qwen1.5-0.5B-Chat');

// Define the list of messages
const messages = [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'Tell me a funny joke.' }
]

// Generate text
const output = await generator(messages, {
    max_new_tokens: 128,
    do_sample: false,
})
console.log(output[0].generated_text);
// [
//   { role: 'system', content: 'You are a helpful assistant.' },
//   { role: 'user', content: 'Tell me a funny joke.' },
//   { role: 'assistant', content: "Sure, here's one:\n\nWhy was the math book sad?\n\nBecause it had too many problems.\n\nI hope you found that joke amusing! Do you have any other questions or topics you'd like to discuss?" },
// ]

We also added the return_full_text parameter, which means if you set return_full_text=false, only the newly-generated tokens will be returned (only applicable if passing the raw text prompt to the pipeline).

🔢 Binary embedding quantization support

Transformers.js v2.17 adds two new parameters to the feature-extraction pipeline ("quantize" and "precision"), enabling you to generate binary embeddings. These can be used with certain embedding models to shrink the size of the document embeddings for retrieval. This results in reductions in index size/memory usage (for storage) and improvements in retrieval speed. Surprisingly, you can still achieve up to ~95% of the original performance, but at 32x storage savings and up to 32x retrieval speeds! 🤯 Thanks to @jonathanpv for this addition in https://github.com/xenova/transformers.js/pull/691!

import { pipeline } from '@xenova/transformers';

// Create feature-extraction pipeline
const extractor = await pipeline('feature-extraction', 'Xenova/all-MiniLM-L6-v2');

// Compute binary embeddings
const output = await extractor('This is a simple test.', { pooling: 'mean', quantize: true, precision: 'binary' });
// Tensor {
//   type: 'int8',
//   data: Int8Array [49, 108, 24, ...],
//   dims: [1, 48]
// }

As you can see, this produces a 32x smaller output tensor (a 4x reduction in data type with Float32Array → Int8Array, as well as an 8x reduction in dimensionality from 384 → 48). For more information, check out this PR in sentence-transformers, which inspired this update!

🛠️ Misc. improvements

Faster dot product by @pulsejet in https://github.com/xenova/transformers.js/pull/667
Update dependencies in https://github.com/xenova/transformers.js/pull/661, https://github.com/xenova/transformers.js/pull/665, https://github.com/xenova/transformers.js/pull/702, and https://github.com/xenova/transformers.js/pull/704.

🤗 New contributors

@pulsejet made their first contribution in https://github.com/xenova/transformers.js/pull/667
@jonathanpv made their first contribution in https://github.com/xenova/transformers.js/pull/691

Full Changelog: https://github.com/xenova/transformers.js/compare/2.16.1...2.17.0

Mar 20, 2024

What's new?

Add support for the image-feature-extraction pipeline in https://github.com/xenova/transformers.js/pull/650.

Example: Perform image feature extraction with Xenova/vit-base-patch16-224-in21k.

const image_feature_extractor = await pipeline('image-feature-extraction', 'Xenova/vit-base-patch16-224-in21k');
const url = 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png';
const features = await image_feature_extractor(url);
// Tensor {
//   dims: [ 1, 197, 768 ],
//   type: 'float32',
//   data: Float32Array(151296) [ ... ],
//   size: 151296
// }

Example: Compute image embeddings with Xenova/clip-vit-base-patch32.

const image_feature_extractor = await pipeline('image-feature-extraction', 'Xenova/clip-vit-base-patch32');
const url = 'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png';
const features = await image_feature_extractor(url);
// Tensor {
//   dims: [ 1, 512 ],
//   type: 'float32',
//   data: Float32Array(512) [ ... ],
//   size: 512
// }

Fix channel format when padding non-square images for certain models in https://github.com/xenova/transformers.js/pull/655. This means you can now perform super-resolution for non-square images with APISR models:

Example: Upscale an image with Xenova/4x_APISR_GRL_GAN_generator-onnx.

import { pipeline } from '@xenova/transformers';

// Create image-to-image pipeline
const upscaler = await pipeline('image-to-image', 'Xenova/4x_APISR_GRL_GAN_generator-onnx', {
    quantized: false,
});

// Upscale an image
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/anime.png';
const output = await upscaler(url);
// RawImage {
//   data: Uint8Array(16588800) [ ... ],
//   width: 2560,
//   height: 1920,
//   channels: 3
// }

// (Optional) Save the upscaled image
output.save('upscaled.png');

<details> <summary>See example output</summary>

Input image:

Output image:

</details>

Update tokenizer apply_chat_template functionality in https://github.com/xenova/transformers.js/pull/647. This PR added functionality to support the new C4AI Command-R tokenizer.

<details> <summary>See example tool usage</summary>

import { AutoTokenizer } from "@xenova/transformers";

const tokenizer = await AutoTokenizer.from_pretrained("Xenova/c4ai-command-r-v01-tokenizer")

// define conversation input:
const conversation = [
  { role: "user", content: "Whats the biggest penguin in the world?" }
]
// Define tools available for the model to use:
const tools = [
  {
    name: "internet_search",
    description: "Returns a list of relevant document snippets for a textual query retrieved from the internet",
    parameter_definitions: {
      query: {
        description: "Query to search the internet with",
        type: "str",
        required: true
      }
    }
  },
  {
    name: "directly_answer",
    description: "Calls a standard (un-augmented) AI chatbot to generate a response given the conversation history",
    parameter_definitions: {}
  }
]


// render the tool use prompt as a string:
const tool_use_prompt = tokenizer.apply_chat_template(
  conversation,
  {
    chat_template: "tool_use",
    tokenize: false,
    add_generation_prompt: true,
    tools,
  }
)
console.log(tool_use_prompt)

</details> <details> <summary>See example RAG usage</summary>

import { AutoTokenizer } from "@xenova/transformers";

const tokenizer = await AutoTokenizer.from_pretrained("Xenova/c4ai-command-r-v01-tokenizer")

// define conversation input:
const conversation = [
  { role: "user", content: "Whats the biggest penguin in the world?" }
]
// define documents to ground on:
const documents = [
  { title: "Tall penguins", text: "Emperor penguins are the tallest growing up to 122 cm in height." },
  { title: "Penguin habitats", text: "Emperor penguins only live in Antarctica." }
]

// render the RAG prompt as a string:
const grounded_generation_prompt = tokenizer.apply_chat_template(
  conversation,
  {
    chat_template: "rag",
    tokenize: false,
    add_generation_prompt: true,

    documents,
    citation_mode: "accurate", // or "fast"
  }
)
console.log(grounded_generation_prompt);

</details>

Add support for EfficientNet in https://github.com/xenova/transformers.js/pull/639.

Example: Classify images with chriamue/bird-species-classifier

import { pipeline } from '@xenova/transformers';

// Create image classification pipeline
const classifier = await pipeline('image-classification', 'chriamue/bird-species-classifier', {
    quantized: false,      // Quantized model doesn't work
    revision: 'refs/pr/1', // Needed until the model author merges the PR
});

// Classify an image
const url = 'https://upload.wikimedia.org/wikipedia/commons/7/73/Short_tailed_Albatross1.jpg';
const output = await classifier(url);
console.log(output)
// [{ label: 'ALBATROSS', score: 0.9999023079872131 }]

Full Changelog: https://github.com/xenova/transformers.js/compare/2.16.0...2.16.1

Mar 7, 2024

What's new?

💬 StableLM text-generation models

This version adds support for the StableLM family of text-generation models (up to 1.6B params), developed by Stability AI. Huge thanks to @D4ve-R for this contribution in https://github.com/xenova/transformers.js/pull/616! See here for the full list of supported models.

Example: Text generation with Xenova/stablelm-2-zephyr-1_6b.

import { pipeline } from '@xenova/transformers';

// Create text generation pipeline
const generator = await pipeline('text-generation', 'Xenova/stablelm-2-zephyr-1_6b');

// Define the prompt and list of messages
const prompt = "Tell me a funny joke."
const messages = [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": prompt },
]

// Apply chat template
const inputs = generator.tokenizer.apply_chat_template(messages, {
    tokenize: false,
    add_generation_prompt: true,
});

// Generate text
const output = await generator(inputs, { max_new_tokens: 20 });
console.log(output[0].generated_text);
// "<|system|>\nYou are a helpful assistant.\n<|user|>\nTell me a funny joke.\n<|assistant|>\nHere's a joke for you:\n\nWhy don't scientists trust atoms?\n\nBecause they make up everything!"

Note: these models may be too large to run in your browser at the moment, so for now, we recommend using them in Node.js. Stay tuned for updates on this!

🔉 Speaker verification and diarization models

Example: Speaker verification w/ Xenova/wavlm-base-plus-sv.

import { AutoProcessor, AutoModel, read_audio, cos_sim } from '@xenova/transformers';

// Load processor and model
const processor = await AutoProcessor.from_pretrained('Xenova/wavlm-base-plus-sv');
const model = await AutoModel.from_pretrained('Xenova/wavlm-base-plus-sv');

// Helper function to compute speaker embedding from audio URL
async function compute_embedding(url) {
    const audio = await read_audio(url, 16000);
    const inputs = await processor(audio);
    const { embeddings } = await model(inputs);
    return embeddings.data;
}

// Generate speaker embeddings
const BASE_URL = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/sv_speaker';
const speaker_1_1 = await compute_embedding(`${BASE_URL}-1_1.wav`);
const speaker_1_2 = await compute_embedding(`${BASE_URL}-1_2.wav`);
const speaker_2_1 = await compute_embedding(`${BASE_URL}-2_1.wav`);
const speaker_2_2 = await compute_embedding(`${BASE_URL}-2_2.wav`);

// Compute similarity scores
console.log(cos_sim(speaker_1_1, speaker_1_2)); // 0.959439158881247 (Both are speaker 1)
console.log(cos_sim(speaker_1_2, speaker_2_1)); // 0.618130172602329 (Different speakers)
console.log(cos_sim(speaker_2_1, speaker_2_2)); // 0.962999814169370 (Both are speaker 2)

Example: Perform speaker diarization with Xenova/wavlm-base-plus-sd.

import { AutoProcessor, AutoModelForAudioFrameClassification, read_audio } from '@xenova/transformers';

// Read and preprocess audio
const processor = await AutoProcessor.from_pretrained('Xenova/wavlm-base-plus-sd');
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav';
const audio = await read_audio(url, 16000);
const inputs = await processor(audio);

// Run model with inputs
const model = await AutoModelForAudioFrameClassification.from_pretrained('Xenova/wavlm-base-plus-sd');
const { logits } = await model(inputs);
// {
//   logits: Tensor {
//     dims: [ 1, 549, 2 ],  // [batch_size, num_frames, num_speakers]
//     type: 'float32',
//     data: Float32Array(1098) [-3.5301010608673096, ...],
//     size: 1098
//   }
// }

const labels = logits[0].sigmoid().tolist().map(
    frames => frames.map(speaker => speaker > 0.5 ? 1 : 0)
);
console.log(labels); // labels is a one-hot array of shape (num_frames, num_speakers)
// [
//     [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0],
//     [0, 0], [0, 0], [0, 0], [0, 0], [0, 0], [0, 0],
//     [0, 0], [0, 1], [0, 1], [0, 1], [0, 1], [0, 1],
//     ...
// ]

These additions were made possible thanks to the following PRs:

Add support for WavLMForXVector by @D4ve-R in https://github.com/xenova/transformers.js/pull/603
Add support for WavLMForAudioFrameClassification and Wav2Vec2ForAudioFrameClassification by @D4ve-R in https://github.com/xenova/transformers.js/pull/611
Add support for UniSpeech and UniSpeechSat models in https://github.com/xenova/transformers.js/pull/624

📝 Improved chat templating operation coverage

With this release, we're pleased to announce that Transformers.js is now able to parse every single valid chat template that is currently on the Hugging Face Hub! 🤯 As of 2024/03/05, this is around ~12k conversational models (of which there were ~250 unique templates). Of course, future models may introduce more complex chat templates, and we'll continue to add support for them!

For example, transformers.js can now generate the prompt for highly complex function-calling models (e.g., fireworks-ai/firefunction-v1):

import { AutoTokenizer } from '@xenova/transformers';

const tokenizer = await AutoTokenizer.from_pretrained('fireworks-ai/firefunction-v1')

const function_spec = [
    {
        name: 'get_stock_price',
        description: 'Get the current stock price',
        parameters: {
            type: 'object',
            properties: {
                symbol: {
                    type: 'string',
                    description: 'The stock symbol, e.g. AAPL, GOOG'
                }
            },
            required: ['symbol']
        }
    },
    {
        name: 'check_word_anagram',
        description: 'Check if two words are anagrams of each other',
        parameters: {
            type: 'object',
            properties: {
                word1: {
                    type: 'string',
                    description: 'The first word'
                },
                word2: {
                    type: 'string',
                    description: 'The second word'
                }
            },
            required: ['word1', 'word2']
        }
    }
]

const messages = [
    { role: 'functions', content: JSON.stringify(function_spec, null, 4) },
    { role: 'system', content: 'You are a helpful assistant with access to functions. Use them if required.' },
    { role: 'user', content: 'Hi, can you tell me the current stock price of AAPL?' }
]

const inputs = tokenizer.apply_chat_template(messages, { tokenize: false });
console.log(inputs);
// <s>SYSTEM: You are a helpful assistant ...

</details>

🎨 New example applications and demos

Create video object detection demo in https://github.com/xenova/transformers.js/pull/607 (try it out).
Create cross-encoder demo in https://github.com/xenova/transformers.js/pull/617 (try it out).
Add Claude 3 and Mistral to the tokenizer playground in https://github.com/xenova/transformers.js/pull/625 (try it out).

🛠️ Misc. improvements

Add support for the starcoder2 architecture in https://github.com/xenova/transformers.js/pull/622. Note: we haven't yet added transformers.js-compatible versions of the 3B and 7B models.
Check for existence of onnx_env.wasm before updating wasmPaths in https://github.com/xenova/transformers.js/pull/621

🤗 New contributors

@D4ve-R made their first contribution in https://github.com/xenova/transformers.js/pull/603

Full Changelog: https://github.com/xenova/transformers.js/compare/2.15.1...2.16.0

Previous 1 2 3 4 Next

Similar releases

Other sources from this team

Similar sources

Latest

4.0.0

Source

@huggingface/transformers.js

Tracking Since

May 15, 2023

Last fetched Apr 18, 2026

.json·.md·.atom

Transformers.js

What's new?

What's new?

What's new?

🔥 Transformers.js v3.3 — StyleTTS 2 (Kokoro) for state-of-the-art text-to-speech, Grounding DINO for zero-shot object detection

What's new?

New Contributors

What's new?

What's new?

What's new?

🔥 Transformers.js v3.2 — Moonshine for real-time speech recognition, Phi-3.5 Vision for multi-frame image understanding and reasoning, and more!

🤖 New models

🐛 Bug fixes

🛠️ Other improvements

🤗 New contributors

🤖 New models

🐛 Bug fixes

📝 Documentation improvements

🛠️ Other improvements

🤗 New contributors

🚀 Transformers.js v3.1 — any-to-any, text-to-image, image-to-text, pose estimation, time series forecasting, and more!

What's new?

🤗 New contributors

What's new?

Transformers.js v3: WebGPU Support, New Models & Tasks, New Quantizations, Deno & Bun Compatibility, and More…

⚡ WebGPU support (up to 100x faster than WASM!)

Usage in Transformers.js v3

🔢 New quantization formats (dtypes)

Basic usage

Per-module dtypes

🏛 A total of 120 supported architectures

📂 25 new example projects and templates

🤖 Over 1200 pre-converted models on the Hugging Face Hub

🌐 Node.js (ESM + CJS), Deno, and Bun compatibility

🏡 A new home on GitHub and NPM

🤗 New contributors

🚀 What's new?

🤗 New contributors

What's new?

What's new?

💬 Improved text-generation pipeline for conversational models

🔢 Binary embedding quantization support

🛠️ Misc. improvements

🤗 New contributors

What's new?

What's new?

💬 StableLM text-generation models

🔉 Speaker verification and diarization models

📝 Improved chat templating operation coverage

🎨 New example applications and demos

🛠️ Misc. improvements

🤗 New contributors

More from this team

Similar releases

Other sources from this team

Similar sources

More from this team

Similar releases

Other sources from this team

Similar sources

💬 Improved `text-generation` pipeline for conversational models