🚀 Transformers.js v3.4 — Background Removal Pipeline, Ultravox DAC, Mimi, SmolVLM2, LiteWhisper.

🖼️ Background Removal Pipeline
🤖 New models: Ultravox DAC, Mimi, SmolVLM2, LiteWhisper
🛠️ Other improvements
🤗 New contributors

<h2 id="new-pipeline">🖼️ New Background Removal Pipeline</h2>

Removing backgrounds from images is now as easy as:

import { pipeline } from "@huggingface/transformers";
const segmenter = await pipeline("background-removal", "onnx-community/BEN2-ONNX");
const output = await segmenter("input.png");
output[0].save("output.png"); // (Optional) Save the image

You can find the full list of compatible models here, which will continue to grow in future! 🔥 For more information, check out https://github.com/huggingface/transformers.js/pull/1216.

<h2 id="new-models">🤖 New models</h2>

Ultravox for audio-text-to-text generation (https://github.com/huggingface/transformers.js/pull/1207). See here for the list of supported models.

<details> <summary> See example usage </summary>

import { UltravoxProcessor, UltravoxModel, read_audio } from "@huggingface/transformers";

const processor = await UltravoxProcessor.from_pretrained(
  "onnx-community/ultravox-v0_5-llama-3_2-1b-ONNX",
);
const model = await UltravoxModel.from_pretrained(
  "onnx-community/ultravox-v0_5-llama-3_2-1b-ONNX",
  {
    dtype: {
      embed_tokens: "q8", // "fp32", "fp16", "q8"
      audio_encoder: "q4", // "fp32", "fp16", "q8", "q4", "q4f16"
      decoder_model_merged: "q4", // "q8", "q4", "q4f16"
    },
  },
);

const audio = await read_audio("http://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/mlk.wav", 16000);
const messages = [
  {
    role: "system",
    content: "You are a helpful assistant.",
  },
  { role: "user", content: "Transcribe this audio:<|audio|>" },
];
const text = processor.tokenizer.apply_chat_template(messages, {
  add_generation_prompt: true,
  tokenize: false,
});

const inputs = await processor(text, audio);
const generated_ids = await model.generate({
  ...inputs,
  max_new_tokens: 128,
});

const generated_texts = processor.batch_decode(
  generated_ids.slice(null, [inputs.input_ids.dims.at(-1), null]),
  { skip_special_tokens: true },
);
console.log(generated_texts[0]);
// "I can transcribe the audio for you. Here's the transcription:\n\n\"I have a dream that one day this nation will rise up and live out the true meaning of its creed.\"\n\n- Martin Luther King Jr.\n\nWould you like me to provide the transcription in a specific format (e.g., word-for-word, character-for-character, or a specific font)?"

</details>

DAC and Mimi for audio tokenization/neural audio codecs (https://github.com/huggingface/transformers.js/pull/1215). See here for the list of supported DAC models and here for the list of supported Mimi models.

<details> <summary> See example usage </summary>

DAC:

import { DacModel, AutoFeatureExtractor } from '@huggingface/transformers';

const model_id = "onnx-community/dac_16khz-ONNX";
const model = await DacModel.from_pretrained(model_id);
const feature_extractor = await AutoFeatureExtractor.from_pretrained(model_id);

const audio_sample = new Float32Array(12000);

// pre-process the inputs
const inputs = await feature_extractor(audio_sample);
{
    // explicitly encode then decode the audio inputs
    const encoder_outputs = await model.encode(inputs);
    const { audio_values } = await model.decode(encoder_outputs);
    console.log(audio_values);
}

{
    // or the equivalent with a forward pass
    const { audio_values } = await model(inputs);
    console.log(audio_values);
}

Mimi:

import { MimiModel, AutoFeatureExtractor } from '@huggingface/transformers';

const model_id = "onnx-community/kyutai-mimi-ONNX";
const model = await MimiModel.from_pretrained(model_id);
const feature_extractor = await AutoFeatureExtractor.from_pretrained(model_id);

const audio_sample = new Float32Array(12000);

// pre-process the inputs
const inputs = await feature_extractor(audio_sample);
{
    // explicitly encode then decode the audio inputs
    const encoder_outputs = await model.encode(inputs);
    const { audio_values } = await model.decode(encoder_outputs);
    console.log(audio_values);
}

{
    // or the equivalent with a forward pass
    const { audio_values } = await model(inputs);
    console.log(audio_values);
}

</details>

SmolVLM2, a lightweight multimodal model designed to analyze image and video content (https://github.com/huggingface/transformers.js/pull/1196). See here for the list of supported models. Usage is identical to SmolVLM.
LiteWhisper for automatic speech recognition (https://github.com/huggingface/transformers.js/pull/1219). See here for the list of supported models. Usage is identical to Whisper.

<h2 id="other-improvements">🛠️ Other improvements</h2>

Add support for multi-chunk external data files in https://github.com/huggingface/transformers.js/pull/1212
Fix package export by @fs-eire in https://github.com/huggingface/transformers.js/pull/1161
Add NFD normalizer in https://github.com/huggingface/transformers.js/pull/1211. Thanks to @adewdev for reporting!
Documentation improvements by @viksit in https://github.com/huggingface/transformers.js/pull/1184
Optimize conversion script in https://github.com/huggingface/transformers.js/pull/1204 and https://github.com/huggingface/transformers.js/pull/1218
Use Float16Array instead of Uint16Array for kvcache when available in https://github.com/huggingface/transformers.js/pull/1208

<h2 id="new-contributors">🤗 New contributors</h2>

@axrati made their first contribution in https://github.com/huggingface/transformers.js/pull/602
@viksit made their first contribution in https://github.com/huggingface/transformers.js/pull/1184
@tangkunyin made their first contribution in https://github.com/huggingface/transformers.js/pull/1203

Full Changelog: https://github.com/huggingface/transformers.js/compare/3.3.3...3.4.0