What's new?

🤯 8 new architectures!

This release adds support for a bunch of new model architectures, covering a wide range of use cases! In total, we now support 73 different model architectures!

1. ViTMatte for image matting (https://github.com/xenova/transformers.js/pull/448). See here for the list of available models.

Example: Image matting w/ Xenova/vitmatte-small-distinctions-646.

import { AutoProcessor, VitMatteForImageMatting, RawImage } from '@xenova/transformers';

// Load processor and model
const processor = await AutoProcessor.from_pretrained('Xenova/vitmatte-small-distinctions-646');
const model = await VitMatteForImageMatting.from_pretrained('Xenova/vitmatte-small-distinctions-646');

// Load image and trimap
const image = await RawImage.fromURL('https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/vitmatte_image.png');
const trimap = await RawImage.fromURL('https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/vitmatte_trimap.png');

// Prepare image + trimap for the model
const inputs = await processor(image, trimap);

// Predict alpha matte
const { alphas } = await model(inputs);
// Tensor {
//   dims: [ 1, 1, 640, 960 ],
//   type: 'float32',
//   size: 614400,
//   data: Float32Array(614400) [ 0.9894027709960938, 0.9970508813858032, ... ]
// }

<details> <summary>Visualization code</summary>

import { Tensor, cat } from '@xenova/transformers';

// Visualize predicted alpha matte
const imageTensor = new Tensor(
  'uint8',
  new Uint8Array(image.data),
  [image.height, image.width, image.channels]
).transpose(2, 0, 1);

// Convert float (0-1) alpha matte to uint8 (0-255)
const alphaChannel = alphas
  .squeeze(0)
  .mul_(255)
  .clamp_(0, 255)
  .round_()
  .to('uint8');

// Concatenate original image with predicted alpha
const imageData = cat([imageTensor, alphaChannel], 0);

// Save output image
const outputImage = RawImage.fromTensor(imageData);
outputImage.save('output.png');

</details>

Inputs:

Image	Trimap

Outputs:

Quantized	Unquantized

2. ESM for protein sequence feature-extraction, masked language modelling, token classification, and zero-shot classification (https://github.com/xenova/transformers.js/pull/447). See here for the list of available models.

Example: Protein sequence classification w/ Xenova/esm2_t6_8M_UR50D_sequence_classifier_v1.

import { pipeline } from '@xenova/transformers';

// Create text classification pipeline
const classifier = await pipeline('text-classification', 'Xenova/esm2_t6_8M_UR50D_sequence_classifier_v1');

// Suppose these are your new sequences that you want to classify
// Additional Family 0: Enzymes
const new_sequences_0 = [ 'ACGYLKTPKLADPPVLRGDSSVTKAICKPDPVLEK', 'GVALDECKALDYLPGKPLPMDGKVCQCGSKTPLRP', 'VLPGYTCGELDCKPGKPLPKCGADKTQVATPFLRG', 'TCGALVQYPSCADPPVLRGSDSSVKACKKLDPQDK', 'GALCEECKLCPGADYKPMDGDRLPAAATSKTRPVG', 'PAVDCKKALVYLPKPLPMDGKVCRGSKTPKTRPYG', 'VLGYTCGALDCKPGKPLPKCGADKTQVATPFLRGA', 'CGALVQYPSCADPPVLRGSDSSVKACKKLDPQDKT', 'ALCEECKLCPGADYKPMDGDRLPAAATSKTRPVGK', 'AVDCKKALVYLPKPLPMDGKVCRGSKTPKTRPYGR' ]

// Additional Family 1: Receptor Proteins
const new_sequences_1 = [ 'VGQRFYGGRQKNRHCELSPLPSACRGSVQGALYTD', 'KDQVLTVPTYACRCCPKMDSKGRVPSTLRVKSARS', 'PLAGVACGRGLDYRCPRKMVPGDLQVTPATQRPYG', 'CGVRLGYPGCADVPLRGRSSFAPRACMKKDPRVTR', 'RKGVAYLYECRKLRCRADYKPRGMDGRRLPKASTT', 'RPTGAVNCKQAKVYRGLPLPMMGKVPRVCRSRRPY', 'RLDGGYTCGQALDCKPGRKPPKMGCADLKSTVATP', 'LGTCRKLVRYPQCADPPVMGRSSFRPKACCRQDPV', 'RVGYAMCSPKLCSCRADYKPPMGDGDRLPKAATSK', 'QPKAVNCRKAMVYRPKPLPMDKGVPVCRSKRPRPY' ]

// Additional Family 2: Structural Proteins
const new_sequences_2 = [ 'VGKGFRYGSSQKRYLHCQKSALPPSCRRGKGQGSAT', 'KDPTVMTVGTYSCQCPKQDSRGSVQPTSRVKTSRSK', 'PLVGKACGRSSDYKCPGQMVSGGSKQTPASQRPSYD', 'CGKKLVGYPSSKADVPLQGRSSFSPKACKKDPQMTS', 'RKGVASLYCSSKLSCKAQYSKGMSDGRSPKASSTTS', 'RPKSAASCEQAKSYRSLSLPSMKGKVPSKCSRSKRP', 'RSDVSYTSCSQSKDCKPSKPPKMSGSKDSSTVATPS', 'LSTCSKKVAYPSSKADPPSSGRSSFSMKACKKQDPPV', 'RVGSASSEPKSSCSVQSYSKPSMSGDSSPKASSTSK', 'QPSASNCEKMSSYRPSLPSMSKGVPSSRSKSSPPYQ' ]

// Merge all sequences
const new_sequences = [...new_sequences_0, ...new_sequences_1, ...new_sequences_2];

// Get the predicted class for each sequence
const predictions = await classifier(new_sequences);

// Output the predicted class for each sequence
for (let i = 0; i < predictions.length; ++i) {
    console.log(`Sequence: ${new_sequences[i]}, Predicted class: '${predictions[i].label}'`)
}
// Sequence: ACGYLKTPKLADPPVLRGDSSVTKAICKPDPVLEK, Predicted class: 'Enzymes'
// ... (truncated)
// Sequence: AVDCKKALVYLPKPLPMDGKVCRGSKTPKTRPYGR, Predicted class: 'Enzymes'
// Sequence: VGQRFYGGRQKNRHCELSPLPSACRGSVQGALYTD, Predicted class: 'Receptor Proteins'
// ... (truncated)
// Sequence: QPKAVNCRKAMVYRPKPLPMDKGVPVCRSKRPRPY, Predicted class: 'Receptor Proteins'
// Sequence: VGKGFRYGSSQKRYLHCQKSALPPSCRRGKGQGSAT, Predicted class: 'Structural Proteins'
// ... (truncated)
// Sequence: QPSASNCEKMSSYRPSLPSMSKGVPSSRSKSSPPYQ, Predicted class: 'Structural Proteins'

3. Hubert for audio classification, and automatic speech recognition (https://github.com/xenova/transformers.js/pull/449). See here for the list of available models.

Example: Speech command recognition w/ Xenova/hubert-base-superb-ks.

import { pipeline } from '@xenova/transformers';

// Create audio classification pipeline
const classifier = await pipeline('audio-classification', 'Xenova/hubert-base-superb-ks');

// Classify audio
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/speech-commands_down.wav';
const output = await classifier(url, { topk: 5 });
// [
//   { label: 'down', score: 0.9954305291175842 },
//   { label: 'go', score: 0.004518700763583183 },
//   { label: '_unknown_', score: 0.00005029444946558215 },
//   { label: 'no', score: 4.877569494965428e-7 },
//   { label: 'stop', score: 5.504634081887616e-9 }
// ]

Example: Perform automatic speech recognition w/ Xenova/hubert-large-ls960-ft.

import { pipeline } from '@xenova/transformers';

// Create automatic speech recognition pipeline
const transcriber = await pipeline('automatic-speech-recognition', 'Xenova/hubert-large-ls960-ft');

// Transcribe audio
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/jfk.wav';
const output = await transcriber(url);
// { text: 'AND SO MY FELLOW AMERICA ASK NOT WHAT YOUR COUNTRY CAN DO FOR YOU ASK WHAT YOU CAN DO FOR YOUR COUNTRY' }

4. Chinese-CLIP for zero-shot image classification (https://github.com/xenova/transformers.js/pull/455). See here for the list of available models.

Example: Zero-shot image classification w/ Xenova/hubert-large-ls960-ft.

import { pipeline } from '@xenova/transformers';

// Create zero-shot image classification pipeline
const classifier = await pipeline('zero-shot-image-classification', 'Xenova/chinese-clip-vit-base-patch16');

// Set image url and candidate labels
const url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/pikachu.png';
const candidate_labels = ['杰尼龟', '妙蛙种子', '小火龙', '皮卡丘'] // Squirtle, Bulbasaur, Charmander, Pikachu in Chinese

// Classify image
const output = await classifier(url, candidate_labels);
console.log(output);
// [
//   { score: 0.9926728010177612, label: '皮卡丘' },        // Pikachu
//   { score: 0.003480620216578245, label: '妙蛙种子' },    // Bulbasaur
//   { score: 0.001942147733643651, label: '杰尼龟' },      // Squirtle
//   { score: 0.0019044597866013646, label: '小火龙' }      // Charmander
// ]

5. DINOv2 for image classification (https://github.com/xenova/transformers.js/pull/444). See here for the list of available models.

Example: Image classification w/ Xenova/dinov2-small-imagenet1k-1-layer.

import { pipeline} from '@xenova/transformers';

// Create image classification pipeline
const classifier = await pipeline('image-classification', 'Xenova/dinov2-small-imagenet1k-1-layer');

// Classify an image
const url = 'http://images.cocodataset.org/val2017/000000039769.jpg';
const output = await classifier(url);
console.log(output)
// [{ label: 'tabby, tabby cat', score: 0.8088238835334778 }]

6. ConvBERT for feature extraction (https://github.com/xenova/transformers.js/pull/445). See here for the list of available models.

Example: Feature extraction w/ Xenova/conv-bert-small.

import { pipeline } from '@xenova/transformers';

// Create feature extraction pipeline
const extractor = await pipeline('feature-extraction', 'Xenova/conv-bert-small');

// Perform feature extraction
const output = await extractor('This is a test sentence.');
console.log(output)
// Tensor {
//   dims: [ 1, 8, 256 ],
//   type: 'float32',
//   data: Float32Array(2048) [ -0.09434918314218521, 0.5715903043746948, ... ],
//   size: 2048
// }

7. ELECTRA for feature extraction (https://github.com/xenova/transformers.js/pull/446). See here for the list of available models.

Example: Feature extraction w/ Xenova/electra-small-discriminator.

import { pipeline } from '@xenova/transformers';

// Create feature extraction pipeline
const extractor = await pipeline('feature-extraction', 'Xenova/electra-small-discriminator');

// Perform feature extraction
const output = await extractor('This is a test sentence.');
console.log(output)
// Tensor {
//   dims: [ 1, 8, 256 ],
//   type: 'float32',
//   data: Float32Array(2048) [ 0.5410046577453613, 0.18386700749397278, ... ],
//   size: 2048
// }

8. Phi for text generation (https://github.com/xenova/transformers.js/pull/443).

NOTE: This only adds support for the architecture. When the external data format is supported in ONNX Runtime, we will make an update that includes converted versions of the available Phi models.

🕹️ New example: Semantic Music Search application

In the last release, we added support for CLAP models (CLIP but for audio), so in this one, we're releasing a simple demo application which shows how you can use a CLAP model to perform real-time semantic music search! For simplicity, we implemented everything in vanilla JavaScript, but feel free to adapt it to your framework of choice! As always, the source code is open source! 🥳 PR: https://github.com/xenova/transformers.js/pull/442

Demo video:

https://github.com/xenova/transformers.js/assets/26504141/72e09f8c-d6e9-4430-a56c-7994737966db

🐛 Bug fixes

Fix tensor inheritance in https://github.com/xenova/transformers.js/pull/451. Thanks to @devfacet for reporting the issue and to @kungfooman for helping review the PR.

🛠️ Other features

Add support for CLS pooling (feature extraction pipeline) in https://github.com/xenova/transformers.js/pull/450

📄 Documentation

Add example usage for SpeechT5ForSpeechToText in https://github.com/xenova/transformers.js/pull/438

Full Changelog: https://github.com/xenova/transformers.js/compare/2.10.1...2.11.0