What's new?

😍 Exciting new tasks!

Transformers.js v2.9.0 adds support for three new tasks: (1) Depth estimation, (2) Zero-shot object detection, and (3) Optical document understanding.

🕵️‍♂️ Depth Estimation

The task of predicting the depth of objects present in an image. See here for more information.

import { pipeline } from '@xenova/transformers';

// Create depth estimation pipeline
let depth_estimator = await pipeline('depth-estimation', 'Xenova/dpt-hybrid-midas');

// Predict depth for image
let url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/cats.jpg';
let output = await depth_estimator(url);

Input	Output

<details> <summary>Raw output</summary>

// {
//   predicted_depth: Tensor {
//     dims: [ 384, 384 ],
//     type: 'float32',
//     data: Float32Array(147456) [ 542.859130859375, 545.2833862304688, 546.1649169921875, ... ],
//     size: 147456
//   },
//   depth: RawImage {
//     data: Uint8Array(307200) [ 86, 86, 86, ... ],
//     width: 640,
//     height: 480,
//     channels: 1
//   }
// }

</details>

🎯 Zero-shot Object Detection

The task of identifying objects of classes that are unseen during training. See here for more information.

import { pipeline } from '@xenova/transformers';

// Create zero-shot object detection pipeline
let detector = await pipeline('zero-shot-object-detection', 'Xenova/owlvit-base-patch32');

// Predict bounding boxes
let url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/astronaut.png';
let candidate_labels = ['human face', 'rocket', 'helmet', 'american flag'];
let output = await detector(url, candidate_labels);

<details> <summary>Raw output</summary>

// [
//   {
//     score: 0.24392342567443848,
//     label: 'human face',
//     box: { xmin: 180, ymin: 67, xmax: 274, ymax: 175 }
//   },
//   {
//     score: 0.15129457414150238,
//     label: 'american flag',
//     box: { xmin: 0, ymin: 4, xmax: 106, ymax: 513 }
//   },
//   {
//     score: 0.13649864494800568,
//     label: 'helmet',
//     box: { xmin: 277, ymin: 337, xmax: 511, ymax: 511 }
//   },
//   {
//     score: 0.10262022167444229,
//     label: 'rocket',
//     box: { xmin: 352, ymin: -1, xmax: 463, ymax: 287 }
//   }
// ]

</details>

📝 Optical Document Understanding (image-to-text)

This task involves translating images of scientific PDFs to markdown, enabling easier access to them. See here for more information.

import { pipeline } from '@xenova/transformers';

// Create image-to-text pipeline
let pipe = await pipeline('image-to-text', 'Xenova/nougat-small');

// Generate markdown
let url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/nougat_paper.png';
let output = await pipe(url, {
  min_length: 1,
  max_new_tokens: 40,
  bad_words_ids: [[pipe.tokenizer.unk_token_id]],
});
// [{ generated_text: "# Nougat: Neural Optical Understanding for Academic Documents\n\nLukas Blecher\n\nCorrespondence to: lblecher@meta.com\n\nGuillem Cucur" }]

<details> <summary>See input image</summary>

</details>

💻 New architectures: Nougat, DPT, GLPN, OwlViT

We added support for 4 new architectures, bringing the total up to 61!

DPT for depth estimation. See here for the list of available models.
GLPN for depth estimation. See here for the list of available models.
OwlViT for zero-shot object detection. See here for the list of available models.
Nougat for optical understanding of academic documents (image-to-text). See here for the list of available models.

🔨 Other improvements

Add support for Grouped Query Attention on Llama Model by @felladrin in https://github.com/xenova/transformers.js/pull/393
Implement max character check by @samlhuillier in https://github.com/xenova/transformers.js/pull/398
Add CLIPFeatureExtractor (and tests) in https://github.com/xenova/transformers.js/pull/387
Add jsDelivr stats to README in https://github.com/xenova/transformers.js/pull/395
Update sharp dependency version in https://github.com/xenova/transformers.js/pull/400

🐛 Bug fixes

Move tensor clone to fix Worker ownership NaN issue by @kungfooman in https://github.com/xenova/transformers.js/pull/404
Add default token_type_ids for multilingual-e5-* models by @do-me in https://github.com/xenova/transformers.js/pull/403
Ensure WASM fallback does not crash in GH actions in https://github.com/xenova/transformers.js/pull/402

🤗 New contributors

@felladrin made their first contribution in https://github.com/xenova/transformers.js/pull/393
@samlhuillier made their first contribution in https://github.com/xenova/transformers.js/pull/398
@do-me made their first contribution in https://github.com/xenova/transformers.js/pull/403

Full Changelog: https://github.com/xenova/transformers.js/compare/2.8.0...2.9.0