Transformers.js v2.9.0 adds support for three new tasks: (1) Depth estimation, (2) Zero-shot object detection, and (3) Optical document understanding.
The task of predicting the depth of objects present in an image. See here for more information.
import { pipeline } from '@xenova/transformers';
// Create depth estimation pipeline
let depth_estimator = await pipeline('depth-estimation', 'Xenova/dpt-hybrid-midas');
// Predict depth for image
let url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/cats.jpg';
let output = await depth_estimator(url);
| Input | Output |
|---|---|
// {
// predicted_depth: Tensor {
// dims: [ 384, 384 ],
// type: 'float32',
// data: Float32Array(147456) [ 542.859130859375, 545.2833862304688, 546.1649169921875, ... ],
// size: 147456
// },
// depth: RawImage {
// data: Uint8Array(307200) [ 86, 86, 86, ... ],
// width: 640,
// height: 480,
// channels: 1
// }
// }
</details>
The task of identifying objects of classes that are unseen during training. See here for more information.
import { pipeline } from '@xenova/transformers';
// Create zero-shot object detection pipeline
let detector = await pipeline('zero-shot-object-detection', 'Xenova/owlvit-base-patch32');
// Predict bounding boxes
let url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/astronaut.png';
let candidate_labels = ['human face', 'rocket', 'helmet', 'american flag'];
let output = await detector(url, candidate_labels);
// [
// {
// score: 0.24392342567443848,
// label: 'human face',
// box: { xmin: 180, ymin: 67, xmax: 274, ymax: 175 }
// },
// {
// score: 0.15129457414150238,
// label: 'american flag',
// box: { xmin: 0, ymin: 4, xmax: 106, ymax: 513 }
// },
// {
// score: 0.13649864494800568,
// label: 'helmet',
// box: { xmin: 277, ymin: 337, xmax: 511, ymax: 511 }
// },
// {
// score: 0.10262022167444229,
// label: 'rocket',
// box: { xmin: 352, ymin: -1, xmax: 463, ymax: 287 }
// }
// ]
</details>
This task involves translating images of scientific PDFs to markdown, enabling easier access to them. See here for more information.
import { pipeline } from '@xenova/transformers';
// Create image-to-text pipeline
let pipe = await pipeline('image-to-text', 'Xenova/nougat-small');
// Generate markdown
let url = 'https://huggingface.co/datasets/Xenova/transformers.js-docs/resolve/main/nougat_paper.png';
let output = await pipe(url, {
min_length: 1,
max_new_tokens: 40,
bad_words_ids: [[pipe.tokenizer.unk_token_id]],
});
// [{ generated_text: "# Nougat: Neural Optical Understanding for Academic Documents\n\nLukas Blecher\n\nCorrespondence to: lblecher@meta.com\n\nGuillem Cucur" }]
<details>
<summary>See input image</summary>
We added support for 4 new architectures, bringing the total up to 61!
image-to-text). See here for the list of available models.CLIPFeatureExtractor (and tests) in https://github.com/xenova/transformers.js/pull/387multilingual-e5-* models by @do-me in https://github.com/xenova/transformers.js/pull/403Full Changelog: https://github.com/xenova/transformers.js/compare/2.8.0...2.9.0
Fetched April 7, 2026