A new model is added to transformers: Kyutai-STT
It is added on top of the v4.52.4 release, and can be installed from the following tag: v4.52.4-Kyutai-STT-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.52.4-Kyutai-STT-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the Kyutai-STT model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.53.0.
Kyutai STT is a speech-to-text model architecture based on the Mimi codec, which encodes audio into discrete tokens in a streaming fashion, and a Moshi-like autoregressive decoder. Kyutai’s lab has released two model checkpoints:
Kyutai-STT can be found on the Huggingface Hub.
import torch
from datasets import load_dataset, Audio
from transformers import KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration
# 1. load the model and the processor
torch_device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "kyutai/stt-2.6b-en"
processor = KyutaiSpeechToTextProcessor.from_pretrained(model_id)
model = KyutaiSpeechToTextForConditionalGeneration.from_pretrained(model_id, device_map=torch_device)
# 2. load audio samples
ds = load_dataset(
"hf-internal-testing/librispeech_asr_dummy", "clean", split="validation"
)
ds = ds.cast_column("audio", Audio(sampling_rate=24000))
# 3. prepare the model inputs
inputs = processor(
ds[0]["audio"]["array"],
)
inputs.to(torch_device)
# 4. infer the model
output_tokens = model.generate(**inputs)
# 5. decode the generated tokens
print(processor.batch_decode(output_tokens, skip_special_tokens=True))
import torch
from datasets import load_dataset, Audio
from transformers import KyutaiSpeechToTextProcessor, KyutaiSpeechToTextForConditionalGeneration
# 1. load the model and the processor
torch_device = "cuda" if torch.cuda.is_available() else "cpu"
model_id = "kyutai/stt-2.6b-en"
processor = KyutaiSpeechToTextProcessor.from_pretrained(model_id)
model = KyutaiSpeechToTextForConditionalGeneration.from_pretrained(model_id, device_map=torch_device)
# 2. load audio samples
ds = load_dataset(
"hf-internal-testing/librispeech_asr_dummy", "clean", split="validation"
)
ds = ds.cast_column("audio", Audio(sampling_rate=24000))
# 3. prepare the model inputs
audio_arrays = [ds[i]["audio"]["array"] for i in range(4)]
inputs = processor(audio_arrays, return_tensors="pt", padding=True)
inputs = inputs.to(torch_device)
# 4. infer the model
output_tokens = model.generate(**inputs)
# 5. decode the generated tokens
decoded_outputs = processor.batch_decode(output_tokens, skip_special_tokens=True)
for output in decoded_outputs:
print(output)
A new model is added to transformers: V-JEPA 2
It is added on top of the v4.52.4 release, and can be installed from the following tag: v4.52.4-VJEPA-2-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.52.4-VJEPA-2-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the VJEPA-2 model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.53.0.
V-JEPA 2 is a self-supervised approach to training video encoders developed by FAIR, Meta. Using internet-scale video data, V-JEPA 2 attains state-of-the-art performance on motion understanding and human action anticipation tasks. V-JEPA 2-AC is a latent action-conditioned world model post-trained from V-JEPA 2 (using a small amount of robot trajectory interaction data) that solves robot manipulation tasks without environment-specific data collection or task-specific training or calibration.
VJEPA-2 can be found on the Huggingface Hub. V-JEPA 2 is intended to represent any video (and image) to perform video classification, retrieval, or as a video encoder for VLMs.
The snippet below shows how to load the V-JEPA 2 model using the AutoModel class.
import torch
from torchcodec.decoders import VideoDecoder
import numpy as np
processor = AutoVideoProcessor.from_pretrained("facebook/vjepa2-vitl-fpc64-256")
model = AutoModel.from_pretrained(
"facebook/vjepa2-vitl-fpc64-256",
torch_dtype=torch.float16,
device_map="auto",
attn_implementation="sdpa"
)
video_url = "https://huggingface.co/datasets/nateraw/kinetics-mini/resolve/main/val/archery/-Qz25rXdMjE_000014_000024.mp4"
vr = VideoDecoder(video_url)
frame_idx = np.arange(0, 64) # choosing some frames. here, you can define more complex sampling strategy
video = vr.get_frames_at(indices=frame_idx).data # T x C x H x W
video = processor(video, return_tensors="pt").to(model.device)
outputs = model(**video)
# V-JEPA 2 encoder outputs, same as calling `model.get_vision_features()`
encoder_outputs = outputs.last_hidden_state
# V-JEPA 2 predictor outputs
predictor_outputs = outputs.predictor_output.last_hidden_state
A new model is added to transformers: ColQwen2
It is added on top of the v4.52.4 release, and can be installed from the following tag: v4.52.4-ColQwen2-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.52.4-ColQwen2-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the ColQwen2 model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.53.0.
ColQwen2 is a variant of the ColPali model designed to retrieve documents by analyzing their visual features. Unlike traditional systems that rely heavily on text extraction and OCR, ColQwen2 treats each page as an image. It uses the Qwen2-VL backbone to capture not only text, but also the layout, tables, charts, and other visual elements to create detailed multi-vector embeddings that can be used for retrieval by computing pairwise late interaction similarity scores. This offers a more comprehensive understanding of documents and enables more efficient and accurate retrieval.
ColQwen2 can be found on the Huggingface Hub.
import requests
import torch
from PIL import Image
from transformers import ColQwen2ForRetrieval, ColQwen2Processor
from transformers.utils.import_utils import is_flash_attn_2_available
# Load the model and the processor
model_name = "vidore/colqwen2-v1.0-hf"
model = ColQwen2ForRetrieval.from_pretrained(
model_name,
torch_dtype=torch.bfloat16,
device_map="auto", # "cpu", "cuda", or "mps" for Apple Silicon
attn_implementation="flash_attention_2" if is_flash_attn_2_available() else "sdpa",
)
processor = ColQwen2Processor.from_pretrained(model_name)
# The document page screenshots from your corpus
url1 = "https://upload.wikimedia.org/wikipedia/commons/8/89/US-original-Declaration-1776.jpg"
url2 = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Romeoandjuliet1597.jpg/500px-Romeoandjuliet1597.jpg"
images = [
Image.open(requests.get(url1, stream=True).raw),
Image.open(requests.get(url2, stream=True).raw),
]
# The queries you want to retrieve documents for
queries = [
"When was the United States Declaration of Independence proclaimed?",
"Who printed the edition of Romeo and Juliet?",
]
# Process the inputs
inputs_images = processor(images=images).to(model.device)
inputs_text = processor(text=queries).to(model.device)
# Forward pass
with torch.no_grad():
image_embeddings = model(**inputs_images).embeddings
query_embeddings = model(**inputs_text).embeddings
# Score the queries against the images
scores = processor.score_retrieval(query_embeddings, image_embeddings)
print("Retrieval scores (query x image):")
print(scores)
If you have issue with loading the images with PIL, you can use the following code to create dummy images:
images = [
Image.new("RGB", (128, 128), color="white"),
Image.new("RGB", (64, 32), color="black"),
]
Quantization reduces the memory burden of large models by representing the weights in a lower precision. Refer to the Quantization overview for more available quantization backends.
The example below uses bitsandbytes to quantize the weights to int4.
import requests
import torch
from PIL import Image
from transformers import BitsAndBytesConfig, ColQwen2ForRetrieval, ColQwen2Processor
model_name = "vidore/colqwen2-v1.0-hf"
# 4-bit quantization configuration
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.float16,
)
model = ColQwen2ForRetrieval.from_pretrained(
model_name,
quantization_config=bnb_config,
device_map="cuda",
).eval()
processor = ColQwen2Processor.from_pretrained(model_name)
url1 = "https://upload.wikimedia.org/wikipedia/commons/8/89/US-original-Declaration-1776.jpg"
url2 = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/4c/Romeoandjuliet1597.jpg/500px-Romeoandjuliet1597.jpg"
images = [
Image.open(requests.get(url1, stream=True).raw),
Image.open(requests.get(url2, stream=True).raw),
]
queries = [
"When was the United States Declaration of Independence proclaimed?",
"Who printed the edition of Romeo and Juliet?",
]
# Process the inputs
inputs_images = processor(images=images, return_tensors="pt").to(model.device)
inputs_text = processor(text=queries, return_tensors="pt").to(model.device)
# Forward pass
with torch.no_grad():
image_embeddings = model(**inputs_images).embeddings
query_embeddings = model(**inputs_text).embeddings
# Score the queries against the images
scores = processor.score_retrieval(query_embeddings, image_embeddings)
print("Retrieval scores (query x image):")
print(scores)
The following commits are included in that patch release:
We had to protect the imports again, a series of bad events. Here are the two prs for the patch:
We had to revert #37877 because of a missing flag that was overriding the device map. We re-introduced the changes because they allow native 3D parallel training in Transformers. Sorry everyone for the troubles! 🤗
The Qwen2.5-Omni model is a unified multiple modalities model proposed in Qwen2.5-Omni Technical Report from Qwen team, Alibaba Group.
The abstract from the technical report is the following:
We present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. This strategy effectively decouples the handling of long sequences of multimodal data, assigning the perceptual responsibilities to the multimodal encoder and entrusting the modeling of extended sequences to a large language model.
Such a division of labor enhances the fusion of different modalities via the shared attention mechanism. To synchronize the timestamps of video inputs with audio, we organized the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE (Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose Thinker-Talker architecture.
In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni outperforms the similarly sized Qwen2-VL and Qwen2-Audio in both image and audio capabilities. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench.
Notably, Qwen2.5-Omni is the first open-source model to achieve a level of performance in end-to-end speech instruction following that is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni’s streaming Talker outperform most existing streaming and non-streaming alternatives in robustness and naturalness.
SAM-HQ (High-Quality Segment Anything Model) was proposed in Segment Anything in High Quality by Lei Ke, Mingqiao Ye, Martin Danelljan, Yifan Liu, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu.
The model is an enhancement to the original SAM model that produces significantly higher quality segmentation masks while maintaining SAM's original promptable design, efficiency, and zero-shot generalizability.
SAM-HQ introduces several key improvements over the original SAM model:
The abstract from the paper is the following:
The recent Segment Anything Model (SAM) represents a big leap in scaling up segmentation models, allowing for powerful zero-shot capabilities and flexible prompting. Despite being trained with 1.1 billion masks, SAM's mask prediction quality falls short in many cases, particularly when dealing with objects that have intricate structures. We propose HQ-SAM, equipping SAM with the ability to accurately segment any object, while maintaining SAM's original promptable design, efficiency, and zero-shot generalizability. Our careful design reuses and preserves the pre-trained model weights of SAM, while only introducing minimal additional parameters and computation. We design a learnable High-Quality Output Token, which is injected into SAM's mask decoder and is responsible for predicting the high-quality mask. Instead of only applying it on mask-decoder features, we first fuse them with early and final ViT features for improved mask details. To train our introduced learnable parameters, we compose a dataset of 44K fine-grained masks from several sources. HQ-SAM is only trained on the introduced dataset of 44k masks, which takes only 4 hours on 8 GPUs.
Tips:
The GraniteMoeHybrid model builds on top of GraniteMoeSharedModel and Bamba. Its decoding layers consist of state space layers or MoE attention layers with shared experts. By default, the attention layers do not use positional encoding.
The D-FINE model was proposed in D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement by Yansong Peng, Hebei Li, Peixi Wu, Yueyi Zhang, Xiaoyan Sun, Feng Wu
The abstract from the paper is the following:
We introduce D-FINE, a powerful real-time object detector that achieves outstanding localization precision by redefining the bounding box regression task in DETR models. D-FINE comprises two key components: Fine-grained Distribution Refinement (FDR) and Global Optimal Localization Self-Distillation (GO-LSD). FDR transforms the regression process from predicting fixed coordinates to iteratively refining probability distributions, providing a fine-grained intermediate representation that significantly enhances localization accuracy. GO-LSD is a bidirectional optimization strategy that transfers localization knowledge from refined distributions to shallower layers through self-distillation, while also simplifying the residual prediction tasks for deeper layers. Additionally, D-FINE incorporates lightweight optimizations in computationally intensive modules and operations, achieving a better balance between speed and accuracy. Specifically, D-FINE-L / X achieves 54.0% / 55.8% AP on the COCO dataset at 124 / 78 FPS on an NVIDIA T4 GPU. When pretrained on Objects365, D-FINE-L / X attains 57.1% / 59.3% AP, surpassing all existing real-time detectors. Furthermore, our method significantly enhances the performance of a wide range of DETR models by up to 5.3% AP with negligible extra parameters and training costs. Our code and pretrained models: this https URL.
The Conversational Speech Model (CSM) is the first open-source contextual text-to-speech model released by Sesame. It is designed to generate natural-sounding speech with or without conversational context. This context typically consists of multi-turn dialogue between speakers, represented as sequences of text and corresponding spoken audio.
Model Architecture: CSM is composed of two LLaMA-style auto-regressive transformer decoders: a backbone decoder that predicts the first codebook token and a depth decoder that generates the remaining tokens. It uses the pretrained codec model Mimi, introduced by Kyutai, to encode speech into discrete codebook tokens and decode them back into audio.
The original csm-1b checkpoint is available under the Sesame organization on Hugging Face.
<div class="flex justify-center"> <img src="https://huggingface.co/datasets/eustlb/documentation-images/resolve/main/csm_architecture.png"/> </div>Trained on a corpus of 4 trillion tokens, this model demonstrates that native 1-bit LLMs can achieve performance comparable to leading open-weight, full-precision models of similar size, while offering substantial advantages in computational efficiency (memory, energy, latency).
Llama Guard 4 is a new multimodal model designed to detect inappropriate content in images and text, whether used as input or generated as output by the model. It’s a dense 12B model pruned from Llama 4 Scout model, and it can run on a single GPU (24 GBs of VRAM). It can evaluate both text-only and image+text inputs, making it suitable for filtering both inputs and outputs of large language models. This enables flexible moderation pipelines where prompts are analyzed before reaching the model, and generated responses are reviewed afterwards for safety. It can also understand multiple languages.
TimesFM (Time Series Foundation Model) is a pretrained time-series foundation model proposed in A decoder-only foundation model for time-series forecasting by Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. It is a decoder only model that uses non-overlapping patches of time-series data as input and outputs some output patch length prediction in an autoregressive fashion.
The abstract from the paper is the following:
Motivated by recent advances in large language models for Natural Language Processing (NLP), we design a time-series foundation model for forecasting whose out-of-the-box zero-shot performance on a variety of public datasets comes close to the accuracy of state-of-the-art supervised forecasting models for each individual dataset. Our model is based on pretraining a patched-decoder style attention model on a large time-series corpus, and can work well across different forecasting history lengths, prediction lengths and temporal granularities.
The MLCD models were released by the DeepGlint-AI team in unicom, which focuses on building foundational visual models for large multimodal language models using large-scale datasets such as LAION400M and COYO700M, and employs sample-to-cluster contrastive learning to optimize performance. MLCD models are primarily used for multimodal visual large language models, such as LLaVA.
The Janus Model was originally proposed in Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation by DeepSeek AI team and later refined in Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling. Janus is a vision-language model that can generate both image and text output, it can also take both images and text as input.
[!NOTE] The model doesn't generate both images and text in an interleaved format. The user has to pass a parameter indicating whether to generate text or image.
The abstract from the original paper is the following:
In this paper, we introduce Janus, an autoregressive framework that unifies multimodal understanding and generation. Prior research often relies on a single visual encoder for both tasks, such as Chameleon. However, due to the differing levels of information granularity required by multimodal understanding and generation, this approach can lead to suboptimal performance, particularly in multimodal understanding. To address this issue, we decouple visual encoding into separate pathways, while still leveraging a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder's roles in understanding and generation, but also enhances the framework's flexibility. For instance, both the multimodal understanding and generation components can independently select their most suitable encoding methods. Experiments show that Janus surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus make it a strong candidate for next-generation unified multimodal models.
The abstract from the aforementioned Janus-Pro paper, released afterwards, is the following:
In this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specifically, Janus-Pro incorporates (1) an optimized training strate (2) expanded training data, and (3) scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capabilities, while also enhancing the stability of text-to-image generation. We hope this work will inspire further exploration in the field. Code and models are publicly available.
The InternVL3 family of Visual Language Models was introduced in InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models.
The abstract from the paper is the following:
We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/internvl_architecture.png" alt="drawing" width="600"/><small> Overview of InternVL3 models architecture, which is the same as InternVL2.5. Taken from the <a href="https://huggingface.co/OpenGVLab/InternVL3-1B">original checkpoint.</a> </small>
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/internvl_overview_performance.png" alt="drawing" width="600"/><small> Comparison of InternVL3 performance on OpenCompass against other SOTA VLLMs. Taken from the <a href="https://huggingface.co/OpenGVLab/InternVL3-1B">original checkpoint.</a> </small>
We integrate some kernels in the transformers library via the kernels package: https://github.com/huggingface/kernels
We start with some kernels in the Llama model, and we iterate to identify the best performance optimizations
In the previous release, we've added TP support in order to run distributed inference. However, this is not supported for all quantization methods. We are progressively adding support to it. Right now, only compressed-tensors, fp8 and fp8-fbgemm support it.
From the AutoRound contributors:
AutoRound is an advanced quantization algorithm that delivers strong accuracy, even at 2-bit precision. It leverages sign gradient descent to fine-tune both rounding values and min-max clipping thresholds in just 200 steps ... More details here: https://github.com/intel/auto-round
We have added two new sections to better understand and get started with quantization:
We've added GGUF support to gemma3 family models.
Most Vision Models and VLMs in Transformers can now benefit from fast image processors. By utilizing torch/torchvision functional transforms, these processors offer a substantial speedup when processing images compared to PiL/numpy functions, and support processing on both CPU and CUDA.
The new @auto_docstring decorator makes it easier to add proper documentation when contributing a model without bloating the modeling code:
@auto_docstring: AutoDocstringgenerateWe now support custom generate methods to be loaded from model.generate. The custom generate methods can be stored on the Hub, enabling quick distribution of experiments regarding new caches, decoding methods, heuristics, ...
from transformers import AutoModelForCausalLM, AutoTokenizer
# `generate` with `custom_generate` -> `generate` uses custom code
# note: calling the custom method prints "✨ using a custom generation method ✨"
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", device_map="auto")
inputs = tokenizer(["The quick brown"], return_tensors="pt").to(model.device)
gen_out = model.generate(**inputs, custom_generate="transformers-community/custom_generate_example", trust_remote_code=True)
print(tokenizer.batch_decode(gen_out, skip_special_tokens=True))
You can find the docs here, and all custom generation methods by searching for the custom_generate tag.
The transformers-cli command is updated to be simpler and cleaner, specifically for its chat variant.
The following is now possible and recommended:
transformers chat Qwen/Qwen2.5-3B-Instruct
Additionally, almost any generate flag can now be passed as a positional argument, present and future, as opposed to being limited to a set of hardcoded flags, for example:
transformers chat Qwen/Qwen2.5-0.5B-Instruct do_sample=False max_new_tokens=10
chat] generate parameterization powered by GenerationConfig and UX-related changes by @gante in #38047The agents folder is finally removed from transformers in favour of using smolagents.
We are moving away from torch 2.0 as it has been released more than two years ago.
init empty weights without accelerate by @Cyrilvallez in #37337_init_weights by @Cyrilvallez in #37341GenerationMixin inheritance by default in PreTrainedModel by @gante in #37173_pytree._register_pytree_node and torch.cpu.amp.autocast by @bzhong-solink in #37372kernels to 0.4.3 by @ArthurZucker in #37419rms_norm_eps for the L2Norm for Llama4 by @ArthurZucker in #37418tests/models/ by @ydshieh in #37415fsspec dependency which isn't directly used by transformers by @cyyever in #37318_init_weights() issues - make it work for composite models by @Cyrilvallez in #37070num_logits_to_keep by @Cyrilvallez in #37149from_pretrained by @Cyrilvallez in #37216attn_temperature_tuning by @gmlwns2000 in #37501test_offloaded_cache_implementation on XPU by @yao-matrix in #37514as_tensor) by @ydshieh in #37551test_can_load_with_global_device_set using a subprocess by @ydshieh in #37553xxx_token_id for multimodal tokens by @zucchini-nlp in #37573test_past_key_values_format by @gante in #37614/scripts 🧹 🧹 by @gante in #37676en docs in push CI by @gante in #37677siglip.md to Korean by @devxaitist in #37145Qwen2_5OmniConfig.get_text_config by @shahruk10 in #37690/model_cards 🧹 🧹 by @gante in #37685sacrebleu (and document why) by @gante in #37700qwen2_5_omni] fix flaky tests by @gante in #37721test_nemotron_8b_generation_sdpa by @faaany in #37665embeds_to_talker device in Qwen2.5-Omni by @BakerBunker in #37739AriaForConditionalGenerationIntegrationTest on T4 by @ydshieh in #37746MllamaForConditionalGenerationIntegrationTest by @ydshieh in #37750HybridCache init when device is passed by @gante in #37718GPT2Model StaticCache support by @poedator in #35761torch version by @gante in #37760roberta.md to Korean by @garongkim in #37069keypoint_detection.md to Korean by @rlaalsrl0922 in #36649hub.py by @srai9 in #37796test_generate_continue_from_past_key_values by @gante in #37724electra.md to Korean by @Kim-Ju-won in #36763torch.compile test by @gante in #37894AOPerModuleConfig and include_embedding by @jerryzh168 in #37802load_state_dict by @woct0rdho in #37902gpu_selection.md to Korean by @nsbg in #36757vocab_size access for multimodal models by @kurzdev in #37937max_memory argument when factoring in unused reserved memory by @gante in #37982pad image transform for batched inputs by @sebasv in #37544Optional typing by @qubvel in #38018test_push_to_hub_with_saves_each_epoch for now by @ydshieh in #38022torchscript.md by @Madghostek in #38004test_speculative_decoding_non_distil device-agnostic by @faaany in #38010AutoDocstring] Based on inspect parsing of the signature by @ArthurZucker and @yonigozlan in #33771ready for review by @ydshieh in #37885Trigger CircleCI via GitHub Actions when ready for review` by @ydshieh in #38038Trigger CircleCI via GitHub Actions when "ready for review" by @ydshieh in #37885)kernels from docker images by @ydshieh in #38083require_read_token by @ydshieh in #38093librispeech_asr dataset by @faaany in #38073lr_scheduler_kwargs options to create LR Scheduler when LayerWiseDummyOptimizer is used by @BlackNoodle in #34559past_key_values type hint in model output types by @ChengLyu in #37953check_bad commit.py gives wrong results by @ydshieh in #38107manueldeprada to run_slow whitelist by @manueldeprada in #38126include_embedding flag by @jerryzh168 in #37935SinusoidsPositionEmbedding precision by @BakerBunker in #38151Trigger CircleCI by ready for review by @ydshieh in #38171convert to draft workflow by @ydshieh in #38177fetch_tests CircleCI job by @ydshieh in #38176test_sdpa_equivalence (redundant) by @gante in #37911The following contributors have made significant changes to the library over the last release:
fsspec dependency which isn't directly used by transformers (#37318)test_offloaded_cache_implementation on XPU (#37514)embeds_to_talker device in Qwen2.5-Omni (#37739)SinusoidsPositionEmbedding precision (#38151)siglip.md to Korean (#37145)A new model is added to transformers: CSM
It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-CSM-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.51.3-CSM-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the CSM model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0.
The Conversational Speech Model (CSM) is the first open-source contextual text-to-speech model released by Sesame. It is designed to generate natural-sounding speech with or without conversational context. This context typically consists of multi-turn dialogue between speakers, represented as sequences of text and corresponding spoken audio.
Model Architecture: CSM is composed of two LLaMA-style auto-regressive transformer decoders: a backbone decoder that predicts the first codebook token and a depth decoder that generates the remaining tokens. It uses the pretrained codec model Mimi, introduced by Kyutai, to encode speech into discrete codebook tokens and decode them back into audio.
The original csm-1b checkpoint is available under the Sesame organization on Hugging Face.
<div class="flex justify-center"> <img src="https://huggingface.co/datasets/eustlb/documentation-images/resolve/main/csm_architecture.png"/> </div>CSM can be found on the Huggingface Hub.
CSM can be used to simply generate speech from a text prompt:
import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
model_id = "eustlb/csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"
# load the model and the processor
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
# prepare the inputs
text = "[0]The past is just a story we tell ourselves." # `[0]` for speaker id 0
inputs = processor(text, add_special_tokens=True).to(device)
# another equivalent way to prepare the inputs
conversation = [
{"role": "0", "content": [{"type": "text", "text": "The past is just a story we tell ourselves."}]},
]
inputs = processor.apply_chat_template(
conversation,
tokenize=True,
return_dict=True,
).to(device)
# infer the model
audio = model.generate(**inputs, output_audio=True)
processor.save_audio(audio, "example_without_context.wav")
CSM can be used to generate speech given a conversation, allowing consistency in the voices and content-aware generation:
import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
from datasets import load_dataset, Audio
model_id = "eustlb/csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"
# load the model and the processor
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
# prepare the inputs
ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
# ensure the audio is 24kHz
ds = ds.cast_column("audio", Audio(sampling_rate=24000))
conversation = []
# 1. context
for text, audio, speaker_id in zip(ds[:4]["text"], ds[:4]["audio"], ds[:4]["speaker_id"]):
conversation.append(
{
"role": f"{speaker_id}",
"content": [{"type": "text", "text": text}, {"type": "audio", "path": audio["array"]}],
}
)
# 2. text prompt
conversation.append({"role": f"{ds[4]['speaker_id']}", "content": [{"type": "text", "text": ds[4]["text"]}]})
inputs = processor.apply_chat_template(
conversation,
tokenize=True,
return_dict=True,
).to(device)
# infer the model
audio = model.generate(**inputs, output_audio=True)
processor.save_audio(audio, "example_with_context.wav")
CSM supports batched inference!
import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
from datasets import load_dataset, Audio
model_id = "eustlb/csm-1b"
device = "cuda" if torch.cuda.is_available() else "cpu"
# load the model and the processor
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
# prepare the inputs
ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
# ensure the audio is 24kHz
ds = ds.cast_column("audio", Audio(sampling_rate=24000))
# here a batch with two prompts
conversation = [
[
{
"role": f"{ds[0]['speaker_id']}",
"content": [
{"type": "text", "text": ds[0]["text"]},
{"type": "audio", "path": ds[0]["audio"]["array"]},
],
},
{
"role": f"{ds[1]['speaker_id']}",
"content": [
{"type": "text", "text": ds[1]["text"]},
],
},
],
[
{
"role": f"{ds[0]['speaker_id']}",
"content": [
{"type": "text", "text": ds[0]["text"]},
],
}
],
]
inputs = processor.apply_chat_template(
conversation,
tokenize=True,
return_dict=True,
).to(device)
audio = model.generate(**inputs, output_audio=True)
processor.save_audio(audio, [f"speech_batch_idx_{i}.wav" for i in range(len(audio))])
CSM supports full-graph compilation with CUDA graphs!
import torch
import copy
from transformers import CsmForConditionalGeneration, AutoProcessor
from datasets import load_dataset
model_id = "eustlb/csm-1b"
device = "cuda"
# set logs to ensure no recompilation and graph breaks
torch._logging.set_logs(graph_breaks=True, recompiles=True, cudagraphs=True)
# load the model and the processor
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
# use static cache, enabling automatically torch compile with fullgraph and reduce-overhead
model.generation_config.max_length = 250 # big enough to avoid recompilation
model.generation_config.max_new_tokens = None # would take precedence over max_length
model.generation_config.cache_implementation = "static"
model.depth_decoder.generation_config.cache_implementation = "static"
# generation kwargs
gen_kwargs = {
"do_sample": False,
"depth_decoder_do_sample": False,
"temperature": 1.0,
"depth_decoder_temperature": 1.0,
}
# Define a timing decorator
class TimerContext:
def __init__(self, name="Execution"):
self.name = name
self.start_event = None
self.end_event = None
def __enter__(self):
# Use CUDA events for more accurate GPU timing
self.start_event = torch.cuda.Event(enable_timing=True)
self.end_event = torch.cuda.Event(enable_timing=True)
self.start_event.record()
return self
def __exit__(self, *args):
self.end_event.record()
torch.cuda.synchronize()
elapsed_time = self.start_event.elapsed_time(self.end_event) / 1000.0
print(f"{self.name} time: {elapsed_time:.4f} seconds")
# prepare the inputs
ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
conversation = [
{
"role": f"{ds[0]['speaker_id']}",
"content": [
{"type": "text", "text": ds[0]["text"]},
{"type": "audio", "path": ds[0]["audio"]["array"]},
],
},
{
"role": f"{ds[1]['speaker_id']}",
"content": [
{"type": "text", "text": ds[1]["text"]},
{"type": "audio", "path": ds[1]["audio"]["array"]},
],
},
{
"role": f"{ds[2]['speaker_id']}",
"content": [
{"type": "text", "text": ds[2]["text"]},
],
},
]
padded_inputs_1 = processor.apply_chat_template(
conversation,
tokenize=True,
return_dict=True,
).to(device)
print("\n" + "="*50)
print("First generation - compiling and recording CUDA graphs...")
with TimerContext("First generation"):
_ = model.generate(**padded_inputs_1, **gen_kwargs)
print("="*50)
print("\n" + "="*50)
print("Second generation - fast !!!")
with TimerContext("Second generation"):
_ = model.generate(**padded_inputs_1, **gen_kwargs)
print("="*50)
# now with different inputs
conversation = [
{
"role": f"{ds[0]['speaker_id']}",
"content": [
{"type": "text", "text": ds[2]["text"]},
{"type": "audio", "path": ds[2]["audio"]["array"]},
],
},
{
"role": f"{ds[1]['speaker_id']}",
"content": [
{"type": "text", "text": ds[3]["text"]},
{"type": "audio", "path": ds[3]["audio"]["array"]},
],
},
{
"role": f"{ds[2]['speaker_id']}",
"content": [
{"type": "text", "text": ds[4]["text"]},
],
},
]
padded_inputs_2 = processor.apply_chat_template(
conversation,
tokenize=True,
return_dict=True,
).to(device)
print("\n" + "="*50)
print("Generation with other inputs!")
with TimerContext("Generation with different inputs"):
_ = model.generate(**padded_inputs_2, **gen_kwargs)
print("="*50)
CSM Transformers integration supports training!
from transformers import CsmForConditionalGeneration, AutoProcessor
from datasets import load_dataset, Audio
model_id = "eustlb/csm-1b"
device = "cuda"
# load the model and the processor
processor = AutoProcessor.from_pretrained(model_id)
model = CsmForConditionalGeneration.from_pretrained(model_id, device_map=device)
model.train()
ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
# ensure the audio is 24kHz
ds = ds.cast_column("audio", Audio(sampling_rate=24000))
conversation = []
# context
for text, audio, speaker_id in zip(ds[:4]["text"], ds[:4]["audio"], ds[:4]["speaker_id"]):
conversation.append(
{
"role": f"{speaker_id}",
"content": [{"type": "text", "text": text}, {"type": "audio", "path": audio["array"]}],
}
)
inputs = processor.apply_chat_template(
conversation,
tokenize=True,
return_dict=True,
output_labels=True,
).to(device)
out = model(**inputs)
out.loss.backward()
A new model is added to transformers: GraniteMoeHybrid
It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-GraniteMoeHybrid-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.51.3-GraniteMoeHybrid-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the GraniteMoeHybrid model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0.
The GraniteMoeHybrid model builds on top of GraniteMoeSharedModel and Bamba. Its decoding layers consist of state space layers or MoE attention layers with shared experts. By default, the attention layers do not use positional encoding.
GraniteMoeHybrid can be found on the Huggingface Hub.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "ibm-granite/granite-4.0-tiny-preview"
tokenizer = AutoTokenizer.from_pretrained(model_path)
# drop device_map if running on CPU
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto")
model.eval()
# change input text as desired
prompt = "Write a code to find the maximum value in a list of numbers."
# tokenize the text
input_tokens = tokenizer(prompt, return_tensors="pt")
# generate output tokens
output = model.generate(**input_tokens, max_new_tokens=100)
# decode output tokens into text
output = tokenizer.batch_decode(output)
# loop over the batch to print, in this example the batch size is 1
for i in output:
print(i)
A new model is added to transformers: D-FINE
It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-D-FINE-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.51.3-D-FINE-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the D-FINE model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0.
The D-FINE model was proposed in D-FINE: Redefine Regression Task in DETRs as Fine-grained Distribution Refinement by Yansong Peng, Hebei Li, Peixi Wu, Yueyi Zhang, Xiaoyan Sun, Feng Wu
The abstract from the paper is the following:
We introduce D-FINE, a powerful real-time object detector that achieves outstanding localization precision by redefining the bounding box regression task in DETR models. D-FINE comprises two key components: Fine-grained Distribution Refinement (FDR) and Global Optimal Localization Self-Distillation (GO-LSD). FDR transforms the regression process from predicting fixed coordinates to iteratively refining probability distributions, providing a fine-grained intermediate representation that significantly enhances localization accuracy. GO-LSD is a bidirectional optimization strategy that transfers localization knowledge from refined distributions to shallower layers through self-distillation, while also simplifying the residual prediction tasks for deeper layers. Additionally, D-FINE incorporates lightweight optimizations in computationally intensive modules and operations, achieving a better balance between speed and accuracy. Specifically, D-FINE-L / X achieves 54.0% / 55.8% AP on the COCO dataset at 124 / 78 FPS on an NVIDIA T4 GPU. When pretrained on Objects365, D-FINE-L / X attains 57.1% / 59.3% AP, surpassing all existing real-time detectors. Furthermore, our method significantly enhances the performance of a wide range of DETR models by up to 5.3% AP with negligible extra parameters and training costs. Our code and pretrained models: this https URL.
D-FINE can be found on the Huggingface Hub.
>>> import torch
>>> from transformers.image_utils import load_image
>>> from transformers import DFineForObjectDetection, AutoImageProcessor
>>> url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
>>> image = load_image(url)
>>> image_processor = AutoImageProcessor.from_pretrained("ustc-community/dfine_x_coco")
>>> model = DFineForObjectDetection.from_pretrained("ustc-community/dfine_x_coco")
>>> inputs = image_processor(images=image, return_tensors="pt")
>>> with torch.no_grad():
... outputs = model(**inputs)
>>> results = image_processor.post_process_object_detection(outputs, target_sizes=[(image.height, image.width)], threshold=0.5)
>>> for result in results:
... for score, label_id, box in zip(result["scores"], result["labels"], result["boxes"]):
... score, label = score.item(), label_id.item()
... box = [round(i, 2) for i in box.tolist()]
... print(f"{model.config.id2label[label]}: {score:.2f} {box}")
cat: 0.96 [344.49, 23.4, 639.84, 374.27]
cat: 0.96 [11.71, 53.52, 316.64, 472.33]
remote: 0.95 [40.46, 73.7, 175.62, 117.57]
sofa: 0.92 [0.59, 1.88, 640.25, 474.74]
remote: 0.89 [333.48, 77.04, 370.77, 187.3]
A new model is added to transformers: SAM-HQ
It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-SAM-HQ-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.51.3-SAM-HQ-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the SAM-HQ model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0.
SAM-HQ (High-Quality Segment Anything Model) was proposed in Segment Anything in High Quality by Lei Ke, Mingqiao Ye, Martin Danelljan, Yifan Liu, Yu-Wing Tai, Chi-Keung Tang, Fisher Yu.
The model is an enhancement to the original SAM model that produces significantly higher quality segmentation masks while maintaining SAM's original promptable design, efficiency, and zero-shot generalizability.
SAM-HQ introduces several key improvements over the original SAM model:
The abstract from the paper is the following:
The recent Segment Anything Model (SAM) represents a big leap in scaling up segmentation models, allowing for powerful zero-shot capabilities and flexible prompting. Despite being trained with 1.1 billion masks, SAM's mask prediction quality falls short in many cases, particularly when dealing with objects that have intricate structures. We propose HQ-SAM, equipping SAM with the ability to accurately segment any object, while maintaining SAM's original promptable design, efficiency, and zero-shot generalizability. Our careful design reuses and preserves the pre-trained model weights of SAM, while only introducing minimal additional parameters and computation. We design a learnable High-Quality Output Token, which is injected into SAM's mask decoder and is responsible for predicting the high-quality mask. Instead of only applying it on mask-decoder features, we first fuse them with early and final ViT features for improved mask details. To train our introduced learnable parameters, we compose a dataset of 44K fine-grained masks from several sources. HQ-SAM is only trained on the introduced dataset of 44k masks, which takes only 4 hours on 8 GPUs.
Tips:
SAM-HQ can be found on the Huggingface Hub.
import torch
from PIL import Image
import requests
from transformers import SamHQModel, SamHQProcessor
device = "cuda" if torch.cuda.is_available() else "cpu"
model = SamHQModel.from_pretrained("sushmanth/sam_hq_vit_b").to(device)
processor = SamHQProcessor.from_pretrained("sushmanth/sam_hq_vit_b")
img_url = "https://huggingface.co/ybelkada/segment-anything/resolve/main/assets/car.png"
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
input_points = [[[450, 600]]] # 2D location of a window in the image
inputs = processor(raw_image, input_points=input_points, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model(**inputs)
masks = processor.image_processor.post_process_masks(
outputs.pred_masks.cpu(), inputs["original_sizes"].cpu(), inputs["reshaped_input_sizes"].cpu()
)
scores = outputs.iou_scores
You can also process your own masks alongside the input images in the processor to be passed to the model:
import torch
from PIL import Image
import requests
from transformers import SamHQModel, SamHQProcessor
device = "cuda" if torch.cuda.is_available() else "cpu"
model = SamHQModel.from_pretrained("sushmanth/sam_hq_vit_b").to(device)
processor = SamHQProcessor.from_pretrained("sushmanth/sam_hq_vit_b")
img_url = "https://huggingface.co/ybelkada/segment-anything/resolve/main/assets/car.png"
raw_image = Image.open(requests.get(img_url, stream=True).raw).convert("RGB")
mask_url = "https://huggingface.co/ybelkada/segment-anything/resolve/main/assets/car.png"
segmentation_map = Image.open(requests.get(mask_url, stream=True).raw).convert("1")
input_points = [[[450, 600]]] # 2D location of a window in the image
inputs = processor(raw_image, input_points=input_points, segmentation_maps=segmentation_map, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model(**inputs)
masks = processor.image_processor.post_process_masks(
outputs.pred_masks.cpu(), inputs["original_sizes"].cpu(), inputs["reshaped_input_sizes"].cpu()
)
scores = outputs.iou_scores
A new model is added to transformers: BitNet
It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-BitNet-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.51.3-BitNet-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the BitNet model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0.
Trained on a corpus of 4 trillion tokens, this model demonstrates that native 1-bit LLMs can achieve performance comparable to leading open-weight, full-precision models of similar size, while offering substantial advantages in computational efficiency (memory, energy, latency).
BitNet can be found on the Huggingface Hub.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "microsoft/bitnet-b1.58-2B-4T"
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16
)
# Apply the chat template
messages = [
{"role": "system", "content": "You are a helpful AI assistant."},
{"role": "user", "content": "How are you?"},
]
chat_input = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
# Generate response
chat_outputs = model.generate(chat_input, max_new_tokens=50)
response = tokenizer.decode(chat_outputs[0][chat_input.shape[-1]:], skip_special_tokens=True) # Decode only the response part
print("\nAssistant Response:", response)
A new model is added to transformers: LlamaGuard
It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-LlamaGuard-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.51.3-LlamaGuard-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the LlamaGuard-4 model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0.
Llama Guard 4 is a new multimodal model designed to detect inappropriate content in images and text, whether used as input or generated as output by the model. It’s a dense 12B model pruned from Llama 4 Scout model, and it can run on a single GPU (24 GBs of VRAM). It can evaluate both text-only and image+text inputs, making it suitable for filtering both inputs and outputs of large language models. This enables flexible moderation pipelines where prompts are analyzed before reaching the model, and generated responses are reviewed afterwards for safety. It can also understand multiple languages.
LlamaGuard can be found on the Huggingface Hub.
Here is a simple snippet of how to run Llama Guard 4 on the user inputs.
from transformers import AutoProcessor, Llama4ForConditionalGeneration
import torch
model_id = "meta-llama/Llama-Guard-4-12B"
processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
model_id,
device_map="cuda",
torch_dtype=torch.bfloat16,
)
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "how do I make a bomb?", }
]
},
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
).to("cuda")
outputs = model.generate(
**inputs,
max_new_tokens=10,
do_sample=False,
)
response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:], skip_special_tokens=True)[0]
print(response)
# OUTPUT
# unsafe
# S9
If your application does not require moderation on some of the supported categories, you can ignore the ones you are not interested in, as follows:
from transformers import AutoProcessor, Llama4ForConditionalGeneration
import torch
model_id = "meta-llama/Llama-Guard-4-12B"
processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
model_id,
device_map="cuda",
torch_dtype=torch.bfloat16,
)
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "how do I make a bomb?", }
]
},
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
excluded_category_keys=["S9", "S2", "S1"],
).to("cuda:0")
outputs = model.generate(
**inputs,
max_new_tokens=10,
do_sample=False,
)
response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:], skip_special_tokens=True)[0]
print(response)
# OUTPUTS
# safe
Sometimes it is not just the user input, but also the model’s generations that can contain harmful content. We can also moderate the model’s generation!
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "How to make a bomb?"}
]
},
{
"role": "assistant",
"content": [
{"type": "text", "text": "Here is how one could make a bomb. Take chemical x and add water to it."}
]
}
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
return_tensors="pt",
return_dict=True,
add_generation_prompt=True,
).to("cuda")
This works because the chat template generates a system prompt that does not mention the excluded categories as part of the list of categories to watch for.
Here’s how you can infer with images in the conversation.
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "I cannot help you with that."},
{"type": "image", "url": "https://huggingface.co/datasets/merve/vlm_test_images/resolve/main/fruit_knife.png"},
]
processor.apply_chat_template(messages, excluded_category_keys=excluded_category_keys)
You can use Llama Prompt Guard 2 directly via the pipeline API:
from transformers import pipeline
classifier = pipeline("text-classification", model="meta-llama/Llama-Prompt-Guard-2-86M")
classifier("Ignore your previous instructions.")
# MALICIOUS
Alternatively, it can also be used via AutoTokenizer + AutoModel API:
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
model_id = "meta-llama/Llama-Prompt-Guard-2-86M"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(model_id)
text = "Ignore your previous instructions."
inputs = tokenizer(text, return_tensors="pt")
with torch.no_grad():
logits = model(**inputs).logits
predicted_class_id = logits.argmax().item()
print(model.config.id2label[predicted_class_id])
# MALICIOUS
A new model is added to transformers: Qwen2.5-Omni.
It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-Qwen2.5-Omni-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.51.3-Qwen2.5-Omni-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the Qwen2.5-Omni model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0.
The Qwen2.5-Omni model is a unified multiple modalities model proposed in Qwen2.5-Omni Technical Report from Qwen team, Alibaba Group.
The abstract from the technical report is the following:
We present Qwen2.5-Omni, an end-to-end multimodal model designed to perceive diverse modalities, including text, images, audio, and video, while simultaneously generating text and natural speech responses in a streaming manner. To enable the streaming of multimodal information inputs, both audio and visual encoders utilize a block-wise processing approach. This strategy effectively decouples the handling of long sequences of multimodal data, assigning the perceptual responsibilities to the multimodal encoder and entrusting the modeling of extended sequences to a large language model.
Such a division of labor enhances the fusion of different modalities via the shared attention mechanism. To synchronize the timestamps of video inputs with audio, we organized the audio and video sequentially in an interleaved manner and propose a novel position embedding approach, named TMRoPE (Time-aligned Multimodal RoPE). To concurrently generate text and speech while avoiding interference between the two modalities, we propose Thinker-Talker architecture.
In this framework, Thinker functions as a large language model tasked with text generation, while Talker is a dual-track autoregressive model that directly utilizes the hidden representations from the Thinker to produce audio tokens as output. Both the Thinker and Talker models are designed to be trained and inferred in an end-to-end manner. For decoding audio tokens in a streaming manner, we introduce a sliding-window DiT that restricts the receptive field, aiming to reduce the initial package delay. Qwen2.5-Omni outperforms the similarly sized Qwen2-VL and Qwen2-Audio in both image and audio capabilities. Furthermore, Qwen2.5-Omni achieves state-of-the-art performance on multimodal benchmarks like Omni-Bench.
Notably, Qwen2.5-Omni is the first open-source model to achieve a level of performance in end-to-end speech instruction following that is comparable to its capabilities with text inputs, as evidenced by benchmarks such as MMLU and GSM8K. As for speech generation, Qwen2.5-Omni’s streaming Talker outperform most existing streaming and non-streaming alternatives in robustness and naturalness.
Qwen2.5-Omni can be found on the Huggingface Hub.
The model can accept text, images, audio and videos as input. Here's an example code for inference.
import soundfile as sf
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-Omni-7B",
torch_dtype="auto",
device_map="auto"
)
processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B")
conversation = [
{
"role": "system",
"content": [
{"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
],
},
{
"role": "user",
"content": [
{"type": "video", "video": "/path/to/video.mp4"},
{"type": "text", "text": "What cant you hear and see in this video?"},
],
},
]
inputs = processor.apply_chat_template(
conversations,
load_audio_from_video=True,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
video_fps=1,
# kwargs to be passed to `Qwen2-5-OmniProcessor`
padding=True,
use_audio_in_video=True,
).to(model.device)
# Generation params for audio or text can be different and have to be prefixed with `thinker_` or `talker_`
text_ids, audio = model.generate(**inputs, use_audio_in_video=True, thinker_do_sample=False, talker_do_sample=True)
text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
sf.write(
"output.wav",
audio.reshape(-1).detach().cpu().numpy(),
samplerate=24000,
)
print(text)
To generate only text output and save compute by not loading the audio generation model, we can use Qwen2_5OmniThinkerForConditionalGeneration model.
from transformers import Qwen2_5OmniThinkerForConditionalGeneration, Qwen2_5OmniProcessor
model = Qwen2_5OmniThinkerForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-Omni-7B",
torch_dtype="auto",
device_map="auto",
)
processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B")
conversation = [
{
"role": "system",
"content": [
{"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
],
},
{
"role": "user",
"content": [
{"type": "video", "video": "/path/to/video.mp4"},
{"type": "text", "text": "What cant you hear and see in this video?"},
],
},
]
inputs = processor.apply_chat_template(
conversations,
load_audio_from_video=True,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
video_fps=1,
# kwargs to be passed to `Qwen2-5-OmniProcessor`
padding=True,
use_audio_in_video=True,
).to(model.device)
text_ids = model.generate(**inputs, use_audio_in_video=True)
text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
sf.write(
"output.wav",
audio.reshape(-1).detach().cpu().numpy(),
samplerate=24000,
)
print(text)
The model can batch inputs composed of mixed samples of various types such as text, images, audio and videos as input when using Qwen2_5OmniThinkerForConditionalGeneration model. Here is an example.
import soundfile as sf
from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-Omni-7B",
torch_dtype="auto",
device_map="auto"
)
processor = Qwen2_5OmniProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B")
# Conversation with video only
conversation1 = [
{
"role": "system",
"content": [
{"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
],
},
{
"role": "user",
"content": [
{"type": "video", "path": "/path/to/video.mp4"},
]
}
]
# Conversation with audio only
conversation2 = [
{
"role": "system",
"content": [
{"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
],
},
{
"role": "user",
"content": [
{"type": "audio", "path": "/path/to/audio.wav"},
]
}
]
# Conversation with pure text
conversation3 = [
{
"role": "system",
"content": [
{"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
],
},
{
"role": "user",
"content": [{"type": "text", "text": "who are you?"}],
}
]
# Conversation with mixed media
conversation4 = [
{
"role": "system",
"content": [
{"type": "text", "text": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech."}
],
},
{
"role": "user",
"content": [
{"type": "image", "path": "/path/to/image.jpg"},
{"type": "video", "path": "/path/to/video.mp4"},
{"type": "audio", "path": "/path/to/audio.wav"},
{"type": "text", "text": "What are the elements can you see and hear in these medias?"},
],
}
]
conversations = [conversation1, conversation2, conversation3, conversation4]
inputs = processor.apply_chat_template(
conversations,
load_audio_from_video=True,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
video_fps=1,
# kwargs to be passed to `Qwen2-5-OmniProcessor`
padding=True,
use_audio_in_video=True,
).to(model.thinker.device)
text_ids = model.generate(**inputs, use_audio_in_video=True)
text = processor.batch_decode(text_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(text)
The model supports a wide range of resolution inputs. By default, it uses the native resolution for input, but higher resolutions can enhance performance at the cost of more computation. Users can set the minimum and maximum number of pixels to achieve an optimal configuration for their needs.
min_pixels = 128*28*28
max_pixels = 768*28*28
processor = AutoProcessor.from_pretrained("Qwen/Qwen2.5-Omni-7B", min_pixels=min_pixels, max_pixels=max_pixels)
If users need audio output, the system prompt must be set as "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.", otherwise the audio output may not work as expected.
{
"role": "system",
"content": "You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.",
}
The model supports both text and audio outputs, if users do not need audio outputs, they can set enable_audio_output in the from_pretrained function. This option will save about ~2GB of GPU memory but the return_audio option for generate function will only allow to be set at False.
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-Omni-7B",
torch_dtype="auto",
device_map="auto",
enable_audio_output=False,
)
In order to obtain a flexible experience, we recommend that users set enable_audio_output at True when initializing the model through from_pretrained function, and then decide whether to return audio when generate function is called. When return_audio is set to False, the model will only return text outputs to get text responses faster.
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-Omni-7B",
torch_dtype="auto",
device_map="auto",
enable_audio_output=True,
)
...
text_ids = model.generate(**inputs, return_audio=False)
Qwen2.5-Omni supports the ability to change the voice of the output audio. Users can use the spk parameter of generate function to specify the voice type. The "Qwen/Qwen2.5-Omni-7B" checkpoint support two voice types: Chelsie and Ethan, while Chelsie is a female voice and Ethan is a male voice. By defalut, if spk is not specified, the default voice type is Chelsie.
text_ids, audio = model.generate(**inputs, spk="Chelsie")
text_ids, audio = model.generate(**inputs, spk="Ethan")
First, make sure to install the latest version of Flash Attention 2:
pip install -U flash-attn --no-build-isolation
Also, you should have hardware that is compatible with FlashAttention 2. Read more about it in the official documentation of the flash attention repository. FlashAttention-2 can only be used when a model is loaded in torch.float16 or torch.bfloat16.
To load and run a model using FlashAttention-2, add attn_implementation="flash_attention_2" when loading the model:
from transformers import Qwen2_5OmniForConditionalGeneration
model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
"Qwen/Qwen2.5-Omni-7B",
device_map="auto",
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
)
A new model is added to transformers: InternVL (2.5 & 3)
It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-InternVL-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.51.3-InternVL-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the InternVL model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0.
The InternVL3 family of Visual Language Models was introduced in InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models.
The abstract from the paper is the following:
We introduce InternVL3, a significant advancement in the InternVL series featuring a native multimodal pre-training paradigm. Rather than adapting a text-only large language model (LLM) into a multimodal large language model (MLLM) that supports visual inputs, InternVL3 jointly acquires multimodal and linguistic capabilities from both diverse multimodal data and pure-text corpora during a single pre-training stage. This unified training paradigm effectively addresses the complexities and alignment challenges commonly encountered in conventional post-hoc training pipelines for MLLMs. To further improve performance and scalability, InternVL3 incorporates variable visual position encoding (V2PE) to support extended multimodal contexts, employs advanced post-training techniques such as supervised fine-tuning (SFT) and mixed preference optimization (MPO), and adopts test-time scaling strategies alongside an optimized training infrastructure. Extensive empirical evaluations demonstrate that InternVL3 delivers superior performance across a wide range of multi-modal tasks. In particular, InternVL3-78B achieves a score of 72.2 on the MMMU benchmark, setting a new state-of-the-art among open-source MLLMs. Its capabilities remain highly competitive with leading proprietary models, including ChatGPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Pro, while also maintaining strong pure-language proficiency. In pursuit of open-science principles, we will publicly release both the training data and model weights to foster further research and development in next-generation MLLMs.
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/internvl_architecture.png" alt="drawing" width="600"/><small> Overview of InternVL3 models architecture, which is the same as InternVL2.5. Taken from the <a href="https://huggingface.co/OpenGVLab/InternVL3-1B">original checkpoint.</a> </small>
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/internvl_overview_performance.png" alt="drawing" width="600"/><small> Comparison of InternVL3 performance on OpenCompass against other SOTA VLLMs. Taken from the <a href="https://huggingface.co/OpenGVLab/InternVL3-1B">original checkpoint.</a> </small>
InternVL can be found on the Huggingface Hub.
Here is how you can use the image-text-to-text pipeline to perform inference with the InternVL3 models in just a few lines of code:
>>> from transformers import pipeline
>>> messages = [
... {
... "role": "user",
... "content": [
... {
... "type": "image",
... "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
... },
... {"type": "text", "text": "Describe this image."},
... ],
... },
... ]
>>> pipe = pipeline("image-text-to-text", model="OpenGVLab/InternVL3-1B-hf")
>>> outputs = pipe(text=messages, max_new_tokens=50, return_full_text=False)
>>> outputs[0]["generated_text"]
'The image showcases a vibrant scene of nature, featuring several flowers and a bee. \n\n1. **Foreground Flowers**: \n - The primary focus is on a large, pink cosmos flower with a prominent yellow center. The petals are soft and slightly r'
This example demonstrates how to perform inference on a single image with the InternVL models using chat templates.
[!NOTE] Note that the model has been trained with a specific prompt format for chatting. Use
processor.apply_chat_template(my_conversation_dict)to correctly format your prompts.
>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch
>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
>>> messages = [
... {
... "role": "user",
... "content": [
... {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
... {"type": "text", "text": "Please describe the image explicitly."},
... ],
... }
... ]
>>> inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
>>> generate_ids = model.generate(**inputs, max_new_tokens=50)
>>> decoded_output = processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)
>>> decoded_output
'The image shows two cats lying on a pink blanket. The cat on the left is a tabby with a mix of brown, black, and white fur, and it appears to be sleeping with its head resting on the blanket. The cat on the'
This example shows how to generate text using the InternVL model without providing any image input.
>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch
>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
>>> messages = [
... {
... "role": "user",
... "content": [
... {"type": "text", "text": "Write a haiku"},
... ],
... }
... ]
>>> inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(torch_device, dtype=torch.bfloat16)
>>> generate_ids = model.generate(**inputs, max_new_tokens=50)
>>> decoded_output = processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)
>>> print(decoded_output)
"Whispers of dawn,\nSilent whispers of the night,\nNew day's light begins."
InternVL models also support batched image and text inputs.
>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch
>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
>>> messages = [
... [
... {
... "role": "user",
... "content": [
... {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
... {"type": "text", "text": "Write a haiku for this image"},
... ],
... },
... ],
... [
... {
... "role": "user",
... "content": [
... {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
... {"type": "text", "text": "Describe this image"},
... ],
... },
... ],
... ]
>>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
>>> output = model.generate(**inputs, max_new_tokens=25)
>>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
>>> decoded_outputs
["user\n\nWrite a haiku for this image\nassistant\nSilky lake, \nWooden pier, \nNature's peace.",
'user\n\nDescribe this image\nassistant\nThe image shows a street scene with a traditional Chinese archway, known as a "Chinese Gate" or "Chinese Gate of']
This implementation of the InternVL models supports batched text-images inputs with different number of images for each text.
>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch
>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
>>> messages = [
... [
... {
... "role": "user",
... "content": [
... {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
... {"type": "text", "text": "Write a haiku for this image"},
... ],
... },
... ],
... [
... {
... "role": "user",
... "content": [
... {"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"},
... {"type": "image", "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg"},
... {"type": "text", "text": "These images depict two different landmarks. Can you identify them?"},
... ],
... },
... ],
>>> ]
>>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
>>> output = model.generate(**inputs, max_new_tokens=25)
>>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
>>> decoded_outputs
["user\n\nWrite a haiku for this image\nassistant\nSilky lake, \nWooden pier, \nNature's peace.",
'user\n\n\nThese images depict two different landmarks. Can you identify them?\nassistant\nYes, these images depict the Statue of Liberty and the Golden Gate Bridge.']
InternVL models can also handle video inputs. Here is an example of how to perform inference on a video input using chat templates.
>>> from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
>>> model_checkpoint = "OpenGVLab/InternVL3-8B-hf"
>>> quantization_config = BitsAndBytesConfig(load_in_4bit=True)
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, quantization_config=quantization_config)
>>> messages = [
... {
... "role": "user",
... "content": [
... {
... "type": "video",
... "url": "https://huggingface.co/datasets/hf-internal-testing/fixtures_videos/resolve/main/tennis.mp4",
... },
... {"type": "text", "text": "What type of shot is the man performing?"},
... ],
... }
>>> ]
>>> inputs = processor.apply_chat_template(
... messages,
... return_tensors="pt",
... add_generation_prompt=True,
... tokenize=True,
... return_dict=True,
>>> ).to(model.device, dtype=torch.float16)
>>> output = model.generate(**inputs, max_new_tokens=25)
>>> decoded_output = processor.decode(output[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)
>>> decoded_output
'The man is performing a forehand shot.'
This example showcases how to handle a batch of chat conversations with interleaved image and video inputs using chat template.
>>> from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
>>> import torch
>>> torch_device = "cuda"
>>> model_checkpoint = "OpenGVLab/InternVL3-1B-hf"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
>>> messages = [
... [
... {
... "role": "user",
... "content": [
... {"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"},
... {"type": "image", "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg"},
... {"type": "text", "text": "These images depict two different landmarks. Can you identify them?"},
... ],
... },
... ],
... [
... {
... "role": "user",
... "content": [
... {"type": "video", "url": "https://huggingface.co/datasets/hf-internal-testing/fixtures_videos/resolve/main/tennis.mp4"},
... {"type": "text", "text": "What type of shot is the man performing?"},
... ],
... },
... ],
... [
... {
... "role": "user",
... "content": [
... {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
... {"type": "text", "text": "Write a haiku for this image"},
... ],
... },
... ],
>>> ]
>>> inputs = processor.apply_chat_template(
... messages,
... padding=True,
... add_generation_prompt=True,
... tokenize=True,
... return_dict=True,
... return_tensors="pt",
>>> ).to(model.device, dtype=torch.bfloat16)
>>> outputs = model.generate(**inputs, max_new_tokens=25)
>>> decoded_outputs = processor.batch_decode(outputs, skip_special_tokens=True)
>>> decoded_outputs
['user\n\n\nThese images depict two different landmarks. Can you identify them?\nassistant\nThe images depict the Statue of Liberty and the Golden Gate Bridge.',
'user\nFrame1: \nFrame2: \nFrame3: \nFrame4: \nFrame5: \nFrame6: \nFrame7: \nFrame8: \nWhat type of shot is the man performing?\nassistant\nA forehand shot',
"user\n\nWrite a haiku for this image\nassistant\nSilky lake, \nWooden pier, \nNature's peace."]
A new model is added to transformers: Janus
It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-Janus-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.51.3-Janus-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the Janus model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0.
The Janus Model was originally proposed in Janus: Decoupling Visual Encoding for Unified Multimodal Understanding and Generation by DeepSeek AI team and later refined in Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling. Janus is a vision-language model that can generate both image and text output, it can also take both images and text as input.
[!NOTE] The model doesn't generate both images and text in an interleaved format. The user has to pass a parameter indicating whether to generate text or image.
The abstract from the original paper is the following:
In this paper, we introduce Janus, an autoregressive framework that unifies multimodal understanding and generation. Prior research often relies on a single visual encoder for both tasks, such as Chameleon. However, due to the differing levels of information granularity required by multimodal understanding and generation, this approach can lead to suboptimal performance, particularly in multimodal understanding. To address this issue, we decouple visual encoding into separate pathways, while still leveraging a single, unified transformer architecture for processing. The decoupling not only alleviates the conflict between the visual encoder's roles in understanding and generation, but also enhances the framework's flexibility. For instance, both the multimodal understanding and generation components can independently select their most suitable encoding methods. Experiments show that Janus surpasses previous unified model and matches or exceeds the performance of task-specific models. The simplicity, high flexibility, and effectiveness of Janus make it a strong candidate for next-generation unified multimodal models.
The abstract from the aforementioned Janus-Pro paper, released afterwards, is the following:
In this work, we introduce Janus-Pro, an advanced version of the previous work Janus. Specifically, Janus-Pro incorporates (1) an optimized training strate (2) expanded training data, and (3) scaling to larger model size. With these improvements, Janus-Pro achieves significant advancements in both multimodal understanding and text-to-image instruction-following capabilities, while also enhancing the stability of text-to-image generation. We hope this work will inspire further exploration in the field. Code and models are publicly available.
Janus can be found on the Huggingface Hub.
Here is the example of visual understanding with a single image.
[!NOTE] Note that the model has been trained with a specific prompt format for chatting. Use
processor.apply_chat_template(my_conversation_dict)to correctly format your prompts.
import torch
from PIL import Image
import requests
from transformers import JanusForConditionalGeneration, JanusProcessor
model_id = "deepseek-community/Janus-Pro-1B"
# Prepare Input for generation.
messages = [
{
"role": "user",
"content": [
{'type':'image', 'url': 'http://images.cocodataset.org/val2017/000000039769.jpg'},
{'type':"text", "text":"What do you see in this image?."}
]
},
]
# Set generation mode to `text` to perform text generation.
processor = JanusProcessor.from_pretrained(model_id)
model = JanusForConditionalGeneration.from_pretrained(model_id,
torch_dtype=torch.bfloat16,
device_map="auto")
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
generation_mode="text",
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)
output = model.generate(**inputs, max_new_tokens=40,generation_mode='text',do_sample=True)
text = processor.decode(output[0], skip_special_tokens=True)
print(text)
Janus can perform inference with multiple images as input, where images can belong to the same prompt or different prompts in batched inference, where the model processes many conversations in parallel. Here is how you can do it:
import torch
from PIL import Image
import requests
from transformers import JanusForConditionalGeneration, JanusProcessor
model_id = "deepseek-community/Janus-Pro-1B"
image_urls = [
"http://images.cocodataset.org/val2017/000000039769.jpg",
"https://www.ilankelman.org/stopsigns/australia.jpg",
"https://huggingface.co/microsoft/kosmos-2-patch14-224/resolve/main/snowman.jpg"
]
messages = [
[
{
"role": "user",
"content": [
{"type": "text", "text": "What’s the difference between"},
{"type": "image", "url": image_urls[0]},
{"type": "text", "text": " and "},
{"type": "image", "url": image_urls[1]}
]
}
],
[
{
"role": "user",
"content": [
{"type": "image", "url": image_urls[2]},
{"type": "text", "text": "What do you see in this image?"}
]
}
]
]
# Load model and processor
processor = JanusProcessor.from_pretrained(model_id)
model = JanusForConditionalGeneration.from_pretrained(
model_id, torch_dtype=torch.bfloat16, device_map="auto"
)
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
generation_mode="text",
tokenize=True,
padding=True,
return_dict=True,
return_tensors="pt"
).to(model.device, dtype=torch.bfloat16)
# Generate response
output = model.generate(**inputs, max_new_tokens=40, generation_mode='text', do_sample=False)
text = processor.batch_decode(output, skip_special_tokens=True)
print(text)
Janus can also generate images given a prompt.
import torch
from transformers import JanusForConditionalGeneration, JanusProcessor
# Set generation mode to `image` to prepare inputs for image generation..
model_id = "deepseek-community/Janus-Pro-1B"
processor = JanusProcessor.from_pretrained(model_id)
model = JanusForConditionalGeneration.from_pretrained(model_id,
torch_dtype=torch.bfloat16,
device_map="auto")
messages = [
{
"role": "user",
"content": [
{"type": "text", "text": "A dog running under the rain."},
],
}
]
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt,generation_mode="image",return_tensors="pt").to(model.device, dtype=torch.bfloat16)
# Set num_return_sequence parameter to generate multiple images per prompt.
model.generation_config.num_return_sequences = 2
outputs = model.generate(**inputs,
generation_mode="image",
do_sample=True,
use_cache=True,
)
# Perform post-processing on the generated token ids.
decoded_image = model.decode_image_tokens(outputs)
images = processor.postprocess(list(decoded_image.float()),return_tensors="PIL.Image.Image")
# Save the image
for i, image in enumerate(images['pixel_values']):
image.save(f"result{i}.png")
A new model is added to transformers: TimesFM
It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-TimesFM-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.51.3-TimesFM-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the TimesFM model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0.
TimesFM (Time Series Foundation Model) is a pretrained time-series foundation model proposed in A decoder-only foundation model for time-series forecasting by Abhimanyu Das, Weihao Kong, Rajat Sen, and Yichen Zhou. It is a decoder only model that uses non-overlapping patches of time-series data as input and outputs some output patch length prediction in an autoregressive fashion.
The abstract from the paper is the following:
Motivated by recent advances in large language models for Natural Language Processing (NLP), we design a time-series foundation model for forecasting whose out-of-the-box zero-shot performance on a variety of public datasets comes close to the accuracy of state-of-the-art supervised forecasting models for each individual dataset. Our model is based on pretraining a patched-decoder style attention model on a large time-series corpus, and can work well across different forecasting history lengths, prediction lengths and temporal granularities.
TimesFM can be found on the Huggingface Hub.
import torch
from transformers import TimesFmModelForPrediction
model = TimesFmModelForPrediction.from_pretrained(
"google/timesfm-2.0-500m-pytorch",
torch_dtype=torch.bfloat16,
attn_implementation="sdpa",
device_map="cuda" if torch.cuda.is_available() else None
)
# Create dummy inputs
forecast_input = [
np.sin(np.linspace(0, 20, 100)),
np.sin(np.linspace(0, 20, 200)),
np.sin(np.linspace(0, 20, 400)),
]
frequency_input = [0, 1, 2]
# Convert inputs to sequence of tensors
forecast_input_tensor = [
torch.tensor(ts, dtype=torch.bfloat16).to("cuda" if torch.cuda.is_available() else "cpu")
for ts in forecast_input
]
frequency_input_tensor = torch.tensor(frequency_input, dtype=torch.long).to(
"cuda" if torch.cuda.is_available() else "cpu"
)
# Get predictions from the pre-trained model
with torch.no_grad():
outputs = model(past_values=forecast_input_tensor, freq=frequency_input_tensor, return_dict=True)
point_forecast_conv = outputs.mean_predictions.float().cpu().numpy()
quantile_forecast_conv = outputs.full_predictions.float().cpu().numpy()
A new model is added to transformers: MLCD
It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-MLCD-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.51.3-MLCD-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the MLCD model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0.
The MLCD models were released by the DeepGlint-AI team in unicom, which focuses on building foundational visual models for large multimodal language models using large-scale datasets such as LAION400M and COYO700M, and employs sample-to-cluster contrastive learning to optimize performance. MLCD models are primarily used for multimodal visual large language models, such as LLaVA.
MLCD can be found on the Huggingface Hub.
import requests
from PIL import Image
from transformers import AutoProcessor, MLCDVisionModel
# Load model and processor
model = MLCDVisionModel.from_pretrained("DeepGlint-AI/mlcd-vit-bigG-patch14-448")
processor = AutoProcessor.from_pretrained("DeepGlint-AI/mlcd-vit-bigG-patch14-448")
# Process single image
url = "http://images.cocodataset.org/val2017/000000039769.jpg"
image = Image.open(requests.get(url, stream=True).raw)
inputs = processor(images=image, return_tensors="pt")
# Generate outputs
with torch.no_grad():
outputs = model(**inputs)
# Get visual features
features = outputs.last_hidden_state
print(f"Extracted features shape: {features.shape}")
A mix of bugs were fixed in this patch; very exceptionally, we diverge from semantic versioning to merge GLM-4 in this patch release.
This is another round of bug fixes, but they are a lot more minor and outputs were not really affected!