Releases13Avg4/moVersionsv4.57.4 → v5.5.3

v4.57.0: Qwen3-Next, Vault Gemma, Qwen3 VL, LongCat Flash, Flex OLMO, LFM2 VL, BLT, Qwen3 OMNI MoE, Parakeet, EdgeTAM, OLMO3

New model additions

Qwen3 Next

The Qwen3-Next series represents the Qwen team's next-generation foundation models, optimized for extreme context length and large-scale parameter efficiency. The series introduces a suite of architectural innovations designed to maximize performance while minimizing computational cost:

Hybrid Attention: Replaces standard attention with the combination of Gated DeltaNet and Gated Attention, enabling efficient context modeling.
High-Sparsity MoE: Achieves an extreme low activation ratio as 1:50 in MoE layers — drastically reducing FLOPs per token while preserving model capacity.
Multi-Token Prediction(MTP): Boosts pretraining model performance, and accelerates inference.
Other Optimizations: Includes techniques such as zero-centered and weight-decayed layernorm, Gated Attention, and other stabilizing enhancements for robust training.

Built on this architecture, they trained and open-sourced Qwen3-Next-80B-A3B — 80B total parameters, only 3B active — achieving extreme sparsity and efficiency.

Despite its ultra-efficiency, it outperforms Qwen3-32B on downstream tasks — while requiring less than 1/10 of the training cost. Moreover, it delivers over 10x higher inference throughput than Qwen3-32B when handling contexts longer than 32K tokens.

For more details, please visit their blog Qwen3-Next (blog post).

Adding Support for Qwen3-Next by @bozheng-hit in #40771

Vault Gemma

VaultGemma is a text-only decoder model derived from Gemma 2, notably it drops the norms after the Attention and MLP blocks, and uses full attention for all layers instead of alternating between full attention and local sliding attention. VaultGemma is available as a pretrained model with 1B parameters that uses a 1024 token sequence length.

VaultGemma was trained from scratch with sequence-level differential privacy (DP). Its training data includes the same mixture as the Gemma 2 models, consisting of a number of documents of varying lengths. Additionally, it is trained using DP stochastic gradient descent (DP-SGD) and provides a (ε ≤ 2.0, δ ≤ 1.1e-10)-sequence-level DP guarantee, where a sequence consists of 1024 consecutive tokens extracted from heterogeneous data sources. Specifically, the privacy unit of the guarantee is for the sequences after sampling and packing of the mixture.

add: differential privacy research model by @RyanMullins in #40851

Qwen3 VL

Qwen3-VL is a multimodal vision-language model series, encompassing both dense and MoE variants, as well as Instruct and Thinking versions.

Building upon its predecessors, Qwen3-VL delivers significant improvements in visual understanding while maintaining strong pure text capabilities. Key architectural advancements include: enhanced MRope with interleaved layout for better spatial-temporal modeling, DeepStack integration to effectively leverage multi-level features from the Vision Transformer (ViT), and improved video understanding through text-based time alignment—evolving from T-RoPE to text timestamp alignment for more precise temporal grounding.

These innovations collectively enable Qwen3-VL to achieve superior performance in complex multimodal tasks.

Adding Support for Qwen3-VL Series by @JJJYmmm in #40795

Longcat Flash

The LongCatFlash model was proposed in LongCat-Flash Technical Report by the Meituan LongCat Team. LongCat-Flash is a 560B parameter Mixture-of-Experts (MoE) model that activates 18.6B-31.3B parameters dynamically (average ~27B). The model features a shortcut-connected architecture enabling high inference speed (>100 tokens/second) and advanced reasoning capabilities.

The abstract from the paper is the following:

We present LongCat-Flash, a 560 billion parameter Mixture-of-Experts (MoE) language model featuring a dynamic computation mechanism that activates 18.6B-31.3B parameters based on context (average ~27B). The model incorporates a shortcut-connected architecture enabling high inference speed (>100 tokens/second) and demonstrates strong performance across multiple benchmarks including 89.71% accuracy on MMLU and exceptional agentic tool use capabilities.

Tips:

LongCat-Flash uses a unique shortcut-connected MoE architecture that enables faster inference compared to traditional MoE models
The model supports up to 128k context length for long-form tasks
Dynamic parameter activation makes it computationally efficient while maintaining high performance
Best suited for applications requiring strong reasoning, coding, and tool-calling capabilities
The MoE architecture includes zero experts (nn.Identity modules) which act as skip connections, allowing tokens to bypass expert computation when appropriate

Add LongCat-Flash by @molbap in #40730

Flex Olmo

FlexOlmo is a new class of language models (LMs) that supports (1) distributed training without data sharing, where different model parameters are independently trained on closed datasets, and (2) data-flexible inference, where these parameters along with their associated data can be flexibly included or excluded from model inferences with no further training. FlexOlmo employs a mixture-of-experts (MoE) architecture where each expert is trained independently on closed datasets and later integrated through a new domain-informed routing without any joint training. FlexOlmo is trained on FlexMix, a corpus we curate comprising publicly available datasets alongside seven domain-specific sets, representing realistic approximations of closed sets.

You can find all the original FlexOlmo checkpoints under the FlexOlmo collection.

Add FlexOlmo model by @2015aroras in #40921

LFM2 VL

LFM2-VL first series of vision-language foundation models developed by Liquid AI. These multimodal models are designed for low-latency and device-aware deployment. LFM2-VL extends the LFM2 family of open-weight Liquid Foundation Models (LFMs) into the vision-language space, supporting both text and image inputs with variable resolutions.

Architecture

LFM2-VL consists of three main components: a language model backbone, a vision encoder, and a multimodal projector. LFM2-VL builds upon the LFM2 backbone, inheriting from either LFM2-1.2B (for LFM2-VL-1.6B) or LFM2-350M (for LFM2-VL-450M). For the vision tower, LFM2-VL uses SigLIP2 NaFlex encoders to convert input images into token sequences. Two variants are implemented:

Shape-optimized (400M) for more fine-grained vision capabilities for LFM2-VL-1.6B
Base (86M) for fast image processing for LFM2-VL-450M

The encoder processes images at their native resolution up to 512×512 pixels, efficiently handling smaller images without upscaling and supporting non-standard aspect ratios without distortion. Larger images are split into non-overlapping square patches of 512×512 each, preserving detail. In LFM2-VL-1.6B, the model also receives a thumbnail (a small, downscaled version of the original image capturing the overall scene) to enhance global context understanding and alignment. Special tokens mark each patch’s position and indicate the thumbnail’s start. The multimodal connector is a 2-layer MLP connector with pixel unshuffle to reduce image token count.

Add new model LFM2-VL by @zucchini-nlp in #40624

BLT

The BLT model was proposed in Byte Latent Transformer: Patches Scale Better Than Tokens by Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li1, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman†, Srinivasan Iyer. BLT is a byte-level LLM that achieves tokenization-level performance through entropy-based dynamic patching.

The abstract from the paper is the following:

We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented based on the entropy of the next byte, allocating more compute and model capacity where increased data complexity demands it. We present the first flop controlled scaling study of byte-level models up to 8B parameters and 4T training bytes. Our results demonstrate the feasibility of scaling models trained on raw bytes without a fixed vocabulary. Both training and inference efficiency improve due to dynamically selecting long patches when data is predictable, along with qualitative improvements on reasoning and long tail generalization. Overall, for fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.

Usage Tips:

Dual Model Architecture: BLT consists of two separate trained models:
- Patcher (Entropy Model): A smaller transformer model that predicts byte-level entropy to determine patch boundaries and segment input.
- Main Transformer Model: The primary model that processes the patches through a Local Encoder, Global Transformer, and Local Decoder.
Dynamic Patching: The model uses entropy-based dynamic patching where:
- High-entropy regions (complex data) get shorter patches with more computational attention
- Low-entropy regions (predictable data) get longer patches for efficiency
- This allows the model to allocate compute resources where they're most needed
Local Encoder: Processes byte sequences with cross-attention to patch embeddings
Global Transformer: Processes patch-level representations with full attention across patches
Local Decoder: Generates output with cross-attention back to the original byte sequence
Byte-Level Tokenizer: Unlike traditional tokenizers that use learned vocabularies, BLT's tokenizer simply converts text to UTF-8 bytes and maps each byte to a token ID. There is no need for a vocabulary.

blt wip by @itazap in #38579

Qwen3 Omni MoE

The Qwen2.5-Omni model is a unified multiple modalities model proposed in Qwen2.5-Omni Technical Report from Qwen team, Alibaba Group.

Notes

Use [Qwen2_5OmniForConditionalGeneration] to generate audio and text output. To generate only one output type, use [Qwen2_5OmniThinkerForConditionalGeneration] for text-only and [Qwen2_5OmniTalkersForConditionalGeneration] for audio-only outputs.
Audio generation with [Qwen2_5OmniForConditionalGeneration] supports only single batch size at the moment.
In case out out-of-memory errors hwen working with video input, decrease processor.max_pixels. By default the maximum is set to a very arge value and high resolution visuals will not be resized, unless resolution exceeds processor.max_pixels.
The processor has its own [~ProcessorMixin.apply_chat_template] method to convert chat messages to model inputs.

Adding support for Qwen3Omni by @BakerBunker in #41025

Parakeet

Parakeet models, introduced by NVIDIA NeMo, are models that combine a Fast Conformer encoder with connectionist temporal classification (CTC), recurrent neural network transducer (RNNT) or token and duration transducer (TDT) decoder for automatic speech recognition.

Model Architecture

Fast Conformer Encoder: A linearly scalable Conformer architecture that processes mel-spectrogram features and reduces sequence length through subsampling. This is more efficient version of the Conformer Encoder found in FastSpeech2Conformer (see [ParakeetEncoder] for the encoder implementation and details).
ParakeetForCTC: a Fast Conformer Encoder + a CTC decoder
- CTC Decoder: Simple but effective decoder consisting of:
  - 1D convolution projection from encoder hidden size to vocabulary size (for optimal NeMo compatibility).
  - CTC loss computation for training.
  - Greedy CTC decoding for inference.

Add Parakeet by @nithinraok in #39062

EdgeTAM

The EdgeTAM model was proposed in EdgeTAM: On-Device Track Anything Model Chong Zhou, Chenchen Zhu, Yunyang Xiong, Saksham Suri, Fanyi Xiao, Lemeng Wu, Raghuraman Krishnamoorthi, Bo Dai, Chen Change Loy, Vikas Chandra, Bilge Soran.

EdgeTAM is an efficient adaptation of SAM 2 that introduces a 2D Spatial Perceiver architecture to optimize memory attention mechanisms for real-time video segmentation on mobile devices.

Add EdgeTAM by @yonigozlan in #39800

OLMO3

More details to come soon :eyes:

Add Olmo3 model by @2015aroras in #40778

Continuous batching

We are introducing Continuous Batching (CB) in this release, we consider it a stable feature. The main use case for CB is batched generation, which makes it very efficient in the context of GRPO training or evaluation. Thanks to CB, researchers or model developers are now free to use transformers in these contexts without having to spin up an additional inference engine.

CB currently supports both full attention and sliding window attention: this means that the vast majority of models are supported, like llama, gemma3, gpt-oss.

CB is also integrated with transformers serve, which means that you can deploy transformers as an OpenAI-compatible HTTP server. Here is a small snippet on how to use it:

import datasets
import torch

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-4B-Instruct-2507", dtype=torch.bfloat16, _attn_implementation="sdpa_paged", device_map="auto"
)
model.generation_config.max_new_tokens = 32
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Instruct-2507", padding_side="left")
dataset = datasets.load_dataset("openai/gsm8k", "socratic", split="test")
tokenized_datasets = dataset.map(lambda x: tokenizer(x["question"]), batched=True)
simple_batch_inputs = [item["input_ids"] for item in tokenized_datasets]

batch_outputs = model.generate_batch(inputs=simple_batch_inputs)
for request in batch_outputs:
    print(tokenizer.decode(batch_outputs[request].generated_tokens))
"""
 Let's break down the problem step by step:

1. **Total eggs laid per day**:  
   Janet’s ducks lay **16 eggs per day**
 Let's break down the problem step by step:

1. **Blue fiber**: The robe takes **2 bolts** of blue fiber.
2. **White fiber
 To determine Josh's profit from flipping the house, let's go step by step.

---

### Step 1: Initial cost of the house
Josh buys the
 To find the total distance James runs in a week, we can break down the problem step by step:

1. **Sprints per session**: James runs 
 To determine how many cups of feed Wendi needs to give her chickens in the final meal of the day, let's go step by step.
"""

Breaking changes

🚨 Remove Group Beam Search decoding strategy by @manueldeprada in #40495
🚨 Remove Constrained Beam Search decoding strategy by @manueldeprada in #40518
🚨 Allow check_model_inputs in core VLMs by @zucchini-nlp in #40342
🔴 Update Glm4V to use config values by @zucchini-nlp in #40712
🚨 Fix Inconsistant input_feature length and attention_mask length in WhisperFeatureExtractor by @BakerBunker in #39221
⚠️ 🔴 Add ministral model by @manueldeprada in #40247
🔴 Move variable output controls to _prepare_generation_config by @manueldeprada in #40715
🔴Make center_crop fast equivalent to slow by @yonigozlan in #40856

Bugfixes and improvements

Fix collated reports upload filename by @ivarflakstad in #40556
pin pytest-rerunfailures<16.0 by @ydshieh in #40561
remove the redundant non maintained jieba and use rjieba instead by @divyanshsinghvi in #40383
Set test_all_params_have_gradient=False for DeepseekV2ModelTest by @ydshieh in #40566
processor tests - use dummy videos by @zucchini-nlp in #40537
[qwen-vl] fix position ids by @zucchini-nlp in #40490
Fix test_eager_matches_sdpa_inference not run for CLIP by @ydshieh in #40581
Fix CircleCI step passes in the case of pytest worker crash at test collection time by @ydshieh in #40552
Allow remi-or to run-slow by @ydshieh in #40590
Fix llava image processor by @zucchini-nlp in #40588
Update get_*_features methods + update doc snippets by @qubvel in #40555
Fix custom generate relative imports by @manueldeprada in #40480
Support batch size > 1 image-text inference by @hiyouga in #36682
Fix typos by @cyyever in #40585
Skip TvpImageProcessingTest::test_slow_fast_equivalence by @ydshieh in #40593
Fix inexistent imports by @cyyever in #40580
Add Copilot instructions by @Rocketknight1 in #40432
Fix siglip flaky test_eager_matches_sdpa_inference by @ydshieh in #40584
Fix for missing default values in encoder decoder by @remi-or in #40517
Fix quite a lot of FA tests by @Cyrilvallez in #40548
[Tests] Fixup duplicated mrope logic by @vasqu in #40592
Reduce more test data fetch by @ydshieh in #40595
Pin torchcodec to 0.5 in AMD docker by @remi-or in #40598
Multiple fixes to FA tests in AMD by @remi-or in #40498
Disable cache for TokenizerTesterMixin temporarily by @ydshieh in #40611
fix: continuous batching in transformers serve by @McPatate in #40479
Fix processor chat template by @zucchini-nlp in #40613
Avoid too many request caused by AutoModelTest::test_dynamic_saving_from_local_repo by @ydshieh in #40614
Fix flaky JambaModelTest.test_load_balancing_loss by @ydshieh in #40617
Add collated reports job to Nvidia CI by @ahadnagy in #40470
Remove unnecessary pillow version check by @cyyever in #40604
Fix invalid typing by @cyyever in #40612
Enable more ruff UP rules by @cyyever in #40579
Support TF32 flag for MUSA backend by @fmo-mt in #33187
Remove random flag by @Cyrilvallez in #40629
🌐 [i18n-KO] Translated deepseek_v3.md to Korean by @ssum21 in #39649
Fix too many requests in TestMistralCommonTokenizer by @ydshieh in #40623
fix: gas for gemma fixed by @yevvonlim in #40591
[auto-model] propagate kwargs by @zucchini-nlp in #40491
[CP] Add attention_mask to the buffer when the mask is causal by @kashif in #40619
Fix: PIL image load in Processing utils apply_chat_template by @abdokaseb in #40622
Skip test_prompt_lookup_decoding_matches_greedy_search for voxtral by @ydshieh in #40643
add DeepseekV3ForTokenClassification by @bzantium in #40641
fix MetaCLIP 2 wrong link & wrong model names in the docstrings by @voidism in #40565
Remove TF/Flax examples by @Rocketknight1 in #40654
Mark LongformerModelTest::test_attention_outputs as flaky by @ydshieh in #40655
fix pipeline dtype by @jiqing-feng in #40638
feat(serving): add healthcheck by @McPatate in #40653
Fix Metaclip modular conversion by @Rocketknight1 in #40660
Avoid attention_mask copy in qwen2.5 by @cyyever in #40658
Allow custom args in custom_generate Callables and unify generation args structure by @manueldeprada in #40586
Update check_determinism inside test_determinism by @ydshieh in #40661
Skip test_fast_is_faster_than_slow for Owlv2ImageProcessingTest by @ydshieh in #40663
Fix warning for output_attentions=True by @qubvel in #40597
Skip test_prompt_lookup_decoding_matches_greedy_search for qwen2_audio by @ydshieh in #40664
Remove overwritten GitModelTest::test_beam_search_generate by @ydshieh in #40666
refactor: use tolist instead of list comprehension calling .item() by @McPatate in #40646
Benchmarking V2: framework impl by @ahadnagy in #40486
Even more test data cached by @ydshieh in #40636
Skip more fast v.s slow image processor tests by @ydshieh in #40675
Avoid night torch CI not run because of irrelevant docker image failing to build by @ydshieh in #40677
Mark Aimv2ModelTest::test_eager_matches_sdpa_inference_04_fp16_pad_right_sdpa_kernels as flaky by @ydshieh in #40683
CircleCI docker images cleanup / update / fix by @ydshieh in #40681
Add sequence classification support for small Gemma 3 text models by @abdokaseb in #40562
Add codebook_dim attribute to DacVectorQuantize for DacResidualVectorQuantize.from_latents() by @flavioialongo in #40665
fix broken offline mode when loading tokenizer from hub by @winglian in #40669
Load a tiny video to make CI faster by @zucchini-nlp in #40684
Final test data cache - inside CI docker images by @ydshieh in #40689
add: embedding model by @RyanMullins in #40694
feat: support request cancellation by @McPatate in #40599
Fixing bug in Voxtral when merging text and audio embeddings by @rcogill in #40671
Change docker image to preview for the MI355 CI by @ahadnagy in #40693
Fix backward compatibility with accelerate in Trainer by @qgallouedec in #40668
Fix self.dropout_p is not defined for SamAttention/Sam2Attention by @yonigozlan in #40667
[Glm4.5V] fix vLLM support by @zucchini-nlp in #40696
Fix broken Llama4 accuracy in MoE part by @nvpohanh in #40609
Avoid T5GemmaModelTest::test_eager_matches_sdpa_inference being flaky by @ydshieh in #40702
Align assisted generate for unified signature in decoding methods by @manueldeprada in #40657
Fetch one missing test data by @ydshieh in #40703
Add Fast Image Processor for ImageGPT by @agamjots05 in #39592
Fetch more test data with hf_hub_download by @ydshieh in #40710
feat(serve): add healthcheck test by @McPatate in #40697
Fix parent classes of ProcessingKwargs by @cyyever in #40676
[tests] fix blip2 edge case by @gante in #40699
[moduar] Add missing self in post-process methods by @framonmar7 in #40711
[onnx] use logical or for grounding dino mask by @lmarshall12 in #40625
Fix parent classes of AllKwargsForChatTemplate by @cyyever in #40685
Fix arguments by @cyyever in #40605
[serve] re-enable tests by @gante in #40717
[tests] remove overwrites of removed test by @gante in #40720
Add Optional typing by @cyyever in #40686
[Gemma Embedding] Fix SWA by @vasqu in #40700
Keypoint matching docs by @merveenoyan in #40541
Skip VitMatteImageProcessingTest::test_fast_is_faster_than_slow by @ydshieh in #40713
refactor(serve): move request_id to headers by @McPatate in #40722
[Continous Batching] fix do_Sample=True in continuous batching by @kashif in #40692
Fix order of mask functions when using and/or_mask_function by @Cyrilvallez in #40753
Fix np array typing by @cyyever in #40741
Set accepts_loss_kwargs to False for ConvNext(|V2)ForImageClassification by @clinty in #40746
Add BF16 support check for MUSA backend by @fmo-mt in #40576
remove gemmas eager training warning by @August-murr in #40744
remove FSDP prefix when using save_pretrained with FSDP2 by @winglian in #40207
feat: err when unsupported attn impl is set w/ --continuous_batching by @McPatate in #40618
docs: add continuous batching to serving by @McPatate in #40758
Remove unnecessary tildes from documentation by @st81 in #40748
Fix more typos by @cyyever in #40627
Fix inconsistency in SeamlessM4T and SeamlessM4Tv2 docs by @clinty in #39364
Fix continue_final_message in apply_chat_template to prevent substring matching issues by @abdokaseb in #40732
🌐 [i18n-KO] Translated 'xclip.md' to Korean by @ssum21 in #39594
Fix Bark failing tests by @ebezzam in #39478
Add EfficientLoFTRImageProcessorFast for GPU-accelerated image processing by @LawJarp-A in #40215
Fix: swanlab public.cloud.experiment_url api error by @Zeyi-Lin in #40763
[generate] PromptLookupCandidateGenerator won't generate forbidden tokens by @gante in #40726
Support sliding window in CB by @remi-or in #40688
[deprecations] Remove generate-related deprecations up to v4.56 by @gante in #40729
rm src/transformers/convert_pytorch_checkpoint_to_tf2.py by @gante in #40718
[tests] update test_past_key_values_format and delete overwrites by @gante in #40701
[RoPE] run RoPE tests when the model uses RoPE by @gante in #40630
Fix crash when executing MambaCache sample code by @torotoki in #40557
[pipeline] ASR pipeline kwargs are forwared to generate by @gante in #40375
[docs] CPU install by @stevhliu in #40631
Adding Support for Qwen3-Next by @bozheng-hit in #40771
Fix gpt-oss router_indices in EP by @jiqing-feng in #40545
Remove reference of video_load_backend and video_fps for processor by @cyyever in #40719
[processors] Unbloating simple processors by @zucchini-nlp in #40377
Enable ruff on benchmark and scripts by @cyyever in #40634
Fix doc for PerceptionLMForConditionalGeneration forward. by @shuminghu in #40733
Fix typos in tests and util by @cyyever in #40780
Fix invalid PipelineParallel member by @cyyever in #40789
Use functools.cached_property by @cyyever in #40607
Read config pattern for Qwen3Next by @Cyrilvallez in #40792
Fix dotted model names by @August-murr in #40745
Fix the issue that csm model cannot work with pipeline mode. by @yuanwu2017 in #39349
Move num_items_in_batch to correct device before accelerator.gather by @ssharpe42 in #40773
Remove use_ipex option from Trainer by @cyyever in #40784
fix_image_processing_fast_for_glm4v by @lambertwjh in #40483
[Docs] Add missing class documentation for optimizer_schedules by @jijihuny in #31870, #23010)
Fix DeepSpeed mixed precision precedence over Accelerate defaults by @notkisk in #39856
feature: Add robust token counting with padding exclusion by @PrathmeshAdsod in #40416
Fix edge case for tokenize by @wangzhen0518 in #36277)
Fix config dtype parsing for Emu3 edge case by @Isotr0py in #40766
Align torch implementation of Gated DeltaNet in Qwen3-Next with fla library. by @bozheng-hit in #40807
Fix typos in src by @cyyever in #40782
add general hub test for Fast Image Processors in test_image_processing_utils by @namgyu-youn in #40086
Push generation config along with checkpoints by @qgallouedec in #40804
[Jetmoe] Fix RoPE by @vasqu in #40819
🌐 [i18n-KO] Translated clipseg.md to Korean by @HyunZ118 in #39903
Improve torch_dtype checks by @cyyever in #40808
Add VideoProcessors to auto-backend requirements by @Cyrilvallez in #40843
Adds Causal Conv 1D kernel for mamba models by @MekkCyber in #40765
Update no split modules in T5Gemma model by @npuichigo in #40810
Replace image classification loss functions to self.loss_function by @qubvel in #40764
Fix the misalignment between the l2norm in GDN of Qwen3-Next and the implementation in the FLA library. by @bozheng-hit in #40842
Fixes for continuous batching by @remi-or in #40828
[tests] re-enable aria fast tests by @gante in #40846
[SAM2] Fix inconsistent results with original implementation with input boxes by @yonigozlan in #40800
[Sam2Video] Fix video inference with batched boxes and add test by @yonigozlan in #40797
add: differential privacy research model by @RyanMullins in #40851
[test] Fix test_eager_matches_sdpa incorrectly skipped by @eustlb in #40852
[tests] move generative tests away from test_modeling_common.py by @gante in #40854
[generate] Always use decoder config to init cache by @gante in #40772
Use checkpoint in auto_class_docstring by @cyyever in #40844
Fix TrainingArguments.parallelism_config NameError with accelerate<1.10.1 by @albertvillanova in #40818
Redirect MI355 CI results to dummy dataset by @ahadnagy in #40862
[Bug fix #40813] Fix base_model_tp_plan of Starcoder2 model. by @greg-kwasniewski1 in #40814
[docstrings / type hints] Update outdated annotations for past_key_values by @gante in #40803
fix florence kwargs by @SunMarc in #40826
fix: XIELU act parameters not being casted to correct dtype by @NanoCode012 in #40812
Update model tags and integration references in bug report by @ArthurZucker in #40881
[Qwen3 Next] Use numerically stable rsqrt by @thalahors in #40848
Adding Support for Qwen3-VL Series by @JJJYmmm in #40795
[VaultGemma] Update expectations in integration tests by @vasqu in #40855
Fix modular consistency by @Cyrilvallez in #40883
Clarify passing is_causal in sdpa_attention_paged_forward by @cyyever in #40838
Use torch.expm1 and torch.log1p for better numerical results by @cyyever in #40860
Add Fast PromptDepthAnything Processor by @SamuelBarryCS in #40602
Fix deta loading & dataclass by @Cyrilvallez in #40878
Remove dict branch of attention_mask in sdpa_attention_paged_forward by @cyyever in #40882
🌐 [i18n-KO] Translated smolvlm.md to Korean by @HyunZ118 in #40414
🌐 [i18n-KO] Translated imageprocessor.md to Korean by @HyunZ118 in #39557
[generate] remove docs of a feature that no longer exists by @gante in #40895
Make debugging failing tests (check and update expect output values) easier 🔥 by @ydshieh in #40727
Fixing the call to kernelize by @MekkCyber in #40628
Fix getter regression by @molbap in #40824
Fix flaky Gemma3nAudioFeatureExtractionTest::test_dither by @ydshieh in #40902
[cache] Merge static sliding and static chunked layer by @Cyrilvallez in #40893
Harmonize CacheLayer names by @Cyrilvallez in #40892
[cache] Only use scalars in get_mask_sizes by @Cyrilvallez in #40907
Set seed for Glm4vIntegrationTest by @ydshieh in #40905
Add Olmo3 model by @2015aroras in #40778
remove dummy EncodingFast by @cyyever in #40864
Improve module name handling for local custom code by @XuehaiPan in #40809
Remove runner_map by @ydshieh in #40880
disable test_fast_is_faster_than_slow by @ydshieh in #40909
[gemma3] Gemma3ForConditionalGeneration compatible with assisted generation by @gante in #40791
[generate] misc fixes by @gante in #40906
Fix dtype in Paligemma by @zucchini-nlp in #40912
[Docs] Adding documentation of MXFP4 Quantization by @ariG23498 in #40885
Processor load with multi-processing by @zucchini-nlp in #40786
[Llama4] Remove image_sizes arg and deprecate vision_feature_layer by @yaswanth19 in #40832
Fix #40067: Add dedicated UMT5 support to GGUF loader (config, tokenizer, test) by @akshay-babbar in #40218
[torchao safetensors] renaming get_state_dict function by @liangel-02 in #40774
Adding activation kernels by @MekkCyber in #40890
Minor fix for #40727 by @ydshieh in #40929
Add support for Florence-2 training by @ducviet00 in #40914
Add LongCat-Flash by @molbap in #40730
[DOC] Add missing dates in model cards by @yonigozlan in #40922
[models] remove unused import torch.utils.checkpoint by @gante in #40934
Intel CPU dockerfile by @jiqing-feng in #40806
docs(i18n): Correct the descriptive text in the README_zh-hans.md by @lilin-1 in #40941
Fix trainer tests by @SunMarc in #40823
Fix Glm4vMoeIntegrationTest by @ydshieh in #40930
Raise error instead of warning when using meta device in from_pretrained by @Cyrilvallez in #40942
Consistent naming for images kwargs by @zucchini-nlp in #40834
Remove nested import logic for torchvision by @yonigozlan in #40940
Fix Glm4vModelTest::test_eager_matches_fa2_generate by @ydshieh in #40947
Update expected values for some test_speculative_generation by @ydshieh in #40949
Standardize audio embedding function name for audio multimodal models by @jackzhxng in #40919
Add FlexOlmo model by @2015aroras in #40921
Don't list dropout in eager_paged_attention_forward by @cyyever in #40924

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@hiyouga
- Support batch size > 1 image-text inference (#36682)
@cyyever
- Fix typos (#40585)
- Fix inexistent imports (#40580)
- Remove unnecessary pillow version check (#40604)
- Fix invalid typing (#40612)
- Enable more ruff UP rules (#40579)
- Avoid attention_mask copy in qwen2.5 (#40658)
- Fix parent classes of ProcessingKwargs (#40676)
- Fix parent classes of AllKwargsForChatTemplate (#40685)
- Fix arguments (#40605)
- Add Optional typing (#40686)
- Fix np array typing (#40741)
- Fix more typos (#40627)
- Remove reference of video_load_backend and video_fps for processor (#40719)
- Enable ruff on benchmark and scripts (#40634)
- Fix typos in tests and util (#40780)
- Fix invalid PipelineParallel member (#40789)
- Use functools.cached_property (#40607)
- Remove use_ipex option from Trainer (#40784)
- Fix typos in src (#40782)
- Improve torch_dtype checks (#40808)
- Use checkpoint in auto_class_docstring (#40844)
- Clarify passing is_causal in sdpa_attention_paged_forward (#40838)
- Use torch.expm1 and torch.log1p for better numerical results (#40860)
- Remove dict branch of attention_mask in sdpa_attention_paged_forward (#40882)
- remove dummy EncodingFast (#40864)
- Don't list dropout in eager_paged_attention_forward (#40924)
- Benchmarking V2: framework impl (#40486)
- Change docker image to preview for the MI355 CI (#40693)
- Redirect MI355 CI results to dummy dataset (#40862)
@voidism
- fix MetaCLIP 2 wrong link & wrong model names in the docstrings (#40565)
@RyanMullins
- add: embedding model (#40694)
- add: differential privacy research model (#40851)
@LawJarp-A
- Add EfficientLoFTRImageProcessorFast for GPU-accelerated image processing (#40215)
@bozheng-hit
- Adding Support for Qwen3-Next (#40771)
- Align torch implementation of Gated DeltaNet in Qwen3-Next with fla library. (#40807)
- Fix the misalignment between the l2norm in GDN of Qwen3-Next and the implementation in the FLA library. (#40842)
@wangzhen0518
- Fix edge case for tokenize (#36277) (#36555)
@HyunZ118
- 🌐 [i18n-KO] Translated clipseg.md to Korean (#39903)
- 🌐 [i18n-KO] Translated smolvlm.md to Korean (#40414)
- 🌐 [i18n-KO] Translated imageprocessor.md to Korean (#39557)
@JJJYmmm
- Adding Support for Qwen3-VL Series (#40795)
@SamuelBarryCS
- Add Fast PromptDepthAnything Processor (#40602)
@2015aroras
- Add Olmo3 model (#40778)
- Add FlexOlmo model (#40921)

Patch release v4.56.2

Processor load with multi-processing (#40786)
[Jetmoe] Fix RoPE (#40819)
Fix getter regression (#40824)
Fix config dtype parsing for Emu3 edge case (#40766)

Vault-Gemma (based on v4.56.1)

A new model is added to transformers: Vault-Gemma It is added on top of the v4.56.1 release, and can be installed from the following tag: v4.56.1-Vault-Gemma-preview.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.56.1-Vault-Gemma-preview

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

As the tag implies, this tag is a preview of the Vault-Gemma model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.57.0.

Vault-Gemma

The example below demonstrates how to chat with the model with pipeline:

from transformers import pipeline

pipe = pipeline(
    task="text-generation",
    model="google/vaultgemma-1b",
    dtype="auto",
    device_map="auto",
)

text = "Tell me an unknown interesting biology fact about the brain."
outputs = pipe(text, max_new_tokens=32)
response = outputs[0]["generated_text"]
print(response)

with the AutoModelForCausalLM class:

# pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "google/vaultgemma-1b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", dtype="auto")

text = "Tell me an unknown interesting biology fact about the brain."
input_ids = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**input_ids, max_new_tokens=32)
print(tokenizer.decode(outputs[0]))

or with transformers chat:

transformers chat google/vaultgemma-1b

Patch release v4.56.1

This patch most notably fixes an issue with the new dtype argument (replacing torch_dtype) in pipelines!

Bug Fixes & Improvements

Fix broken Llama4 accuracy in MoE part (#40609)
fix pipeline dtype (#40638)
Fix self.dropout_p is not defined for SamAttention/Sam2Attention (#40667)
Fix backward compatibility with accelerate in Trainer (#40668)
fix broken offline mode when loading tokenizer from hub (#40669)
[Glm4.5V] fix vLLM support (#40696)

Embedding Gemma (based on v4.56.0)

A new model is added to transformers: Embedding Gemma It is added on top of the v4.56.0 release, and can be installed from the following tag: v4.56.0-Embedding-Gemma-preview.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.56.0-Embedding-Gemma-preview

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

As the tag implies, this tag is a preview of the EmbeddingGemma model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.57.0.

Embedding-Gemma

Today, Google releases EmbeddingGemma, a state-of-the-art multilingual embedding model perfect for on-device use cases. Designed for speed and efficiency, the model features a compact size of 308M parameters and a 2K context window, unlocking new possibilities for mobile RAG pipelines, agents, and more. EmbeddingGemma is trained to support over 100 languages and is the highest-ranking text-only multilingual embedding model under 500M on the Massive Text Embedding Benchmark (MTEB) at the time of writing.

Usage example

EmbeddingGemma can be found on the Huggingface Hub. It is integrated in sentence-transformers which depends on transformers.

See below for sentence-transformers examples using the model:

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("google/embeddinggemma-300m")

# Run inference with queries and documents
query = "Which planet is known as the Red Planet?"
documents = [
    "Venus is often called Earth's twin because of its similar size and proximity.",
    "Mars, known for its reddish appearance, is often referred to as the Red Planet.",
    "Jupiter, the largest planet in our solar system, has a prominent red spot.",
    "Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
]
query_embeddings = model.encode_query(query)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# (768,) (4, 768)

# Compute similarities to determine a ranking
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[0.3011, 0.6359, 0.4930, 0.4889]])

# Convert similarities to a ranking
ranking = similarities.argsort(descending=True)[0]
print(ranking)
# tensor([1, 2, 3, 0])

v4.56: Dino v3, X-Codec, Ovis 2, MetaCLIP 2, Florence 2, SAM 2, Kosmos 2.5, HunYuan, GLMV-4.5

New model additions

Dino v3

DINOv3 is a family of versatile vision foundation models that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models.

You can find all the original DINOv3 checkpoints under the DINOv3 collection.

Add Dino v3 by @qubvel in #40167

X-Codec

he X-Codec model was proposed in Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model by Zhen Ye, Peiwen Sun, Jiahe Lei, Hongzhan Lin, Xu Tan, Zheqi Dai, Qiuqiang Kong, Jianyi Chen, Jiahao Pan, Qifeng Liu, Yike Guo, Wei Xue

The X-Codec model is a neural audio codec that integrates semantic information from self-supervised models (e.g., HuBERT) alongside traditional acoustic information. This enables :

Music continuation : Better modeling of musical semantics yields more coherent continuations.
Text-to-Sound Synthesis : X-Codec captures semantic alignment between text prompts and generated audio.
Semantic aware audio tokenization: X-Codec is used as an audio tokenizer in the YuE lyrics to song generation model.

Add X-Codec model by @Manalelaidouni in #38248

Ovis 2

The Ovis2 is an updated version of the Ovis model developed by the AIDC-AI team at Alibaba International Digital Commerce Group.

Ovis2 is the latest advancement in multi-modal large language models (MLLMs), succeeding Ovis1.6. It retains the architectural design of the Ovis series, which focuses on aligning visual and textual embeddings, and introduces major improvements in data curation and training methods.

Add Ovis2 model and processor implementation by @thisisiron in #37088

MetaCLIP 2

MetaCLIP 2 is a replication of the original CLIP model trained on 300+ languages. It achieves state-of-the-art (SOTA) results on multilingual benchmarks (e.g., XM3600, CVQA, Babel‑ImageNet), surpassing previous SOTA such as mSigLIP and SigLIP‑2. The authors show that English and non-English worlds can mutually benefit and elevate each other.

Add MetaCLIP 2 by @NielsRogge in #39826

Florence 2

Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks. Florence-2 can interpret simple text prompts to perform tasks like captioning, object detection, and segmentation. It leverages the FLD-5B dataset, containing 5.4 billion annotations across 126 million images, to master multi-task learning. The model's sequence-to-sequence architecture enables it to excel in both zero-shot and fine-tuned settings, proving to be a competitive vision foundation model.

Add support for Florence-2 by @ducviet00 in #38188

SAM 2

SAM2 (Segment Anything Model 2) was proposed in Segment Anything in Images and Videos by Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, Christoph Feichtenhofer.

The model can be used to predict segmentation masks of any object of interest given an input image or video, and input points or bounding boxes.

Add Segment Anything 2 (SAM2) by @SangbumChoi in #32317

Kosmos 2.5

The Kosmos-2.5 model was proposed in KOSMOS-2.5: A Multimodal Literate Model by Microsoft.

The abstract from the paper is the following:

We present Kosmos-2.5, a multimodal literate model for machine reading of text-intensive images. Pre-trained on large-scale text-intensive images, Kosmos-2.5 excels in two distinct yet cooperative transcription tasks: (1) generating spatially-aware text blocks, where each block of text is assigned its spatial coordinates within the image, and (2) producing structured text output that captures styles and structures into the markdown format. This unified multimodal literate capability is achieved through a shared Transformer architecture, task-specific prompts, and flexible text representations. We evaluate Kosmos-2.5 on end-to-end document-level text recognition and image-to-markdown text generation. Furthermore, the model can be readily adapted for any text-intensive image understanding task with different prompts through supervised fine-tuning, making it a general-purpose tool for real-world applications involving text-rich images. This work also paves the way for the future scaling of multimodal large language models.

Add Kosmos-2.5 by @tic-top in #31711

HunYuan

More information at release 🤗

HunYuan opensource by @yjc9696 in #39606

Seed OSS

More information at release 🤗

Adding ByteDance Seed Seed-OSS by @Fazziekey in #40272

GLM-4.5V

More information at release 🤗

GLM-4.5V Model Support by @zRzRzRzRzRzRzR in #39805

Cache

Beyond a large refactor of the caching system in Transformers, making it much more practical and general, models using sliding window attention/chunk attention do not waste memory anymore when caching past states. It was allowed most notable by:

New DynamicSlidingWindowLayer & associated Cache by @Cyrilvallez in #40039

See the following improvements on memory usage for Mistral (using only sliding layers) and GPT-OSS (1 out of 2 layers is sliding) respectively: <img width="569" height="431" alt="image" src="https://github.com/user-attachments/assets/7f1688f4-b077-4840-a62c-bfa6131fe806" /> <img width="574" height="431" alt="image" src="https://github.com/user-attachments/assets/bb4a284f-961e-413d-b7e1-783bb5d8fb39" />

Beyond memory usage, it will also improve generation/forward speed by a large margin for large contexts, as only necessary states are passed to the attention computation, which is very sensitive to the sequence length.

Quantization

MXFP4

Since the GPT-OSS release which introduced the MXPF4 quantization type, several improvements have been made to the support, which should now stabilize.

Fix MXFP4 quantizer validation to allow CPU inference with dequantize option by @returnL in #39953
Enable gpt-oss mxfp4 on older hardware (sm75+) by @matthewdouglas in #39940
Fix typo and improve GPU kernel check error message in MXFP4 quantization by @akintunero in #40349)
Default to dequantize if cpu in device_map for mxfp4 by @MekkCyber in #39993
Fix GPT-OSS swiglu_limit not passed in for MXFP4 by @danielhanchen in #40197
[Mxfp4] Add a way to save with a quantization method by @ArthurZucker in #40176

New standard

Now that we deprecated tensorflow and jax, we felt that torch_dtype was not only misaligned with torch, but was redundant and hard to remember. For this reason, we switched to a much more standard dtype argument!

⚠️⚠️ Use dtype instead of torch_dtype everywhere! by @Cyrilvallez in #39782

torch_dtype will still be a valid usage for as long as needed to ensure a smooth transition, but new code should use dtype, and we encourage you to update older code as well!

Breaking changes

The following commits are breaking changes in workflows that were either buggy or not working as expected.

Saner hub-defaults for hybrid cache implementation

On models where the hub checkpoint specifies cache_implementation="hybrid" (static sliding window hybrid cache), UNSETS this value. This will make the model use the dynamic sliding window layers by default.

This default meant that there were widespread super slow 1st generate calls on models with hybrid caches, which should nol onger be the case.

🚨🚨 [generate] ignore cache_implementation="hybrid" hub defaults by @gante in #40135

Sine positional embeddings for MaskFormer & LRU cache

Cache the computation of sine positional embeddings for MaskFormer; results in a 6% performance improvement.

🚨 Use lru_cache for sine pos embeddings MaskFormer by @yonigozlan in #40007

Explicit cache initialization

Adds explicit cache initialization to prepare for the deprecation of the from_legacy_cache utility.

🚨 Always return Cache objects in modelings (to align with generate) by @manueldeprada in #39765

Default compilation with `fullgraph=False`

Having fullgraph set to True during compilation ended up being very restrictive, especially with the arrival of widely-used MoEs.

🚨🚨 Switch default compilation to fullgraph=False by @Cyrilvallez in #40137

Remove decoding strategies

The DoLa decoding strategy has been moved to the following remote-code repository a few versions ago: https://huggingface.co/transformers-community/dola

The Contrastive Search decoding strategy has been moved to the following remote-code repository a few versions ago: https://huggingface.co/transformers-community/contrastive-search

Both have now been removed from the library as a result.

🚨 Remove DoLa decoding strategy by @manueldeprada in #40082
🚨 Remove Contrastive Search decoding strategy by @manueldeprada in #40428

Fix sliding window in flash attention

Flash attention has used sliding window sizes which were off by one. This affected generations that had initially bigger contexts than the sliding window size.

:rotating_light: [Flash Attention] Fix sliding window size by @vasqu in #40163

Minimum Torch version is now 2.2

Torch 2.1 support has been unreliable for some time, so we've now made it official and bumped our minimum version to 2.2.

byebye torch 2.1 by @Rocketknight1 in #40317

Bugfixes and improvements

[CI] post-GptOss fixes for green CI by @gante in #39929
Avoid utils/check_bad_commit.py failing due to rate limit (requesting api.github.com) by @ydshieh in #39918
Fix CI: Tests failing on CPU due to torch.device('cpu').index being None by @manueldeprada in #39933
circleci: pin torch 2.7.1 until torchcodec is updated by @ydshieh in #39951
[docs] ko toc fix by @gante in #39927
docs: fix typo in 'quantization-aware training' by @luckyvickyricky in #39904
Fix grammatical error in MoE variable name: expert_hitted → expert_hit, hitted_experts → hit_experts by @Mihonarium in #39959
fix typo by @Tialo in #39936
[image processor] fix glm4v by @KeyKy in #39964
remove triton_kernels dep with kernels instead by @SunMarc in #39926
Fix fix_and_overwrite mode of utils/check_docstring.py by @manueldeprada in #39369
[bugfix] fix flash_attention_2 unavailable error on Ascend NPU by @FightingZhen in #39844
chore: update Deformable_Detr model card by @arpon-kapuria in #39902
Modular fix: remove the model name in find_file_type by @yonigozlan in #39897
Gemma3 fixes by @remi-or in #39960
[superglue] Fixed the way batch mask was applied to the scores before match assignment computation by @sbucaille in #39968
Support input_embeds in torch exportable decoders by @jackzhxng in #39836
Various test fixes for AMD by @remi-or in #39978
[Idefics] fix device mismatch by @zucchini-nlp in #39981
Fix gemma3n feature extractor's incorrect squeeze by @Isotr0py in #39919
[typing] Fix return typehint for decoder and inv_freq annotation by @qubvel in #39610
Fix consistency by @Cyrilvallez in #39995
Update expected output values after #39885 (part 1) by @ydshieh in #39990
Fix int4 quantized model cannot work with cpu by @yuanwu2017 in #39724
Fix missing video inputs for PerceptionLM. by @shuminghu in #39971
fix: remove CHAT_TEMPLATE import in tests for deepseek-vl by @geetu040 in #40003
Fix HGNetV2 Model Card and Image Classification Pipeline Usage Tips by @ducviet00 in #39965
Fix default values of getenv by @cyyever in #39867
FA2 can continue generation from cache by @zucchini-nlp in #39843
unpin torch<2.8 on circleci by @ydshieh in #40012
docs: fix duplication in 'en/optimizers.md' by @luckyvickyricky in #40014
Raising error when quantizing a quantized model by @MekkCyber in #39998
Update expected output values after #39885 (part 2) by @ydshieh in #40015
pin torchcodec==0.5.0 for now with torch 2.7.1 on daily CI by @ydshieh in #40013
Fix broken image inference for Fuyu model by @Isotr0py in #39915
Higgs modules_to_not_convert standardization by @MekkCyber in #39989
Fix an annoying flaky test by @zucchini-nlp in #40000
Harmonize past_key_value to past_key_valueS everywhere by @Cyrilvallez in #39956
Fix missing None default values for Gemma3n model in get_placeholder_mask by @Znerual in #39991)
[core] Refactor the Cache logic to make it simpler and more general by @Cyrilvallez in #39797
Tie weights recursively on all submodels by @Cyrilvallez in #39996
Bnb failling tests by @MekkCyber in #40026
fix notification_service.py about time_spent by @ydshieh in #40037
Revert "fix notification_service.py about time_spent" by @ydshieh in #40044
Update HuBERT model card according to template by @reedrya in #39742
unpin torchcodec==0.5.0 and use torch 2.8 on daily CI by @ydshieh in #40072
fix: resolve triton version check compatibility on windows by @Tsumugii24 in #39986
[qwen-vl] fix beam search with videos by @zucchini-nlp in #39726
[gemma3] update conversion key mapping by @zucchini-nlp in #39778
fix: move super().init after vision_config init in Mistral3Config by @starcatmeow in #40063
Remove deprecated cache-related objects by @Cyrilvallez in #40035
guard on model.eval when using torch.compile + FSDP2 by @winglian in #37413
Fix repo consistency by @zucchini-nlp in #40077
added Textnet fast image processor by @rahzaazhar in #39884
Fix time_spent in notification_service.py. by @ydshieh in #40081
chore: standardize DeBERTa model card by @Shoumik-Gandre in #37409
[GPT Big Code] Fix attention scaling by @vasqu in #40041
feat: extract rev in attn_implementation kernels via @ by @drbh in #40009
Update notification service MI325 by @ivarflakstad in #40078
Fix PerceptionLM image preprocessing for non-tiled image input. by @shuminghu in #40006
Revert FA2 kwargs construction by @zucchini-nlp in #40029
[fix] batch inference for llava_onevision by @cyr0930 in #40021
[docs] Zero Shot Object Detection Task by @ariG23498 in #40096
Update Glm4V processor and add tests by @zucchini-nlp in #39988
Add glm4.5&&glm4.5V doc by @lambertwjh in #40095
Causal loss for ForConditionalGeneration by @qgallouedec in #39973
Audio encodings now match conv2d weight dtype in Gemma3nAudioSSCPConvBlock by @Malav-P in #39743
New DynamicSlidingWindowLayer & associated Cache by @Cyrilvallez in #40039
Enable SIM rules by @cyyever in #39806
feat: add is_fast to ImageProcessor by @MilkClouds in #39603
Re-apply make style by @Cyrilvallez in #40106
Replace logger.warning with logger.warning_once in GradientCheckpointingLayer by @qgallouedec in #40091
Fix regression in mllama vision encoder by @Isotr0py in #40083
Switch the order of args in StaticCache (for BC and future logic) by @Cyrilvallez in #40100
Fix Qwen3 MoE GGUF architecture mismatch by @ctcanbol in #39976
Fix error on importing unavailable torch.distributed by @m-gallus in #40038
[Flash Attention] Fix flash attention integration by @vasqu in #40002
[trainer] ensure special tokens in model configs are aligned with tokenizer at train time by @gante in #38441
Fix Causality Handling in Flash Attention to Support Bidirectional Attention by @lucaswychan in #39707
[docs] Add reference to HF-maintained custom_generate collections by @gante in #39894
Add model card for MobileViT by @Shivamjan in #40033
remove sequence parallel in llama4 by @3outeille in #40084
🌐 [i18n-KO] Translated tiny_agents.md to Korean by @AhnJoonSung in #39913
[bugfix] Fix tensor device in Idefics2, Idefics3, and SmolVLM by @qgallouedec in #39975
changed xLSTMRMSNorm to RMSNorm by @nikitazuevblago in #40113
Fix QuantoQuantizedCache import issues by @manueldeprada in #40109
[serve] allow array content inputs for LLMs by @gante in #39829
decoding_method argument in generate by @manueldeprada in #40085
Collated reports by @ivarflakstad in #40080
DOCS: Add missing space in SECURITY.md by @shivaheidari in #40087
[trainer] handle case where EOS token is None in generation_config by @gante in #40127
Fix hidden torchvision>=0.15 dependency issue by @yonigozlan in #39928
🌐 [i18n-KO] Translated main_classes/processors.md to Korean by @TaskerJang in #39519
🌐 [i18n-KO] Translated jamba.md to Korean by @skwh54 in #39890
🌐 [i18n-KO] Translated main_classes/optimizer_schedules.md to Korean by @luckyvickyricky in #39713
🌐 [i18n-KO] Translated gpt2.md to Korean by @taemincode in #39808
🌐 [i18n-KO] Translated optimizers.md to Korean by @chelsseeey in #40011
🌐 [i18n-KO] Translated grounding-dino.md to Korean by @TaskerJang in #39861
🌐 [i18n-KO] Translated pipelines.md to Korean by @xhaktm00 in #39577
gpt oss is important by @ArthurZucker in #40139
Fix Janus by @Cyrilvallez in #40140
[docs] Fix ko toctree by @stevhliu in #40138
Remove an old badly designed test by @Cyrilvallez in #40142
updated visualBERT modelcard by @Anil-Red in #40057
🌐 [i18n-KO] Translated gemma3.md to Korean by @seopp in #39865
Fix quantized cache with only cache_implementation in generate by @Cyrilvallez in #40144
Add pytest marker: torch_compile_test and torch_export_test by @ydshieh in #39950
Update Dockerfiles to install packages inside a virtual environment by @Sai-Suraj-27 in #39098
Create self-scheduled-amd-mi355-caller.yml by @glegendre01 in #40134
[Cohere2Vision] remove unused arg by @zucchini-nlp in #40103
[efficientloftr] fix bugs and follow original cross attn implementation strictly by @sbucaille in #40141
Fix CI: Use correct import in SAM for torchvision InterpolationMode by @manueldeprada in #40160
[Continous Batching] set head_dim when config.head_dim is None by @kashif in #40159
Replace self.tokenizer by self.processing_class by @qgallouedec in #40119
[FA2] Fix it finally - revert fa kwargs preparation by @Cyrilvallez in #40161
[bugfix] fix flash-attention2 unavailable error for Ascend NPU by @FightingZhen in #40151
build: Add fast image processor tvp by @adutchengineer in #39529
Add GptOssForSequenceClassification for GPT-OSS models by @zyfedward in #40043
Standardize BARTpho model card: badges, new examples, fixed broken im… by @eshwanthkartitr in #40051
Add dates to the model docs by @MHRDYN7 in #39320
Pin torch to 2.7.1 on CircleCI for now by @ydshieh in #40174
Update dynamic attnt setter for multimodals by @zucchini-nlp in #39908
[MINOR:TYPO] Update base.py by @cakiki in #40169
make model doc device agnostic by @yao-matrix in #40143
fix to avoid modifying a view in place by @3outeille in #40162
Fix fsdp for generic-task models by @Cyrilvallez in #40191
Add repr to EncoderDecoderCache by @Cyrilvallez in #40195
Fix typos by @cyyever in #40175
Remove _prepare_flash_attention_from_position_ids by @cyyever in #40069
Avoid CUDA stream sync by @cyyever in #40060
Fix various Pylint warnings by @cyyever in #40107
Update: add type hints to check_tokenizers.py by @ajeet214 in #40094
Benchmarking improvements by @ahadnagy in #39768
docs: Update LayoutLM model card according to new standardized format by @Jin-HoMLee in #40129
Revert "Pin torch to 2.7.1 on CircleCI for now" + Final fix for too long with no output by @ydshieh in #40201
Use correct model_input_names for PixtralImageProcessor by @rohitrango in #40226
fix error vocab_size at Qwen2_5_VLForConditionalGeneration loss_function by @killight98 in #40130
[SAM 2] Change checkpoints in docs and tests by @yonigozlan in #40213
Fix more typos by @cyyever in #40212
Fix ESM token_dropout crash when using inputs_embeds instead of input_ids by @notkisk in #40181
AMD scheduled CI ref env file by @ivarflakstad in #40243
Fix more pylint warnings by @cyyever in #40204
remove transpose_for_scores call in ESM-2 by @pstjohn in #40210
Add chat_template (jinja2) as an extra dependency by @tboerstad in #40128
[typing] fix type annotation error in DepthPro model image processor by @MengAiDev in #40238
[serve] guard imports by @gante in #39825
[CI] Fix repo consistency by @vasqu in #40249
Fixes for EncoderDecoderCache by @remi-or in #40008
fix: Catch correct ConnectionError for additional_chat_templates by @akug in #39874
Model card for NLLB by @sahil-kabir in #40074
Correct typo and update notes in docs Readme by @PavloFesenko in #40234
Fix benchmark workflow by @ahadnagy in #40254
docs: Update OLMo model card by @rafakatri in #40233
Skip broken tests by @zucchini-nlp in #40157
Remove MI300 CI by @ivarflakstad in #40270
set inputs_embeds to None while generate to avoid audio encoder forward in generation process by @BakerBunker in #40248
[detection] fix attention mask for RT-DETR-based models by @materight in #40269
Fix slow static cache export tests by @jackzhxng in #40261
Fix setting attention for multimodal models by @zucchini-nlp in #39984
[detection] fix correct k_proj weight and bias slicing in D-FINE by @notkisk in #40257
Skipping pytree registration in case fsdp is enabled by @romitjain in #40075
Update image_processing_perception_lm_fast.py to allow for proper override of vision_input_type by @tyleryzhu in #40252
fix which routing method by @ArthurZucker in #40283
Fix chat CLI GPU loading and request_id validation issues by @robin-ede in #40230)
docs(layoutlm): add missing id=usage to <hfoptions> tag in LayoutLM model card by @Jin-HoMLee in #40273
Standardize RAG model card by @aayush226 in #40222
docs: Update TrOCR model card to new format by @AceHunterr in #40240
Update model card for gpt neox japanese by @ahnjj in #39862
SmolVLM and InternVL: Ensure pixel values are converted to the correct dtype for fp16/bf16 by @qgallouedec in #40121
Standardize BertGeneration model card by @nemitha2005 in #40250
Adjust ROCm test output expectations by @ahadnagy in #40279
SmolVLM test fixes by @ahadnagy in #40275
make model docs device agnostic (2) by @yao-matrix in #40256
[3/3] make docs device agnostic, all en docs for existing models done by @yao-matrix in #40298
Allow to be able to run torch.compile tests with fullgraph=True by @ydshieh in #40164
[FA] Fix dtype in varlen with position ids by @vasqu in #40295
[docs] delete more TF/Flax docs by @gante in #40289
Clean up X-Codec. by @ebezzam in #40271
Remove OTel SDK dependencies by @anuraaga in #40305
Fix GOT-OCR2 and Cohere2Vision image processor patches caculation by @Isotr0py in #40312
[fix] Pass adamw optimizer parameters to StableAdamW by @emapco in #40184
chore: fix typo in find_executable_batch_size to match new 0.9 ratio by @MilkClouds in #40206
:rotating_light: [Flash Attention] Fix sliding window size by @vasqu in #40163
Remove unnecessary contiguous calls for modern torch by @Rocketknight1 in #40315
Qwen2.5-Omni test fixes by @ahadnagy in #40307
Add back _tp_plan attribute by @rishub-tamirisa in #39944
byebye torch 2.1 by @Rocketknight1 in #40317
No more natten by @ydshieh in #40287
[GPT OSS] Refactor the tests as it was not properly checking the outputs by @ArthurZucker in #40288
Update CI with nightly torch workflow file by @ydshieh in #40306
Fix: Apply get_placeholder_mask in Ovis2 by @thisisiron in #40280
Update notification service amd_daily_ci_workflows definition by @ivarflakstad in #40314
One cache class to rule them all by @Cyrilvallez in #40276
Fix chunked attention mask with left-padding by @Cyrilvallez in #40324
[docs] remove flax references from /en/model_doc by @gante in #40311
Fix qwen-omni processor text only mode by @yuekaizhang in #40336
Change Qwen2RMSNorm to RMSNorm from PyTorch by @cyyever in #40066
Add DeepseekV3ForSequenceClassification for Deepseek V3 models by @abdokaseb in #40200
Fix deprecation warning version by @Cyrilvallez in #40343
Add missing arguments to class constructors by @cyyever in #40068
[docs] remove TF references from /en/model_doc by @gante in #40344
Fix: Only call Trainer.align_special_tokens if model has "config" attribute by @tomaarsen in #40322
add type hints by @wirthual in #40319
Fix an infinite loop bug in recursive search of relative imports by @eladsegal in #40326
Fix links in Glm4vMoe configuration classes to point to the correct H… by @vvvdwbvvv in #40310
T5 test and target device fixes by @ahadnagy in #40313
Update test_spm_converter_bytefallback_warning by @ydshieh in #40284
(small) fix conditional for input_ids and input_embeds in marian by @cyntqliu in #40045
Fix attention vizualizer by @molbap in #40285
[ModernBert] Prevent the attention mask from being None in ModernBertForSequenceClassification by @ashmikuz in #35991
Clean up XCodec and other codecs by @ebezzam in #40348
[serve] add cors warnings by @gante in #40112
[detection] use consistent dtype for Conditional and DAB DETR positional embeddings by @agkphysics in #40300
Remove more PyTorch 2.2 compatible code by @cyyever in #40337
[FA] Fix some model tests by @vasqu in #40350
Qwen2.5-VL test fixes for ROCm by @ahadnagy in #40308
[generate] handle support for cache classes when num enc layers != num dec layers by @gante in #40277
[4/N]more docs to device agnostic by @yao-matrix in #40355
DOCS: Clarification on the use of label_names as an argument to TrainingArguments by @huzaifa-jawad367 in #40353
Fix idefics3 vision embeddings indices dtype by @Isotr0py in #40360
wav2vec2 fixes by @remi-or in #40341
Change multimodal data links to HF hub by @zucchini-nlp in #40309
[pipelines] add support to skip_special_tokens in the main text generation pipelines by @gante in #40356
⚠️⚠️ Use dtype instead of torch_dtype everywhere! by @Cyrilvallez in #39782
[processor] move commonalities to mixin by @zucchini-nlp in #40339
[configuration] allow to overwrite kwargs from subconfigs by @zucchini-nlp in #40241
fix(example): align parameter names with the latest function definition for gdino by @developer0hye in #40369
Add GptOssForTokenClassification for GPT-OSS models by @abdokaseb in #40190
Bug Fix: Dynamically set return_lse flag in FlexAttention by @amd-lalithnc in #40352
Chat Template Doc Fixes by @Rocketknight1 in #40173
Rework the Cache documentation by @Cyrilvallez in #40373
Update README_zh-hans.md by @TardC in #40380
HF papers in doc by @qgallouedec in #40381
Run FA2 tests in CI by @ydshieh in #40397
Reactivate a lot of tests skipped for no reason anymore by @Cyrilvallez in #40378
:broom: :broom: :broom: Get set decoder cleanup by @molbap in #39509
fix to accept cumulative_seqlens from TransformersKwargs in FA by @Kurt232 in #40194
[docs] flax/jax purge by @gante in #40372
Fix typo: 'casual' -> 'causal' in code and documentation by @akintunero in #40371)
Fix CI (hunyuan moe does not support fullgraph) by @Cyrilvallez in #40423
Fix typo: 'seperator' to 'separator' in variable names by @Prawal-Sharma in #40389
Fix UnboundLocalError in WER metric computation by @prxshetty in #40402
Gpt oss optim by @jiqing-feng in #40304
Fix processing tests by @zucchini-nlp in #40379
Fix label smoothing incompatibility with multi-label classification by @avchauzov in #40296
Fix modular for modernbert-decoder by @Cyrilvallez in #40431
Update collated reports working directory and --path by @ivarflakstad in #40433
Add tokenizer_kwargs argument to the text generation pipeline by @Joshua-Chin in #40364
[docs] remove last references to transformers TF classes/methods by @gante in #40429
Remove working-dir from collated reports job by @ivarflakstad in #40435
🌐 [i18n-KO] Translated models.md to Korean by @Judy-Choi in #39518
Gemma3 text fixes: Add expectations for MI325 by @ahadnagy in #40384
Fix collated reports model directory traversal by @ivarflakstad in #40437
Fix https://github.com/huggingface/transformers/issues/40292 by @id01 in #40439
Fix collated reports uploading by @ivarflakstad in #40440
InternVL MI325 test expectations by @ahadnagy in #40387
Fix collated reports model name entry by @ivarflakstad in #40441
Fix non FA2 tests after FA2 installed in CI docker image by @ydshieh in #40430
Refactor ViT-like models by @qubvel in #39816
[Trainer] accelerate contextparallel support in trainer by @kashif in #40205
fix qwen25-vl grad acc by @iMountTai in #40333
[video processors] decode only sampled videos -> less RAM and faster processing by @zucchini-nlp in #39600
rename get_cuda_warm_up_factor to get_accelerator_warm_up_factor by @yao-matrix in #40363
Make cache_config not mandatory by @remi-or in #40316
Continuous batching refactor by @remi-or in #40426
flash_paged: s_aux may not exist by @pcuenca in #40434
Fix extra template loading by @Rocketknight1 in #40455
deci gguf support by @ved1beta in #38669
[fast_image_processor] fix image normalization for resize by @audioXD in #40436
[RoPE] explicit factor > implicit factor in YaRN by @gante in #40320
[pipeline] Add Keypoint Matching pipeline by @sbucaille in #39970
Update SegFormer model card by @GSNCodes in #40417
Not to shock AMD team by the cancelled workflow run notification ❤️ 💖 by @ydshieh in #40467
Fix nightly torch CI by @ydshieh in #40469
CI when PR merged to main by @ydshieh in #40451
Validate GptOssConfig rope config after it's fully initialized by @zifeitong in #40474
[modular] Use multi-processing + fix model import issue by @Cyrilvallez in #40481
[modular] Remove ambiguity in all calls to parent class methods + fix dependency graph by @Cyrilvallez in #40456
[ESM] support attention API by @zucchini-nlp in #40370
[EfficientLoFTR] dynamic image size support by @sbucaille in #40329
Fix qwen2_moe tests by @ydshieh in #40494
[Whisper] Add rocm expected results to certain tests by @ivarflakstad in #40482
Collated reports: no need to upload artifact by @ivarflakstad in #40502
Fix the CI workflow of merge to main by @ydshieh in #40503
docs(pixtral): Update Pixtral model card to new format by @BryanBradfo in #40442
[modular] Classes can now be defined and referenced in arbitrary order (without bringing unwanted dependencies) by @Cyrilvallez in #40507
Include machine type in collated reports filename by @ivarflakstad in #40514

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@remi-or
- Gemma3 fixes (#39960)
- Various test fixes for AMD (#39978)
- Fixes for EncoderDecoderCache (#40008)
- wav2vec2 fixes (#40341)
- Make cache_config not mandatory (#40316)
- Continuous batching refactor (#40426)
@sbucaille
- [superglue] Fixed the way batch mask was applied to the scores before match assignment computation (#39968)
- [efficientloftr] fix bugs and follow original cross attn implementation strictly (#40141)
- [pipeline] Add Keypoint Matching pipeline (#39970)
- [EfficientLoFTR] dynamic image size support (#40329)
@ducviet00
- Fix HGNetV2 Model Card and Image Classification Pipeline Usage Tips (#39965)
- Add support for Florence-2 (#38188)
@cyyever
- Fix default values of getenv (#39867)
- Enable SIM rules (#39806)
- Fix typos (#40175)
- Remove _prepare_flash_attention_from_position_ids (#40069)
- Avoid CUDA stream sync (#40060)
- Fix various Pylint warnings (#40107)
- Fix more typos (#40212)
- Fix more pylint warnings (#40204)
- Change Qwen2RMSNorm to RMSNorm from PyTorch (#40066)
- Add missing arguments to class constructors (#40068)
- Remove more PyTorch 2.2 compatible code (#40337)
@zRzRzRzRzRzRzR
- GLM-4.5V Model Support (#39805)
@SangbumChoi
- Add Segment Anything 2 (SAM2) (#32317)
@adutchengineer
- build: Add fast image processor tvp (#39529)
@MHRDYN7
- Add dates to the model docs (#39320)
@yao-matrix
- make model doc device agnostic (#40143)
- make model docs device agnostic (2) (#40256)
- [3/3] make docs device agnostic, all en docs for existing models done (#40298)
- [4/N]more docs to device agnostic (#40355)
- rename get_cuda_warm_up_factor to get_accelerator_warm_up_factor (#40363)
@Manalelaidouni
- Add X-Codec model (#38248)
@thisisiron
- Add Ovis2 model and processor implementation (#37088)
- Fix: Apply get_placeholder_mask in Ovis2 (#40280)
@tic-top
- Add Kosmos-2.5 (#31711)
@yjc9696
- HunYuan opensource (#39606)
@Fazziekey
- Addiing ByteDance Seed Seed-OSS (#40272)

There was a mick mack on our side when cherry-picking the commit #40197 which led to a wrong commit in the patch! Sorry everyone 😭

This patch is just the official fix for #40197!

Patch release v4.55.3

Patch release 4.55.3

Focused on stabilizing FlashAttention-2 on Ascend NPU, improving FSDP behavior for generic-task models, fixing MXFP4 integration for GPT-OSS

Bug Fixes & Improvements

FlashAttention-2 / Ascend NPU – Fix “unavailable” runtime error (#40151) by @FightingZhen
FlashAttention kwargs – Revert FA kwargs preparation to resolve regression (#40161) by @Cyrilvallez
FSDP (generic-task models) – Fix sharding/runtime issues (#40191) by @Cyrilvallez
GPT-OSS / MXFP4 – Ensure swiglu_limit is correctly passed through (#40197) by @danielhanchen
Mamba – Fix cache handling to prevent stale/incorrect state (#40203) by @manueldeprada
Misc – Minor follow-up fix addressing #40262 by @ArthurZucker

Patch release 4.55.2: for FA2 users!

Patch release 4.55.2!

only affects `FA2` generations!

😢 Well sorry everyone, sometimes shit can happen... 4.55.1 was broken because of 🥁 git merge conflict. I cherry-picked https://github.com/huggingface/transformers/pull/40002 without having https://github.com/huggingface/transformers/pull/40029 , thus from ..modeling_flash_attention_utils import prepare_fa_kwargs_from_position_ids is missing, and since this is a slow test, nothing caught it.

Will work to remediate and write the post-mortem when yanking the release.

Patch release 4.55.1

Patch release 4.55.1:

Mostly focused around stabalizing the Mxfp4 for GPTOSS model!

Bug Fixes & Improvements

Idefics2, Idefics3, SmolVLM – Fix tensor device issue (#39975) by @qgallouedec
Merge conflicts – Fix merge conflicts from previous changes by @vasqu
MXFP4 / CPU device_map – Default to dequantize when CPU is in device_map (#39993) by @MekkCyber
GPT Big Code – Fix attention scaling (#40041) by @vasqu
Windows compatibility – Resolve Triton version check compatibility (#39986) by @Tsumugii24 @MekkCyber
Gemma3n model – Add missing None default values for get_placeholder_mask (#39991, #40024) by @Znerual
Fuyu model – Fix broken image inference (#39915) by @Isotr0py
PerceptionLM – Fix missing video inputs (#39971) by @shuminghu
Idefics – Fix device mismatch (#39981) by @zucchini-nlp
Triton kernels – Remove triton_kernels dependency in favor of included kernels (#39926) by @SunMarc
GPT-OSS MXFP4 – Enable on older hardware (sm75+) (#39940) by @matthewdouglas @SunMarc
MXFP4 quantizer – Allow CPU inference with dequantize option (#39953) by @returnL

CI & Build

CI stability – Post-GPT-OSS fixes for green CI (#39929) by @gante @LysandreJik

GLM-4.5V preview based on 4.55.0

New model added by the Z.ai team to transformers! GLM-4.5V is a new multimodal reasoning model based on GLM-4.5-Air, which has 106B total and 12B active parameters.

It's performant across 42 benchmarks across various categories:

Image reasoning (scene understanding, complex multi-image analysis, spatial recognition)
Video understanding (long video segmentation and event recognition)
GUI tasks (screen reading, icon recognition, desktop operation assistance)
Complex chart & long document parsing (research report analysis, information extraction)
Grounding (precise visual element localization)

To use, install transformers release branch.

pip install transformers-v4.55.0-GLM-4.5V-preview

Then you can run:

from transformers import AutoProcessor, Glm4vMoeForConditionalGeneration
import torch

MODEL_PATH = "zai-org/GLM-4.5V"
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "url": "https://upload.wikimedia.org/wikipedia/commons/f/fa/Grayscale_8bits_palette_sample_image.png"
            },
            {
                "type": "text",
                "text": "describe this image"
            }
        ],
    }
]
processor = AutoProcessor.from_pretrained(MODEL_PATH)
model = Glm4vMoeForConditionalGeneration.from_pretrained(
    pretrained_model_name_or_path=MODEL_PATH,
    torch_dtype="auto",
    device_map="auto",
)
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device)
inputs.pop("token_type_ids", None)
generated_ids = model.generate(**inputs, max_new_tokens=8192)
output_text = processor.decode(generated_ids[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False)
print(output_text)

v4.55.0: New openai GPT OSS model!

Welcome GPT OSS, the new open-source model family from OpenAI!

For more detailed information about this model, we recommend reading the following blogpost: https://huggingface.co/blog/welcome-openai-gpt-oss

GPT OSS is a hugely anticipated open-weights release by OpenAI, designed for powerful reasoning, agentic tasks, and versatile developer use cases. It comprises two models: a big one with 117B parameters (gpt-oss-120b), and a smaller one with 21B parameters (gpt-oss-20b). Both are mixture-of-experts (MoEs) and use a 4-bit quantization scheme (MXFP4), enabling fast inference (thanks to fewer active parameters, see details below) while keeping resource usage low. The large model fits on a single H100 GPU, while the small one runs within 16GB of memory and is perfect for consumer hardware and on-device applications.

Overview of Capabilities and Architecture

21B and 117B total parameters, with 3.6B and 5.1B active parameters, respectively.
4-bit quantization scheme using mxfp4 format. Only applied on the MoE weights. As stated, the 120B fits in a single 80 GB GPU and the 20B fits in a single 16GB GPU.
Reasoning, text-only models; with chain-of-thought and adjustable reasoning effort levels.
Instruction following and tool use support.
Inference implementations using transformers, vLLM, llama.cpp, and ollama.
Responses API is recommended for inference.
License: Apache 2.0, with a small complementary use policy.

Architecture

Token-choice MoE with SwiGLU activations.
When calculating the MoE weights, a softmax is taken over selected experts (softmax-after-topk).
Each attention layer uses RoPE with 128K context.
Alternate attention layers: full-context, and sliding 128-token window.
Attention layers use a learned attention sink per-head, where the denominator of the softmax has an additional additive value.
It uses the same tokenizer as GPT-4o and other OpenAI API models.
Some new tokens have been incorporated to enable compatibility with the Responses API.

The following snippet shows simple inference with the 20B model. It runs on 16 GB GPUs when using mxfp4, or ~48 GB in bfloat16.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
)

messages = [
    {"role": "user", "content": "How many rs are in the word 'strawberry'?"},
]  

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))

Flash Attention 3

The models use attention sinks, a technique the vLLM team made compatible with Flash Attention 3. We have packaged and integrated their optimized kernel in kernels-community/vllm-flash-attn3. At the time of writing, this super-fast kernel has been tested on Hopper cards with PyTorch 2.7 and 2.8. We expect increased coverage in the coming days. If you run the models on Hopper cards (for example, H100 or H200), you need to pip install –upgrade kernels and add the following line to your snippet:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
+    # Flash Attention with Sinks
+    attn_implementation="kernels-community/vllm-flash-attn3",
)  

messages = [
    {"role": "user", "content": "How many rs are in the word 'strawberry'?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))

Even though the 120B model fits on a single H100 GPU (using mxfp4), you can also run it easily on multiple GPUs using accelerate or torchrun. Transformers provides a default parallelization plan, and you can leverage optimized attention kernels as well. The following snippet can be run with torchrun --nproc_per_node=4 generate.py on a system with 4 GPUs:

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.distributed import DistributedConfig
import torch

model_path = "openai/gpt-oss-120b"
tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side="left")

device_map = {
    "tp_plan": "auto",    # Enable Tensor Parallelism
}

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    attn_implementation="kernels-community/vllm-flash-attn3",
    **device_map,
)

messages = [
     {"role": "user", "content": "Explain how expert parallelism works in large language models."}
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=1000)

# Decode and print
response = tokenizer.decode(outputs[0])
print("Model response:", response.split("<|channel|>final<|message|>")[-1].strip())

Other optimizations

If you have a Hopper GPU or better, we recommend you use mxfp4 for the reasons explained above. If you can additionally use Flash Attention 3, then by all means do enable it!

[!TIP] If your GPU is not compatible with mxfp4, then we recommend you use MegaBlocks MoE kernels for a nice speed bump. To do so, you just need to adjust your inference code like this:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
+    # Optimize MoE layers with downloadable MegaBlocksMoeMLP
+    use_kernels=True,
)

messages = [
    {"role": "user", "content": "How many rs are in the word 'strawberry'?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))

[!TIP] MegaBlocks optimized MoE kernels require the model to run on bfloat16, so memory consumption will be higher than running on mxfp4. We recommend you use mxfp4 if you can, otherwise opt in to MegaBlocks via use_kernels=True.

transformers serve

You can use transformers serve to experiment locally with the models, without any other dependencies. You can launch the server with just: transformers serve

To which you can send requests using the Responses API.

# responses API
curl -X POST http://localhost:8000/v1/responses \
-H "Content-Type: application/json" \
-d '{"input": [{"role": "system", "content": "hello"}], "temperature": 1.0, "stream": true, "model": "openai/gpt-oss-120b"}'

You can also send requests using the standard Completions API:

# completions API
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "system", "content": "hello"}], "temperature": 1.0, "max_tokens": 1000, "stream": true, "model": "openai/gpt-oss-120b"}'

Command A Vision

Command A Vision is a state-of-the-art multimodal model designed to seamlessly integrate visual and textual information for a wide range of applications. By combining advanced computer vision techniques with natural language processing capabilities, Command A Vision enables users to analyze, understand, and generate insights from both visual and textual data.

The model excels at tasks including image captioning, visual question answering, document understanding, and chart understanding. This makes it a versatile tool for AI practitioners. Its ability to process complex visual and textual inputs makes it useful in settings where text-only representations are imprecise or unavailable, like real-world image understanding and graphics-heavy document processing.

Command A Vision is built upon a robust architecture that leverages the latest advancements in VLMs. It's highly performant and efficient, even when dealing with large-scale datasets. The model's flexibility makes it suitable for a wide range of use cases, from content moderation and image search to medical imaging analysis and robotics.

[Model] Cohere2 Vision by @zucchini-nlp in #39810

MM Grounding DINO

MM Grounding DINO model was proposed in An Open and Comprehensive Pipeline for Unified Object Grounding and Detection by Xiangyu Zhao, Yicheng Chen, Shilin Xu, Xiangtai Li, Xinjiang Wang, Yining Li, Haian Huang>.

MM Grounding DINO improves upon the Grounding DINO by improving the contrastive class head and removing the parameter sharing in the decoder, improving zero-shot detection performance on both COCO (50.6(+2.2) AP) and LVIS (31.9(+11.8) val AP and 41.4(+12.6) minival AP).

You can find all the original MM Grounding DINO checkpoints under the MM Grounding DINO collection. This model also supports LLMDet inference. You can find LLMDet checkpoints under the LLMDet collection.

Add MM Grounding DINO by @rziga in #37925

Bugfixes and improvements

More robust tied weight test by @Cyrilvallez in #39681
fix missing model._tp_size from ep refactor by @winglian in #39688
Fix missing initialization of FastSpeech2Conformer by @bvantuan in #39689
fix(tokenization): check token.content for trie by @pjo256 in #39587
xpu optimization for generation case by @sywangyi in #39573
[processors] add tests for helper fn by @zucchini-nlp in #39629
update ernie model card by @jzhang533 in #39657
[configuration] remove redundant classmethod by @zucchini-nlp in #38812
Add self-hosted runner scale set workflow for mi325 CI by @jitesh-gupta in #39651
PATCH: add back n-dim device-mesh + fix tp trainer saving by @S1ro1 in #39693
[CI] Add Eric to comment slow ci by @vasqu in #39601
Remove all expired deprecation cycles by @Cyrilvallez in #39725
mllama outputs refactor by @itazap in #39643
Update QAPipelineTests::test_large_model_course after #39193 by @ydshieh in #39666
skip Glm4MoeModelTest::test_torch_compile_for_training by @ydshieh in #39670
Fix Qwen2AudioForConditionalGeneration.forward() and test_flash_attn_kernels_inference_equivalence by @ebezzam in #39503
Fix Layer device placement in Caches by @Cyrilvallez in #39732
Fix cache-related tests by @zucchini-nlp in #39676
Fix AMD dockerfile for audio models by @remi-or in #39669
Superpoint fast image processor by @arkhamHack in #37804
Add Fast Segformer Processor by @capnmav77 in #37024
BLIPs clean-up by @zucchini-nlp in #35560
extend more trainer test cases to XPU, all pass by @yao-matrix in #39652
fix cache inheritance by @ArthurZucker in #39748
[Fix] import two missing typos in models/__init__.py for typo checking by @hebangwen in #39745
Fix: add back base model plan by @S1ro1 in #39733
update GemmaIntegrationTest::test_model_2b_bf16_dola again by @ydshieh in #39731
Update IMPORTANT_MODELS list by @ivarflakstad in #39734
Fix mamba regression by @manueldeprada in #39728
Apply several ruff SIM rules by @cyyever in #37283
Use --gpus all in workflow files by @ydshieh in #39752
AMD disable torchcodec by @ivarflakstad in #39757
Avoid OOM when other tests are failing by @ydshieh in #39758
Fix GPT2 with cross attention by @zucchini-nlp in #39754
Support loading Qwen3 MoE GGUF by @ctcanbol in #39638
Enable xpu allocator on caching_allocator_warmup by @jiqing-feng in #39654
Fix version issue in modeling_utils.py by @Cyrilvallez in #39759
add libcst to extras["testing"] in setup.py by @ydshieh in #39761
[modenbert] fix regression by @zucchini-nlp in #39750
🌐 [i18n-KO] Translated main_classes/peft.md by @luckyvickyricky in #39515
🌐 [i18n-KO] Translated albert.md to Korean by @ahnjj in #39524
🌐 [i18n-KO] Translated tvp.md to Korean by @Kim-Ju-won in #39578
🌐 [i18n-KO] Translated tokenizer.md to Korean by @seopp in #39532
🌐 [i18n-KO] Translated pipeline_gradio.md to Korean by @AhnJoonSung in #39520
🌐 [i18n-KO] Translated perf_train_gpu_one.md to Korean by @D15M4S in #39552
🌐 [i18n-KO] Translated how_to_hack_models.md to Korean by @skwh54 in #39536
fix(trainer): Correct loss scaling for incomplete gradient accumulation steps by @hutaiHang in #39659
Fix Cache.max_cache_len max value for Hybrid models by @manueldeprada in #39737
[docs] Ko doc fixes after toc update by @gante in #39660
Remove python3.7 reference from doc link by @st81 in #39706
Fix OmDet test after arg deprecation by @Cyrilvallez in #39766
docs: Update EfficientLoFTR documentation by @sbucaille in #39620
Standardize CLAP model card format by @yanamis in #39738
Don't set run_name when none by @qgallouedec in #39695
Fix Evolla and xLSTM tests by @Cyrilvallez in #39769
enable static cache on vision encoder decoder by @jiqing-feng in #39773
[ASR pipline] fix with datasets 4.0 by @eustlb in #39504
more info in model_results.json by @ydshieh in #39783
Super tiny update by @zucchini-nlp in #39727
fix chameleonvision UT failure by @yao-matrix in #39646
Fix an invalid condition by @cyyever in #39762
Simplify conditional code by @cyyever in #39781
Fix re-compilations for cross attention cache by @zucchini-nlp in #39788
standardized BARThez model card by @EthanV431 in #39701
Update model card for Cohere2 (Command R7B) by @arpon-kapuria in #39604
Update mT5 model card by @dross20 in #39702
Add callback to monitor progress in whisper transcription by @poke1024 in #37483
fix: providing a tensor to cache_position in model.generate kwargs always crashes because of boolean test by @gante in #39300
feat(tokenization): add encode_message to tokenize messages one by one by @pco111 in #39507
[docs] fix korean docs yet again by @gante in #39813
Update documentation for Cohere2Vision models by @kyle-cohere in #39817
[cohere2 vision] move doc to multimodal section by @zucchini-nlp in #39820
Fix broken links by @oToToT in #39809
Fix bad markdown links by @ebezzam in #39819
Fix tp cb by @ArthurZucker in #39838
[VLMs] split out "get placeholder mask" to helper by @zucchini-nlp in #39777
[attn_implementation] remove recursive, allows custom kernels with wrappers by @ArthurZucker in #39823
[typecheck] proper export of private symbols by @cyyever in #39729
Update ux cb by @ArthurZucker in #39845
Fix responses add tests by @LysandreJik in #39848
Add fast image processor Janus, Deepseek VL, Deepseek VL hybrid by @yonigozlan in #39739
[image-processing] deprecate plot_keypoint_matching, make visualize_keypoint_matching as a standard by @sbucaille in #39830
Allow TrackioCallback to work when pynvml is not installed by @qgallouedec in #39851
remove dtensors, not explicit by @ArthurZucker in #39840
Improve is_wandb_available function to verify WandB installation by @qgallouedec in #39875
Refactor label name handling for PEFT models in Trainer class by @qgallouedec in #39265
Use comment to build doc on PRs by @ydshieh in #39846
Add support for including in-memory videos (not just files/urls) in apply_chat_template by @akibjawad in #39494
[core] Fix attn_implementation setter with missing sub_configs by @qubvel in #39855
Fix quant docker for fp-quant by @SunMarc in #39641
Rework add-new-model-like with modular and make test filenames coherent by @Cyrilvallez in #39612
Replace Tokenizer with PreTrainedTokenizerFast in ContinuousBatchProcessor by @qgallouedec in #39858
Set torch.backends.cudnn.allow_tf32 = False for CI by @ydshieh in #39885
[typing] better return type hint for AutoModelForCausalLM and AutoModelForImageTextToText by @qubvel in #39881
Fix link to models in README by @qubvel in #39880
[DOCS] : Improved mimi model card by @rohitthewanderer in #39824
Update cohere2 vision test by @ydshieh in #39888
send some feedback when manually building doc via comment by @ydshieh in #39889
Add support for ModernBertForMultipleChoice by @netique in #39232
chore: update DETR model card by @arpon-kapuria in #39822
Reorder serving docs by @LysandreJik in #39634
[Exaone4] Fixes the attn implementation! by @ArthurZucker in #39906
fix test_working_of_tp failure of accelerate ut by @yao-matrix in #39828
[qwen] remove unnecessary CUDA sync in qwen2_5_vl by @cyyever in #39870
Avoid aliasing in cond's branches for torch 2.8 by @ydwu4 in #39488
Fix misleading WandB error when WANDB_DISABLED is set by @notkisk in #39891
Replace video_fps with fps in tests by @cyyever in #39898
Fix eval thread fork bomb by @JustinVanHeek in #39717
Fix aria tests by @zucchini-nlp in #39879

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@capnmav77
- Add Fast Segformer Processor (#37024)
@cyyever
- Apply several ruff SIM rules (#37283)
- Fix an invalid condition (#39762)
- Simplify conditional code (#39781)
- [typecheck] proper export of private symbols (#39729)
- [qwen] remove unnecessary CUDA sync in qwen2_5_vl (#39870)
- Replace video_fps with fps in tests (#39898)
@rziga
- Add MM Grounding DINO (#37925)

Patch release 4.54.1

We had quite a lot of bugs that got through! Release was a bit rushed, sorry everyone! 🤗 Mostly cache fixes, as we now have layered cache, and fixed to distributed.

Fix Cache.max_cache_len max value for Hybrid models, @manueldeprada, @Cyrilvallez, #39737
[modenbert] fix regression, @zucchini-nlp, #39750
Fix version issue in modeling_utils.py, @Cyrilvallez, #39759
Fix GPT2 with cross attention, @zucchini-nlp, #39754
Fix mamba regression, @manueldeprada, #39728
Fix: add back base model plan, @S1ro1, #39733
fix cache inheritance, #39748
Fix cache-related tests, @zucchini-nlp, #39676
Fix Layer device placement in Caches, @Cyrilvallez, #39732
PATCH: add back n-dim device-mesh + fix tp trainer saving, @S1ro1, @SunMarc, #39693
fix missing model._tp_size from ep refactor, @winglian, #39688

v4.54.0: Kernels, Transformers Serve, Ernie, Voxtral, LFM2, DeepSeek v2, ModernBERT Decoder...

Important news!

In order to become the source of truth, we recognize that we need to address two common and long-heard critiques about transformers:

transformers is bloated
transformers is slow

Our team has focused on improving both aspects, and we are now ready to announce this. The modeling files for the standard Llama models are down to 500 LOC and should be much more readable, keeping just the core of the modeling and hiding the "powerful transformers features." <img width="1583" height="974" alt="image" src="https://github.com/user-attachments/assets/f1075598-d63e-4184-b3af-c0d4b31cdde5" />

The MoEs are getting some kernel magic, enabling the use of the efficient megablocks kernels, setting a good precedent to allow the community to leverage any of the most powerful kernels developed for quantization as well! It should also be much more convenient to use with any attention implementation you want. This opens the door to some optimizations such as leveraging flash-attention on Metal (MPS Torch backend). <img width="2050" height="752" alt="image" src="https://github.com/user-attachments/assets/23ebfb20-7626-46a5-b264-76ffb8b8c811" />

This is but the tip of the iceberg: with the work on kernels that we're heavily pushing forward, expect speed-ups on several backends in the coming months!!

This release also includes the first steps to enabling efficient distributed training natively in transformers. Loading a 100B model takes ~3 seconds on our cluster — we hope this will be the norm for everyone! We are working on distributed checkpointing as well, and want to make sure our API can be easily used for any type of parallelism.

We want the community to benefit from all of the advances, and as always, include all hardware and platforms! We believe the kernels library will give the tools to optimize everything, making a big difference for the industry!

New models

Ernie 4.5

The Ernie 4.5 model was released in the Ernie 4.5 Model Family release by baidu. This family of models contains multiple different architectures and model sizes. This model in specific targets the base text model without mixture of experts (moe) with 0.3B parameters in total. It uses the standard Llama at its core.

Other models from the family can be found at Ernie 4.5 MoE.

[Ernie 4.5] Add ernie text models by @vasqu in #39228

Voxtral

Voxtral is an upgrade of Ministral 3B and Mistral Small 3B, extending its language capabilities with audio input support. It is designed to handle tasks such as speech transcription, translation, and audio understanding.

You can read more in Mistral's realease blog post.

The model is available in two checkpoints:

3B: mistralai/Voxtral-Mini-3B-2507
24B: mistralai/Voxtral-Small-24B-2507

Key Features

Voxtral builds on Ministral-3B by adding audio processing capabilities:

Transcription mode: Includes a dedicated mode for speech transcription. By default, Voxtral detects the spoken language and transcribes it accordingly.
Long-form context: With a 32k token context window, Voxtral can process up to 30 minutes of audio for transcription or 40 minutes for broader audio understanding.
Integrated Q&A and summarization: Supports querying audio directly and producing structured summaries without relying on separate ASR and language models.
Multilingual support: Automatically detects language and performs well across several widely spoken languages, including English, Spanish, French, Portuguese, Hindi, German, Dutch, and Italian.
Function calling via voice: Can trigger functions or workflows directly from spoken input based on detected user intent.
Text capabilities: Maintains the strong text processing performance of its Ministral-3B foundation.

Add voxtral by @eustlb in #39429

LFM2

LFM2 represents a new generation of Liquid Foundation Models developed by Liquid AI, specifically designed for edge AI and on-device deployment.

The models are available in three sizes (350M, 700M, and 1.2B parameters) and are engineered to run efficiently on CPU, GPU, and NPU hardware, making them particularly well-suited for applications requiring low latency, offline operation, and privacy.

LFM2 by @paulpak58 in #39340

DeepSeek v2

The DeepSeek-V2 model was proposed in DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model by DeepSeek-AI Team.

The model uses Multi-head Latent Attention (MLA) and DeepSeekMoE architectures for efficient inference and cost-effective training. It employs an auxiliary-loss-free strategy for load balancing and multi-token prediction training objective. The model can be used for various language tasks after being pre-trained on 14.8 trillion tokens and going through Supervised Fine-Tuning and Reinforcement Learning stages.

Add DeepSeek V2 Model into Transformers by @VladOS95-cyber in #36400

ModernBERT Decoder models

ModernBERT Decoder is the same architecture as ModernBERT but trained from scratch with a causal language modeling (CLM) objective. This allows for using the same architecture for comparing encoders and decoders. This is the decoder architecture implementation of ModernBERT, designed for autoregressive text generation tasks.

Like the encoder version, ModernBERT Decoder incorporates modern architectural improvements such as rotary positional embeddings to support sequences of up to 8192 tokens, unpadding to avoid wasting compute on padding tokens, GeGLU layers, and alternating attention patterns. However, it uses causal (unidirectional) attention to enable autoregressive generation.

Add ModernBERT Decoder Models - ModernBERT, but trained with CLM! by @orionw in #38967

EoMT

The Encoder-only Mask Transformer (EoMT) model was introduced in the CVPR 2025 Highlight Paper Your ViT is Secretly an Image Segmentation Model by Tommie Kerssies, Niccolò Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, and Daan de Geus. EoMT reveals Vision Transformers can perform image segmentation efficiently without task-specific components.

✨ Add EoMT Model || 🚨 Fix Mask2Former loss calculation by @yaswanth19 in #37610

Doge

Doge is a series of small language models based on the Doge architecture, aiming to combine the advantages of state-space and self-attention algorithms, calculate dynamic masks from cached value states using the zero-order hold method, and solve the problem of existing mainstream language models getting lost in context. It uses the wsd_scheduler scheduler to pre-train on the smollm-corpus, and can continue training on new datasets or add sparse activation feedforward networks from stable stage checkpoints.

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/refs%2Fpr%2F426/transformers/model_doc/doge_architecture.png" alt="drawing" width="600"

Add Doge model by @LoserCheems in #35891

AIM v2

The AIMv2 model was proposed in Multimodal Autoregressive Pre-training of Large Vision Encoders by Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, Alaaeldin El-Nouby.

The abstract from the paper is the following:

We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal image understanding across diverse settings.

Add Aimv2 model by @yaswanth19 in #36625

PerceptionLM

The PerceptionLM model was proposed in PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding by Jang Hyun Cho et al. It's a fully open, reproducible model for transparent research in image and video understanding. PLM consists of a vision encoder with a small scale (<8B parameters) LLM decoder.

PerceptionLM by @shuminghu in #37878

Efficient LoFTR

The EfficientLoFTR model was proposed in Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed by Yifan Wang, Xingyi He, Sida Peng, Dongli Tan and Xiaowei Zhou.

This model consists of matching two images together by finding pixel correspondences. It can be used to estimate the pose between them. This model is useful for tasks such as image matching, homography estimation, etc.

Add EfficientLoFTR model by @sbucaille in #36355

EVOLLA

Evolla is an advanced 80-billion-parameter protein-language generative model designed to decode the molecular language of proteins. It integrates information from protein sequences, structures, and user queries to generate precise and contextually nuanced insights into protein function. Trained on an unprecedented AI-generated dataset of 546 million protein question-answer pairs and 150 billion word tokens, Evolla significantly advances research in proteomics and functional genomics, providing expert-level insights and shedding light on the molecular logic encoded in proteins.

Add evolla rebase main by @zhoubay in #36232

DeepSeek VL

Deepseek-VL was introduced by the DeepSeek AI team. It is a vision-language model (VLM) designed to process both text and images for generating contextually relevant responses. The model leverages LLaMA as its text encoder, while SigLip is used for encoding images.

Add support for DeepseekAI's DeepseekVL by @geetu040 in #36248

xLSTM

The xLSTM model was proposed in xLSTM: Extended Long Short-Term Memory by Maximilian Beck*, Korbinian Pöppel*, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter and Sepp Hochreiter. xLSTM updates the original LSTM architecture to be competitive with Transformer models by introducing exponential gating, matrix memory expansion, and parallelizable training and ingestion.

The 7B model variant was trained by the xLSTM team Maximilian Beck, Korbinian Pöppel, Phillip Lippe, Richard Kurle, Patrick Blies, Sebastian Böck and Sepp Hochreiter at NXAI.

Add xlstm model by @Cyrilvallez in #39665

EXAONE 4.0

EXAONE 4.0 model is the language model, which integrates a Non-reasoning mode and Reasoning mode to achieve both the excellent usability of EXAONE 3.5 and the advanced reasoning abilities of EXAONE Deep. To pave the way for the agentic AI era, EXAONE 4.0 incorporates essential features such as agentic tool use, and its multilingual capabilities are extended to support Spanish in addition to English and Korean.

The EXAONE 4.0 model series consists of two sizes: a mid-size 32B model optimized for high performance, and a small-size 1.2B model designed for on-device applications.

Add EXAONE 4.0 model by @lgai-exaone in #39129

Parallelisation

We've added Expert Parallel support for Llama4, next release will include it for all model! You can just set a distributed_config with enable_expert_parallel=True. This is enabling efficient training of sparse Mixture-of-Experts (MoE) models across multiple devices. This allows each expert in the MoE layer to run in parallel (instead of previous TP which requires more communication), significantly improving scalability and memory efficiency.

Add ep by @ArthurZucker in #39501

Quantization

FP Quant

FP-Quant is a quantization method optimized for Blackwell-generation Nvidia GPUs, supporting efficient post-training quantization (PTQ) and quantization-aware training (QAT) of LLMs using MXFP4 and NVFP4 formats.

Currently, only PTQ with MXFP4 is available. You can quantize models on-the-fly using transformers:

from transformers import AutoModelForCausalLM, FPQuantConfig

model = AutoModelForCausalLM.from_pretrained(
    "qwen/Qwen3-8B",
    quantization_config=FPQuantConfig(),
    device_map="cuda",
    torch_dtype=torch.bfloat16,
)

FP-Quant requires a Blackwell GPU and runs via the QuTLASS library. No Blackwell GPU? Use FPQuantConfig(pseudoquant=True) to emulate quantization (no QuTLASS needed).

The following results show the inference speedup of QuTLASS MXFP4 over PyTorch BF16 in Transformers. MXFP4 gives consistent speedups across all batch sizes, reaching up to 4× faster at larger scales.

FP-Quant support by @BlackSamorez in #38696

Kernels

The kernels project aims to become the single trusted source for high-performance kernels in the Transformers ecosystem. We're working toward centralizing all kernels on the Hub, so updates, bug fixes, and improvements can happen in one place—no more scattered repos and no compilation headaches!

You can already try it out today by setting use_kernels=True in from_pretrained. Any contributor can build their kernel, register it and use it right away—no extra setup, more on this here

Even better: want to use Flash Attention 3? No need to deal with tricky compilation and missing symbols issues! Just drop in:

model.set_attn_implementation("kernels-community/flash-attn3")

This automatically fetches the right build for your setup (e.g. CUDA and PyTorch versions).

We’re also teaming up with amazing kernel devs from Unsloth, Liger, vLLM, and more to bring their work directly to the Hub—making it easier than ever to access amazing performance with a single line of code.

Kernels flash attn by @ArthurZucker in #39474

Transformers Serve

https://github.com/user-attachments/assets/9928f62b-543c-4b8a-b81b-4a6e262c229e

Over the past few months, we have been putting more and more functionality in the transformers chat utility, which offers a CLI-based app to chat with chat models. We've chosen to push this further by splitting the backend of transformers chat in a new, separate utility called transformers serve.

This is ideal for experimentation purposes, or to run models locally for personal and private use. It does not aim to compete with dedicated inference engines such as vLLM or SGLang.

Models of diverse modalities supported by transformers may be served with the transformers serve CLI. It spawns a local server that offers compatibility with the OpenAI SDK, which is the de-facto standard for LLM conversations and other related tasks. This way, you can use the server from many third party applications, or test it using the transformers chat CLI (docs).

The server supports the following REST APIs:

/v1/chat/completions
/v1/responses
/v1/audio/transcriptions
/v1/models

Relevant commits:

Split transformers chat and transformers serve by @LysandreJik in #38443
[serve] Cursor support, move docs into separate page, add more examples by @gante in #39133
Fix continuous batching in transformers serve by @LysandreJik in #39149
[server] add tests and fix passing a custom generation_config by @gante in #39230
[serve] Model name or path should be required by @LysandreJik in #39178
Random serve fixes by @pcuenca in #39176
[tests] tag serve tests as slow by @gante in #39343
Responses API in transformers serve by @LysandreJik in #39155
[serve] Add speech to text (/v1/audio/transcriptions) by @gante in #39434
Transformers serve VLM by @LysandreJik in #39454

Refactors

Significant refactors have been underway in transformers, aiming to reduce the complexity of the code. A metric we follow to see how the refactors impact our code is to follow the number of lines in a given model; we try to reduce it as much as possible, while keeping everything related to the forward pass and model definition in that file.

See the evolution here:

Some notable refactors:

KV caching

KV caches are now defined per layer, enabling new hybrid caches that mix different attention types. CacheProcessors also encapsulate cache quantization and offloading, making them easy to customize.

[cache refactor] Move all the caching logic to a per-layer approach by @manueldeprada in #39106

Handling specific attributes like `output_attentions` or `output_hidden_states`

Such attributes require very specific handling within the forward call, while they're not important to understand how the model works. We remove that code but keep the functionality by providing a better utility to handle it.

Refactor the way we handle outputs for new llamas and new models by @ArthurZucker in #39120

Setting the attention implementation

We refactor the way to explicitly set the attention implementation so that it has a method dedicated to it.

[refactor] set attention implementation by @zucchini-nlp in #38974

Breaking changes

[Whisper] 🚨 Fix pipeline word timestamp: timestamp token is end of token time !!! by @eustlb in #36632
🚨 Don't use cache in non-generative models by @zucchini-nlp in #38751
🚨🚨🚨 [eomt] make EoMT compatible with pipeline by @yaswanth19 in #39122
🚨🚨 Fix and simplify attention implementation dispatch and subconfigs handling by @Cyrilvallez in #39423
🚨🚨🚨 [Trainer] Enable average_tokens_across_devices by default in TrainingArguments by @Krish0909 in #39395
🔴 Fix EnCodec internals and integration tests by @ebezzam in #39431

Bugfixes and improvements

Add StableAdamW Optimizer by @SunMarc in #39446
[Flex Attn] Fix torch 2.5.1 incompatibilities by @vasqu in #37406
fix test_compare_unprocessed_logit_scores by @ydshieh in #39053
fix t5gemma tests by @ydshieh in #39052
Update SuperPoint model card by @sbucaille in #38896
fix layoutlmv3 tests by @ydshieh in #39050
[docs] Model contribution by @stevhliu in #38995
Update PEGASUS-X model card by @dross20 in #38971
[docs] @auto_docstring by @stevhliu in #39011
[docs] Tensor parallelism by @stevhliu in #38241
[Whisper] fix shape mismatch in tests by @eustlb in #39074
Cleanup Attention class for Siglip and dependent models by @yaswanth19 in #39040
fix Gemma3nProcessorTest by @ydshieh in #39068
Fix initialization of OneFormer by @bvantuan in #38901
Uninstallling Flash attention from quantization docker by @MekkCyber in #39078
fix a bunch of XPU UT failures on stock PyTorch 2.7 and 2.8 by @yao-matrix in #39069
Pipeline: fix unnecessary warnings by @eustlb in #35753
fix mistral3 tests by @ydshieh in #38989
fixed typo for docstring in prepare_inputs method by @JINO-ROHIT in #39071
TST PEFT integration tests with pipeline generate by @BenjaminBossan in #39086
add fast image processor nougat by @NahieliV in #37661
Add Fast Image Processor for mobileViT by @MinJu-Ha in #37143
guard torch distributed check by @tvukovic-amd in #39057
fix dots1 tests by @ydshieh in #39088
Add Fast Image Processor for Chameleon by @farrosalferro in #37140
Fix: unprotected import of tp plugin by @S1ro1 in #39083
TST Fix PEFT integration test bitsandbytes config by @BenjaminBossan in #39082
[fix] Add FastSpeech2ConformerWithHifiGan by @stevhliu in #38207
Sandeepyadav1478/2025 06 19 deberta v2 model card update by @sandeepyadav1478 in #38895
Fixes the failing test test_is_split_into_words in test_pipelines_token_classification.py by @st81 in #39079
skip some test_sdpa_can_dispatch_on_flash by @ydshieh in #39092
fix UT failures on XPU w/ stock PyTorch 2.7 & 2.8 by @yao-matrix in #39116
Fix some bug for finetune and batch infer For GLM-4.1V by @zRzRzRzRzRzRzR in #39090
docs: Gemma 3n audio encoder by @RyanMullins in #39087
All CI jobs with A10 by @ydshieh in #39119
Licenses by @LysandreJik in #39127
Fix chat by @gante in #39128
Enable XPU doc by @jiqing-feng in #38929
docs: correct two typos in awesome-transformers.md by @VladimirGutuev in #39102
switch default xpu tp backend to pytorch built-in XCCL from pytorch 2.8 by @yao-matrix in #39024
Update BigBirdPegasus model card by @dross20 in #39104
[Whisper] update token timestamps tests by @eustlb in #39126
Fix key mapping for VLMs by @bvantuan in #39029
Several fixes for Gemma3n by @Cyrilvallez in #39135
fix caching_allocator_warmup with tie weights by @jiqing-feng in #39070
feat: support indivisible shards for TP model loading and TPlizing. by @kmehant in #37220
[qwen2-vl] fix FA2 inference by @zucchini-nlp in #39121
[typing] LlamaAttention return typehint by @ArkVex in #38998
[VLMs] support passing embeds along with pixels by @zucchini-nlp in #38467
[superglue] fix wrong concatenation which made batching results wrong by @sbucaille in #38850
Fix missing fsdp & trainer jobs in daily CI by @ydshieh in #39153
Fix: Ensure wandb logs config in offline mode by @DavidS2106 in #38992
Change @lru_cache() to @lru_cache to match styles from #38883. by @rasmi in #39093
fix: remove undefined variable by @ybkurt in #39146
update bnb ground truth by @jiqing-feng in #39117
Suggest jobs to use in run-slow by @ydshieh in #39100
Update expected values (after switching to A10) by @ydshieh in #39157
fix llama tests by @ydshieh in #39161
Add activation sparsity reference in gemma3n doc by @ChongYou in #39160
fix default value of config to match checkpionts in LLaVa-OV models by @ved1beta in #39163
[smolvlm] fix video inference by @zucchini-nlp in #39147
Fix multimodal processor get duplicate arguments when receive kwargs for initialization by @Isotr0py in #39125
Blip2 fixes by @remi-or in #39080
Fix missing initializations for models created in 2024 by @bvantuan in #38987
Reduce Glm4v model test size significantly by @Cyrilvallez in #39173
[docs] ViTPose by @stevhliu in #38630
[generate] document non-canonical beam search default behavior by @gante in #39000
Update expected values (after switching to A10) - part 2 by @ydshieh in #39165
Update expected values (after switching to A10) - part 3 by @ydshieh in #39179
Test fixes for Aria (and some Expectation for llava_next_video) by @remi-or in #39131
[glm4v] fix video inference by @zucchini-nlp in #39174
when delaying optimizer creation only prepare the model by @winglian in #39152
Decouple device_map='auto' and tp_plan='auto' by @SunMarc in #38942
Fix many HPU failures in the CI by @IlyasMoutawwakil in #39066
[Dia] Change ckpt path in docs by @vasqu in #39181
Update expected values (after switching to A10) - part 4 by @ydshieh in #39189
[typing] better return typehints for from_pretrained by @qubvel in #39184
Update expected values (after switching to A10) - part 5 by @ydshieh in #39205
Update expected values (after switching to A10) - part 6 by @ydshieh in #39207
Add packed tensor format support for flex/sdpa/eager through the mask! by @Cyrilvallez in #39194
Update expected values (after switching to A10) - part 7 by @ydshieh in #39218
Update expected values (after switching to A10) - part 8 - Final by @ydshieh in #39220
[video processors] Support float fps for precise frame sampling by @zrohyun in #39134
Expectations re-order and corrected FA3 skip by @remi-or in #39195
[vjepa2] replace einsum with unsqueeze by @xenova in #39234
Fix missing fast tokenizer/image_processor in whisper/qwen2.5-omni processor by @Isotr0py in #39244
[modular] Follow global indexing and attribute setting, and their dependencies by @Cyrilvallez in #39180
fix typo in Gemma3n notes by @davanstrien in #39196
Don't send new comment if the previous one is less than 30 minutes (unless the content is changed) by @ydshieh in #39170
fix bug using FSDP V1 will lead to model device not properly set by @kaixuanliu in #39177
Make _compute_dynamic_ntk_parameters exportable by @xadupre in #39171
[modular] Simplify logic and docstring handling by @Cyrilvallez in #39185
[bugfix] fix flash attention 2 unavailable error on Ascend NPU by @FightingZhen in #39166
fix fastspeech2_conformer tests by @ydshieh in #39229
RotaryEmbeddings change is not None -> isinstance(..., dict) by @qubvel in #39145
Fix patch helper by @Cyrilvallez in #39216
enable xpu on kv-cache and hqq doc by @jiqing-feng in #39246
adjust input and output texts for test_modeling_recurrent_gemma.py by @kaixuanliu in #39190
Update tiny-agents example by @Wauplin in #39245
Add Korean translation for glossary.md by @JoosunH in #38804
Clarify per_device_train_batch_size scaling in TrainingArguments by @Shohail-Ismail in #38…
Add segmentation_maps support to MobileNetV2ImageProcessor by @simonreise in #37312
Simplify Mixtral and its modular children by @Cyrilvallez in #39252
fix some flaky tests in tests/generation/test_utils.py by @ydshieh in #39254
Update LED model card by @dross20 in #39233
Glm 4 doc by @zRzRzRzRzRzRzR in #39247
fix xpu failures on PT 2.7 and 2.8 w/o IPEX and enable hqq cases on XPU by @yao-matrix in #39187
Fix license text, duplicate assignment, and typo in constant names by @gudwls215 in #39250
Skip test_eager_matches sdpa generate and update an integration test for blip-like models by @ydshieh in #39248
remove broken block by @molbap in #39255
fix(generation): stop beam search per-instance when heuristic satisfied by @guang-yng in #38778
fix recompiles due to instance key, and deepcopy issues by @ArthurZucker in #39270
Fix errors when use verl to train GLM4.1v model by @kaln27 in #39199
[CI] fix docs by @gante in #39273
[pagged-attention] fix off-by-1 error in pagged attention generation by @kashif in #39258
[smollm3] add tokenizer mapping for smollm3 by @gante in #39271
Refactor PretrainedConfig.__init__ method to make it more explicit by @qubvel in #39158
fix flaky test_generate_compile_model_forward by @ydshieh in #39276
[lightglue] add support for remote code DISK keypoint detector by @sbucaille in #39253
Add torchcodec in docstrings/tests for datasets 4.0 by @lhoestq in #39156
Update T5gemma by @bzhangGo in #39210
[Tests] Update model_id in AIMv2 Tests by @yaswanth19 in #39281
Fix SDPA attention precision issue in Qwen2.5-VL by @JJJYmmm in #37363
[flash attn 3] bring back flags by @zucchini-nlp in #39294
fix aria tests by @ydshieh in #39277
skip test_torchscript_* for now until the majority of the community ask for it by @ydshieh in #39307
[modular] Allow method with the same name in case of @property decorator by @Cyrilvallez in #39308
[sliding window] revert and deprecate by @zucchini-nlp in #39301
🌐 [i18n-KO] Translated quark.md to Korean by @maximizemaxwell in #39268
Fix consistency and a few docstrings warnings by @Cyrilvallez in #39314
add stevhliu to the list in self-comment-ci.yml by @ydshieh in #39315
Updated the Model docs - for the MARIAN model by @emanrissha in #39138
skip files in src/ for doctest (for now) by @ydshieh in #39316
docs: update LLaVA-NeXT model card by @Bpriya42 in #38894
Fix typo: langauge -> language by @tomaarsen in #39317
Granite speech speedups by @avihu111 in #39197
Fix max_length_q and max_length_k types to flash_attn_varlen_func by @HollowMan6 in #37206
enable static cache on TP model by @jiqing-feng in #39164
Fix broken SAM after #39120 by @yonigozlan in #39289
Delete deprecated stuff by @zucchini-nlp in #38838
fix Glm4v batch videos forward by @Kuangdd01 in #39172
fix phi3 tests by @ydshieh in #39312
Handle DAC conversion when using weight_norm with newer PyTorch versions by @edwko in #36393
[modeling][lfm2] LFM2: Remove deprecated seen_tokens by @paulpak58 in #39342
[Core] [Offloading] Enable saving offloaded models with multiple shared tensor groups by @kylesayrs in #39263
Add a default value for position_ids in masking_utils by @Cyrilvallez in #39310
[modular] speedup check_modular_conversion with multiprocessing by @qubvel in #37456
Updated Switch Transformers model card with standardized format (Issue #36979) by @giuseppeCoccia in #39305
Fix link for testpypi by @Cyrilvallez in #39360
update cb TP by @ArthurZucker in #39361
fix failing test_sdpa_can_dispatch_on_flash by @ydshieh in #39259
Verbose error in fix mode for utils/check_docstrings.py by @manueldeprada in #38915
Remove device check in HQQ quantizer by @learning-chip in #39299
Add mistral common support by @juliendenize in #38906
Update Readme to Run Multiple Choice Script from Example Directory by @eromomon in #39323
Updated CamemBERT model card to new standardized format by @MShaheerMalik77 in #39227
fix gpt2 usage doc by @Xiang-cd in #39351
Update Model Card for Encoder Decoder Model by @ParagEkbote in #39272
update docker file to use latest timm (for perception_lm) by @ydshieh in #39380
Fix overriding Fast Image/Video Processors instance attributes affect other instances by @yonigozlan in #39363
[shieldgemma] fix checkpoint loading by @zucchini-nlp in #39348
[BLIP] remove cache from Qformer by @zucchini-nlp in #39335
[Qwen2.5-VL] Fix torch.finfo() TypeError for integer attention_mask_tensor by @dsnsabari in #39333
Deprecate AutoModelForVision2Seq by @zucchini-nlp in #38900
Fix Lfm2 and common tests by @Cyrilvallez in #39398
[examples] fix do_reduce_labels argument for run_semantic_segmentation_no_trainer by @eromomon in #39322
Totally rewrite how pipelines load preprocessors by @Rocketknight1 in #38947
Use np.pad instead of np.lib.pad. by @rasmi in #39346
[Docs] Fix typo in CustomTrainer compute_loss method and adjust loss reduction logic by @MilkClouds in #39391
Update phi4_multimodal.md by @tanuj-rai in #38830
[siglip] fix pooling comment by @sameerajashyam in #39378
Fix typo in /v1/models output payload by @alvarobartt in #39414
support loading qwen3 gguf by @44670 in #38645
Ignore extra position embeddings weights for ESM by @Rocketknight1 in #39063
set document_question_answering pipeline _load_tokenizer to True by @jiqing-feng in #39411
Fix invalid property by @cyyever in #39384
refactor: remove set_tracer_provider and set_meter_provider calls by @McPatate in #39422
Fix bugs from pipeline preprocessor overhaul by @Rocketknight1 in #39425
Fix bugs in pytorch example run_clm when streaming is enabled by @HRezaei in #39286
Remove deprecated audio utils functions by @jiangwangyi in #39330
Remove residual quantization attribute from dequantized models by @DWarez in #39373
handle training summary when creating modelcard but offline mode is set by @winglian in #37095
[vlm] fix loading of retrieval VLMs by @zucchini-nlp in #39242
docs: update SuperGlue docs by @sbucaille in #39406
docs: update LightGlue docs by @sbucaille in #39407
CI workflow for performed test regressions by @ahadnagy in #39198
[autodocstring] add video and audio inputs by @zucchini-nlp in #39420
[Core] [Offloading] Fix saving offloaded submodules by @kylesayrs in #39280
Remove double soft-max in load-balancing loss. Fixes #39055 . by @rudolfwilliam in #39056
Fixed a bug calculating cross entropy loss in JetMoeForCausalLM by @Phoenix-Shen in #37830
[chat template] add a testcase for kwargs by @zucchini-nlp in #39415
Fix L270 - hasattr("moe_args") returning False error by @wjdghks950 in #38715
Defaults to adamw_torch_fused for Pytorch>=2.8 by @cyyever in #37358
Change log level from warning to info for scheduled request logging in ContinuousBatchProcessor by @qgallouedec in #39372
Add cosine_with_min_lr_schedule_with_warmup_lr_rate scheduler in Trainer by @richardodliu in #31870
Fix missing definition of diff_file_url in notification service by @ahadnagy in #39445
add test scanner by @molbap in #39419
Remove runtime conditions for type checking by @cyyever in #37340
docs: add missing numpy import to minimal example by @IliasAarab in #39444
[cache] make all classes cache compatible finally by @zucchini-nlp in #38635
Fix typo in generation configuration for Janus model weight conversion by @thisisiron in #39432
Better typing for model.config by @qubvel in #39132
[Bugfix] [Quantization] Remove unused init arg by @kylesayrs in #39324
Fix processor tests by @zucchini-nlp in #39450
Remove something that should have never been there by @ArthurZucker in #38254
make the loss context manager easier to extend by @winglian in #39321
Fixes #39204: add fallback if get_base_model missing by @sebastianvlad1 in #39226
[CI] Fix partially red CI by @vasqu in #39448
Updated Megatron conversion script for gpt2 checkpoints by @LckyLke in #38969
Fix indentation bug in SmolVLM image processor causing KeyError by @Krish0909 in #39452
fix cached file error when repo type is dataset by @hiyouga in #36909
Improve grammar and clarity in perf_hardware.md by @ridima11 in #39428
create ijepa modelcard (ref : PR #36979 ). by @dhruvmalik007 in #39354
Corrections to PR #38642 and enhancements to Wav2Vec2Processor call and pad docstrings by @renet10 in #38822
fix(pipelines): QA pipeline returns fewer than top_k results in batch mode by @yushi2006 in #39193
fix max_length calculating using cu_seq_lens by @KKZ20 in #39341
Fix tests due to breaking change in accelerate by @SunMarc in #39451
Use newer typing notation by @cyyever in #38934
fix a comment typo in utils.py by @klimarissa17 in #39459
Update GemmaIntegrationTest::test_model_2b_bf16_dola by @ydshieh in #39362
Fix convert_and_export_with_cache failures for GPU models by @Stonepia in #38976
Enable some ruff checks for performance and readability by @cyyever in #39383
fix: ImageTextToTextPipeline handles user-defined generation_config by @peteryschneider in #39374
Update integration_utils.py by @zhaiji0727 in #39469
Add unified logits_to_keep support to LLMClass by @hellopahe in #39472
Fix typing order by @Tavish9 in #39467
[dependencies] temporary pyarrow pin by @gante in #39496
Slack CI bot: set default result for non-existing artifacts by @ahadnagy in #39499
[dependencies] Update datasets pin by @gante in #39500
[chat template] return assistant mask in processors by @zucchini-nlp in #38545
[gemma3] Fix do_convert_rgb in image processors. by @MohitIntel in #39438
Fix BatchEncoding.to() for nested elements by @eginhard in #38985
Add fast image processor SAM by @yonigozlan in #39385
Improve @auto_docstring doc and rename args_doc.py to auto_docstring.py by @yonigozlan in #39439
Update SAM/SAM HQ attention implementation + fix Cuda sync issues by @yonigozlan in #39386
Fix placeholders replacement logic in auto_docstring by @yonigozlan in #39433
[gemma3] support sequence classification task by @zucchini-nlp in #39465
[qwen2 vl] fix packing with all attentions by @zucchini-nlp in #39447
GLM-4 Update by @zRzRzRzRzRzRzR in #39393
Fix bad tensor shape in failing Hubert test. by @ebezzam in #39502
Fix the check in flex test by @Cyrilvallez in #39548
Rename _supports_flash_attn_2 in examples and tests by @zucchini-nlp in #39471
Fix Qwen Omni integration test by @Cyrilvallez in #39553
Fix pylint warnings by @cyyever in #39477
Raise TypeError instead of ValueError for invalid types by @Sai-Suraj-27 in #38660
Fix missing initializations for models created in 2023 by @bvantuan in #39239
use the enable_gqa param in torch.nn.functional.scaled_dot_product_at… by @sywangyi in #39412
Fix Docstring of BarkProcessor by @st81 in #39546
Refactor MambaCache to modeling_mamba.py by @manueldeprada in #38086
fix ndim check of device_mesh for TP by @winglian in #39538
[Fast image processor] refactor fast image processor glm4v by @yonigozlan in #39490
🌐 [i18n-KO] Translated perf_infer_gpu_multi.md to Korean by @luckyvickyricky in #39441
Refactor embedding input/output getter/setter by @molbap in #39339
[Fast image processors] Improve handling of image-like inputs other than images (segmentation_maps) by @yonigozlan in #39489
[CI] Fix post merge ernie 4.5 by @vasqu in #39561
Update modernbertdecoder docs by @orionw in #39453
Update OLMoE model card by @nlhmnlhmnlhm in #39344
[gemma3] fix bidirectional image mask by @zucchini-nlp in #39396
Bump AMD container for 2.7.1 PyTorch by @ahadnagy in #39458
Fixes needed for n-d parallelism and TP by @winglian in #39562
[timm_wrapper] add support for gradient checkpointing by @Yozer in #39287
Add AMD test expectations to DETR model by @ahadnagy in #39539
[docs] update attention implementation and cache docs by @zucchini-nlp in #39547
[docs] Create page on inference servers with transformers backend by @zucchini-nlp in #39550
Add AMD expectations to Mistral3 tests by @ahadnagy in #39481
Add AMD GPU expectations for LLaVA tests by @ahadnagy in #39486
General weight initialization scheme by @Cyrilvallez in #39579
[cache refactor] Move all the caching logic to a per-layer approach by @manueldeprada in #39106
Update docs/source/ko/_toctree.yml by @jungnerd in #39516
updated mistral3 model card by @cassiasamp in #39531
[Paged-Attention] Handle continuous batching for repetition penalty by @kashif in #39457
Torchdec RuntimeError catch by @SunMarc in #39580
Fix link in "Inference server backends" doc by @hmellor in #39589
[WIP] Add OneformerFastImageProcessor by @Player256 in #38343
🎯 Trackio integration by @qgallouedec in #38814
Mask2former & Maskformer Fast Image Processor by @SangbumChoi in #35685
Fix DynamicCache and simplify Cache classes a bit by @Cyrilvallez in #39590
Generic task-specific base classes by @Cyrilvallez in #39584
[Trackio] Allow single-gpu training and monitor power by @qgallouedec in #39595
Rename supports_static_cache to can_compile_fullgraph by @zucchini-nlp in #39505
FP-Quant support by @BlackSamorez in #38696
fix moe routing_weights by @llbdyiu66 in #39581
[idefics3] fix for vLLM by @zucchini-nlp in #39470
enable triton backend on awq xpu by @jiqing-feng in #39443
Allow device_mesh have multiple dim by @S1ro1 in #38949
Fix typos and grammar issues in documentation and code by @cluster2600 in #39598
Fix important models CI by @molbap in #39576
Move openai import by @ebezzam in #39613
Fix DAC integration tests and checkpoint conversion. by @ebezzam in #39313
Feature/standardize opt model card by @JoestarGagan in #39568
standardized YOLOS model card according to template in #36979 by @EthanV431 in #39528
[Docs] Translate audio_classification.md from English to Spanish by @weezymatt in #39513
Update recent processors for vLLM backend by @zucchini-nlp in #39583
[efficientloftr] fix model_id in tests by @sbucaille in #39621
[timm] new timm pin by @gante in #39640
[Voxtral] values for A10 runners by @eustlb in #39605
revert behavior of _prepare_from_posids by @winglian in #39622
Add owlv2 fast processor by @lmarshall12 in #39041
[attention] fix test for packed padfree masking by @zucchini-nlp in #39582
Fix: explicit not none check for tensors in flash attention by @jeffrey-dot-li in #39639
revert change to cu_seqlen_k and max_k when preparing from position_ids by @winglian in #39653
Make pytorch examples UV-compatible by @lhoestq in #39635
[docs] fix ko cache docs by @gante in #39644
make fixup by @gante in #39661
fix(voxtral): correct typo in apply_transcription_request by @rev2607 in #39572
Rename huggingface_cli to hf by @LysandreJik in #39630
🚨[Fast Image Processor] Force Fast Image Processor for Qwen2_VL/2_5_VL + Refactor by @yonigozlan in #39591
Fix ModernBERT Decoder model by @qubvel in #39671
[CI] revert device in test_export_static_cache by @gante in #39662
[Ernie 4.5] Post merge adaptations by @vasqu in #39664
Delete bad rebasing functions by @Cyrilvallez in #39672
Fixes the BC by @ArthurZucker in #39636
fix kyutai tests by @ydshieh in #39416
update expected outputs for whisper after #38778 by @ydshieh in #39304
Add missing flag for CacheLayer by @Cyrilvallez in #39678
Fix auto_docstring crashing when dependencies are missing by @yonigozlan in #39564
fix: HWIO to OIHW by @RyanMullins in #39200
Use auto_docstring for perception_lm fast image processor by @yonigozlan in #39679
bad_words_ids no longer slow on mps by @DWarez in #39556
Support typing.Literal as type of tool parameters or return value by @grf53 in #39633
fix break for ckpt without _tp_plan by @MoyanZitto in #39658
Fix tied weight test by @Cyrilvallez in #39680
Add padding-free to Granite hybrid moe models by @garrett361 in #39677

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@sbucaille
- Update SuperPoint model card (#38896)
- [superglue] fix wrong concatenation which made batching results wrong (#38850)
- [lightglue] add support for remote code DISK keypoint detector (#39253)
- docs: update SuperGlue docs (#39406)
- docs: update LightGlue docs (#39407)
- Add EfficientLoFTR model (#36355)
@yaswanth19
- Cleanup Attention class for Siglip and dependent models (#39040)
- ✨ Add EoMT Model || 🚨 Fix Mask2Former loss calculation (#37610)
- 🚨🚨🚨 [eomt] make EoMT compatible with pipeline (#39122)
- Add Aimv2 model (#36625)
- [Tests] Update model_id in AIMv2 Tests (#39281)
@bvantuan
- Fix initialization of OneFormer (#38901)
- Fix key mapping for VLMs (#39029)
- Fix missing initializations for models created in 2024 (#38987)
- Fix missing initializations for models created in 2023 (#39239)
@NahieliV
- add fast image processor nougat (#37661)
@MinJu-Ha
- Add Fast Image Processor for mobileViT (#37143)
@zRzRzRzRzRzRzR
- Fix some bug for finetune and batch infer For GLM-4.1V (#39090)
- Glm 4 doc (#39247)
- GLM-4 Update (#39393)
@simonreise
- Add segmentation_maps support to MobileNetV2ImageProcessor (#37312)
@LoserCheems
- Add Doge model (#35891)
@VladOS95-cyber
- Add DeepSeek V2 Model into Transformers (#36400)
@paulpak58
- LFM2 (#39340)
- [modeling][lfm2] LFM2: Remove deprecated seen_tokens (#39342)
@shuminghu
- PerceptionLM (#37878)
@juliendenize
- Add mistral common support (#38906)
@orionw
- Add ModernBERT Decoder Models - ModernBERT, but trained with CLM! (#38967)
- Update modernbertdecoder docs (#39453)
@cyyever
- Fix invalid property (#39384)
- Defaults to adamw_torch_fused for Pytorch>=2.8 (#37358)
- Remove runtime conditions for type checking (#37340)
- Use newer typing notation (#38934)
- Enable some ruff checks for performance and readability (#39383)
- Fix pylint warnings (#39477)
@jungnerd
- Update docs/source/ko/_toctree.yml (#39516)
@Player256
- [WIP] Add OneformerFastImageProcessor (#38343)
@SangbumChoi
- Mask2former & Maskformer Fast Image Processor (#35685)
@BlackSamorez
- FP-Quant support (#38696)

Ernie-4.5 and Ernie-4.5 MoE (based on v4.53.2)

Two new models are added to transformers: Ernie 4.5, and its MoE variant, Ernie 4.5 MoE. They are added on top of the v4.53.2 release, and can be installed from the following tag: v4.53.2-Ernie-4.5-preview.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.53.2-Ernie-4.5-preview

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

As the tag implies, this tag is a preview of the Ernie-4.5 models. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.54.0.

Ernie-4.5 and its MoE variant

The Ernie 4.5 model was released in the Ernie 4.5 Model Family release by baidu. This family of models contains multiple different architectures and model sizes.

The Dense

This model in specific targets the base text model without mixture of experts (moe) with 0.3B parameters in total. It uses the standard Llama at its core.

The MoE

This model in specific targets the base text model with mixture of experts (moe) - one with 21B total, 3B active parameters and another one with 300B total, 47B active parameters. It uses the standard Llama at its core combined with a specialized MoE based on Mixtral with additional shared experts.

Usage example

Ernie-4.5 can be found on the Huggingface Hub.

Generating text with Ernie:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "baidu/ERNIE-4.5-0.3B-PT"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# prepare the model input
inputs = tokenizer("Hey, are you conscious? Can you talk to me?", return_tensors="pt")
prompt = "Hey, are you conscious? Can you talk to me?"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], add_special_tokens=False, return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32,
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# decode the generated ids
generate_text = tokenizer.decode(output_ids, skip_special_tokens=True)

See below for an example leveraging the MoE variant:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "baidu/ERNIE-4.5-21B-A3B-PT"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# prepare the model input
inputs = tokenizer("Hey, are you conscious? Can you talk to me?", return_tensors="pt")
prompt = "Hey, are you conscious? Can you talk to me?"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], add_special_tokens=False, return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32,
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# decode the generated ids
generate_text = tokenizer.decode(output_ids, skip_special_tokens=True)

Patch release v4.53.3

Small path release 4.53.3!

A small patch for open telemetry fixes! Sorry for the delay!

** refactor: remove set_tracer_provider and set_meter_provider calls (https://github.com/huggingface/transformers/pull/39422) from @McPatate

ModernBERT Decoder (based on v4.53.2)

A new model is added to transformers: ModernBERT Decoder It is added on top of the v4.53.2 release, and can be installed from the following tag: v4.53.2-modernbert-decoder-preview.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.53.2-modernbert-decoder-preview

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

As the tag implies, this tag is a preview of the ModernBERT Decoder model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.54.0.

ModernBERT Decoder

Usage example

ModernBERT Decoder can be found on the Huggingface Hub.

Using pipeline:

import torch
from transformers import pipeline

generator = pipeline(
    task="text-generation",
    model="blab-jhu/test-32m-dec",
    torch_dtype=torch.float16,
    device=0
)
generator("The future of artificial intelligence is", max_length=50, num_return_sequences=1)

# For sequence classification
classifier = pipeline(
    task="text-classification",
    model="blab-jhu/test-32m-dec",
    torch_dtype=torch.float16,
    device=0
)
classifier("This movie is really great!")

Using AutoModel:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("blab-jhu/test-32m-dec")
model = AutoModelForCausalLM.from_pretrained(
    "blab-jhu/test-32m-dec",
    torch_dtype=torch.float16,
    device_map="auto",
)

prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_length=50,
        num_return_sequences=1,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated text: {generated_text}")

# For sequence classification
from transformers import AutoModelForSequenceClassification

classifier_model = AutoModelForSequenceClassification.from_pretrained(
    "blab-jhu/test-32m-dec",
    torch_dtype=torch.float16,
    device_map="auto",
    num_labels=2
)

text = "This movie is really great!"
inputs = tokenizer(text, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = classifier_model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(predictions, dim=-1)

print(f"Predicted class: {predicted_class.item()}")
print(f"Prediction probabilities: {predictions}")

Using the transformers CLI:

echo "The future of artificial intelligence is" | transformers run --task text-generation --model your-username/modernbert-decoder-base --device 0

Patch Release v4.53.2

This patch contains the following bug fixes:

Fix some bug for finetune and batch infer For GLM-4.1V (#39090)
[bugfix] fix flash attention 2 unavailable error on Ascend NPU (#39166)
Fix errors when use verl to train GLM4.1v model (#39199)
[pagged-attention] fix off-by-1 error in pagged attention generation (#39258)
[smollm3] add tokenizer mapping for smollm3 (#39271)
[sliding window] revert and deprecate (#39301)
fix Glm4v batch videos forward (#39172)
Add a default value for position_ids in masking_utils (#39310)

Patch Release v4.53.1

This patch contains several bug fixes. The following commits are included:

Fix: unprotected import of tp plugin (#39083)
Fix key mapping for VLMs (#39029)
Several fixes for Gemma3n(#39135)
[qwen2-vl] fix FA2 inference (#39121)
[smolvlm] fix video inference (#39147)
Fix multimodal processor get duplicate arguments when receive kwargs for initialization (#39125)
when delaying optimizer creation only prepare the model (#39152)
Add packed tensor format support for flex/sdpa/eager through the mask! (#39194)

Gemma3n

Gemma 3n models are designed for efficient execution on low-resource devices. They are capable of multimodal input, handling text, image, video, and audio input, and generating text outputs, with open weights for pre-trained and instruction-tuned variants. These models were trained with data in over 140 spoken languages.

Gemma 3n models use selective parameter activation technology to reduce resource requirements. This technique allows the models to operate at an effective size of 2B and 4B parameters, which is lower than the total number of parameters they contain. For more information on Gemma 3n's efficient parameter management technology, see the Gemma 3n page.

from transformers import pipeline
import torch

pipe = pipeline(
    "image-text-to-text",
    torch_dtype=torch.bfloat16,
    model="google/gemma-3n-e4b",
    device="cuda",
)
output = pipe(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
    text="<image_soft_token> in this image, there is"
)

print(output)

Dia

Dia is an opensource text-to-speech (TTS) model (1.6B parameters) developed by Nari Labs. It can generate highly realistic dialogue from transcript including nonverbal communications such as laughter and coughing. Furthermore, emotion and tone control is also possible via audio conditioning (voice cloning).

Model Architecture: Dia is an encoder-decoder transformer based on the original transformer architecture. However, some more modern features such as rotational positional embeddings (RoPE) are also included. For its text portion (encoder), a byte tokenizer is utilized while for the audio portion (decoder), a pretrained codec model DAC is used - DAC encodes speech into discrete codebook tokens and decodes them back into audio.

Add Dia model by @buttercrab in #38405

Kyutai Speech-to-Text

Kyutai STT is a speech-to-text model architecture based on the Mimi codec, which encodes audio into discrete tokens in a streaming fashion, and a Moshi-like autoregressive decoder. Kyutai’s lab has released two model checkpoints:

kyutai/stt-1b-en_fr: a 1B-parameter model capable of transcribing both English and French
kyutai/stt-2.6b-en: a 2.6B-parameter model focused solely on English, optimized for maximum transcription accuracy

Add kyutai stt by @eustlb in #38909

Read more about the model in the documentation

V-JEPA 2

V-JEPA 2 is a self-supervised approach to training video encoders developed by FAIR, Meta. Using internet-scale video data, V-JEPA 2 attains state-of-the-art performance on motion understanding and human action anticipation tasks. V-JEPA 2-AC is a latent action-conditioned world model post-trained from V-JEPA 2 (using a small amount of robot trajectory interaction data) that solves robot manipulation tasks without environment-specific data collection or task-specific training or calibration.

Add V-JEPA 2 by @qubvel in #38746

Read more about the model in the documentation.

Arcee

Arcee is a decoder-only transformer model based on the Llama architecture with a key modification: it uses ReLU² (ReLU-squared) activation in the MLP blocks instead of SiLU, following recent research showing improved training efficiency with squared activations. This architecture is designed for efficient training and inference while maintaining the proven stability of the Llama design.

The Arcee model is architecturally similar to Llama but uses x * relu(x) in MLP layers for improved gradient flow and is optimized for efficiency in both training and inference scenarios.

Add Arcee model support by @Crystalcareai in #38621

Read more about the model in the documentation.

ColQwen2

ColQwen2 is a variant of the ColPali model designed to retrieve documents by analyzing their visual features. Unlike traditional systems that rely heavily on text extraction and OCR, ColQwen2 treats each page as an image. It uses the Qwen2-VL backbone to capture not only text, but also the layout, tables, charts, and other visual elements to create detailed multi-vector embeddings that can be used for retrieval by computing pairwise late interaction similarity scores. This offers a more comprehensive understanding of documents and enables more efficient and accurate retrieval.

Add ColQwen2 to 🤗 transformers by @tonywu71 in #35778

Read more about the model in the documentation.

MiniMax

MiniMax is a powerful language model with 456 billion total parameters, of which 45.9 billion are activated per token. To better unlock the long context capabilities of the model, MiniMax adopts a hybrid architecture that combines Lightning Attention, Softmax Attention and Mixture-of-Experts (MoE). Leveraging advanced parallel strategies and innovative compute-communication overlap methods—such as Linear Attention Sequence Parallelism Plus (LASP+), varlen ring attention, Expert Tensor Parallel (ETP), etc., MiniMax's training context length is extended to 1 million tokens, and it can handle a context of up to 4 million tokens during the inference. On various academic benchmarks, MiniMax also demonstrates the performance of a top-tier model.

The architecture of MiniMax is briefly described as follows:

Total Parameters: 456B
Activated Parameters per Token: 45.9B
Number Layers: 80
Hybrid Attention: a softmax attention is positioned after every 7 lightning attention.
- Number of attention heads: 64
- Attention head dimension: 128
Mixture of Experts:
- Number of experts: 32
- Expert hidden dimension: 9216
- Top-2 routing strategy
Positional Encoding: Rotary Position Embedding (RoPE) applied to half of the attention head dimension with a base frequency of 10,000,000
Hidden Size: 6144
Vocab Size: 200,064

For more details refer to the release blog post.

Add support for MiniMax's MiniMax-Text-01 by @geetu040 in #35831

Read more about the model in the documentation.

Encoder-Decoder Gemma

T5Gemma (aka encoder-decoder Gemma) was proposed in a research paper by Google. It is a family of encoder-decoder large langauge models, developed by adapting pretrained decoder-only models into encoder-decoder. T5Gemma includes pretrained and instruction-tuned variants. The architecture is based on transformer encoder-decoder design following T5, with improvements from Gemma 2: GQA, RoPE, GeGLU activation, RMSNorm, and interleaved local/global attention.

T5Gemma has two groups of model sizes: 1) Gemma 2 sizes (2B-2B, 9B-2B, and 9B-9B), which are based on the offical Gemma 2 models (2B and 9B); and 2) T5 sizes (Small, Base, Large, and XL), where are pretrained under the Gemma 2 framework following T5 configuration. In addition, we also provide a model at ML size (medium large, ~2B in total), which is in-between T5 Large and T5 XL.

The pretrained varaints are trained with two objectives: prefix language modeling with knowledge distillation (PrefixLM) and UL2, separately. We release both variants for each model size. The instruction-turned varaints was post-trained with supervised fine-tuning and reinforcement learning.

Encoder-Decoder Gemma by @bzhangGo in #38332

Read more about the model in the documentation.

GLM-4.1V

The GLM-4.1V model architecture is added to transformers; no models have yet been released with that architecture. Stay tuned for the GLM team upcoming releases!

GLM-4.1V Model support by @zRzRzRzRzRzRzR in #38431

Read more about the model in the documentation.

Falcon H1

The FalconH1 model was developed by the TII Pretraining team. A comprehensive research paper covering the architecture, pretraining dynamics, experimental results, and conclusions is forthcoming. You can read more about this series in this website.

[MODEL] Add Falcon H1 by @younesbelkada in #38249

Read more about the model in the documentation.

LightGlue

The LightGlue model was proposed in LightGlue: Local Feature Matching at Light Speed by Philipp Lindenberger, Paul-Edouard Sarlin and Marc Pollefeys.

Similar to SuperGlue, this model consists of matching two sets of local features extracted from two images, its goal is to be faster than SuperGlue. Paired with the SuperPoint model, it can be used to match two images and estimate the pose between them. This model is useful for tasks such as image matching, homography estimation, etc.

The abstract from the paper is the following:

We introduce LightGlue, a deep neural network that learns to match local features across images. We revisit multiple design decisions of SuperGlue, the state of the art in sparse matching, and derive simple but effective improvements. Cumulatively, they make LightGlue more efficient - in terms of both memory and computation, more accurate, and much easier to train. One key property is that LightGlue is adaptive to the difficulty of the problem: the inference is much faster on image pairs that are intuitively easy to match, for example because of a larger visual overlap or limited appearance change. This opens up exciting prospects for deploying deep matchers in latency-sensitive applications like 3D reconstruction. The code and trained models are publicly available at this https URL

Add LightGlue model by @sbucaille in #31718

Read more about the model in the documentation.

dots.llm1

The abstract from the report is the following:

Mixture of Experts (MoE) models have emerged as a promising paradigm for scaling language models efficiently by activating only a subset of parameters for each input token. In this report, we present dots.llm1, a large-scale MoE model that activates 14B parameters out of a total of 142B parameters, delivering performance on par with state-of-the-art models while reducing training and inference costs. Leveraging our meticulously crafted and efficient data processing pipeline, dots.llm1 achieves performance comparable to Qwen2.5-72B after pretraining on high-quality corpus and post-training to fully unlock its capabilities. Notably, no synthetic data is used during pretraining. To foster further research, we open-source intermediate training checkpoints spanning the entire training process, providing valuable insights into the learning dynamics of large language models.

[Model] add dots1 by @redmoe-moutain in #38143

Read more about the model in the documentation.

SmolLM3

SmolLM3 is a fully open, compact language model designed for efficient deployment while maintaining strong performance. It uses a Transformer decoder architecture with Grouped Query Attention (GQA) to reduce the kv cache, and no RoPE, enabling improved performance on long-context tasks. It is trained using a multi-stage training approach on high-quality public datasets across web, code, and math domains. The model is multilingual and supports very large context lengths. The instruct variant is optimized for reasoning and tool use.

Add SmolLM3 by @anton-l in #38755

Read more about the model in the documentation.

Performance optimizations

Kernels

In previous versions, installing the kernels library would automatically activate the custom kernels added to transformers, because the @use_kernel_forward_from_the_hub decorator directly swapped out the model’s forward method. This implicit behavior caused several issues for users — including problems with torch.compile, non-determinism, and inconsistent outputs.

To address this, we've introduced a new opt-in mechanism called kernelize. You can now enable kernel usage explicitly by passing use_kernels=True to from_pretrained. The use_kernel_forward_from_the_hub decorator now simply stores the kernel name that the user wants to use — and kernelize handles the rest under the hood.

Example

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    use_kernels=True
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")

input = "Hello"
input_ids = tokenizer(input, return_tensors="pt").to(model.device).input_ids
output = model.generate(input_ids, max_new_tokens=100)

print(tokenizer.decode(output[0], skip_special_tokens=True))

More kernels will be added over time — this will be a collaborative, community-driven effort to make transformers lighter and faster 🤗

Add kernelize to transformers by @MekkCyber in #38205

Flash Attention 3

Support for Flash Attention 3 is added across the most popular models.

Support for Flash Attention 3 by @EduardDurech in #38972

Notable repository maintenance & refactors

Several efforts refactoring the repository are happening in parallel. The direction is to greatly simplify the library, removing unnecessary codepaths. Whilst the efforts are spread across the library, they're particularly visible in each individual models; where non-modeling-specific code will be simplified and eventually removed.

We take the assumption that model-agnostic utilities shouldn't be in the modeling code. Things like the output of attentions, hidden states, router logits, are important for end-users but don't need to be explicitely displayed in the modeling code.

Apply GradientCheckpointingLayer to the whole repo by @qubvel in #38913
No more Tuple, List, Dict by @Rocketknight1 in #38797
Deprecate TF + JAX by @Rocketknight1 in #38758

Breaking changes

Several minimal breaking changes aiming to bring clearer defaults while greatly simplifying the library have been merged.

🔴 Update default dtype for pipelines to auto by @Vaibhavs10 in #38882
🚨🚨 Fix initialization of Mask2Former by @Cyrilvallez in #38864
:rotating_light: :rotating_light: Inherited CausalLM Tests by @Rocketknight1 in #37590
🚨Early-error🚨 config will error out if output_attentions=True and the attn implementation is wrong by @ArthurZucker in #38288
🔴 [VLM] modeling updates by @zucchini-nlp in #38317
:rotating_light: :rotating_light: Fix custom code saving by @Rocketknight1 in #37716
🚨🚨[core] Completely rewrite the masking logic for all attentions by @Cyrilvallez in #37866
🔴🔴🔴 [Attention] Refactor Attention Interface for Bart-based Models by @vasqu in #38108
🔴[Attention] Attention refactor for Whisper-based models by @vasqu in #38235
Add CB by @ArthurZucker in #38085

Bugfixes and improvements

CI reporting improvements by @ydshieh in #38230
Revert parallelism temporarily by @LysandreJik in #38240
tp plan should not be NONE by @ArthurZucker in #38255
[Falcon H1] Fix Typo in Integration Test by @dhiaEddineRhaiem in #38256
[compile] re-enable for Qwen-VL models by @zucchini-nlp in #38127
fix multi-image case for llava-onevision by @cyr0930 in #38084
Add tearDown method to Quark to solve OOM issues by @MekkCyber in #38234
Clearer error on import failure by @LysandreJik in #38257
[whisper] small changes for faster tests by @gante in #38236
Simplify DTensor Check for modeling_utils.py by @amd-xiaoyu12 in #38245
Improve typing in TrainingArgument by @cyyever in #36944
Fix: missing else branch to handle "--load_best_model_at_end" in training_args.py by @danielyxyang in #38217
assign the correct torchao data layout for xpu by @jiqing-feng in #37781
Remove Japanese sequence_classification doc and update references by @ritsumei-aoi in #38246
Protect ParallelInterface by @ArthurZucker in #38262
Update Model Card for Mamba by @ParagEkbote in #37863
docs(swin): Update Swin model card to standard format by @BryanBradfo in #37628
add XPU info print in print_env by @yao-matrix in #38282
[whisper] move processor test into processor test file 🧹 by @gante in #38266
[Whisper] handle deprecation of forced_decoder_ids by @gante in #38232
add liger-kernel to docker file by @ydshieh in #38292
Fix tp error when torch distributed is already initialized by @SunMarc in #38294
More typing in src/transformers/training_args.py by @cyyever in #38106
refine transformers env output by @yao-matrix in #38274
Update CI Docker base image for AMD tests by @ahadnagy in #38261
Fix HybridChunedCache & Llama4 by @Cyrilvallez in #38299
Oups typo for HybridChunkedCache by @Cyrilvallez in #38303
[Tests] Cleanup Janus Testcase by @yaswanth19 in #38311
[emu3] fix conversion script by @zucchini-nlp in #38297
Fix run_slow by @cyyever in #38314
Fix typo: change 'env' to 'environment' in .circleci/config.yml by @AbdessamadEnabih in #38273
Adds use_repr to model_addition_debugger_context by @RyanMullins in #37984
[tf/flax] handle forced_decoder_ids deletion by @gante in #38316
[Whisper + beam search] fix usage of beam_indices by @gante in #38259
Expose AutoModelForTimeSeriesPrediction for import by @jinan-zhou in #38307
[custom_generate] don't forward custom_generate and trust_remote_code by @gante in #38304
add vasqu to self-comment-ci.yml by @ydshieh in #38324
Fix some tests (especially compile with fullgraph=True on Python<3.11) by @Cyrilvallez in #38319
[performance_optim] reduce frequency of declaring attention_mask in Ascend NPU flash attention by @FightingZhen in #38278
refactor can_save_slow_tokenizer by @itazap in #37722
[FlexAttention] Reenable flex for encoder-decoder and make the test more robust by @vasqu in #38321
Enhance Model Loading By Providing Parallelism, Uses Optional Env Flag by @inf3rnus in #36835
Use Gradient Checkpointing Layer in Jamba & Blip Related Models by @alex-jw-brooks in #38310
Never fallback to eager implicitly by @Cyrilvallez in #38327
Remove duplicate docstring: resample by @qqii in #38305
Update BioGPT model card by @Aguedoom in #38214
docs(swinv2): Update SwinV2 model card to new standard format by @BryanBradfo in #37942
[docs]: update roformer.md model card by @KsuParkhamchuk in #37946
new failure CI reports for all jobs by @ydshieh in #38298
Hot fix for AMD CI workflow by @ydshieh in #38349
Uninstall kernels for AMD docker images by @ydshieh in #38354
[VLMs] add helpers for get/set embedding by @zucchini-nlp in #38144
switch to device agnostic device calling for test cases by @yao-matrix in #38247
[OPT] Fix attention scaling by @vasqu in #38290
Fix all import errors based on older torch versions by @Cyrilvallez in #38370
Fix incorrect batching audio index calculation for Phi-4-Multimodal by @Isotr0py in #38103
Protect get_default_device for torch<2.3 by @Cyrilvallez in #38376
[Falcon H1] Fix slow path forward pass by @dhiaEddineRhaiem in #38320
Improved cache docs by @manueldeprada in #38060
for now disable compile by @ArthurZucker in #38383
Use one utils/notification_service.py by @ydshieh in #38379
Better check in initialize_weights by @Cyrilvallez in #38382
fix typos by @DeVikingMark in #38336
fix typo: tokenizer -> tokenize by @foldl in #38357
Stop TF weight rename reDOS by @Rocketknight1 in #38325
[cli] cli usable without torch by @gante in #38386
update gemma tests by @ydshieh in #38384
Stop autoconverting custom code checkpoints by @Rocketknight1 in #37751
Add AMD MI300 CI caller leveraging self-hosted runner scale set workflow in hf-workflows by @jitesh-gupta in #38132
Fix image token mask in Gemma3 by @Cyrilvallez in #38295
[transformers x vLLM] standardize processors by @zucchini-nlp in #37915
[paligemma] fix processor with suffix by @zucchini-nlp in #38365
[video utils] group and reorder by number of frames by @zucchini-nlp in #38374
[aya vision] fix processor for vLLM by @zucchini-nlp in #38371
guard size mismatch check to only quantized models by @SunMarc in #38397
[chat] improvements for thinking models and reduce default verbosity by @gante in #38322
Fix convert to original state dict for VLMs by @hiyouga in #38385
[chat] use the checkpoint's generation_config.json as base parameterization by @gante in #38330
Fix Qwen2.5-VL Video Processor by @yeliudev in #38366
[CSM] infer codec model with no_grad + audio eos label by @eustlb in #38215
Add report_repo_id to mi300 workflow by @ivarflakstad in #38401
[CSM] update model id by @eustlb in #38211
[cleanup] delete deprecated kwargs in qwen2_audio 🧹 by @gante in #38404
[tests] remove overload for deleted test (test_offloaded_cache_implementation) by @gante in #37896
[mllama] Allow pixel_values with inputs_embeds by @dxoigmn in #38334
Update Model Card for Mamba-2 by @ParagEkbote in #37951
Updated Zoedepth model card by @miniMaddy in #37898
Updated BigBird Model card as per #36979. by @RogerSinghChugh in #37959
Updated BERTweet model card. by @RogerSinghChugh in #37981
New bart model card by @RogerSinghChugh in #37858
Update granite.md by @Tanuj-rai in #37791
Falcon-H1 - Fix auto_docstring and add can_return_tuple decorator by @yonigozlan in #38260
Updated model card for OLMo2 by @andyvu923 in #38394
Add mi300 to amd daily ci workflows definition by @ivarflakstad in #38415
Change slack channel for mi250 CI by @ivarflakstad in #38410
Fix an error in verify_tp_plan for keys without '.' by @liwii in #38420
[qwen-vl] Look for vocab size in text config by @zucchini-nlp in #38372
Update CsmForConditionalGenerationIntegrationTest by @ydshieh in #38424
enable large_gpu and torchao cases on XPU by @yao-matrix in #38355
Disable mi210 scheduled CI by @ivarflakstad in #38411
Update error when using additional and/or masks by @Cyrilvallez in #38429
Fix CircleCI not triggered when PR is opened from a branch of huggingface/transformers by @ydshieh in #38413
make Llama4TextMoe forward more readable by @JJJYmmm in #37529
[core] support tensor-valued _extra_state values in from_pretrained by @pstjohn in #38155
Fix typo in tokenization_utils_base.py docstring by @cwngan in #38418
Fix convert weights for InternVL by @yonigozlan in #38233
Trigger doc-builder job after style bot by @ydshieh in #38398
Remove redundant test_sdpa_equivalence test by @Rocketknight1 in #38436
Fix MoE gradient test by @Rocketknight1 in #38438
Fix from_args_and_dict ProcessorMixin by @yonigozlan in #38296
Fix handling of slow/fast image processors in image_processing_auto.py by @yonigozlan in #38161
Updated the Model docs - for the ALIGN model by @1himan in #38072
Updated the model card for ViTMAE by @mreraser in #38302
Model card for mobilenet v1 and v2 by @yuanjua in #37948
Merge type hints from microsoft/python-type-stubs (post dropping support for Python 3.8) by @Avasam in #38335
Fix GLM4 checkpoints by @ydshieh in #38412
feat: add cache retention for requests by @McPatate in #38446
[Tests] Clean up test cases for few models by @yaswanth19 in #38315
Fix TypeError in save_pretrained error handling (fixes #38422) by @rahulrshetty45 in #38449
Cleanup BatchFeature and BatchEncoding by @lgeiger in #38459
Fix Gemma3IntegrationTest by @ydshieh in #38471
[Qwen2.5-Omni] Fix dtype of cos,sin when used with flash attention by @HarryHsing in #38453
fix: handle no scheduler passed by user by @McPatate in #38407
make it go brrrr by @ArthurZucker in #38409
Fix convert_internvl_weights_to_hf.py to support local paths by @xvyv99 in #38264
Fix incorrect bbox_embed initialization when decoder_bbox_embed_share=False in GroundingDINO by @islemyakoubi in #38238
[Tests] Reduced model size for albert-test model by @saqlain2204 in #38480
Align TP check by @SunMarc in #38328
protect dtensor import by @SunMarc in #38496
[docs] add xpu environment variable for gpu selection by @faaany in #38194
Remove deprecated use_flash_attention_2 parameter by @cyyever in #37131
Fix setting FLASH_ATTENTION_DETERMINISTIC after importing by @HollowMan6 in #37185
[seamless_m4t] Skip some tests when speech is not available by @remi-or in #38430
Update Loss Functions to Accept Tensor num_items_in_batch by @NEREUScode in #38029
[generate] add soft deprecations on custom generation methods by @gante in #38406
[generate] move SinkCache to a custom_generate repo by @gante in #38399
remove unhandled parameter by @itazap in #38145
Fix amp deprecation issue by @SunMarc in #38100
[flax/mistral] support sliding_window: null in config by @yiding in #37402
Num parameters in model.safetensors.index.json by @LysandreJik in #38531
Remove type annotation in Siglip Attention Module by @yaswanth19 in #38503
Fix Gemma2IntegrationTest by @ydshieh in #38492
Fix blip2 tests by @ydshieh in #38510
[tests] expand flex-attn test for vision models by @zucchini-nlp in #38434
Don't use default attn if pre-set in sub-config by @zucchini-nlp in #38526
update emu3 test by @jiqing-feng in #38543
Update docker image to use av by @ydshieh in #38548
[bugfix] [WIP] fix apply_rotary_emb error on Ascend NPU by @FightingZhen in #38491
[TP] Change command in tests to python3 by @S1ro1 in #38555
Explicitly setting encoding in tokenization_utils_base.py by @Muqi1029 in #38553
Fix utils/notification_service.py by @ydshieh in #38556
Name change AOPermod -> ModuleFqn by @drisspg in #38456
Fix hqq issue by @SunMarc in #38551
[docs] Format fix by @stevhliu in #38414
[janus] Fix failing tests on mi3XX by @remi-or in #38426
Fix chameleon tests by @ydshieh in #38565
update utils/notification_service.py for AMD vs Nvidia by @ydshieh in #38563
Fix deepseekv3 by @ydshieh in #38562
[FlexAttn] Fix models with unique characteristics by @vasqu in #38433
fix(attention_visualizer): add default value for image_seq_length by @IceGiraffe in #38577
allow custom head_dim for qwen2_moe by @bzantium in #37188
Docs: fix code formatting in torchao docs by @Manalelaidouni in #38504
feat: add repository field to benchmarks table by @McPatate in #38582
[Dinov2] Enable device_map="auto" support by @aryanchauhan31 in #38487
tests/roformer: fix couple roformer tests on gpus by @dvrogozh in #38570
New gpt neo model card by @RogerSinghChugh in #38505
Updated deprecated typing imports with equivalents for Python 3.9+ by @Sai-Suraj-27 in #38546
added fast image processor for ZoeDepth and expanded tests accordingly by @henrikm11 in #38515
[qwen-omni] fix sliding window by @zucchini-nlp in #38525
Remove custom pytest and pluggy by @ydshieh in #38589
pin pandas by @ydshieh in #38605
Allow mlm_probability to be set to None when mlm=False in DataCollatorForLanguageModeling by @KameniAlexNea in #38522)
Avoid overwrite existing local implementation when loading remote custom model by @Isotr0py in #38474
fix spelling errors by @davidjsonn in #38608
Remove isort from dependencies by @Sai-Suraj-27 in #38616
Fix return_dict=False giving errors in a few VLM models by @ydshieh in #38519
docs: fix dark mode logo display. by @johncaged in #38586
Fix typo in LLaVa documentation by @mynameismon in #38618
[Nit] Add Note on SigOpt being in Public Archive Mode by @ParagEkbote in #38610
Updated Aria model card by @1himan in #38472
Fix MiniMax (docs and integration tests checkpoint) by @geetu040 in #38575
enable more test cases on xpu by @yao-matrix in #38572
Improve test_initialization by @ydshieh in #38607
Use torch 2.7.1 on CircleCI jobs by @ydshieh in #37856
[generation] bring back tests on vision models by @zucchini-nlp in #38603
update ColQwen2ModelIntegrationTest by @ydshieh in #38583
Improve test_initialization for SwiftFormer by @ydshieh in #38636
fix: support grad clipping for TP through replicating non-sharded modules by @kmehant in #36132
Don't run AriaForConditionalGenerationModelTest on CircleCI by @ydshieh in #38615
fix total batch size calculation in trainer by @inkcherry in #38286
fix torch_dtype on awq by @jiqing-feng in #38463
Better CI by @ydshieh in #38552
remove ipex_optimize_model usage by @yao-matrix in #38632
Skip torchscript tests for 2 models by @ydshieh in #38643
Fix InternVL integration test by @ydshieh in #38612
Use torch 2.7.1 on daily CI by @ydshieh in #38620
Fix qwen2-audio chat template audio placeholder insertion by @Isotr0py in #38640
Fixed modeling_auto.py MODEL_FOR_MASK_GENERATION_MAPPING_NAMES variable by @sbucaille in #38664
fix: "check out" as verb by @DePasqualeOrg in #38678
Fix attention mask expansion when converting to executorch by @pweglik in #38637
Fix some models import by @nicelulu in #38694
Fix retrieve function signature and remove faiss requirement by @Fiona-Waters in #38624
Fix TypeError: 'NoneType' object is not iterable for esm by @dbleyl in #38667)
Docs: update bitsandbytes torch.compile compatibility by @matthewdouglas in #38651
Drop as_target_processor from the call and pad methods by @marcndo in #38642
Created model card for XLM model by @AshAnand34 in #38595
Update XLM-RoBERTa model documentation with enhanced usage examples and improved layout by @AshAnand34 in #38596
Created model card for xlm-roberta-xl by @AshAnand34 in #38597
Fix aya_vision test by @ydshieh in #38674
Standardize ByT5 model card format by @yanamis in #38699
Fix smart resize by @rdonggroq in #38706
Update some tests for torch 2.7.1 by @ydshieh in #38701
Logging message for is_bitsandbytes_available() by @ved1beta in #38528
Fix llava tests by @ydshieh in #38722
Use OSError by @cyyever in #38712
[add-new-model-like] Robust search & proper outer '),' in tokenizer mapping by @alexzms in #38703
Fix typo in Language Modeling example scripts and update TPU type by @framoncg in #38652
Add AGENTS.md by @Rocketknight1 in #38734
New canine model card by @RogerSinghChugh in #38631
Fixed a multiple-devices issue in SmolVLM model by @remi-or in #38736
[llava] fix integration tests with Siglip by @zucchini-nlp in #38732
fix: Add method to get image features in PaliGemmaForConditionalGeneration by @YushunXiang in #38730
from 1.11.0, torchao.prototype.low_bit_optim is promoted to torchao.optim by @yao-matrix in #38689
fix: bf16 with TPU is allowed in configuration by @yevvonlim in #38670
[DeepSeek-V3] implement when q_lora_rank is None by @bzantium in #38743
Revert "Trigger doc-builder job after style bot" by @ydshieh in #38735
Add z-loss to Bamba for v2 by @daviswer in #37842
Better typing for num_items_in_batch by @SunMarc in #38728
Prepare for TF+Jax deprecation by @Rocketknight1 in #38760
Remove IPEX requirement for bitsandbytes on CPU by @matthewdouglas in #38594
Update repo consistency check by @Rocketknight1 in #38763
fix(qwen3_moe): pass kwargs to self_attn by @llllvvuu in #38691
Update pegasus model card by @dross20 in #38675
Make style bot trigger CI after push by @ydshieh in #38754
chore(pixtral): emit block attention mask when using flash attention by @starcatmeow in #38741
Update altCLIP model card by @EmileAydar in #38306
Add Qwen2 MoE model card by @rileyafox in #38649
[masking utils] check None instead of try/except by @zucchini-nlp in #38561
[Hotfix] Fix style bot by @ydshieh in #38779
Fix masking utils by @Cyrilvallez in #38783
[video processors] support frame sampling within processors by @zucchini-nlp in #38105
Skip some export tests on torch 2.7 by @ydshieh in #38677
Reduce verbosity for average_tokens_across_devices=True and world size = 1 by @qgallouedec in #38785
Update PULL_REQUEST_TEMPLATE.md by @qgallouedec in #38770
[docs] Add int4wo + 2:4 sparsity example to TorchAO README by @jcaip in #38592
Fix qwen_2_5 omni by @ydshieh in #38658
Fix llava_onevision tests by @ydshieh in #38791
Reword README in light of model definitions by @LysandreJik in #38762
Fix Typos in Comments: "quantitation" → "quantization", "averege" → "average" by @leopardracer in #38766
Initialize flash attn flag by @farnasirim in #38768
Fix mllama by @ydshieh in #38704
build: :pushpin: Remove upper bound on PyTorch by @KyleMylonakisProtopia in #38789
Remove all traces of low_cpu_mem_usage by @Cyrilvallez in #38792
[Docs] New DiT model card by @yushi2006 in #38721
Add missing div in Pegasus model card by @dross20 in #38773
Updated moonshine modelcard by @SohamPrabhu in #38711
refactor create_token_type_ids_from_sequences by @itazap in #37681
[docs] update cache docs with new info by @zucchini-nlp in #38775
Fix erroneous docstring for the ordering of SWA layers by @norpadon in #38794
Fix configs and doc for the Qwens by @Cyrilvallez in #38808
Unbreak optimum-executorch by @guangy10 in #38646
Disable custom MRA kernels for ROCm by @ahadnagy in #38738
Use HF papers by @qgallouedec in #38184
Simplify and update trl examples by @qgallouedec in #38772
Better pipeline type hints ✨ by @qubvel in #38049
Fix llava_next tests by @ydshieh in #38813
Expectation fixes and added AMD expectations by @remi-or in #38729
Use wandb.run.url instead of wandb.run.get_url() (deprecated) by @qgallouedec in #38817
Refactor DBRX tests to use CausalLMModelTest base classes by @Rocketknight1 in #38475
change fsdp_strategy to fsdp in TrainingArguments in accelerate doc by @PT-10 in #38807
Fix a minor security issue by @ydshieh in #38815
Fix trainer.py not showing signature columns by @nenesekai in #38465
Add V-JEPA for video classification model by @qubvel in #38788
fixed docstring in modular_qwen2_5_vl.py by @lawrencefeng17 in #38798
[docs] Update docs moved to the course by @stevhliu in #38800
[docs] updated roberta model card by @allmight05 in #38777
Updated Albert model Card by @souvikchand in #37753
[internvl] fix video inference by @zucchini-nlp in #38811
Fix redundant code in Janus by @yaswanth19 in #38826
bugfix: propage weight key_mapping to peft to fix 3.52 VLM renaming by @ManuelFay in #38627
Fix peft integration by @Cyrilvallez in #38841
Fix broken notebooks link in Italian training docs by @VolodymyrBg in #38834
Fix broken tag in Longformer model card by @dross20 in #38828
[BugFix] QA pipeline edge case: align_to_words=True in QuestionAnsweringPipeline can lead to duplicate answers by @yushi2006 in #38761
GraniteMoeHybrid: Allow for only shared expert case. by @shawntan in #38801
Updated aya_vision.md by @1himan in #38749
Remove merge conflict artifacts in Albert model doc by @druvdub in #38849
[video processor] fix BC when no video config if found by @zucchini-nlp in #38840
Fix incorrect width ratio calculation in Llama4 image processor by @Jingxiang-Zhang in #38842
Allow customization of sdpa in executorch.py by @kimishpatel in #38827
Fix qwen2_5_vl tests by @ydshieh in #38845
Improve auxiliary_in_channels default behavior in UperNet by @simonreise in #37540
Fix qwen3 tests by @ydshieh in #38862
Update CvT documentation with improved usage examples and additional … by @sezan92 in #38731
Update roc bert docs by @SohamPrabhu in #38835
Post-PR fixes! by @Rocketknight1 in #38868
enable misc test cases on XPU by @yao-matrix in #38852
Fix phi4_multimodal tests by @ydshieh in #38816
Fix qwen3_moe tests by @ydshieh in #38865
Fix HQQ model param device transfer issue by @HighCWu in #38466
Fixed markdown for BertTokenizer's '[CLS]' token. by @eu90h in #38506
null deepspeed_plugin in args for wandb callback fake trainer by @winglian in #38867
More PYUP fixes by @cyyever in #38883
Fix loop var naming by @Rocketknight1 in #38885
[bugfix] fix ATTN_MASK_NPU device mismatch error on multi-device NPU … by @qykong in #38876
log: Add logging when using split_batches and per_device_train_batch_size by @KeshavSingh29 in #38633
Docs: Add custom fine-tuning tutorial to TrOCR model page by @Ashutosh-4485 in #38847
36978 | Fast image processor for DPT model by @samrae7 in #37481
[video processor] fix slow tests by @zucchini-nlp in #38881
Update bamba model card by @druvdub in #38853
Add support for specifying revisions when pushing to Hub via internal Trainer call by @IsaacBreen in #36852
Use raise from e in hub.py utility by @Wauplin in #37241
[phi-4] use mel filters from audio utils by @eustlb in #36966
Fix fsmt tests by @ydshieh in #38904
Fix unnecessary super calls by @cyyever in #38897
align xpu's autocast behavior w/ cuda by using device agnostic torch APIs by @yao-matrix in #38284
Fix FalconMambaIntegrationTests by @ydshieh in #38566
Skip sdpa tests if submodule does not support sdpa by @ivarflakstad in #38907
Fix ReDOS in tokenizer digit substitution by @Rocketknight1 in #38844
feat: Add granite architectures to auto tokenizer name mappings by @gabe-l-hart in #38802
Allow make-fixup on main branch, albeit slowly by @Rocketknight1 in #38892
feat: add flexible Liger Kernel configuration to TrainingArguments by @hamza-hcompany in #38911
Remove deprecated classes in modeling_utils.py by @Cyrilvallez in #38919
Skip some tests for now by @ydshieh in #38931
Modernbert fixes by @remi-or in #38912
add pytorch-xpu Dockerfile by @yao-matrix in #38875
Remove ALL_LAYERNORM_LAYERS by @Cyrilvallez in #38922
[static cache] fix device map per layer in VLMs by @zucchini-nlp in #38488
Add kwargs for timm.create_model in TimmWrapper by @qubvel in #38860
Pin PyTorch extras for AMD containers by @ahadnagy in #38941
Correctly raise error for awq quantization by @Cyrilvallez in #38945
Fix more flaky test_initialization by @ydshieh in #38932
Switch to use A10 progressively by @ydshieh in #38936
Fix custom generate from local directory by @manueldeprada in #38916
Update blip model card by @devkade in #38513
Gaudi3 CI by @IlyasMoutawwakil in #38790
Fix DTensor import compatibility for PyTorch < 2.5 by @Benoqtr in #38836
Fix(informer): Correct tensor shape for input_size=1 by @Flink-ddd in #38856
[modular] CLI allows positional arguments, and more defaults names for the optional arg by @Cyrilvallez in #38979
Remove dead protected imports by @Cyrilvallez in #38980
Break tie in Expectations and gemma3 fixes by @remi-or in #38943
Add Idefics2/3 and SmolVLM Fast image processors + improvements for fast image processors by @yonigozlan in #38157
fix: add bool operator to tokenizer to avoid bloated asserts by @kallewoof in #38899
Add support for auto_docstring with model outputs by @yonigozlan in #38242
fix mistral and mistral3 tests by @ydshieh in #38978
[Feature] Support is_split_into_words in the TokenClassificationPipeline. by @yushi2006 in #38818
Fix rag by @ydshieh in #38585
[docs] Typos - Single GPU efficient training features by @casinca in #38964
[qwen] refactor attentions for vision/audio by @zucchini-nlp in #38930
Removing extra space in large command for speech-pretraining example by @dggaytan in #38705
[Attention] Small fix on output attentions by @vasqu in #38948
Fixes for Arcee model by @Cyrilvallez in #39001
Added scikit-learn to the example image-classification requirements.txt by @mylonjones in #37506
Update attention_visualizer.py by @Tanuj-rai in #37860
Skip non-selected experts for qwen3_moe by @seven-mile in #38133
Fix undeterministic order in modular dependencies by @Cyrilvallez in #39005
Granite speech - minor fixes to support training with the HF trainer by @avihu111 in #38833
Fix bugs in DynamicCache by @tugsbayasgalan in #37880
Update self-comment-ci.yml user list by @ivarflakstad in #39014
Skip sdpa dispatch on flash test due to unsupported head dims by @ivarflakstad in #39010
[HPU][Critical Issue Fix] ThreadPool instead of Pool for parallel pre-processing by @dsmertin in #39002
Add Hugging Face authentication procedure for IDEs (PyCharm, VS Code,… by @marcndo in #38954
[LightGlue] Fixed attribute usage from descriptor_dim to keypoint_detector_descriptor_dim by @sbucaille in #39021
Add zero dim tensor check when using flash_attention by @ranzhejiang in #38280
Fix graph break in torch.compile when using FA2 with attention_mask=None and batch size > 1 by @efsotr in #37332
[AutoModelForMaskGeneration] Remove duplicate code by @NielsRogge in #38622
[video processor] support torchcodec and decrease cuda memory usage by @zucchini-nlp in #38880
Drop unnecessary tokens in GPT2Model generation by @null-pointer-access in #39016
Fix the seamless_m4t cannot work on Gaudi by @yuanwu2017 in #38363
fix: astronomical loss with ModernBERT when using gradient checkpointing by @umarbutler in #38982)
fix gemma3 grad acc by @SunMarc in #37208
Remove script datasets in tests by @lhoestq in #38940
Fix grammatical error in models documentation by @marcndo in #39019
refactor: remove custom BarkLayerNorm by @eginhard in #39003
[Kyutai-STT] correct model type + model id by @eustlb in #39035
Two ReDOS fixes by @Rocketknight1 in #39013
[tests] remove TF tests (uses of require_tf) by @gante in #38944
Granite speech speedup + model saving bugfix by @avihu111 in #39028
Fix Bad Outputs in Fast Path for GraniteMoeHybrid by @alex-jw-brooks in #39033

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@ydshieh
- CI reporting improvements (#38230)
- add liger-kernel to docker file (#38292)
- add vasqu to self-comment-ci.yml (#38324)
- new failure CI reports for all jobs (#38298)
- Hot fix for AMD CI workflow (#38349)
- Uninstall kernels for AMD docker images (#38354)
- Use one utils/notification_service.py (#38379)
- update gemma tests (#38384)
- Update CsmForConditionalGenerationIntegrationTest (#38424)
- Fix CircleCI not triggered when PR is opened from a branch of huggingface/transformers (#38413)
- Trigger doc-builder job after style bot (#38398)
- Fix GLM4 checkpoints (#38412)
- Fix Gemma3IntegrationTest (#38471)
- Fix Gemma2IntegrationTest (#38492)
- Fix blip2 tests (#38510)
- Update docker image to use av (#38548)
- Fix utils/notification_service.py (#38556)
- Fix chameleon tests (#38565)
- update utils/notification_service.py for AMD vs Nvidia (#38563)
- Fix deepseekv3 (#38562)
- Remove custom pytest and pluggy (#38589)
- pin pandas (#38605)
- Fix return_dict=False giving errors in a few VLM models (#38519)
- Improve test_initialization (#38607)
- Use torch 2.7.1 on CircleCI jobs (#37856)
- update ColQwen2ModelIntegrationTest (#38583)
- Improve test_initialization for SwiftFormer (#38636)
- Don't run AriaForConditionalGenerationModelTest on CircleCI (#38615)
- Better CI (#38552)
- Skip torchscript tests for 2 models (#38643)
- Fix InternVL integration test (#38612)
- Use torch 2.7.1 on daily CI (#38620)
- Fix aya_vision test (#38674)
- Update some tests for torch 2.7.1 (#38701)
- Fix llava tests (#38722)
- Revert "Trigger doc-builder job after style bot" (#38735)
- Make style bot trigger CI after push (#38754)
- [Hotfix] Fix style bot (#38779)
- Skip some export tests on torch 2.7 (#38677)
- Fix qwen_2_5 omni (#38658)
- Fix llava_onevision tests (#38791)
- Fix mllama (#38704)
- Fix llava_next tests (#38813)
- Fix a minor security issue (#38815)
- Fix qwen2_5_vl tests (#38845)
- Fix qwen3 tests (#38862)
- Fix phi4_multimodal tests (#38816)
- Fix qwen3_moe tests (#38865)
- Fix fsmt tests (#38904)
- Fix FalconMambaIntegrationTests (#38566)
- Skip some tests for now (#38931)
- Fix more flaky test_initialization (#38932)
- Switch to use A10 progressively (#38936)
- fix mistral and mistral3 tests (#38978)
- Fix rag (#38585)
@ArthurZucker
- tp plan should not be NONE (#38255)
- Protect ParallelInterface (#38262)
- Add CB (#38085)
- 🚨Early-error🚨 config will error out if output_attentions=True and the attn implementation is wrong (#38288)
- for now disable compile (#38383)
- make it go brrrr (#38409)
@younesbelkada
- [MODEL] Add Falcon H1 (#38249)
@cyr0930
- fix multi-image case for llava-onevision (#38084)
@cyyever
- Improve typing in TrainingArgument (#36944)
- More typing in src/transformers/training_args.py (#38106)
- Fix run_slow (#38314)
- Remove deprecated use_flash_attention_2 parameter (#37131)
- Use OSError (#38712)
- More PYUP fixes (#38883)
- Fix unnecessary super calls (#38897)
@ritsumei-aoi
- Remove Japanese sequence_classification doc and update references (#38246)
@yao-matrix
- add XPU info print in print_env (#38282)
- refine transformers env output (#38274)
- switch to device agnostic device calling for test cases (#38247)
- enable large_gpu and torchao cases on XPU (#38355)
- enable more test cases on xpu (#38572)
- remove ipex_optimize_model usage (#38632)
- from 1.11.0, torchao.prototype.low_bit_optim is promoted to torchao.optim (#38689)
- enable misc test cases on XPU (#38852)
- align xpu's autocast behavior w/ cuda by using device agnostic torch APIs (#38284)
- add pytorch-xpu Dockerfile (#38875)
@vasqu
- 🔴🔴🔴 [Attention] Refactor Attention Interface for Bart-based Models (#38108)
- [FlexAttention] Reenable flex for encoder-decoder and make the test more robust (#38321)
- [OPT] Fix attention scaling (#38290)
- 🔴[Attention] Attention refactor for Whisper-based models (#38235)
- [FlexAttn] Fix models with unique characteristics (#38433)
- [Attention] Small fix on output attentions (#38948)
@itazap
- refactor can_save_slow_tokenizer (#37722)
- remove unhandled parameter (#38145)
- refactor create_token_type_ids_from_sequences (#37681)
@eustlb
- [CSM] infer codec model with no_grad + audio eos label (#38215)
- [CSM] update model id (#38211)
- [phi-4] use mel filters from audio utils (#36966)
- Add kyutai stt (#38909)
- [Kyutai-STT] correct model type + model id (#39035)
@RogerSinghChugh
- Updated BigBird Model card as per #36979. (#37959)
- Updated BERTweet model card. (#37981)
- New bart model card (#37858)
- New gpt neo model card (#38505)
- New canine model card (#38631)
@1himan
- Updated the Model docs - for the ALIGN model (#38072)
- Updated Aria model card (#38472)
- Updated aya_vision.md (#38749)
@Avasam
- Merge type hints from microsoft/python-type-stubs (post dropping support for Python 3.8) (#38335)
@remi-or
- [seamless_m4t] Skip some tests when speech is not available (#38430)
- [janus] Fix failing tests on mi3XX (#38426)
- Fixed a multiple-devices issue in SmolVLM model (#38736)
- Expectation fixes and added AMD expectations (#38729)
- Modernbert fixes (#38912)
- Break tie in Expectations and gemma3 fixes (#38943)
@tonywu71
- Add ColQwen2 to 🤗 transformers (#35778)
@geetu040
- Add support for MiniMax's MiniMax-Text-01 (#35831)
- Fix MiniMax (docs and integration tests checkpoint) (#38575)
@sbucaille
- Fixed modeling_auto.py MODEL_FOR_MASK_GENERATION_MAPPING_NAMES variable (#38664)
- Add LightGlue model (#31718)
- [LightGlue] Fixed attribute usage from descriptor_dim to keypoint_detector_descriptor_dim (#39021)
@samrae7
- 36978 | Fast image processor for DPT model (#37481)
@Crystalcareai
- Add Arcee model support (#38621)
@zRzRzRzRzRzRzR
- GLM-4.1V Model support (#38431)
@bzhangGo
- Encoder-Decoder Gemma (#38332)
@redmoe-moutain
- [Model] add dots1 (#38143)
@EduardDurech
- Support for Flash Attention 3 (#38972)

v4.41.0: Phi3, JetMoE, PaliGemma, VideoLlava, Falcon2, FalconVLM & GGUF support

Transformers

New model additions

Qwen3 Next

Vault Gemma

Qwen3 VL

Longcat Flash

Flex Olmo

LFM2 VL

Architecture

BLT

Usage Tips:

Qwen3 Omni MoE

Notes

Parakeet

EdgeTAM

OLMO3

Continuous batching

Breaking changes

Bugfixes and improvements

Significant community contributions

Vault-Gemma

Bug Fixes & Improvements

Embedding-Gemma

Usage example

New model additions

Dino v3

X-Codec

Ovis 2

MetaCLIP 2

Florence 2

SAM 2

Kosmos 2.5

HunYuan

Seed OSS

GLM-4.5V

Cache

Quantization

MXFP4

New standard

Breaking changes

Saner hub-defaults for hybrid cache implementation

Sine positional embeddings for MaskFormer & LRU cache

Explicit cache initialization

Default compilation with fullgraph=False

Remove decoding strategies

Fix sliding window in flash attention

Minimum Torch version is now 2.2

Bugfixes and improvements

Significant community contributions

Patch release 4.55.3

Bug Fixes & Improvements

Patch release 4.55.2!

only affects FA2 generations!

Patch release 4.55.1:

Bug Fixes & Improvements

CI & Build

Welcome GPT OSS, the new open-source model family from OpenAI!

Overview of Capabilities and Architecture

Architecture

Flash Attention 3

Other optimizations

transformers serve

Command A Vision

MM Grounding DINO

Bugfixes and improvements

Significant community contributions

Important news!

New models

Ernie 4.5

Voxtral

Key Features

LFM2

DeepSeek v2

ModernBERT Decoder models

EoMT

Doge

AIM v2

PerceptionLM

Efficient LoFTR

EVOLLA

Default compilation with `fullgraph=False`

only affects `FA2` generations!

Handling specific attributes like `output_attentions` or `output_hidden_states`