releases.shpreview
Hugging Face/Transformers

Transformers

$npx -y @buildinternet/releases show transformers
Mon
Wed
Fri
AprMayJunJulAugSepOctNovDecJanFebMarApr
Less
More
Releases13Avg4/moVersionsv4.57.4 → v5.5.3
Oct 3, 2025
v4.57.0: Qwen3-Next, Vault Gemma, Qwen3 VL, LongCat Flash, Flex OLMO, LFM2 VL, BLT, Qwen3 OMNI MoE, Parakeet, EdgeTAM, OLMO3

New model additions

Qwen3 Next

<img width="1200" height="511" alt="image" src="https://github.com/user-attachments/assets/3abad6c4-5650-412d-a831-f8a30a5d962e" />

The Qwen3-Next series represents the Qwen team's next-generation foundation models, optimized for extreme context length and large-scale parameter efficiency. The series introduces a suite of architectural innovations designed to maximize performance while minimizing computational cost:

  • Hybrid Attention: Replaces standard attention with the combination of Gated DeltaNet and Gated Attention, enabling efficient context modeling.
  • High-Sparsity MoE: Achieves an extreme low activation ratio as 1:50 in MoE layers — drastically reducing FLOPs per token while preserving model capacity.
  • Multi-Token Prediction(MTP): Boosts pretraining model performance, and accelerates inference.
  • Other Optimizations: Includes techniques such as zero-centered and weight-decayed layernorm, Gated Attention, and other stabilizing enhancements for robust training.

Built on this architecture, they trained and open-sourced Qwen3-Next-80B-A3B — 80B total parameters, only 3B active — achieving extreme sparsity and efficiency.

Despite its ultra-efficiency, it outperforms Qwen3-32B on downstream tasks — while requiring less than 1/10 of the training cost. Moreover, it delivers over 10x higher inference throughput than Qwen3-32B when handling contexts longer than 32K tokens.

For more details, please visit their blog Qwen3-Next (blog post).

  • Adding Support for Qwen3-Next by @bozheng-hit in #40771

Vault Gemma

<img width="1282" height="392" alt="image" src="https://github.com/user-attachments/assets/9412905b-4083-4994-9000-aa0dbf97eb6f" />

VaultGemma is a text-only decoder model derived from Gemma 2, notably it drops the norms after the Attention and MLP blocks, and uses full attention for all layers instead of alternating between full attention and local sliding attention. VaultGemma is available as a pretrained model with 1B parameters that uses a 1024 token sequence length.

VaultGemma was trained from scratch with sequence-level differential privacy (DP). Its training data includes the same mixture as the Gemma 2 models, consisting of a number of documents of varying lengths. Additionally, it is trained using DP stochastic gradient descent (DP-SGD) and provides a (ε ≤ 2.0, δ ≤ 1.1e-10)-sequence-level DP guarantee, where a sequence consists of 1024 consecutive tokens extracted from heterogeneous data sources. Specifically, the privacy unit of the guarantee is for the sequences after sampling and packing of the mixture.

  • add: differential privacy research model by @RyanMullins in #40851

Qwen3 VL

<img width="3544" height="1886" alt="image" src="https://github.com/user-attachments/assets/5afa70cb-506e-4d56-baa3-30e7522ac653" />

Qwen3-VL is a multimodal vision-language model series, encompassing both dense and MoE variants, as well as Instruct and Thinking versions.

Building upon its predecessors, Qwen3-VL delivers significant improvements in visual understanding while maintaining strong pure text capabilities. Key architectural advancements include: enhanced MRope with interleaved layout for better spatial-temporal modeling, DeepStack integration to effectively leverage multi-level features from the Vision Transformer (ViT), and improved video understanding through text-based time alignment—evolving from T-RoPE to text timestamp alignment for more precise temporal grounding.

These innovations collectively enable Qwen3-VL to achieve superior performance in complex multimodal tasks.

  • Adding Support for Qwen3-VL Series by @JJJYmmm in #40795

Longcat Flash

<img width="763" height="468" alt="image" src="https://github.com/user-attachments/assets/289d33e0-6c71-458d-ae07-b7d454ac2adf" />

The LongCatFlash model was proposed in LongCat-Flash Technical Report by the Meituan LongCat Team. LongCat-Flash is a 560B parameter Mixture-of-Experts (MoE) model that activates 18.6B-31.3B parameters dynamically (average ~27B). The model features a shortcut-connected architecture enabling high inference speed (>100 tokens/second) and advanced reasoning capabilities.

The abstract from the paper is the following:

We present LongCat-Flash, a 560 billion parameter Mixture-of-Experts (MoE) language model featuring a dynamic computation mechanism that activates 18.6B-31.3B parameters based on context (average ~27B). The model incorporates a shortcut-connected architecture enabling high inference speed (>100 tokens/second) and demonstrates strong performance across multiple benchmarks including 89.71% accuracy on MMLU and exceptional agentic tool use capabilities.

Tips:

  • LongCat-Flash uses a unique shortcut-connected MoE architecture that enables faster inference compared to traditional MoE models
  • The model supports up to 128k context length for long-form tasks
  • Dynamic parameter activation makes it computationally efficient while maintaining high performance
  • Best suited for applications requiring strong reasoning, coding, and tool-calling capabilities
  • The MoE architecture includes zero experts (nn.Identity modules) which act as skip connections, allowing tokens to bypass expert computation when appropriate
  • Add LongCat-Flash by @molbap in #40730

Flex Olmo

<img width="700" height="414" alt="image" src="https://github.com/user-attachments/assets/7b92ee0f-5f5a-459c-ad4d-e01b5c10202e" />

FlexOlmo is a new class of language models (LMs) that supports (1) distributed training without data sharing, where different model parameters are independently trained on closed datasets, and (2) data-flexible inference, where these parameters along with their associated data can be flexibly included or excluded from model inferences with no further training. FlexOlmo employs a mixture-of-experts (MoE) architecture where each expert is trained independently on closed datasets and later integrated through a new domain-informed routing without any joint training. FlexOlmo is trained on FlexMix, a corpus we curate comprising publicly available datasets alongside seven domain-specific sets, representing realistic approximations of closed sets.

You can find all the original FlexOlmo checkpoints under the FlexOlmo collection.

  • Add FlexOlmo model by @2015aroras in #40921

LFM2 VL

<img width="2300" height="1400" alt="image" src="https://github.com/user-attachments/assets/ef0605cd-9512-458c-915a-62316e14d90c" />

LFM2-VL first series of vision-language foundation models developed by Liquid AI. These multimodal models are designed for low-latency and device-aware deployment. LFM2-VL extends the LFM2 family of open-weight Liquid Foundation Models (LFMs) into the vision-language space, supporting both text and image inputs with variable resolutions.

Architecture

LFM2-VL consists of three main components: a language model backbone, a vision encoder, and a multimodal projector. LFM2-VL builds upon the LFM2 backbone, inheriting from either LFM2-1.2B (for LFM2-VL-1.6B) or LFM2-350M (for LFM2-VL-450M). For the vision tower, LFM2-VL uses SigLIP2 NaFlex encoders to convert input images into token sequences. Two variants are implemented:

  • Shape-optimized (400M) for more fine-grained vision capabilities for LFM2-VL-1.6B
  • Base (86M) for fast image processing for LFM2-VL-450M

The encoder processes images at their native resolution up to 512×512 pixels, efficiently handling smaller images without upscaling and supporting non-standard aspect ratios without distortion. Larger images are split into non-overlapping square patches of 512×512 each, preserving detail. In LFM2-VL-1.6B, the model also receives a thumbnail (a small, downscaled version of the original image capturing the overall scene) to enhance global context understanding and alignment. Special tokens mark each patch’s position and indicate the thumbnail’s start. The multimodal connector is a 2-layer MLP connector with pixel unshuffle to reduce image token count.

  • Add new model LFM2-VL by @zucchini-nlp in #40624

BLT

<img width="1448" height="1062" alt="image" src="https://github.com/user-attachments/assets/af1fbb09-082c-4331-9217-357adb506cbf" />

The BLT model was proposed in Byte Latent Transformer: Patches Scale Better Than Tokens by Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li1, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman†, Srinivasan Iyer. BLT is a byte-level LLM that achieves tokenization-level performance through entropy-based dynamic patching.

The abstract from the paper is the following:

We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented based on the entropy of the next byte, allocating more compute and model capacity where increased data complexity demands it. We present the first flop controlled scaling study of byte-level models up to 8B parameters and 4T training bytes. Our results demonstrate the feasibility of scaling models trained on raw bytes without a fixed vocabulary. Both training and inference efficiency improve due to dynamically selecting long patches when data is predictable, along with qualitative improvements on reasoning and long tail generalization. Overall, for fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.

Usage Tips:

  • Dual Model Architecture: BLT consists of two separate trained models:

    • Patcher (Entropy Model): A smaller transformer model that predicts byte-level entropy to determine patch boundaries and segment input.
    • Main Transformer Model: The primary model that processes the patches through a Local Encoder, Global Transformer, and Local Decoder.
  • Dynamic Patching: The model uses entropy-based dynamic patching where:

    • High-entropy regions (complex data) get shorter patches with more computational attention
    • Low-entropy regions (predictable data) get longer patches for efficiency
    • This allows the model to allocate compute resources where they're most needed
  • Local Encoder: Processes byte sequences with cross-attention to patch embeddings

  • Global Transformer: Processes patch-level representations with full attention across patches

  • Local Decoder: Generates output with cross-attention back to the original byte sequence

  • Byte-Level Tokenizer: Unlike traditional tokenizers that use learned vocabularies, BLT's tokenizer simply converts text to UTF-8 bytes and maps each byte to a token ID. There is no need for a vocabulary.

  • blt wip by @itazap in #38579

Qwen3 Omni MoE

<img width="14084" height="7429" alt="image" src="https://github.com/user-attachments/assets/20d46a43-15f2-42bf-9703-9575f5ca4430" />

The Qwen2.5-Omni model is a unified multiple modalities model proposed in Qwen2.5-Omni Technical Report from Qwen team, Alibaba Group.

Notes

  • Use [Qwen2_5OmniForConditionalGeneration] to generate audio and text output. To generate only one output type, use [Qwen2_5OmniThinkerForConditionalGeneration] for text-only and [Qwen2_5OmniTalkersForConditionalGeneration] for audio-only outputs.
  • Audio generation with [Qwen2_5OmniForConditionalGeneration] supports only single batch size at the moment.
  • In case out out-of-memory errors hwen working with video input, decrease processor.max_pixels. By default the maximum is set to a very arge value and high resolution visuals will not be resized, unless resolution exceeds processor.max_pixels.
  • The processor has its own [~ProcessorMixin.apply_chat_template] method to convert chat messages to model inputs.
  • Adding support for Qwen3Omni by @BakerBunker in #41025

Parakeet

<img width="1431" height="527" alt="image" src="https://github.com/user-attachments/assets/e831f451-9be3-4b5c-a222-b833a50ceb2a" />

Parakeet models, introduced by NVIDIA NeMo, are models that combine a Fast Conformer encoder with connectionist temporal classification (CTC), recurrent neural network transducer (RNNT) or token and duration transducer (TDT) decoder for automatic speech recognition.

Model Architecture

  • Fast Conformer Encoder: A linearly scalable Conformer architecture that processes mel-spectrogram features and reduces sequence length through subsampling. This is more efficient version of the Conformer Encoder found in FastSpeech2Conformer (see [ParakeetEncoder] for the encoder implementation and details).
  • ParakeetForCTC: a Fast Conformer Encoder + a CTC decoder
    • CTC Decoder: Simple but effective decoder consisting of:
      • 1D convolution projection from encoder hidden size to vocabulary size (for optimal NeMo compatibility).
      • CTC loss computation for training.
      • Greedy CTC decoding for inference.
  • Add Parakeet by @nithinraok in #39062

EdgeTAM

<img width="949" height="537" alt="image" src="https://github.com/user-attachments/assets/5ca4e73d-5aa9-487d-96e1-92d4f2f4739f" />

The EdgeTAM model was proposed in EdgeTAM: On-Device Track Anything Model Chong Zhou, Chenchen Zhu, Yunyang Xiong, Saksham Suri, Fanyi Xiao, Lemeng Wu, Raghuraman Krishnamoorthi, Bo Dai, Chen Change Loy, Vikas Chandra, Bilge Soran.

EdgeTAM is an efficient adaptation of SAM 2 that introduces a 2D Spatial Perceiver architecture to optimize memory attention mechanisms for real-time video segmentation on mobile devices.

  • Add EdgeTAM by @yonigozlan in #39800

OLMO3

More details to come soon :eyes:

  • Add Olmo3 model by @2015aroras in #40778

Continuous batching

We are introducing Continuous Batching (CB) in this release, we consider it a stable feature. The main use case for CB is batched generation, which makes it very efficient in the context of GRPO training or evaluation. Thanks to CB, researchers or model developers are now free to use transformers in these contexts without having to spin up an additional inference engine.

CB currently supports both full attention and sliding window attention: this means that the vast majority of models are supported, like llama, gemma3, gpt-oss.

CB is also integrated with transformers serve, which means that you can deploy transformers as an OpenAI-compatible HTTP server. Here is a small snippet on how to use it:

import datasets
import torch

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-4B-Instruct-2507", dtype=torch.bfloat16, _attn_implementation="sdpa_paged", device_map="auto"
)
model.generation_config.max_new_tokens = 32
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Instruct-2507", padding_side="left")
dataset = datasets.load_dataset("openai/gsm8k", "socratic", split="test")
tokenized_datasets = dataset.map(lambda x: tokenizer(x["question"]), batched=True)
simple_batch_inputs = [item["input_ids"] for item in tokenized_datasets]

batch_outputs = model.generate_batch(inputs=simple_batch_inputs)
for request in batch_outputs:
    print(tokenizer.decode(batch_outputs[request].generated_tokens))
"""
 Let's break down the problem step by step:

1. **Total eggs laid per day**:  
   Janet’s ducks lay **16 eggs per day**
 Let's break down the problem step by step:

1. **Blue fiber**: The robe takes **2 bolts** of blue fiber.
2. **White fiber
 To determine Josh's profit from flipping the house, let's go step by step.

---

### Step 1: Initial cost of the house
Josh buys the
 To find the total distance James runs in a week, we can break down the problem step by step:

1. **Sprints per session**: James runs 
 To determine how many cups of feed Wendi needs to give her chickens in the final meal of the day, let's go step by step.
"""

Breaking changes

  • 🚨 Remove Group Beam Search decoding strategy by @manueldeprada in #40495
  • 🚨 Remove Constrained Beam Search decoding strategy by @manueldeprada in #40518
  • 🚨 Allow check_model_inputs in core VLMs by @zucchini-nlp in #40342
  • 🔴 Update Glm4V to use config values by @zucchini-nlp in #40712
  • 🚨 Fix Inconsistant input_feature length and attention_mask length in WhisperFeatureExtractor by @BakerBunker in #39221
  • ⚠️ 🔴 Add ministral model by @manueldeprada in #40247
  • 🔴 Move variable output controls to _prepare_generation_config by @manueldeprada in #40715
  • 🔴Make center_crop fast equivalent to slow by @yonigozlan in #40856

Bugfixes and improvements

  • Fix collated reports upload filename by @ivarflakstad in #40556
  • pin pytest-rerunfailures<16.0 by @ydshieh in #40561
  • remove the redundant non maintained jieba and use rjieba instead by @divyanshsinghvi in #40383
  • Set test_all_params_have_gradient=False for DeepseekV2ModelTest by @ydshieh in #40566
  • processor tests - use dummy videos by @zucchini-nlp in #40537
  • [qwen-vl] fix position ids by @zucchini-nlp in #40490
  • Fix test_eager_matches_sdpa_inference not run for CLIP by @ydshieh in #40581
  • Fix CircleCI step passes in the case of pytest worker crash at test collection time by @ydshieh in #40552
  • Allow remi-or to run-slow by @ydshieh in #40590
  • Fix llava image processor by @zucchini-nlp in #40588
  • Update get_*_features methods + update doc snippets by @qubvel in #40555
  • Fix custom generate relative imports by @manueldeprada in #40480
  • Support batch size > 1 image-text inference by @hiyouga in #36682
  • Fix typos by @cyyever in #40585
  • Skip TvpImageProcessingTest::test_slow_fast_equivalence by @ydshieh in #40593
  • Fix inexistent imports by @cyyever in #40580
  • Add Copilot instructions by @Rocketknight1 in #40432
  • Fix siglip flaky test_eager_matches_sdpa_inference by @ydshieh in #40584
  • Fix for missing default values in encoder decoder by @remi-or in #40517
  • Fix quite a lot of FA tests by @Cyrilvallez in #40548
  • [Tests] Fixup duplicated mrope logic by @vasqu in #40592
  • Reduce more test data fetch by @ydshieh in #40595
  • Pin torchcodec to 0.5 in AMD docker by @remi-or in #40598
  • Multiple fixes to FA tests in AMD by @remi-or in #40498
  • Disable cache for TokenizerTesterMixin temporarily by @ydshieh in #40611
  • fix: continuous batching in transformers serve by @McPatate in #40479
  • Fix processor chat template by @zucchini-nlp in #40613
  • Avoid too many request caused by AutoModelTest::test_dynamic_saving_from_local_repo by @ydshieh in #40614
  • Fix flaky JambaModelTest.test_load_balancing_loss by @ydshieh in #40617
  • Add collated reports job to Nvidia CI by @ahadnagy in #40470
  • Remove unnecessary pillow version check by @cyyever in #40604
  • Fix invalid typing by @cyyever in #40612
  • Enable more ruff UP rules by @cyyever in #40579
  • Support TF32 flag for MUSA backend by @fmo-mt in #33187
  • Remove random flag by @Cyrilvallez in #40629
  • 🌐 [i18n-KO] Translated deepseek_v3.md to Korean by @ssum21 in #39649
  • Fix too many requests in TestMistralCommonTokenizer by @ydshieh in #40623
  • fix: gas for gemma fixed by @yevvonlim in #40591
  • [auto-model] propagate kwargs by @zucchini-nlp in #40491
  • [CP] Add attention_mask to the buffer when the mask is causal by @kashif in #40619
  • Fix: PIL image load in Processing utils apply_chat_template by @abdokaseb in #40622
  • Skip test_prompt_lookup_decoding_matches_greedy_search for voxtral by @ydshieh in #40643
  • add DeepseekV3ForTokenClassification by @bzantium in #40641
  • fix MetaCLIP 2 wrong link & wrong model names in the docstrings by @voidism in #40565
  • Remove TF/Flax examples by @Rocketknight1 in #40654
  • Mark LongformerModelTest::test_attention_outputs as flaky by @ydshieh in #40655
  • fix pipeline dtype by @jiqing-feng in #40638
  • feat(serving): add healthcheck by @McPatate in #40653
  • Fix Metaclip modular conversion by @Rocketknight1 in #40660
  • Avoid attention_mask copy in qwen2.5 by @cyyever in #40658
  • Allow custom args in custom_generate Callables and unify generation args structure by @manueldeprada in #40586
  • Update check_determinism inside test_determinism by @ydshieh in #40661
  • Skip test_fast_is_faster_than_slow for Owlv2ImageProcessingTest by @ydshieh in #40663
  • Fix warning for output_attentions=True by @qubvel in #40597
  • Skip test_prompt_lookup_decoding_matches_greedy_search for qwen2_audio by @ydshieh in #40664
  • Remove overwritten GitModelTest::test_beam_search_generate by @ydshieh in #40666
  • refactor: use tolist instead of list comprehension calling .item() by @McPatate in #40646
  • Benchmarking V2: framework impl by @ahadnagy in #40486
  • Even more test data cached by @ydshieh in #40636
  • Skip more fast v.s slow image processor tests by @ydshieh in #40675
  • Avoid night torch CI not run because of irrelevant docker image failing to build by @ydshieh in #40677
  • Mark Aimv2ModelTest::test_eager_matches_sdpa_inference_04_fp16_pad_right_sdpa_kernels as flaky by @ydshieh in #40683
  • CircleCI docker images cleanup / update / fix by @ydshieh in #40681
  • Add sequence classification support for small Gemma 3 text models by @abdokaseb in #40562
  • Add codebook_dim attribute to DacVectorQuantize for DacResidualVectorQuantize.from_latents() by @flavioialongo in #40665
  • fix broken offline mode when loading tokenizer from hub by @winglian in #40669
  • Load a tiny video to make CI faster by @zucchini-nlp in #40684
  • Final test data cache - inside CI docker images by @ydshieh in #40689
  • add: embedding model by @RyanMullins in #40694
  • feat: support request cancellation by @McPatate in #40599
  • Fixing bug in Voxtral when merging text and audio embeddings by @rcogill in #40671
  • Change docker image to preview for the MI355 CI by @ahadnagy in #40693
  • Fix backward compatibility with accelerate in Trainer by @qgallouedec in #40668
  • Fix self.dropout_p is not defined for SamAttention/Sam2Attention by @yonigozlan in #40667
  • [Glm4.5V] fix vLLM support by @zucchini-nlp in #40696
  • Fix broken Llama4 accuracy in MoE part by @nvpohanh in #40609
  • Avoid T5GemmaModelTest::test_eager_matches_sdpa_inference being flaky by @ydshieh in #40702
  • Align assisted generate for unified signature in decoding methods by @manueldeprada in #40657
  • Fetch one missing test data by @ydshieh in #40703
  • Add Fast Image Processor for ImageGPT by @agamjots05 in #39592
  • Fetch more test data with hf_hub_download by @ydshieh in #40710
  • feat(serve): add healthcheck test by @McPatate in #40697
  • Fix parent classes of ProcessingKwargs by @cyyever in #40676
  • [tests] fix blip2 edge case by @gante in #40699
  • [moduar] Add missing self in post-process methods by @framonmar7 in #40711
  • [onnx] use logical or for grounding dino mask by @lmarshall12 in #40625
  • Fix parent classes of AllKwargsForChatTemplate by @cyyever in #40685
  • Fix arguments by @cyyever in #40605
  • [serve] re-enable tests by @gante in #40717
  • [tests] remove overwrites of removed test by @gante in #40720
  • Add Optional typing by @cyyever in #40686
  • [Gemma Embedding] Fix SWA by @vasqu in #40700
  • Keypoint matching docs by @merveenoyan in #40541
  • Skip VitMatteImageProcessingTest::test_fast_is_faster_than_slow by @ydshieh in #40713
  • refactor(serve): move request_id to headers by @McPatate in #40722
  • [Continous Batching] fix do_Sample=True in continuous batching by @kashif in #40692
  • Fix order of mask functions when using and/or_mask_function by @Cyrilvallez in #40753
  • Fix np array typing by @cyyever in #40741
  • Set accepts_loss_kwargs to False for ConvNext(|V2)ForImageClassification by @clinty in #40746
  • Add BF16 support check for MUSA backend by @fmo-mt in #40576
  • remove gemmas eager training warning by @August-murr in #40744
  • remove FSDP prefix when using save_pretrained with FSDP2 by @winglian in #40207
  • feat: err when unsupported attn impl is set w/ --continuous_batching by @McPatate in #40618
  • docs: add continuous batching to serving by @McPatate in #40758
  • Remove unnecessary tildes from documentation by @st81 in #40748
  • Fix more typos by @cyyever in #40627
  • Fix inconsistency in SeamlessM4T and SeamlessM4Tv2 docs by @clinty in #39364
  • Fix continue_final_message in apply_chat_template to prevent substring matching issues by @abdokaseb in #40732
  • 🌐 [i18n-KO] Translated 'xclip.md' to Korean by @ssum21 in #39594
  • Fix Bark failing tests by @ebezzam in #39478
  • Add EfficientLoFTRImageProcessorFast for GPU-accelerated image processing by @LawJarp-A in #40215
  • Fix: swanlab public.cloud.experiment_url api error by @Zeyi-Lin in #40763
  • [generate] PromptLookupCandidateGenerator won't generate forbidden tokens by @gante in #40726
  • Support sliding window in CB by @remi-or in #40688
  • [deprecations] Remove generate-related deprecations up to v4.56 by @gante in #40729
  • rm src/transformers/convert_pytorch_checkpoint_to_tf2.py by @gante in #40718
  • [tests] update test_past_key_values_format and delete overwrites by @gante in #40701
  • [RoPE] run RoPE tests when the model uses RoPE by @gante in #40630
  • Fix crash when executing MambaCache sample code by @torotoki in #40557
  • [pipeline] ASR pipeline kwargs are forwared to generate by @gante in #40375
  • [docs] CPU install by @stevhliu in #40631
  • Adding Support for Qwen3-Next by @bozheng-hit in #40771
  • Fix gpt-oss router_indices in EP by @jiqing-feng in #40545
  • Remove reference of video_load_backend and video_fps for processor by @cyyever in #40719
  • [processors] Unbloating simple processors by @zucchini-nlp in #40377
  • Enable ruff on benchmark and scripts by @cyyever in #40634
  • Fix doc for PerceptionLMForConditionalGeneration forward. by @shuminghu in #40733
  • Fix typos in tests and util by @cyyever in #40780
  • Fix invalid PipelineParallel member by @cyyever in #40789
  • Use functools.cached_property by @cyyever in #40607
  • Read config pattern for Qwen3Next by @Cyrilvallez in #40792
  • Fix dotted model names by @August-murr in #40745
  • Fix the issue that csm model cannot work with pipeline mode. by @yuanwu2017 in #39349
  • Move num_items_in_batch to correct device before accelerator.gather by @ssharpe42 in #40773
  • Remove use_ipex option from Trainer by @cyyever in #40784
  • fix_image_processing_fast_for_glm4v by @lambertwjh in #40483
  • [Docs] Add missing class documentation for optimizer_schedules by @jijihuny in #31870, #23010)
  • Fix DeepSpeed mixed precision precedence over Accelerate defaults by @notkisk in #39856
  • feature: Add robust token counting with padding exclusion by @PrathmeshAdsod in #40416
  • Fix edge case for tokenize by @wangzhen0518 in #36277)
  • Fix config dtype parsing for Emu3 edge case by @Isotr0py in #40766
  • Align torch implementation of Gated DeltaNet in Qwen3-Next with fla library. by @bozheng-hit in #40807
  • Fix typos in src by @cyyever in #40782
  • add general hub test for Fast Image Processors in test_image_processing_utils by @namgyu-youn in #40086
  • Push generation config along with checkpoints by @qgallouedec in #40804
  • [Jetmoe] Fix RoPE by @vasqu in #40819
  • 🌐 [i18n-KO] Translated clipseg.md to Korean by @HyunZ118 in #39903
  • Improve torch_dtype checks by @cyyever in #40808
  • Add VideoProcessors to auto-backend requirements by @Cyrilvallez in #40843
  • Adds Causal Conv 1D kernel for mamba models by @MekkCyber in #40765
  • Update no split modules in T5Gemma model by @npuichigo in #40810
  • Replace image classification loss functions to self.loss_function by @qubvel in #40764
  • Fix the misalignment between the l2norm in GDN of Qwen3-Next and the implementation in the FLA library. by @bozheng-hit in #40842
  • Fixes for continuous batching by @remi-or in #40828
  • [tests] re-enable aria fast tests by @gante in #40846
  • [SAM2] Fix inconsistent results with original implementation with input boxes by @yonigozlan in #40800
  • [Sam2Video] Fix video inference with batched boxes and add test by @yonigozlan in #40797
  • add: differential privacy research model by @RyanMullins in #40851
  • [test] Fix test_eager_matches_sdpa incorrectly skipped by @eustlb in #40852
  • [tests] move generative tests away from test_modeling_common.py by @gante in #40854
  • [generate] Always use decoder config to init cache by @gante in #40772
  • Use checkpoint in auto_class_docstring by @cyyever in #40844
  • Fix TrainingArguments.parallelism_config NameError with accelerate<1.10.1 by @albertvillanova in #40818
  • Redirect MI355 CI results to dummy dataset by @ahadnagy in #40862
  • [Bug fix #40813] Fix base_model_tp_plan of Starcoder2 model. by @greg-kwasniewski1 in #40814
  • [docstrings / type hints] Update outdated annotations for past_key_values by @gante in #40803
  • fix florence kwargs by @SunMarc in #40826
  • fix: XIELU act parameters not being casted to correct dtype by @NanoCode012 in #40812
  • Update model tags and integration references in bug report by @ArthurZucker in #40881
  • [Qwen3 Next] Use numerically stable rsqrt by @thalahors in #40848
  • Adding Support for Qwen3-VL Series by @JJJYmmm in #40795
  • [VaultGemma] Update expectations in integration tests by @vasqu in #40855
  • Fix modular consistency by @Cyrilvallez in #40883
  • Clarify passing is_causal in sdpa_attention_paged_forward by @cyyever in #40838
  • Use torch.expm1 and torch.log1p for better numerical results by @cyyever in #40860
  • Add Fast PromptDepthAnything Processor by @SamuelBarryCS in #40602
  • Fix deta loading & dataclass by @Cyrilvallez in #40878
  • Remove dict branch of attention_mask in sdpa_attention_paged_forward by @cyyever in #40882
  • 🌐 [i18n-KO] Translated smolvlm.md to Korean by @HyunZ118 in #40414
  • 🌐 [i18n-KO] Translated imageprocessor.md to Korean by @HyunZ118 in #39557
  • [generate] remove docs of a feature that no longer exists by @gante in #40895
  • Make debugging failing tests (check and update expect output values) easier 🔥 by @ydshieh in #40727
  • Fixing the call to kernelize by @MekkCyber in #40628
  • Fix getter regression by @molbap in #40824
  • Fix flaky Gemma3nAudioFeatureExtractionTest::test_dither by @ydshieh in #40902
  • [cache] Merge static sliding and static chunked layer by @Cyrilvallez in #40893
  • Harmonize CacheLayer names by @Cyrilvallez in #40892
  • [cache] Only use scalars in get_mask_sizes by @Cyrilvallez in #40907
  • Set seed for Glm4vIntegrationTest by @ydshieh in #40905
  • Add Olmo3 model by @2015aroras in #40778
  • remove dummy EncodingFast by @cyyever in #40864
  • Improve module name handling for local custom code by @XuehaiPan in #40809
  • Remove runner_map by @ydshieh in #40880
  • disable test_fast_is_faster_than_slow by @ydshieh in #40909
  • [gemma3] Gemma3ForConditionalGeneration compatible with assisted generation by @gante in #40791
  • [generate] misc fixes by @gante in #40906
  • Fix dtype in Paligemma by @zucchini-nlp in #40912
  • [Docs] Adding documentation of MXFP4 Quantization by @ariG23498 in #40885
  • Processor load with multi-processing by @zucchini-nlp in #40786
  • [Llama4] Remove image_sizes arg and deprecate vision_feature_layer by @yaswanth19 in #40832
  • Fix #40067: Add dedicated UMT5 support to GGUF loader (config, tokenizer, test) by @akshay-babbar in #40218
  • [torchao safetensors] renaming get_state_dict function by @liangel-02 in #40774
  • Adding activation kernels by @MekkCyber in #40890
  • Minor fix for #40727 by @ydshieh in #40929
  • Add support for Florence-2 training by @ducviet00 in #40914
  • Add LongCat-Flash by @molbap in #40730
  • [DOC] Add missing dates in model cards by @yonigozlan in #40922
  • [models] remove unused import torch.utils.checkpoint by @gante in #40934
  • Intel CPU dockerfile by @jiqing-feng in #40806
  • docs(i18n): Correct the descriptive text in the README_zh-hans.md by @lilin-1 in #40941
  • Fix trainer tests by @SunMarc in #40823
  • Fix Glm4vMoeIntegrationTest by @ydshieh in #40930
  • Raise error instead of warning when using meta device in from_pretrained by @Cyrilvallez in #40942
  • Consistent naming for images kwargs by @zucchini-nlp in #40834
  • Remove nested import logic for torchvision by @yonigozlan in #40940
  • Fix Glm4vModelTest::test_eager_matches_fa2_generate by @ydshieh in #40947
  • Update expected values for some test_speculative_generation by @ydshieh in #40949
  • Standardize audio embedding function name for audio multimodal models by @jackzhxng in #40919
  • Add FlexOlmo model by @2015aroras in #40921
  • Don't list dropout in eager_paged_attention_forward by @cyyever in #40924

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @hiyouga
    • Support batch size > 1 image-text inference (#36682)
  • @cyyever
    • Fix typos (#40585)
    • Fix inexistent imports (#40580)
    • Remove unnecessary pillow version check (#40604)
    • Fix invalid typing (#40612)
    • Enable more ruff UP rules (#40579)
    • Avoid attention_mask copy in qwen2.5 (#40658)
    • Fix parent classes of ProcessingKwargs (#40676)
    • Fix parent classes of AllKwargsForChatTemplate (#40685)
    • Fix arguments (#40605)
    • Add Optional typing (#40686)
    • Fix np array typing (#40741)
    • Fix more typos (#40627)
    • Remove reference of video_load_backend and video_fps for processor (#40719)
    • Enable ruff on benchmark and scripts (#40634)
    • Fix typos in tests and util (#40780)
    • Fix invalid PipelineParallel member (#40789)
    • Use functools.cached_property (#40607)
    • Remove use_ipex option from Trainer (#40784)
    • Fix typos in src (#40782)
    • Improve torch_dtype checks (#40808)
    • Use checkpoint in auto_class_docstring (#40844)
    • Clarify passing is_causal in sdpa_attention_paged_forward (#40838)
    • Use torch.expm1 and torch.log1p for better numerical results (#40860)
    • Remove dict branch of attention_mask in sdpa_attention_paged_forward (#40882)
    • remove dummy EncodingFast (#40864)
    • Don't list dropout in eager_paged_attention_forward (#40924)
    • Benchmarking V2: framework impl (#40486)
    • Change docker image to preview for the MI355 CI (#40693)
    • Redirect MI355 CI results to dummy dataset (#40862)
  • @voidism
    • fix MetaCLIP 2 wrong link & wrong model names in the docstrings (#40565)
  • @RyanMullins
    • add: embedding model (#40694)
    • add: differential privacy research model (#40851)
  • @LawJarp-A
    • Add EfficientLoFTRImageProcessorFast for GPU-accelerated image processing (#40215)
  • @bozheng-hit
    • Adding Support for Qwen3-Next (#40771)
    • Align torch implementation of Gated DeltaNet in Qwen3-Next with fla library. (#40807)
    • Fix the misalignment between the l2norm in GDN of Qwen3-Next and the implementation in the FLA library. (#40842)
  • @wangzhen0518
    • Fix edge case for tokenize (#36277) (#36555)
  • @HyunZ118
    • 🌐 [i18n-KO] Translated clipseg.md to Korean (#39903)
    • 🌐 [i18n-KO] Translated smolvlm.md to Korean (#40414)
    • 🌐 [i18n-KO] Translated imageprocessor.md to Korean (#39557)
  • @JJJYmmm
    • Adding Support for Qwen3-VL Series (#40795)
  • @SamuelBarryCS
    • Add Fast PromptDepthAnything Processor (#40602)
  • @2015aroras
    • Add Olmo3 model (#40778)
    • Add FlexOlmo model (#40921)
Sep 17, 2025
Patch release v4.56.2
  • Processor load with multi-processing (#40786)
  • [Jetmoe] Fix RoPE (#40819)
  • Fix getter regression (#40824)
  • Fix config dtype parsing for Emu3 edge case (#40766)
Sep 12, 2025
Vault-Gemma (based on v4.56.1)

A new model is added to transformers: Vault-Gemma It is added on top of the v4.56.1 release, and can be installed from the following tag: v4.56.1-Vault-Gemma-preview.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.56.1-Vault-Gemma-preview

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

As the tag implies, this tag is a preview of the Vault-Gemma model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.57.0.

Vault-Gemma

VaultGemma is a text-only decoder model derived from Gemma 2, notably it drops the norms after the Attention and MLP blocks, and uses full attention for all layers instead of alternating between full attention and local sliding attention. VaultGemma is available as a pretrained model with 1B parameters that uses a 1024 token sequence length.

VaultGemma was trained from scratch with sequence-level differential privacy (DP). Its training data includes the same mixture as the Gemma 2 models, consisting of a number of documents of varying lengths. Additionally, it is trained using DP stochastic gradient descent (DP-SGD) and provides a (ε ≤ 2.0, δ ≤ 1.1e-10)-sequence-level DP guarantee, where a sequence consists of 1024 consecutive tokens extracted from heterogeneous data sources. Specifically, the privacy unit of the guarantee is for the sequences after sampling and packing of the mixture.

The example below demonstrates how to chat with the model with pipeline:

from transformers import pipeline

pipe = pipeline(
    task="text-generation",
    model="google/vaultgemma-1b",
    dtype="auto",
    device_map="auto",
)

text = "Tell me an unknown interesting biology fact about the brain."
outputs = pipe(text, max_new_tokens=32)
response = outputs[0]["generated_text"]
print(response)

with the AutoModelForCausalLM class:

# pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM

model_id = "google/vaultgemma-1b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", dtype="auto")

text = "Tell me an unknown interesting biology fact about the brain."
input_ids = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**input_ids, max_new_tokens=32)
print(tokenizer.decode(outputs[0]))

or with transformers chat:

transformers chat google/vaultgemma-1b
Sep 4, 2025
Patch release v4.56.1

This patch most notably fixes an issue with the new dtype argument (replacing torch_dtype) in pipelines!

Bug Fixes & Improvements

  • Fix broken Llama4 accuracy in MoE part (#40609)
  • fix pipeline dtype (#40638)
  • Fix self.dropout_p is not defined for SamAttention/Sam2Attention (#40667)
  • Fix backward compatibility with accelerate in Trainer (#40668)
  • fix broken offline mode when loading tokenizer from hub (#40669)
  • [Glm4.5V] fix vLLM support (#40696)
Embedding Gemma (based on v4.56.0)

A new model is added to transformers: Embedding Gemma It is added on top of the v4.56.0 release, and can be installed from the following tag: v4.56.0-Embedding-Gemma-preview.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.56.0-Embedding-Gemma-preview

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

As the tag implies, this tag is a preview of the EmbeddingGemma model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.57.0.

Embedding-Gemma

<img width="1048" height="548" alt="image" src="https://github.com/user-attachments/assets/7f40f9dc-353b-472e-8914-2dbf02709ffb" />

Today, Google releases EmbeddingGemma, a state-of-the-art multilingual embedding model perfect for on-device use cases. Designed for speed and efficiency, the model features a compact size of 308M parameters and a 2K context window, unlocking new possibilities for mobile RAG pipelines, agents, and more. EmbeddingGemma is trained to support over 100 languages and is the highest-ranking text-only multilingual embedding model under 500M on the Massive Text Embedding Benchmark (MTEB) at the time of writing.

Usage example

EmbeddingGemma can be found on the Huggingface Hub. It is integrated in sentence-transformers which depends on transformers.

See below for sentence-transformers examples using the model:

from sentence_transformers import SentenceTransformer

# Download from the 🤗 Hub
model = SentenceTransformer("google/embeddinggemma-300m")

# Run inference with queries and documents
query = "Which planet is known as the Red Planet?"
documents = [
    "Venus is often called Earth's twin because of its similar size and proximity.",
    "Mars, known for its reddish appearance, is often referred to as the Red Planet.",
    "Jupiter, the largest planet in our solar system, has a prominent red spot.",
    "Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
]
query_embeddings = model.encode_query(query)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# (768,) (4, 768)

# Compute similarities to determine a ranking
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[0.3011, 0.6359, 0.4930, 0.4889]])

# Convert similarities to a ranking
ranking = similarities.argsort(descending=True)[0]
print(ranking)
# tensor([1, 2, 3, 0])
Aug 29, 2025
v4.56: Dino v3, X-Codec, Ovis 2, MetaCLIP 2, Florence 2, SAM 2, Kosmos 2.5, HunYuan, GLMV-4.5

New model additions

Dino v3

DINOv3 is a family of versatile vision foundation models that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models.

You can find all the original DINOv3 checkpoints under the DINOv3 collection.

<img width="814" height="658" alt="image" src="https://github.com/user-attachments/assets/740a5c3d-a5a1-45d9-9e4c-d9117837205d" />
  • Add Dino v3 by @qubvel in #40167

X-Codec

he X-Codec model was proposed in Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model by Zhen Ye, Peiwen Sun, Jiahe Lei, Hongzhan Lin, Xu Tan, Zheqi Dai, Qiuqiang Kong, Jianyi Chen, Jiahao Pan, Qifeng Liu, Yike Guo, Wei Xue

The X-Codec model is a neural audio codec that integrates semantic information from self-supervised models (e.g., HuBERT) alongside traditional acoustic information. This enables :

  • Music continuation : Better modeling of musical semantics yields more coherent continuations.
  • Text-to-Sound Synthesis : X-Codec captures semantic alignment between text prompts and generated audio.
  • Semantic aware audio tokenization: X-Codec is used as an audio tokenizer in the YuE lyrics to song generation model.
<img width="1958" height="949" alt="image" src="https://github.com/user-attachments/assets/e36552d0-6465-4921-8208-3f7d3c9087f1" />
  • Add X-Codec model by @Manalelaidouni in #38248

Ovis 2

The Ovis2 is an updated version of the Ovis model developed by the AIDC-AI team at Alibaba International Digital Commerce Group.

Ovis2 is the latest advancement in multi-modal large language models (MLLMs), succeeding Ovis1.6. It retains the architectural design of the Ovis series, which focuses on aligning visual and textual embeddings, and introduces major improvements in data curation and training methods.

<img src="https://cdn-uploads.huggingface.co/production/uploads/637aebed7ce76c3b834cea37/XB-vgzDL6FshrSNGyZvzc.png" width="600">
  • Add Ovis2 model and processor implementation by @thisisiron in #37088

MetaCLIP 2

MetaCLIP 2 is a replication of the original CLIP model trained on 300+ languages. It achieves state-of-the-art (SOTA) results on multilingual benchmarks (e.g., XM3600, CVQA, Babel‑ImageNet), surpassing previous SOTA such as mSigLIP and SigLIP‑2. The authors show that English and non-English worlds can mutually benefit and elevate each other.

<img width="805" height="408" alt="image" src="https://github.com/user-attachments/assets/72eaa441-9362-4a6a-a834-f505d6727a2a" />
  • Add MetaCLIP 2 by @NielsRogge in #39826

Florence 2

Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks. Florence-2 can interpret simple text prompts to perform tasks like captioning, object detection, and segmentation. It leverages the FLD-5B dataset, containing 5.4 billion annotations across 126 million images, to master multi-task learning. The model's sequence-to-sequence architecture enables it to excel in both zero-shot and fine-tuned settings, proving to be a competitive vision foundation model.

<img width="864" height="565" alt="image" src="https://github.com/user-attachments/assets/d09dfe3a-6dda-45a3-8dd3-0254d8503b4e" />
  • Add support for Florence-2 by @ducviet00 in #38188

SAM 2

SAM2 (Segment Anything Model 2) was proposed in Segment Anything in Images and Videos by Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, Christoph Feichtenhofer.

The model can be used to predict segmentation masks of any object of interest given an input image or video, and input points or bounding boxes.

<img width="960" height="540" alt="image" src="https://github.com/user-attachments/assets/0ab42e5c-6951-4cbc-9d5d-ff8bf0c2dbf1" />
  • Add Segment Anything 2 (SAM2) by @SangbumChoi in #32317

Kosmos 2.5

The Kosmos-2.5 model was proposed in KOSMOS-2.5: A Multimodal Literate Model by Microsoft.

The abstract from the paper is the following:

We present Kosmos-2.5, a multimodal literate model for machine reading of text-intensive images. Pre-trained on large-scale text-intensive images, Kosmos-2.5 excels in two distinct yet cooperative transcription tasks: (1) generating spatially-aware text blocks, where each block of text is assigned its spatial coordinates within the image, and (2) producing structured text output that captures styles and structures into the markdown format. This unified multimodal literate capability is achieved through a shared Transformer architecture, task-specific prompts, and flexible text representations. We evaluate Kosmos-2.5 on end-to-end document-level text recognition and image-to-markdown text generation. Furthermore, the model can be readily adapted for any text-intensive image understanding task with different prompts through supervised fine-tuning, making it a general-purpose tool for real-world applications involving text-rich images. This work also paves the way for the future scaling of multimodal large language models.

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/kosmos2_5_ocr.png" alt="drawing" width="600"/>

  • Add Kosmos-2.5 by @tic-top in #31711

HunYuan

<img width="674" height="402" alt="image" src="https://github.com/user-attachments/assets/230f83f0-870c-4b31-b8b7-738116761457" />

More information at release 🤗

  • HunYuan opensource by @yjc9696 in #39606

Seed OSS

<img width="858" height="537" alt="image" src="https://github.com/user-attachments/assets/29ccc3c2-9b85-4d89-935a-1e1c28d173fd" />

More information at release 🤗

  • Adding ByteDance Seed Seed-OSS by @Fazziekey in #40272

GLM-4.5V

More information at release 🤗

  • GLM-4.5V Model Support by @zRzRzRzRzRzRzR in #39805

Cache

Beyond a large refactor of the caching system in Transformers, making it much more practical and general, models using sliding window attention/chunk attention do not waste memory anymore when caching past states. It was allowed most notable by:

  • New DynamicSlidingWindowLayer & associated Cache by @Cyrilvallez in #40039

See the following improvements on memory usage for Mistral (using only sliding layers) and GPT-OSS (1 out of 2 layers is sliding) respectively: <img width="569" height="431" alt="image" src="https://github.com/user-attachments/assets/7f1688f4-b077-4840-a62c-bfa6131fe806" /> <img width="574" height="431" alt="image" src="https://github.com/user-attachments/assets/bb4a284f-961e-413d-b7e1-783bb5d8fb39" />

Beyond memory usage, it will also improve generation/forward speed by a large margin for large contexts, as only necessary states are passed to the attention computation, which is very sensitive to the sequence length.

Quantization

MXFP4

Since the GPT-OSS release which introduced the MXPF4 quantization type, several improvements have been made to the support, which should now stabilize.

  • Fix MXFP4 quantizer validation to allow CPU inference with dequantize option by @returnL in #39953
  • Enable gpt-oss mxfp4 on older hardware (sm75+) by @matthewdouglas in #39940
  • Fix typo and improve GPU kernel check error message in MXFP4 quantization by @akintunero in #40349)
  • Default to dequantize if cpu in device_map for mxfp4 by @MekkCyber in #39993
  • Fix GPT-OSS swiglu_limit not passed in for MXFP4 by @danielhanchen in #40197
  • [Mxfp4] Add a way to save with a quantization method by @ArthurZucker in #40176

New standard

Now that we deprecated tensorflow and jax, we felt that torch_dtype was not only misaligned with torch, but was redundant and hard to remember. For this reason, we switched to a much more standard dtype argument!

  • ⚠️⚠️ Use dtype instead of torch_dtype everywhere! by @Cyrilvallez in #39782

torch_dtype will still be a valid usage for as long as needed to ensure a smooth transition, but new code should use dtype, and we encourage you to update older code as well!

Breaking changes

The following commits are breaking changes in workflows that were either buggy or not working as expected.

Saner hub-defaults for hybrid cache implementation

On models where the hub checkpoint specifies cache_implementation="hybrid" (static sliding window hybrid cache), UNSETS this value. This will make the model use the dynamic sliding window layers by default.

This default meant that there were widespread super slow 1st generate calls on models with hybrid caches, which should nol onger be the case.

  • 🚨🚨 [generate] ignore cache_implementation="hybrid" hub defaults by @gante in #40135

Sine positional embeddings for MaskFormer & LRU cache

Cache the computation of sine positional embeddings for MaskFormer; results in a 6% performance improvement.

  • 🚨 Use lru_cache for sine pos embeddings MaskFormer by @yonigozlan in #40007

Explicit cache initialization

Adds explicit cache initialization to prepare for the deprecation of the from_legacy_cache utility.

  • 🚨 Always return Cache objects in modelings (to align with generate) by @manueldeprada in #39765

Default compilation with fullgraph=False

Having fullgraph set to True during compilation ended up being very restrictive, especially with the arrival of widely-used MoEs.

  • 🚨🚨 Switch default compilation to fullgraph=False by @Cyrilvallez in #40137

Remove decoding strategies

The DoLa decoding strategy has been moved to the following remote-code repository a few versions ago: https://huggingface.co/transformers-community/dola

The Contrastive Search decoding strategy has been moved to the following remote-code repository a few versions ago: https://huggingface.co/transformers-community/contrastive-search

Both have now been removed from the library as a result.

  • 🚨 Remove DoLa decoding strategy by @manueldeprada in #40082
  • 🚨 Remove Contrastive Search decoding strategy by @manueldeprada in #40428

Fix sliding window in flash attention

Flash attention has used sliding window sizes which were off by one. This affected generations that had initially bigger contexts than the sliding window size.

  • :rotating_light: [Flash Attention] Fix sliding window size by @vasqu in #40163

Minimum Torch version is now 2.2

Torch 2.1 support has been unreliable for some time, so we've now made it official and bumped our minimum version to 2.2.

  • byebye torch 2.1 by @Rocketknight1 in #40317

Bugfixes and improvements

  • [CI] post-GptOss fixes for green CI by @gante in #39929
  • Avoid utils/check_bad_commit.py failing due to rate limit (requesting api.github.com) by @ydshieh in #39918
  • Fix CI: Tests failing on CPU due to torch.device('cpu').index being None by @manueldeprada in #39933
  • circleci: pin torch 2.7.1 until torchcodec is updated by @ydshieh in #39951
  • [docs] ko toc fix by @gante in #39927
  • docs: fix typo in 'quantization-aware training' by @luckyvickyricky in #39904
  • Fix grammatical error in MoE variable name: expert_hitted → expert_hit, hitted_experts → hit_experts by @Mihonarium in #39959
  • fix typo by @Tialo in #39936
  • [image processor] fix glm4v by @KeyKy in #39964
  • remove triton_kernels dep with kernels instead by @SunMarc in #39926
  • Fix fix_and_overwrite mode of utils/check_docstring.py by @manueldeprada in #39369
  • [bugfix] fix flash_attention_2 unavailable error on Ascend NPU by @FightingZhen in #39844
  • chore: update Deformable_Detr model card by @arpon-kapuria in #39902
  • Modular fix: remove the model name in find_file_type by @yonigozlan in #39897
  • Gemma3 fixes by @remi-or in #39960
  • [superglue] Fixed the way batch mask was applied to the scores before match assignment computation by @sbucaille in #39968
  • Support input_embeds in torch exportable decoders by @jackzhxng in #39836
  • Various test fixes for AMD by @remi-or in #39978
  • [Idefics] fix device mismatch by @zucchini-nlp in #39981
  • Fix gemma3n feature extractor's incorrect squeeze by @Isotr0py in #39919
  • [typing] Fix return typehint for decoder and inv_freq annotation by @qubvel in #39610
  • Fix consistency by @Cyrilvallez in #39995
  • Update expected output values after #39885 (part 1) by @ydshieh in #39990
  • Fix int4 quantized model cannot work with cpu by @yuanwu2017 in #39724
  • Fix missing video inputs for PerceptionLM. by @shuminghu in #39971
  • fix: remove CHAT_TEMPLATE import in tests for deepseek-vl by @geetu040 in #40003
  • Fix HGNetV2 Model Card and Image Classification Pipeline Usage Tips by @ducviet00 in #39965
  • Fix default values of getenv by @cyyever in #39867
  • FA2 can continue generation from cache by @zucchini-nlp in #39843
  • unpin torch<2.8 on circleci by @ydshieh in #40012
  • docs: fix duplication in 'en/optimizers.md' by @luckyvickyricky in #40014
  • Raising error when quantizing a quantized model by @MekkCyber in #39998
  • Update expected output values after #39885 (part 2) by @ydshieh in #40015
  • pin torchcodec==0.5.0 for now with torch 2.7.1 on daily CI by @ydshieh in #40013
  • Fix broken image inference for Fuyu model by @Isotr0py in #39915
  • Higgs modules_to_not_convert standardization by @MekkCyber in #39989
  • Fix an annoying flaky test by @zucchini-nlp in #40000
  • Harmonize past_key_value to past_key_valueS everywhere by @Cyrilvallez in #39956
  • Fix missing None default values for Gemma3n model in get_placeholder_mask by @Znerual in #39991)
  • [core] Refactor the Cache logic to make it simpler and more general by @Cyrilvallez in #39797
  • Tie weights recursively on all submodels by @Cyrilvallez in #39996
  • Bnb failling tests by @MekkCyber in #40026
  • fix notification_service.py about time_spent by @ydshieh in #40037
  • Revert "fix notification_service.py about time_spent" by @ydshieh in #40044
  • Update HuBERT model card according to template by @reedrya in #39742
  • unpin torchcodec==0.5.0 and use torch 2.8 on daily CI by @ydshieh in #40072
  • fix: resolve triton version check compatibility on windows by @Tsumugii24 in #39986
  • [qwen-vl] fix beam search with videos by @zucchini-nlp in #39726
  • [gemma3] update conversion key mapping by @zucchini-nlp in #39778
  • fix: move super().init after vision_config init in Mistral3Config by @starcatmeow in #40063
  • Remove deprecated cache-related objects by @Cyrilvallez in #40035
  • guard on model.eval when using torch.compile + FSDP2 by @winglian in #37413
  • Fix repo consistency by @zucchini-nlp in #40077
  • added Textnet fast image processor by @rahzaazhar in #39884
  • Fix time_spent in notification_service.py. by @ydshieh in #40081
  • chore: standardize DeBERTa model card by @Shoumik-Gandre in #37409
  • [GPT Big Code] Fix attention scaling by @vasqu in #40041
  • feat: extract rev in attn_implementation kernels via @ by @drbh in #40009
  • Update notification service MI325 by @ivarflakstad in #40078
  • Fix PerceptionLM image preprocessing for non-tiled image input. by @shuminghu in #40006
  • Revert FA2 kwargs construction by @zucchini-nlp in #40029
  • [fix] batch inference for llava_onevision by @cyr0930 in #40021
  • [docs] Zero Shot Object Detection Task by @ariG23498 in #40096
  • Update Glm4V processor and add tests by @zucchini-nlp in #39988
  • Add glm4.5&&glm4.5V doc by @lambertwjh in #40095
  • Causal loss for ForConditionalGeneration by @qgallouedec in #39973
  • Audio encodings now match conv2d weight dtype in Gemma3nAudioSSCPConvBlock by @Malav-P in #39743
  • New DynamicSlidingWindowLayer & associated Cache by @Cyrilvallez in #40039
  • Enable SIM rules by @cyyever in #39806
  • feat: add is_fast to ImageProcessor by @MilkClouds in #39603
  • Re-apply make style by @Cyrilvallez in #40106
  • Replace logger.warning with logger.warning_once in GradientCheckpointingLayer by @qgallouedec in #40091
  • Fix regression in mllama vision encoder by @Isotr0py in #40083
  • Switch the order of args in StaticCache (for BC and future logic) by @Cyrilvallez in #40100
  • Fix Qwen3 MoE GGUF architecture mismatch by @ctcanbol in #39976
  • Fix error on importing unavailable torch.distributed by @m-gallus in #40038
  • [Flash Attention] Fix flash attention integration by @vasqu in #40002
  • [trainer] ensure special tokens in model configs are aligned with tokenizer at train time by @gante in #38441
  • Fix Causality Handling in Flash Attention to Support Bidirectional Attention by @lucaswychan in #39707
  • [docs] Add reference to HF-maintained custom_generate collections by @gante in #39894
  • Add model card for MobileViT by @Shivamjan in #40033
  • remove sequence parallel in llama4 by @3outeille in #40084
  • 🌐 [i18n-KO] Translated tiny_agents.md to Korean by @AhnJoonSung in #39913
  • [bugfix] Fix tensor device in Idefics2, Idefics3, and SmolVLM by @qgallouedec in #39975
  • changed xLSTMRMSNorm to RMSNorm by @nikitazuevblago in #40113
  • Fix QuantoQuantizedCache import issues by @manueldeprada in #40109
  • [serve] allow array content inputs for LLMs by @gante in #39829
  • decoding_method argument in generate by @manueldeprada in #40085
  • Collated reports by @ivarflakstad in #40080
  • DOCS: Add missing space in SECURITY.md by @shivaheidari in #40087
  • [trainer] handle case where EOS token is None in generation_config by @gante in #40127
  • Fix hidden torchvision>=0.15 dependency issue by @yonigozlan in #39928
  • 🌐 [i18n-KO] Translated main_classes/processors.md to Korean by @TaskerJang in #39519
  • 🌐 [i18n-KO] Translated jamba.md to Korean by @skwh54 in #39890
  • 🌐 [i18n-KO] Translated main_classes/optimizer_schedules.md to Korean by @luckyvickyricky in #39713
  • 🌐 [i18n-KO] Translated gpt2.md to Korean by @taemincode in #39808
  • 🌐 [i18n-KO] Translated optimizers.md to Korean by @chelsseeey in #40011
  • 🌐 [i18n-KO] Translated grounding-dino.md to Korean by @TaskerJang in #39861
  • 🌐 [i18n-KO] Translated pipelines.md to Korean by @xhaktm00 in #39577
  • gpt oss is important by @ArthurZucker in #40139
  • Fix Janus by @Cyrilvallez in #40140
  • [docs] Fix ko toctree by @stevhliu in #40138
  • Remove an old badly designed test by @Cyrilvallez in #40142
  • updated visualBERT modelcard by @Anil-Red in #40057
  • 🌐 [i18n-KO] Translated gemma3.md to Korean by @seopp in #39865
  • Fix quantized cache with only cache_implementation in generate by @Cyrilvallez in #40144
  • Add pytest marker: torch_compile_test and torch_export_test by @ydshieh in #39950
  • Update Dockerfiles to install packages inside a virtual environment by @Sai-Suraj-27 in #39098
  • Create self-scheduled-amd-mi355-caller.yml by @glegendre01 in #40134
  • [Cohere2Vision] remove unused arg by @zucchini-nlp in #40103
  • [efficientloftr] fix bugs and follow original cross attn implementation strictly by @sbucaille in #40141
  • Fix CI: Use correct import in SAM for torchvision InterpolationMode by @manueldeprada in #40160
  • [Continous Batching] set head_dim when config.head_dim is None by @kashif in #40159
  • Replace self.tokenizer by self.processing_class by @qgallouedec in #40119
  • [FA2] Fix it finally - revert fa kwargs preparation by @Cyrilvallez in #40161
  • [bugfix] fix flash-attention2 unavailable error for Ascend NPU by @FightingZhen in #40151
  • build: Add fast image processor tvp by @adutchengineer in #39529
  • Add GptOssForSequenceClassification for GPT-OSS models by @zyfedward in #40043
  • Standardize BARTpho model card: badges, new examples, fixed broken im… by @eshwanthkartitr in #40051
  • Add dates to the model docs by @MHRDYN7 in #39320
  • Pin torch to 2.7.1 on CircleCI for now by @ydshieh in #40174
  • Update dynamic attnt setter for multimodals by @zucchini-nlp in #39908
  • [MINOR:TYPO] Update base.py by @cakiki in #40169
  • make model doc device agnostic by @yao-matrix in #40143
  • fix to avoid modifying a view in place by @3outeille in #40162
  • Fix fsdp for generic-task models by @Cyrilvallez in #40191
  • Add repr to EncoderDecoderCache by @Cyrilvallez in #40195
  • Fix typos by @cyyever in #40175
  • Remove _prepare_flash_attention_from_position_ids by @cyyever in #40069
  • Avoid CUDA stream sync by @cyyever in #40060
  • Fix various Pylint warnings by @cyyever in #40107
  • Update: add type hints to check_tokenizers.py by @ajeet214 in #40094
  • Benchmarking improvements by @ahadnagy in #39768
  • docs: Update LayoutLM model card according to new standardized format by @Jin-HoMLee in #40129
  • Revert "Pin torch to 2.7.1 on CircleCI for now" + Final fix for too long with no output by @ydshieh in #40201
  • Use correct model_input_names for PixtralImageProcessor by @rohitrango in #40226
  • fix error vocab_size at Qwen2_5_VLForConditionalGeneration loss_function by @killight98 in #40130
  • [SAM 2] Change checkpoints in docs and tests by @yonigozlan in #40213
  • Fix more typos by @cyyever in #40212
  • Fix ESM token_dropout crash when using inputs_embeds instead of input_ids by @notkisk in #40181
  • AMD scheduled CI ref env file by @ivarflakstad in #40243
  • Fix more pylint warnings by @cyyever in #40204
  • remove transpose_for_scores call in ESM-2 by @pstjohn in #40210
  • Add chat_template (jinja2) as an extra dependency by @tboerstad in #40128
  • [typing] fix type annotation error in DepthPro model image processor by @MengAiDev in #40238
  • [serve] guard imports by @gante in #39825
  • [CI] Fix repo consistency by @vasqu in #40249
  • Fixes for EncoderDecoderCache by @remi-or in #40008
  • fix: Catch correct ConnectionError for additional_chat_templates by @akug in #39874
  • Model card for NLLB by @sahil-kabir in #40074
  • Correct typo and update notes in docs Readme by @PavloFesenko in #40234
  • Fix benchmark workflow by @ahadnagy in #40254
  • docs: Update OLMo model card by @rafakatri in #40233
  • Skip broken tests by @zucchini-nlp in #40157
  • Remove MI300 CI by @ivarflakstad in #40270
  • set inputs_embeds to None while generate to avoid audio encoder forward in generation process by @BakerBunker in #40248
  • [detection] fix attention mask for RT-DETR-based models by @materight in #40269
  • Fix slow static cache export tests by @jackzhxng in #40261
  • Fix setting attention for multimodal models by @zucchini-nlp in #39984
  • [detection] fix correct k_proj weight and bias slicing in D-FINE by @notkisk in #40257
  • Skipping pytree registration in case fsdp is enabled by @romitjain in #40075
  • Update image_processing_perception_lm_fast.py to allow for proper override of vision_input_type by @tyleryzhu in #40252
  • fix which routing method by @ArthurZucker in #40283
  • Fix chat CLI GPU loading and request_id validation issues by @robin-ede in #40230)
  • docs(layoutlm): add missing id=usage to <hfoptions> tag in LayoutLM model card by @Jin-HoMLee in #40273
  • Standardize RAG model card by @aayush226 in #40222
  • docs: Update TrOCR model card to new format by @AceHunterr in #40240
  • Update model card for gpt neox japanese by @ahnjj in #39862
  • SmolVLM and InternVL: Ensure pixel values are converted to the correct dtype for fp16/bf16 by @qgallouedec in #40121
  • Standardize BertGeneration model card by @nemitha2005 in #40250
  • Adjust ROCm test output expectations by @ahadnagy in #40279
  • SmolVLM test fixes by @ahadnagy in #40275
  • make model docs device agnostic (2) by @yao-matrix in #40256
  • [3/3] make docs device agnostic, all en docs for existing models done by @yao-matrix in #40298
  • Allow to be able to run torch.compile tests with fullgraph=True by @ydshieh in #40164
  • [FA] Fix dtype in varlen with position ids by @vasqu in #40295
  • [docs] delete more TF/Flax docs by @gante in #40289
  • Clean up X-Codec. by @ebezzam in #40271
  • Remove OTel SDK dependencies by @anuraaga in #40305
  • Fix GOT-OCR2 and Cohere2Vision image processor patches caculation by @Isotr0py in #40312
  • [fix] Pass adamw optimizer parameters to StableAdamW by @emapco in #40184
  • chore: fix typo in find_executable_batch_size to match new 0.9 ratio by @MilkClouds in #40206
  • :rotating_light: [Flash Attention] Fix sliding window size by @vasqu in #40163
  • Remove unnecessary contiguous calls for modern torch by @Rocketknight1 in #40315
  • Qwen2.5-Omni test fixes by @ahadnagy in #40307
  • Add back _tp_plan attribute by @rishub-tamirisa in #39944
  • byebye torch 2.1 by @Rocketknight1 in #40317
  • No more natten by @ydshieh in #40287
  • [GPT OSS] Refactor the tests as it was not properly checking the outputs by @ArthurZucker in #40288
  • Update CI with nightly torch workflow file by @ydshieh in #40306
  • Fix: Apply get_placeholder_mask in Ovis2 by @thisisiron in #40280
  • Update notification service amd_daily_ci_workflows definition by @ivarflakstad in #40314
  • One cache class to rule them all by @Cyrilvallez in #40276
  • Fix chunked attention mask with left-padding by @Cyrilvallez in #40324
  • [docs] remove flax references from /en/model_doc by @gante in #40311
  • Fix qwen-omni processor text only mode by @yuekaizhang in #40336
  • Change Qwen2RMSNorm to RMSNorm from PyTorch by @cyyever in #40066
  • Add DeepseekV3ForSequenceClassification for Deepseek V3 models by @abdokaseb in #40200
  • Fix deprecation warning version by @Cyrilvallez in #40343
  • Add missing arguments to class constructors by @cyyever in #40068
  • [docs] remove TF references from /en/model_doc by @gante in #40344
  • Fix: Only call Trainer.align_special_tokens if model has "config" attribute by @tomaarsen in #40322
  • add type hints by @wirthual in #40319
  • Fix an infinite loop bug in recursive search of relative imports by @eladsegal in #40326
  • Fix links in Glm4vMoe configuration classes to point to the correct H… by @vvvdwbvvv in #40310
  • T5 test and target device fixes by @ahadnagy in #40313
  • Update test_spm_converter_bytefallback_warning by @ydshieh in #40284
  • (small) fix conditional for input_ids and input_embeds in marian by @cyntqliu in #40045
  • Fix attention vizualizer by @molbap in #40285
  • [ModernBert] Prevent the attention mask from being None in ModernBertForSequenceClassification by @ashmikuz in #35991
  • Clean up XCodec and other codecs by @ebezzam in #40348
  • [serve] add cors warnings by @gante in #40112
  • [detection] use consistent dtype for Conditional and DAB DETR positional embeddings by @agkphysics in #40300
  • Remove more PyTorch 2.2 compatible code by @cyyever in #40337
  • [FA] Fix some model tests by @vasqu in #40350
  • Qwen2.5-VL test fixes for ROCm by @ahadnagy in #40308
  • [generate] handle support for cache classes when num enc layers != num dec layers by @gante in #40277
  • [4/N]more docs to device agnostic by @yao-matrix in #40355
  • DOCS: Clarification on the use of label_names as an argument to TrainingArguments by @huzaifa-jawad367 in #40353
  • Fix idefics3 vision embeddings indices dtype by @Isotr0py in #40360
  • wav2vec2 fixes by @remi-or in #40341
  • Change multimodal data links to HF hub by @zucchini-nlp in #40309
  • [pipelines] add support to skip_special_tokens in the main text generation pipelines by @gante in #40356
  • ⚠️⚠️ Use dtype instead of torch_dtype everywhere! by @Cyrilvallez in #39782
  • [processor] move commonalities to mixin by @zucchini-nlp in #40339
  • [configuration] allow to overwrite kwargs from subconfigs by @zucchini-nlp in #40241
  • fix(example): align parameter names with the latest function definition for gdino by @developer0hye in #40369
  • Add GptOssForTokenClassification for GPT-OSS models by @abdokaseb in #40190
  • Bug Fix: Dynamically set return_lse flag in FlexAttention by @amd-lalithnc in #40352
  • Chat Template Doc Fixes by @Rocketknight1 in #40173
  • Rework the Cache documentation by @Cyrilvallez in #40373
  • Update README_zh-hans.md by @TardC in #40380
  • HF papers in doc by @qgallouedec in #40381
  • Run FA2 tests in CI by @ydshieh in #40397
  • Reactivate a lot of tests skipped for no reason anymore by @Cyrilvallez in #40378
  • :broom: :broom: :broom: Get set decoder cleanup by @molbap in #39509
  • fix to accept cumulative_seqlens from TransformersKwargs in FA by @Kurt232 in #40194
  • [docs] flax/jax purge by @gante in #40372
  • Fix typo: 'casual' -> 'causal' in code and documentation by @akintunero in #40371)
  • Fix CI (hunyuan moe does not support fullgraph) by @Cyrilvallez in #40423
  • Fix typo: 'seperator' to 'separator' in variable names by @Prawal-Sharma in #40389
  • Fix UnboundLocalError in WER metric computation by @prxshetty in #40402
  • Gpt oss optim by @jiqing-feng in #40304
  • Fix processing tests by @zucchini-nlp in #40379
  • Fix label smoothing incompatibility with multi-label classification by @avchauzov in #40296
  • Fix modular for modernbert-decoder by @Cyrilvallez in #40431
  • Update collated reports working directory and --path by @ivarflakstad in #40433
  • Add tokenizer_kwargs argument to the text generation pipeline by @Joshua-Chin in #40364
  • [docs] remove last references to transformers TF classes/methods by @gante in #40429
  • Remove working-dir from collated reports job by @ivarflakstad in #40435
  • 🌐 [i18n-KO] Translated models.md to Korean by @Judy-Choi in #39518
  • Gemma3 text fixes: Add expectations for MI325 by @ahadnagy in #40384
  • Fix collated reports model directory traversal by @ivarflakstad in #40437
  • Fix https://github.com/huggingface/transformers/issues/40292 by @id01 in #40439
  • Fix collated reports uploading by @ivarflakstad in #40440
  • InternVL MI325 test expectations by @ahadnagy in #40387
  • Fix collated reports model name entry by @ivarflakstad in #40441
  • Fix non FA2 tests after FA2 installed in CI docker image by @ydshieh in #40430
  • Refactor ViT-like models by @qubvel in #39816
  • [Trainer] accelerate contextparallel support in trainer by @kashif in #40205
  • fix qwen25-vl grad acc by @iMountTai in #40333
  • [video processors] decode only sampled videos -> less RAM and faster processing by @zucchini-nlp in #39600
  • rename get_cuda_warm_up_factor to get_accelerator_warm_up_factor by @yao-matrix in #40363
  • Make cache_config not mandatory by @remi-or in #40316
  • Continuous batching refactor by @remi-or in #40426
  • flash_paged: s_aux may not exist by @pcuenca in #40434
  • Fix extra template loading by @Rocketknight1 in #40455
  • deci gguf support by @ved1beta in #38669
  • [fast_image_processor] fix image normalization for resize by @audioXD in #40436
  • [RoPE] explicit factor > implicit factor in YaRN by @gante in #40320
  • [pipeline] Add Keypoint Matching pipeline by @sbucaille in #39970
  • Update SegFormer model card by @GSNCodes in #40417
  • Not to shock AMD team by the cancelled workflow run notification ❤️ 💖 by @ydshieh in #40467
  • Fix nightly torch CI by @ydshieh in #40469
  • CI when PR merged to main by @ydshieh in #40451
  • Validate GptOssConfig rope config after it's fully initialized by @zifeitong in #40474
  • [modular] Use multi-processing + fix model import issue by @Cyrilvallez in #40481
  • [modular] Remove ambiguity in all calls to parent class methods + fix dependency graph by @Cyrilvallez in #40456
  • [ESM] support attention API by @zucchini-nlp in #40370
  • [EfficientLoFTR] dynamic image size support by @sbucaille in #40329
  • Fix qwen2_moe tests by @ydshieh in #40494
  • [Whisper] Add rocm expected results to certain tests by @ivarflakstad in #40482
  • Collated reports: no need to upload artifact by @ivarflakstad in #40502
  • Fix the CI workflow of merge to main by @ydshieh in #40503
  • docs(pixtral): Update Pixtral model card to new format by @BryanBradfo in #40442
  • [modular] Classes can now be defined and referenced in arbitrary order (without bringing unwanted dependencies) by @Cyrilvallez in #40507
  • Include machine type in collated reports filename by @ivarflakstad in #40514

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @remi-or
    • Gemma3 fixes (#39960)
    • Various test fixes for AMD (#39978)
    • Fixes for EncoderDecoderCache (#40008)
    • wav2vec2 fixes (#40341)
    • Make cache_config not mandatory (#40316)
    • Continuous batching refactor (#40426)
  • @sbucaille
    • [superglue] Fixed the way batch mask was applied to the scores before match assignment computation (#39968)
    • [efficientloftr] fix bugs and follow original cross attn implementation strictly (#40141)
    • [pipeline] Add Keypoint Matching pipeline (#39970)
    • [EfficientLoFTR] dynamic image size support (#40329)
  • @ducviet00
    • Fix HGNetV2 Model Card and Image Classification Pipeline Usage Tips (#39965)
    • Add support for Florence-2 (#38188)
  • @cyyever
    • Fix default values of getenv (#39867)
    • Enable SIM rules (#39806)
    • Fix typos (#40175)
    • Remove _prepare_flash_attention_from_position_ids (#40069)
    • Avoid CUDA stream sync (#40060)
    • Fix various Pylint warnings (#40107)
    • Fix more typos (#40212)
    • Fix more pylint warnings (#40204)
    • Change Qwen2RMSNorm to RMSNorm from PyTorch (#40066)
    • Add missing arguments to class constructors (#40068)
    • Remove more PyTorch 2.2 compatible code (#40337)
  • @zRzRzRzRzRzRzR
    • GLM-4.5V Model Support (#39805)
  • @SangbumChoi
    • Add Segment Anything 2 (SAM2) (#32317)
  • @adutchengineer
    • build: Add fast image processor tvp (#39529)
  • @MHRDYN7
    • Add dates to the model docs (#39320)
  • @yao-matrix
    • make model doc device agnostic (#40143)
    • make model docs device agnostic (2) (#40256)
    • [3/3] make docs device agnostic, all en docs for existing models done (#40298)
    • [4/N]more docs to device agnostic (#40355)
    • rename get_cuda_warm_up_factor to get_accelerator_warm_up_factor (#40363)
  • @Manalelaidouni
    • Add X-Codec model (#38248)
  • @thisisiron
    • Add Ovis2 model and processor implementation (#37088)
    • Fix: Apply get_placeholder_mask in Ovis2 (#40280)
  • @tic-top
    • Add Kosmos-2.5 (#31711)
  • @yjc9696
    • HunYuan opensource (#39606)
  • @Fazziekey
    • Addiing ByteDance Seed Seed-OSS (#40272)
Aug 22, 2025
Patch v4.55.4

There was a mick mack on our side when cherry-picking the commit #40197 which led to a wrong commit in the patch! Sorry everyone 😭

This patch is just the official fix for #40197!

Aug 21, 2025
Patch release v4.55.3

Patch release 4.55.3

Focused on stabilizing FlashAttention-2 on Ascend NPU, improving FSDP behavior for generic-task models, fixing MXFP4 integration for GPT-OSS

Bug Fixes & Improvements

  • FlashAttention-2 / Ascend NPU – Fix “unavailable” runtime error (#40151) by @FightingZhen
  • FlashAttention kwargs – Revert FA kwargs preparation to resolve regression (#40161) by @Cyrilvallez
  • FSDP (generic-task models) – Fix sharding/runtime issues (#40191) by @Cyrilvallez
  • GPT-OSS / MXFP4 – Ensure swiglu_limit is correctly passed through (#40197) by @danielhanchen
  • Mamba – Fix cache handling to prevent stale/incorrect state (#40203) by @manueldeprada
  • Misc – Minor follow-up fix addressing #40262 by @ArthurZucker
Aug 13, 2025
Patch release 4.55.2: for FA2 users!

Patch release 4.55.2!

only affects FA2 generations!

😢 Well sorry everyone, sometimes shit can happen... 4.55.1 was broken because of 🥁 git merge conflict. I cherry-picked https://github.com/huggingface/transformers/pull/40002 without having https://github.com/huggingface/transformers/pull/40029 , thus from ..modeling_flash_attention_utils import prepare_fa_kwargs_from_position_ids is missing, and since this is a slow test, nothing caught it.

Will work to remediate and write the post-mortem when yanking the release.

Patch release 4.55.1

Patch release 4.55.1:

Mostly focused around stabalizing the Mxfp4 for GPTOSS model!

Bug Fixes & Improvements

  • Idefics2, Idefics3, SmolVLM – Fix tensor device issue (#39975) by @qgallouedec
  • Merge conflicts – Fix merge conflicts from previous changes by @vasqu
  • MXFP4 / CPU device_map – Default to dequantize when CPU is in device_map (#39993) by @MekkCyber
  • GPT Big Code – Fix attention scaling (#40041) by @vasqu
  • Windows compatibility – Resolve Triton version check compatibility (#39986) by @Tsumugii24 @MekkCyber
  • Gemma3n model – Add missing None default values for get_placeholder_mask (#39991, #40024) by @Znerual
  • Fuyu model – Fix broken image inference (#39915) by @Isotr0py
  • PerceptionLM – Fix missing video inputs (#39971) by @shuminghu
  • Idefics – Fix device mismatch (#39981) by @zucchini-nlp
  • Triton kernels – Remove triton_kernels dependency in favor of included kernels (#39926) by @SunMarc
  • GPT-OSS MXFP4 – Enable on older hardware (sm75+) (#39940) by @matthewdouglas @SunMarc
  • MXFP4 quantizer – Allow CPU inference with dequantize option (#39953) by @returnL

CI & Build

  • CI stability – Post-GPT-OSS fixes for green CI (#39929) by @gante @LysandreJik
Aug 11, 2025
GLM-4.5V preview based on 4.55.0

New model added by the Z.ai team to transformers! GLM-4.5V is a new multimodal reasoning model based on GLM-4.5-Air, which has 106B total and 12B active parameters.

It's performant across 42 benchmarks across various categories:

  • Image reasoning (scene understanding, complex multi-image analysis, spatial recognition)
  • Video understanding (long video segmentation and event recognition)
  • GUI tasks (screen reading, icon recognition, desktop operation assistance)
  • Complex chart & long document parsing (research report analysis, information extraction)
  • Grounding (precise visual element localization)
<img width="5188" height="5082" alt="image" src="https://github.com/user-attachments/assets/9dda0f5b-6de0-49ce-b02a-684106750353" />

To use, install transformers release branch.

pip install transformers-v4.55.0-GLM-4.5V-preview

Then you can run:

from transformers import AutoProcessor, Glm4vMoeForConditionalGeneration
import torch

MODEL_PATH = "zai-org/GLM-4.5V"
messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "url": "https://upload.wikimedia.org/wikipedia/commons/f/fa/Grayscale_8bits_palette_sample_image.png"
            },
            {
                "type": "text",
                "text": "describe this image"
            }
        ],
    }
]
processor = AutoProcessor.from_pretrained(MODEL_PATH)
model = Glm4vMoeForConditionalGeneration.from_pretrained(
    pretrained_model_name_or_path=MODEL_PATH,
    torch_dtype="auto",
    device_map="auto",
)
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt"
).to(model.device)
inputs.pop("token_type_ids", None)
generated_ids = model.generate(**inputs, max_new_tokens=8192)
output_text = processor.decode(generated_ids[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False)
print(output_text)
Aug 5, 2025
v4.55.0: New openai GPT OSS model!

Welcome GPT OSS, the new open-source model family from OpenAI!

<img width="2320" height="1160" alt="image" src="https://github.com/user-attachments/assets/4a1cd2f6-dde9-445e-83d9-73f6551e2da2" />

For more detailed information about this model, we recommend reading the following blogpost: https://huggingface.co/blog/welcome-openai-gpt-oss

GPT OSS is a hugely anticipated open-weights release by OpenAI, designed for powerful reasoning, agentic tasks, and versatile developer use cases. It comprises two models: a big one with 117B parameters (gpt-oss-120b), and a smaller one with 21B parameters (gpt-oss-20b). Both are mixture-of-experts (MoEs) and use a 4-bit quantization scheme (MXFP4), enabling fast inference (thanks to fewer active parameters, see details below) while keeping resource usage low. The large model fits on a single H100 GPU, while the small one runs within 16GB of memory and is perfect for consumer hardware and on-device applications.

Overview of Capabilities and Architecture

  • 21B and 117B total parameters, with 3.6B and 5.1B active parameters, respectively.
  • 4-bit quantization scheme using mxfp4 format. Only applied on the MoE weights. As stated, the 120B fits in a single 80 GB GPU and the 20B fits in a single 16GB GPU.
  • Reasoning, text-only models; with chain-of-thought and adjustable reasoning effort levels.
  • Instruction following and tool use support.
  • Inference implementations using transformers, vLLM, llama.cpp, and ollama.
  • Responses API is recommended for inference.
  • License: Apache 2.0, with a small complementary use policy.

Architecture

  • Token-choice MoE with SwiGLU activations.
  • When calculating the MoE weights, a softmax is taken over selected experts (softmax-after-topk).
  • Each attention layer uses RoPE with 128K context.
  • Alternate attention layers: full-context, and sliding 128-token window.
  • Attention layers use a learned attention sink per-head, where the denominator of the softmax has an additional additive value.
  • It uses the same tokenizer as GPT-4o and other OpenAI API models.
  • Some new tokens have been incorporated to enable compatibility with the Responses API.

The following snippet shows simple inference with the 20B model. It runs on 16 GB GPUs when using mxfp4, or ~48 GB in bfloat16.

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
)

messages = [
    {"role": "user", "content": "How many rs are in the word 'strawberry'?"},
]  

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))

Flash Attention 3

The models use attention sinks, a technique the vLLM team made compatible with Flash Attention 3. We have packaged and integrated their optimized kernel in kernels-community/vllm-flash-attn3. At the time of writing, this super-fast kernel has been tested on Hopper cards with PyTorch 2.7 and 2.8. We expect increased coverage in the coming days. If you run the models on Hopper cards (for example, H100 or H200), you need to pip install –upgrade kernels and add the following line to your snippet:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
+    # Flash Attention with Sinks
+    attn_implementation="kernels-community/vllm-flash-attn3",
)  

messages = [
    {"role": "user", "content": "How many rs are in the word 'strawberry'?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))

Even though the 120B model fits on a single H100 GPU (using mxfp4), you can also run it easily on multiple GPUs using accelerate or torchrun. Transformers provides a default parallelization plan, and you can leverage optimized attention kernels as well. The following snippet can be run with torchrun --nproc_per_node=4 generate.py on a system with 4 GPUs:

from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.distributed import DistributedConfig
import torch

model_path = "openai/gpt-oss-120b"
tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side="left")

device_map = {
    "tp_plan": "auto",    # Enable Tensor Parallelism
}

model = AutoModelForCausalLM.from_pretrained(
    model_path,
    torch_dtype="auto",
    attn_implementation="kernels-community/vllm-flash-attn3",
    **device_map,
)

messages = [
     {"role": "user", "content": "Explain how expert parallelism works in large language models."}
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=1000)

# Decode and print
response = tokenizer.decode(outputs[0])
print("Model response:", response.split("<|channel|>final<|message|>")[-1].strip())

Other optimizations

If you have a Hopper GPU or better, we recommend you use mxfp4 for the reasons explained above. If you can additionally use Flash Attention 3, then by all means do enable it!

[!TIP] If your GPU is not compatible with mxfp4, then we recommend you use MegaBlocks MoE kernels for a nice speed bump. To do so, you just need to adjust your inference code like this:

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "openai/gpt-oss-20b"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    device_map="auto",
    torch_dtype="auto",
+    # Optimize MoE layers with downloadable MegaBlocksMoeMLP
+    use_kernels=True,
)

messages = [
    {"role": "user", "content": "How many rs are in the word 'strawberry'?"},
]

inputs = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_tensors="pt",
    return_dict=True,
).to(model.device)

generated = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))

[!TIP] MegaBlocks optimized MoE kernels require the model to run on bfloat16, so memory consumption will be higher than running on mxfp4. We recommend you use mxfp4 if you can, otherwise opt in to MegaBlocks via use_kernels=True.

transformers serve

You can use transformers serve to experiment locally with the models, without any other dependencies. You can launch the server with just: transformers serve

To which you can send requests using the Responses API.

# responses API
curl -X POST http://localhost:8000/v1/responses \
-H "Content-Type: application/json" \
-d '{"input": [{"role": "system", "content": "hello"}], "temperature": 1.0, "stream": true, "model": "openai/gpt-oss-120b"}'

You can also send requests using the standard Completions API:

# completions API
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "system", "content": "hello"}], "temperature": 1.0, "max_tokens": 1000, "stream": true, "model": "openai/gpt-oss-120b"}'

Command A Vision

<img width="1920" height="960" alt="image" src="https://github.com/user-attachments/assets/5502cc65-2fc9-49ac-8e15-262aa573b68d" />

Command A Vision is a state-of-the-art multimodal model designed to seamlessly integrate visual and textual information for a wide range of applications. By combining advanced computer vision techniques with natural language processing capabilities, Command A Vision enables users to analyze, understand, and generate insights from both visual and textual data.

The model excels at tasks including image captioning, visual question answering, document understanding, and chart understanding. This makes it a versatile tool for AI practitioners. Its ability to process complex visual and textual inputs makes it useful in settings where text-only representations are imprecise or unavailable, like real-world image understanding and graphics-heavy document processing.

Command A Vision is built upon a robust architecture that leverages the latest advancements in VLMs. It's highly performant and efficient, even when dealing with large-scale datasets. The model's flexibility makes it suitable for a wide range of use cases, from content moderation and image search to medical imaging analysis and robotics.

  • [Model] Cohere2 Vision by @zucchini-nlp in #39810

MM Grounding DINO

<img width="838" height="266" alt="image" src="https://github.com/user-attachments/assets/4d1e153c-0586-4650-8e18-c9d08145ce49" />

MM Grounding DINO model was proposed in An Open and Comprehensive Pipeline for Unified Object Grounding and Detection by Xiangyu Zhao, Yicheng Chen, Shilin Xu, Xiangtai Li, Xinjiang Wang, Yining Li, Haian Huang>.

MM Grounding DINO improves upon the Grounding DINO by improving the contrastive class head and removing the parameter sharing in the decoder, improving zero-shot detection performance on both COCO (50.6(+2.2) AP) and LVIS (31.9(+11.8) val AP and 41.4(+12.6) minival AP).

You can find all the original MM Grounding DINO checkpoints under the MM Grounding DINO collection. This model also supports LLMDet inference. You can find LLMDet checkpoints under the LLMDet collection.

  • Add MM Grounding DINO by @rziga in #37925

Bugfixes and improvements

  • More robust tied weight test by @Cyrilvallez in #39681
  • fix missing model._tp_size from ep refactor by @winglian in #39688
  • Fix missing initialization of FastSpeech2Conformer by @bvantuan in #39689
  • fix(tokenization): check token.content for trie by @pjo256 in #39587
  • xpu optimization for generation case by @sywangyi in #39573
  • [processors] add tests for helper fn by @zucchini-nlp in #39629
  • update ernie model card by @jzhang533 in #39657
  • [configuration] remove redundant classmethod by @zucchini-nlp in #38812
  • Add self-hosted runner scale set workflow for mi325 CI by @jitesh-gupta in #39651
  • PATCH: add back n-dim device-mesh + fix tp trainer saving by @S1ro1 in #39693
  • [CI] Add Eric to comment slow ci by @vasqu in #39601
  • Remove all expired deprecation cycles by @Cyrilvallez in #39725
  • mllama outputs refactor by @itazap in #39643
  • Update QAPipelineTests::test_large_model_course after #39193 by @ydshieh in #39666
  • skip Glm4MoeModelTest::test_torch_compile_for_training by @ydshieh in #39670
  • Fix Qwen2AudioForConditionalGeneration.forward() and test_flash_attn_kernels_inference_equivalence by @ebezzam in #39503
  • Fix Layer device placement in Caches by @Cyrilvallez in #39732
  • Fix cache-related tests by @zucchini-nlp in #39676
  • Fix AMD dockerfile for audio models by @remi-or in #39669
  • Superpoint fast image processor by @arkhamHack in #37804
  • Add Fast Segformer Processor by @capnmav77 in #37024
  • BLIPs clean-up by @zucchini-nlp in #35560
  • extend more trainer test cases to XPU, all pass by @yao-matrix in #39652
  • fix cache inheritance by @ArthurZucker in #39748
  • [Fix] import two missing typos in models/__init__.py for typo checking by @hebangwen in #39745
  • Fix: add back base model plan by @S1ro1 in #39733
  • update GemmaIntegrationTest::test_model_2b_bf16_dola again by @ydshieh in #39731
  • Update IMPORTANT_MODELS list by @ivarflakstad in #39734
  • Fix mamba regression by @manueldeprada in #39728
  • Apply several ruff SIM rules by @cyyever in #37283
  • Use --gpus all in workflow files by @ydshieh in #39752
  • AMD disable torchcodec by @ivarflakstad in #39757
  • Avoid OOM when other tests are failing by @ydshieh in #39758
  • Fix GPT2 with cross attention by @zucchini-nlp in #39754
  • Support loading Qwen3 MoE GGUF by @ctcanbol in #39638
  • Enable xpu allocator on caching_allocator_warmup by @jiqing-feng in #39654
  • Fix version issue in modeling_utils.py by @Cyrilvallez in #39759
  • add libcst to extras["testing"] in setup.py by @ydshieh in #39761
  • [modenbert] fix regression by @zucchini-nlp in #39750
  • 🌐 [i18n-KO] Translated main_classes/peft.md by @luckyvickyricky in #39515
  • 🌐 [i18n-KO] Translated albert.md to Korean by @ahnjj in #39524
  • 🌐 [i18n-KO] Translated tvp.md to Korean by @Kim-Ju-won in #39578
  • 🌐 [i18n-KO] Translated tokenizer.md to Korean by @seopp in #39532
  • 🌐 [i18n-KO] Translated pipeline_gradio.md to Korean by @AhnJoonSung in #39520
  • 🌐 [i18n-KO] Translated perf_train_gpu_one.md to Korean by @D15M4S in #39552
  • 🌐 [i18n-KO] Translated how_to_hack_models.md to Korean by @skwh54 in #39536
  • fix(trainer): Correct loss scaling for incomplete gradient accumulation steps by @hutaiHang in #39659
  • Fix Cache.max_cache_len max value for Hybrid models by @manueldeprada in #39737
  • [docs] Ko doc fixes after toc update by @gante in #39660
  • Remove python3.7 reference from doc link by @st81 in #39706
  • Fix OmDet test after arg deprecation by @Cyrilvallez in #39766
  • docs: Update EfficientLoFTR documentation by @sbucaille in #39620
  • Standardize CLAP model card format by @yanamis in #39738
  • Don't set run_name when none by @qgallouedec in #39695
  • Fix Evolla and xLSTM tests by @Cyrilvallez in #39769
  • enable static cache on vision encoder decoder by @jiqing-feng in #39773
  • [ASR pipline] fix with datasets 4.0 by @eustlb in #39504
  • more info in model_results.json by @ydshieh in #39783
  • Super tiny update by @zucchini-nlp in #39727
  • fix chameleonvision UT failure by @yao-matrix in #39646
  • Fix an invalid condition by @cyyever in #39762
  • Simplify conditional code by @cyyever in #39781
  • Fix re-compilations for cross attention cache by @zucchini-nlp in #39788
  • standardized BARThez model card by @EthanV431 in #39701
  • Update model card for Cohere2 (Command R7B) by @arpon-kapuria in #39604
  • Update mT5 model card by @dross20 in #39702
  • Add callback to monitor progress in whisper transcription by @poke1024 in #37483
  • fix: providing a tensor to cache_position in model.generate kwargs always crashes because of boolean test by @gante in #39300
  • feat(tokenization): add encode_message to tokenize messages one by one by @pco111 in #39507
  • [docs] fix korean docs yet again by @gante in #39813
  • Update documentation for Cohere2Vision models by @kyle-cohere in #39817
  • [cohere2 vision] move doc to multimodal section by @zucchini-nlp in #39820
  • Fix broken links by @oToToT in #39809
  • Fix bad markdown links by @ebezzam in #39819
  • Fix tp cb by @ArthurZucker in #39838
  • [VLMs] split out "get placeholder mask" to helper by @zucchini-nlp in #39777
  • [attn_implementation] remove recursive, allows custom kernels with wrappers by @ArthurZucker in #39823
  • [typecheck] proper export of private symbols by @cyyever in #39729
  • Update ux cb by @ArthurZucker in #39845
  • Fix responses add tests by @LysandreJik in #39848
  • Add fast image processor Janus, Deepseek VL, Deepseek VL hybrid by @yonigozlan in #39739
  • [image-processing] deprecate plot_keypoint_matching, make visualize_keypoint_matching as a standard by @sbucaille in #39830
  • Allow TrackioCallback to work when pynvml is not installed by @qgallouedec in #39851
  • remove dtensors, not explicit by @ArthurZucker in #39840
  • Improve is_wandb_available function to verify WandB installation by @qgallouedec in #39875
  • Refactor label name handling for PEFT models in Trainer class by @qgallouedec in #39265
  • Use comment to build doc on PRs by @ydshieh in #39846
  • Add support for including in-memory videos (not just files/urls) in apply_chat_template by @akibjawad in #39494
  • [core] Fix attn_implementation setter with missing sub_configs by @qubvel in #39855
  • Fix quant docker for fp-quant by @SunMarc in #39641
  • Rework add-new-model-like with modular and make test filenames coherent by @Cyrilvallez in #39612
  • Replace Tokenizer with PreTrainedTokenizerFast in ContinuousBatchProcessor by @qgallouedec in #39858
  • Set torch.backends.cudnn.allow_tf32 = False for CI by @ydshieh in #39885
  • [typing] better return type hint for AutoModelForCausalLM and AutoModelForImageTextToText by @qubvel in #39881
  • Fix link to models in README by @qubvel in #39880
  • [DOCS] : Improved mimi model card by @rohitthewanderer in #39824
  • Update cohere2 vision test by @ydshieh in #39888
  • send some feedback when manually building doc via comment by @ydshieh in #39889
  • Add support for ModernBertForMultipleChoice by @netique in #39232
  • chore: update DETR model card by @arpon-kapuria in #39822
  • Reorder serving docs by @LysandreJik in #39634
  • [Exaone4] Fixes the attn implementation! by @ArthurZucker in #39906
  • fix test_working_of_tp failure of accelerate ut by @yao-matrix in #39828
  • [qwen] remove unnecessary CUDA sync in qwen2_5_vl by @cyyever in #39870
  • Avoid aliasing in cond's branches for torch 2.8 by @ydwu4 in #39488
  • Fix misleading WandB error when WANDB_DISABLED is set by @notkisk in #39891
  • Replace video_fps with fps in tests by @cyyever in #39898
  • Fix eval thread fork bomb by @JustinVanHeek in #39717
  • Fix aria tests by @zucchini-nlp in #39879

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @capnmav77
    • Add Fast Segformer Processor (#37024)
  • @cyyever
    • Apply several ruff SIM rules (#37283)
    • Fix an invalid condition (#39762)
    • Simplify conditional code (#39781)
    • [typecheck] proper export of private symbols (#39729)
    • [qwen] remove unnecessary CUDA sync in qwen2_5_vl (#39870)
    • Replace video_fps with fps in tests (#39898)
  • @rziga
    • Add MM Grounding DINO (#37925)
Jul 29, 2025
Patch release 4.54.1

We had quite a lot of bugs that got through! Release was a bit rushed, sorry everyone! 🤗 Mostly cache fixes, as we now have layered cache, and fixed to distributed.

  • Fix Cache.max_cache_len max value for Hybrid models, @manueldeprada, @Cyrilvallez, #39737
  • [modenbert] fix regression, @zucchini-nlp, #39750
  • Fix version issue in modeling_utils.py, @Cyrilvallez, #39759
  • Fix GPT2 with cross attention, @zucchini-nlp, #39754
  • Fix mamba regression, @manueldeprada, #39728
  • Fix: add back base model plan, @S1ro1, #39733
  • fix cache inheritance, #39748
  • Fix cache-related tests, @zucchini-nlp, #39676
  • Fix Layer device placement in Caches, @Cyrilvallez, #39732
  • PATCH: add back n-dim device-mesh + fix tp trainer saving, @S1ro1, @SunMarc, #39693
  • fix missing model._tp_size from ep refactor, @winglian, #39688
Jul 25, 2025
v4.54.0: Kernels, Transformers Serve, Ernie, Voxtral, LFM2, DeepSeek v2, ModernBERT Decoder...

Important news!

In order to become the source of truth, we recognize that we need to address two common and long-heard critiques about transformers:

  1. transformers is bloated
  2. transformers is slow

Our team has focused on improving both aspects, and we are now ready to announce this. The modeling files for the standard Llama models are down to 500 LOC and should be much more readable, keeping just the core of the modeling and hiding the "powerful transformers features." <img width="1583" height="974" alt="image" src="https://github.com/user-attachments/assets/f1075598-d63e-4184-b3af-c0d4b31cdde5" />

The MoEs are getting some kernel magic, enabling the use of the efficient megablocks kernels, setting a good precedent to allow the community to leverage any of the most powerful kernels developed for quantization as well! It should also be much more convenient to use with any attention implementation you want. This opens the door to some optimizations such as leveraging flash-attention on Metal (MPS Torch backend). <img width="2050" height="752" alt="image" src="https://github.com/user-attachments/assets/23ebfb20-7626-46a5-b264-76ffb8b8c811" />

This is but the tip of the iceberg: with the work on kernels that we're heavily pushing forward, expect speed-ups on several backends in the coming months!!

This release also includes the first steps to enabling efficient distributed training natively in transformers. Loading a 100B model takes ~3 seconds on our cluster — we hope this will be the norm for everyone! We are working on distributed checkpointing as well, and want to make sure our API can be easily used for any type of parallelism.

We want the community to benefit from all of the advances, and as always, include all hardware and platforms! We believe the kernels library will give the tools to optimize everything, making a big difference for the industry!

New models

Ernie 4.5

The Ernie 4.5 model was released in the Ernie 4.5 Model Family release by baidu. This family of models contains multiple different architectures and model sizes. This model in specific targets the base text model without mixture of experts (moe) with 0.3B parameters in total. It uses the standard Llama at its core.

Other models from the family can be found at Ernie 4.5 MoE.

<div class="flex justify-center"> <img src="https://ernie.baidu.com/blog/posts/ernie4.5/overview.png"/> </div>
  • [Ernie 4.5] Add ernie text models by @vasqu in #39228

Voxtral

Voxtral is an upgrade of Ministral 3B and Mistral Small 3B, extending its language capabilities with audio input support. It is designed to handle tasks such as speech transcription, translation, and audio understanding.

You can read more in Mistral's realease blog post.

The model is available in two checkpoints:

Key Features

Voxtral builds on Ministral-3B by adding audio processing capabilities:

  • Transcription mode: Includes a dedicated mode for speech transcription. By default, Voxtral detects the spoken language and transcribes it accordingly.
  • Long-form context: With a 32k token context window, Voxtral can process up to 30 minutes of audio for transcription or 40 minutes for broader audio understanding.
  • Integrated Q&A and summarization: Supports querying audio directly and producing structured summaries without relying on separate ASR and language models.
  • Multilingual support: Automatically detects language and performs well across several widely spoken languages, including English, Spanish, French, Portuguese, Hindi, German, Dutch, and Italian.
  • Function calling via voice: Can trigger functions or workflows directly from spoken input based on detected user intent.
  • Text capabilities: Maintains the strong text processing performance of its Ministral-3B foundation.
  • Add voxtral by @eustlb in #39429

LFM2

LFM2 represents a new generation of Liquid Foundation Models developed by Liquid AI, specifically designed for edge AI and on-device deployment.

The models are available in three sizes (350M, 700M, and 1.2B parameters) and are engineered to run efficiently on CPU, GPU, and NPU hardware, making them particularly well-suited for applications requiring low latency, offline operation, and privacy.

  • LFM2 by @paulpak58 in #39340

DeepSeek v2

The DeepSeek-V2 model was proposed in DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model by DeepSeek-AI Team.

The model uses Multi-head Latent Attention (MLA) and DeepSeekMoE architectures for efficient inference and cost-effective training. It employs an auxiliary-loss-free strategy for load balancing and multi-token prediction training objective. The model can be used for various language tasks after being pre-trained on 14.8 trillion tokens and going through Supervised Fine-Tuning and Reinforcement Learning stages.

  • Add DeepSeek V2 Model into Transformers by @VladOS95-cyber in #36400

ModernBERT Decoder models

ModernBERT Decoder is the same architecture as ModernBERT but trained from scratch with a causal language modeling (CLM) objective. This allows for using the same architecture for comparing encoders and decoders. This is the decoder architecture implementation of ModernBERT, designed for autoregressive text generation tasks.

Like the encoder version, ModernBERT Decoder incorporates modern architectural improvements such as rotary positional embeddings to support sequences of up to 8192 tokens, unpadding to avoid wasting compute on padding tokens, GeGLU layers, and alternating attention patterns. However, it uses causal (unidirectional) attention to enable autoregressive generation.

  • Add ModernBERT Decoder Models - ModernBERT, but trained with CLM! by @orionw in #38967

EoMT

The Encoder-only Mask Transformer (EoMT) model was introduced in the CVPR 2025 Highlight Paper Your ViT is Secretly an Image Segmentation Model by Tommie Kerssies, Niccolò Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, and Daan de Geus. EoMT reveals Vision Transformers can perform image segmentation efficiently without task-specific components.

<div style="text-align: center;"> <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/eomt_architecture.png" alt="drawing" width="500"/> </div>
  • ✨ Add EoMT Model || 🚨 Fix Mask2Former loss calculation by @yaswanth19 in #37610

Doge

Doge is a series of small language models based on the Doge architecture, aiming to combine the advantages of state-space and self-attention algorithms, calculate dynamic masks from cached value states using the zero-order hold method, and solve the problem of existing mainstream language models getting lost in context. It uses the wsd_scheduler scheduler to pre-train on the smollm-corpus, and can continue training on new datasets or add sparse activation feedforward networks from stable stage checkpoints.

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/refs%2Fpr%2F426/transformers/model_doc/doge_architecture.png" alt="drawing" width="600"

  • Add Doge model by @LoserCheems in #35891

AIM v2

The AIMv2 model was proposed in Multimodal Autoregressive Pre-training of Large Vision Encoders by Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, Alaaeldin El-Nouby.

The abstract from the paper is the following:

We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal image understanding across diverse settings.

  • Add Aimv2 model by @yaswanth19 in #36625

PerceptionLM

The PerceptionLM model was proposed in PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding by Jang Hyun Cho et al. It's a fully open, reproducible model for transparent research in image and video understanding. PLM consists of a vision encoder with a small scale (<8B parameters) LLM decoder.

  • PerceptionLM by @shuminghu in #37878

Efficient LoFTR

The EfficientLoFTR model was proposed in Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed by Yifan Wang, Xingyi He, Sida Peng, Dongli Tan and Xiaowei Zhou.

This model consists of matching two images together by finding pixel correspondences. It can be used to estimate the pose between them. This model is useful for tasks such as image matching, homography estimation, etc.

  • Add EfficientLoFTR model by @sbucaille in #36355

EVOLLA

<img width="2772" height="3276" alt="image" src="https://github.com/user-attachments/assets/8fb76e17-9bff-4edc-9ac8-205f4b58a898" />

Evolla is an advanced 80-billion-parameter protein-language generative model designed to decode the molecular language of proteins. It integrates information from protein sequences, structures, and user queries to generate precise and contextually nuanced insights into protein function. Trained on an unprecedented AI-generated dataset of 546 million protein question-answer pairs and 150 billion word tokens, Evolla significantly advances research in proteomics and functional genomics, providing expert-level insights and shedding light on the molecular logic encoded in proteins.

  • Add evolla rebase main by @zhoubay in #36232

DeepSeek VL

<img width="824" height="1017" alt="image" src="https://github.com/user-attachments/assets/298074a0-c509-4e0b-9adb-c72bac206b18" />

Deepseek-VL was introduced by the DeepSeek AI team. It is a vision-language model (VLM) designed to process both text and images for generating contextually relevant responses. The model leverages LLaMA as its text encoder, while SigLip is used for encoding images.

  • Add support for DeepseekAI's DeepseekVL by @geetu040 in #36248

xLSTM

<img width="692" height="434" alt="image" src="https://github.com/user-attachments/assets/164243ba-1fd8-48b5-b565-3396d640ce0e" />

The xLSTM model was proposed in xLSTM: Extended Long Short-Term Memory by Maximilian Beck*, Korbinian Pöppel*, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter and Sepp Hochreiter. xLSTM updates the original LSTM architecture to be competitive with Transformer models by introducing exponential gating, matrix memory expansion, and parallelizable training and ingestion.

The 7B model variant was trained by the xLSTM team Maximilian Beck, Korbinian Pöppel, Phillip Lippe, Richard Kurle, Patrick Blies, Sebastian Böck and Sepp Hochreiter at NXAI.

  • Add xlstm model by @Cyrilvallez in #39665

EXAONE 4.0

<img width="3750" height="954" alt="image" src="https://github.com/user-attachments/assets/591040d2-d90a-4770-a4a7-0d91e21263fb" />

EXAONE 4.0 model is the language model, which integrates a Non-reasoning mode and Reasoning mode to achieve both the excellent usability of EXAONE 3.5 and the advanced reasoning abilities of EXAONE Deep. To pave the way for the agentic AI era, EXAONE 4.0 incorporates essential features such as agentic tool use, and its multilingual capabilities are extended to support Spanish in addition to English and Korean.

The EXAONE 4.0 model series consists of two sizes: a mid-size 32B model optimized for high performance, and a small-size 1.2B model designed for on-device applications.

  • Add EXAONE 4.0 model by @lgai-exaone in #39129

Parallelisation

We've added Expert Parallel support for Llama4, next release will include it for all model! You can just set a distributed_config with enable_expert_parallel=True. This is enabling efficient training of sparse Mixture-of-Experts (MoE) models across multiple devices. This allows each expert in the MoE layer to run in parallel (instead of previous TP which requires more communication), significantly improving scalability and memory efficiency.

  • Add ep by @ArthurZucker in #39501

Quantization

FP Quant

FP-Quant is a quantization method optimized for Blackwell-generation Nvidia GPUs, supporting efficient post-training quantization (PTQ) and quantization-aware training (QAT) of LLMs using MXFP4 and NVFP4 formats.

Currently, only PTQ with MXFP4 is available. You can quantize models on-the-fly using transformers:

from transformers import AutoModelForCausalLM, FPQuantConfig

model = AutoModelForCausalLM.from_pretrained(
    "qwen/Qwen3-8B",
    quantization_config=FPQuantConfig(),
    device_map="cuda",
    torch_dtype=torch.bfloat16,
)

FP-Quant requires a Blackwell GPU and runs via the QuTLASS library. No Blackwell GPU? Use FPQuantConfig(pseudoquant=True) to emulate quantization (no QuTLASS needed).

The following results show the inference speedup of QuTLASS MXFP4 over PyTorch BF16 in Transformers. MXFP4 gives consistent speedups across all batch sizes, reaching up to 4× faster at larger scales.

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/qwen3-8b-end-to-end-prefill-speedup-mxfp4-vs-bf16-on-rtx5090.svg" alt="drawing" width="600">
  • FP-Quant support by @BlackSamorez in #38696

Kernels

The kernels project aims to become the single trusted source for high-performance kernels in the Transformers ecosystem. We're working toward centralizing all kernels on the Hub, so updates, bug fixes, and improvements can happen in one place—no more scattered repos and no compilation headaches!

You can already try it out today by setting use_kernels=True in from_pretrained. Any contributor can build their kernel, register it and use it right away—no extra setup, more on this here

Even better: want to use Flash Attention 3? No need to deal with tricky compilation and missing symbols issues! Just drop in:

model.set_attn_implementation("kernels-community/flash-attn3")

This automatically fetches the right build for your setup (e.g. CUDA and PyTorch versions).

We’re also teaming up with amazing kernel devs from Unsloth, Liger, vLLM, and more to bring their work directly to the Hub—making it easier than ever to access amazing performance with a single line of code.

  • Kernels flash attn by @ArthurZucker in #39474

Transformers Serve

https://github.com/user-attachments/assets/9928f62b-543c-4b8a-b81b-4a6e262c229e

Over the past few months, we have been putting more and more functionality in the transformers chat utility, which offers a CLI-based app to chat with chat models. We've chosen to push this further by splitting the backend of transformers chat in a new, separate utility called transformers serve.

This is ideal for experimentation purposes, or to run models locally for personal and private use. It does not aim to compete with dedicated inference engines such as vLLM or SGLang.

Models of diverse modalities supported by transformers may be served with the transformers serve CLI. It spawns a local server that offers compatibility with the OpenAI SDK, which is the de-facto standard for LLM conversations and other related tasks. This way, you can use the server from many third party applications, or test it using the transformers chat CLI (docs).

The server supports the following REST APIs:

  • /v1/chat/completions
  • /v1/responses
  • /v1/audio/transcriptions
  • /v1/models

Relevant commits:

  • Split transformers chat and transformers serve by @LysandreJik in #38443
  • [serve] Cursor support, move docs into separate page, add more examples by @gante in #39133
  • Fix continuous batching in transformers serve by @LysandreJik in #39149
  • [server] add tests and fix passing a custom generation_config by @gante in #39230
  • [serve] Model name or path should be required by @LysandreJik in #39178
  • Random serve fixes by @pcuenca in #39176
  • [tests] tag serve tests as slow by @gante in #39343
  • Responses API in transformers serve by @LysandreJik in #39155
  • [serve] Add speech to text (/v1/audio/transcriptions) by @gante in #39434
  • Transformers serve VLM by @LysandreJik in #39454

Refactors

Significant refactors have been underway in transformers, aiming to reduce the complexity of the code. A metric we follow to see how the refactors impact our code is to follow the number of lines in a given model; we try to reduce it as much as possible, while keeping everything related to the forward pass and model definition in that file.

See the evolution here:

<img width="1200" height="600" alt="image" src="https://github.com/user-attachments/assets/c232bc8d-7d7c-4192-baa8-a60efe5eb2ff" />

Some notable refactors:

KV caching

KV caches are now defined per layer, enabling new hybrid caches that mix different attention types. CacheProcessors also encapsulate cache quantization and offloading, making them easy to customize.

  • [cache refactor] Move all the caching logic to a per-layer approach by @manueldeprada in #39106

Handling specific attributes like output_attentions or output_hidden_states

Such attributes require very specific handling within the forward call, while they're not important to understand how the model works. We remove that code but keep the functionality by providing a better utility to handle it.

  • Refactor the way we handle outputs for new llamas and new models by @ArthurZucker in #39120

Setting the attention implementation

We refactor the way to explicitly set the attention implementation so that it has a method dedicated to it.

  • [refactor] set attention implementation by @zucchini-nlp in #38974

Breaking changes

  • [Whisper] 🚨 Fix pipeline word timestamp: timestamp token is end of token time !!! by @eustlb in #36632
  • 🚨 Don't use cache in non-generative models by @zucchini-nlp in #38751
  • 🚨🚨🚨 [eomt] make EoMT compatible with pipeline by @yaswanth19 in #39122
  • 🚨🚨 Fix and simplify attention implementation dispatch and subconfigs handling by @Cyrilvallez in #39423
  • 🚨🚨🚨 [Trainer] Enable average_tokens_across_devices by default in TrainingArguments by @Krish0909 in #39395
  • 🔴 Fix EnCodec internals and integration tests by @ebezzam in #39431

Bugfixes and improvements

  • Add StableAdamW Optimizer by @SunMarc in #39446
  • [Flex Attn] Fix torch 2.5.1 incompatibilities by @vasqu in #37406
  • fix test_compare_unprocessed_logit_scores by @ydshieh in #39053
  • fix t5gemma tests by @ydshieh in #39052
  • Update SuperPoint model card by @sbucaille in #38896
  • fix layoutlmv3 tests by @ydshieh in #39050
  • [docs] Model contribution by @stevhliu in #38995
  • Update PEGASUS-X model card by @dross20 in #38971
  • [docs] @auto_docstring by @stevhliu in #39011
  • [docs] Tensor parallelism by @stevhliu in #38241
  • [Whisper] fix shape mismatch in tests by @eustlb in #39074
  • Cleanup Attention class for Siglip and dependent models by @yaswanth19 in #39040
  • fix Gemma3nProcessorTest by @ydshieh in #39068
  • Fix initialization of OneFormer by @bvantuan in #38901
  • Uninstallling Flash attention from quantization docker by @MekkCyber in #39078
  • fix a bunch of XPU UT failures on stock PyTorch 2.7 and 2.8 by @yao-matrix in #39069
  • Pipeline: fix unnecessary warnings by @eustlb in #35753
  • fix mistral3 tests by @ydshieh in #38989
  • fixed typo for docstring in prepare_inputs method by @JINO-ROHIT in #39071
  • TST PEFT integration tests with pipeline generate by @BenjaminBossan in #39086
  • add fast image processor nougat by @NahieliV in #37661
  • Add Fast Image Processor for mobileViT by @MinJu-Ha in #37143
  • guard torch distributed check by @tvukovic-amd in #39057
  • fix dots1 tests by @ydshieh in #39088
  • Add Fast Image Processor for Chameleon by @farrosalferro in #37140
  • Fix: unprotected import of tp plugin by @S1ro1 in #39083
  • TST Fix PEFT integration test bitsandbytes config by @BenjaminBossan in #39082
  • [fix] Add FastSpeech2ConformerWithHifiGan by @stevhliu in #38207
  • Sandeepyadav1478/2025 06 19 deberta v2 model card update by @sandeepyadav1478 in #38895
  • Fixes the failing test test_is_split_into_words in test_pipelines_token_classification.py by @st81 in #39079
  • skip some test_sdpa_can_dispatch_on_flash by @ydshieh in #39092
  • fix UT failures on XPU w/ stock PyTorch 2.7 & 2.8 by @yao-matrix in #39116
  • Fix some bug for finetune and batch infer For GLM-4.1V by @zRzRzRzRzRzRzR in #39090
  • docs: Gemma 3n audio encoder by @RyanMullins in #39087
  • All CI jobs with A10 by @ydshieh in #39119
  • Licenses by @LysandreJik in #39127
  • Fix chat by @gante in #39128
  • Enable XPU doc by @jiqing-feng in #38929
  • docs: correct two typos in awesome-transformers.md by @VladimirGutuev in #39102
  • switch default xpu tp backend to pytorch built-in XCCL from pytorch 2.8 by @yao-matrix in #39024
  • Update BigBirdPegasus model card by @dross20 in #39104
  • [Whisper] update token timestamps tests by @eustlb in #39126
  • Fix key mapping for VLMs by @bvantuan in #39029
  • Several fixes for Gemma3n by @Cyrilvallez in #39135
  • fix caching_allocator_warmup with tie weights by @jiqing-feng in #39070
  • feat: support indivisible shards for TP model loading and TPlizing. by @kmehant in #37220
  • [qwen2-vl] fix FA2 inference by @zucchini-nlp in #39121
  • [typing] LlamaAttention return typehint by @ArkVex in #38998
  • [VLMs] support passing embeds along with pixels by @zucchini-nlp in #38467
  • [superglue] fix wrong concatenation which made batching results wrong by @sbucaille in #38850
  • Fix missing fsdp & trainer jobs in daily CI by @ydshieh in #39153
  • Fix: Ensure wandb logs config in offline mode by @DavidS2106 in #38992
  • Change @lru_cache() to @lru_cache to match styles from #38883. by @rasmi in #39093
  • fix: remove undefined variable by @ybkurt in #39146
  • update bnb ground truth by @jiqing-feng in #39117
  • Suggest jobs to use in run-slow by @ydshieh in #39100
  • Update expected values (after switching to A10) by @ydshieh in #39157
  • fix llama tests by @ydshieh in #39161
  • Add activation sparsity reference in gemma3n doc by @ChongYou in #39160
  • fix default value of config to match checkpionts in LLaVa-OV models by @ved1beta in #39163
  • [smolvlm] fix video inference by @zucchini-nlp in #39147
  • Fix multimodal processor get duplicate arguments when receive kwargs for initialization by @Isotr0py in #39125
  • Blip2 fixes by @remi-or in #39080
  • Fix missing initializations for models created in 2024 by @bvantuan in #38987
  • Reduce Glm4v model test size significantly by @Cyrilvallez in #39173
  • [docs] ViTPose by @stevhliu in #38630
  • [generate] document non-canonical beam search default behavior by @gante in #39000
  • Update expected values (after switching to A10) - part 2 by @ydshieh in #39165
  • Update expected values (after switching to A10) - part 3 by @ydshieh in #39179
  • Test fixes for Aria (and some Expectation for llava_next_video) by @remi-or in #39131
  • [glm4v] fix video inference by @zucchini-nlp in #39174
  • when delaying optimizer creation only prepare the model by @winglian in #39152
  • Decouple device_map='auto' and tp_plan='auto' by @SunMarc in #38942
  • Fix many HPU failures in the CI by @IlyasMoutawwakil in #39066
  • [Dia] Change ckpt path in docs by @vasqu in #39181
  • Update expected values (after switching to A10) - part 4 by @ydshieh in #39189
  • [typing] better return typehints for from_pretrained by @qubvel in #39184
  • Update expected values (after switching to A10) - part 5 by @ydshieh in #39205
  • Update expected values (after switching to A10) - part 6 by @ydshieh in #39207
  • Add packed tensor format support for flex/sdpa/eager through the mask! by @Cyrilvallez in #39194
  • Update expected values (after switching to A10) - part 7 by @ydshieh in #39218
  • Update expected values (after switching to A10) - part 8 - Final by @ydshieh in #39220
  • [video processors] Support float fps for precise frame sampling by @zrohyun in #39134
  • Expectations re-order and corrected FA3 skip by @remi-or in #39195
  • [vjepa2] replace einsum with unsqueeze by @xenova in #39234
  • Fix missing fast tokenizer/image_processor in whisper/qwen2.5-omni processor by @Isotr0py in #39244
  • [modular] Follow global indexing and attribute setting, and their dependencies by @Cyrilvallez in #39180
  • fix typo in Gemma3n notes by @davanstrien in #39196
  • Don't send new comment if the previous one is less than 30 minutes (unless the content is changed) by @ydshieh in #39170
  • fix bug using FSDP V1 will lead to model device not properly set by @kaixuanliu in #39177
  • Make _compute_dynamic_ntk_parameters exportable by @xadupre in #39171
  • [modular] Simplify logic and docstring handling by @Cyrilvallez in #39185
  • [bugfix] fix flash attention 2 unavailable error on Ascend NPU by @FightingZhen in #39166
  • fix fastspeech2_conformer tests by @ydshieh in #39229
  • RotaryEmbeddings change is not None -> isinstance(..., dict) by @qubvel in #39145
  • Fix patch helper by @Cyrilvallez in #39216
  • enable xpu on kv-cache and hqq doc by @jiqing-feng in #39246
  • adjust input and output texts for test_modeling_recurrent_gemma.py by @kaixuanliu in #39190
  • Update tiny-agents example by @Wauplin in #39245
  • Add Korean translation for glossary.md by @JoosunH in #38804
  • Clarify per_device_train_batch_size scaling in TrainingArguments by @Shohail-Ismail in #38…
  • Add segmentation_maps support to MobileNetV2ImageProcessor by @simonreise in #37312
  • Simplify Mixtral and its modular children by @Cyrilvallez in #39252
  • fix some flaky tests in tests/generation/test_utils.py by @ydshieh in #39254
  • Update LED model card by @dross20 in #39233
  • Glm 4 doc by @zRzRzRzRzRzRzR in #39247
  • fix xpu failures on PT 2.7 and 2.8 w/o IPEX and enable hqq cases on XPU by @yao-matrix in #39187
  • Fix license text, duplicate assignment, and typo in constant names by @gudwls215 in #39250
  • Skip test_eager_matches sdpa generate and update an integration test for blip-like models by @ydshieh in #39248
  • remove broken block by @molbap in #39255
  • fix(generation): stop beam search per-instance when heuristic satisfied by @guang-yng in #38778
  • fix recompiles due to instance key, and deepcopy issues by @ArthurZucker in #39270
  • Fix errors when use verl to train GLM4.1v model by @kaln27 in #39199
  • [CI] fix docs by @gante in #39273
  • [pagged-attention] fix off-by-1 error in pagged attention generation by @kashif in #39258
  • [smollm3] add tokenizer mapping for smollm3 by @gante in #39271
  • Refactor PretrainedConfig.__init__ method to make it more explicit by @qubvel in #39158
  • fix flaky test_generate_compile_model_forward by @ydshieh in #39276
  • [lightglue] add support for remote code DISK keypoint detector by @sbucaille in #39253
  • Add torchcodec in docstrings/tests for datasets 4.0 by @lhoestq in #39156
  • Update T5gemma by @bzhangGo in #39210
  • [Tests] Update model_id in AIMv2 Tests by @yaswanth19 in #39281
  • Fix SDPA attention precision issue in Qwen2.5-VL by @JJJYmmm in #37363
  • [flash attn 3] bring back flags by @zucchini-nlp in #39294
  • fix aria tests by @ydshieh in #39277
  • skip test_torchscript_* for now until the majority of the community ask for it by @ydshieh in #39307
  • [modular] Allow method with the same name in case of @property decorator by @Cyrilvallez in #39308
  • [sliding window] revert and deprecate by @zucchini-nlp in #39301
  • 🌐 [i18n-KO] Translated quark.md to Korean by @maximizemaxwell in #39268
  • Fix consistency and a few docstrings warnings by @Cyrilvallez in #39314
  • add stevhliu to the list in self-comment-ci.yml by @ydshieh in #39315
  • Updated the Model docs - for the MARIAN model by @emanrissha in #39138
  • skip files in src/ for doctest (for now) by @ydshieh in #39316
  • docs: update LLaVA-NeXT model card by @Bpriya42 in #38894
  • Fix typo: langauge -> language by @tomaarsen in #39317
  • Granite speech speedups by @avihu111 in #39197
  • Fix max_length_q and max_length_k types to flash_attn_varlen_func by @HollowMan6 in #37206
  • enable static cache on TP model by @jiqing-feng in #39164
  • Fix broken SAM after #39120 by @yonigozlan in #39289
  • Delete deprecated stuff by @zucchini-nlp in #38838
  • fix Glm4v batch videos forward by @Kuangdd01 in #39172
  • fix phi3 tests by @ydshieh in #39312
  • Handle DAC conversion when using weight_norm with newer PyTorch versions by @edwko in #36393
  • [modeling][lfm2] LFM2: Remove deprecated seen_tokens by @paulpak58 in #39342
  • [Core] [Offloading] Enable saving offloaded models with multiple shared tensor groups by @kylesayrs in #39263
  • Add a default value for position_ids in masking_utils by @Cyrilvallez in #39310
  • [modular] speedup check_modular_conversion with multiprocessing by @qubvel in #37456
  • Updated Switch Transformers model card with standardized format (Issue #36979) by @giuseppeCoccia in #39305
  • Fix link for testpypi by @Cyrilvallez in #39360
  • update cb TP by @ArthurZucker in #39361
  • fix failing test_sdpa_can_dispatch_on_flash by @ydshieh in #39259
  • Verbose error in fix mode for utils/check_docstrings.py by @manueldeprada in #38915
  • Remove device check in HQQ quantizer by @learning-chip in #39299
  • Add mistral common support by @juliendenize in #38906
  • Update Readme to Run Multiple Choice Script from Example Directory by @eromomon in #39323
  • Updated CamemBERT model card to new standardized format by @MShaheerMalik77 in #39227
  • fix gpt2 usage doc by @Xiang-cd in #39351
  • Update Model Card for Encoder Decoder Model by @ParagEkbote in #39272
  • update docker file to use latest timm (for perception_lm) by @ydshieh in #39380
  • Fix overriding Fast Image/Video Processors instance attributes affect other instances by @yonigozlan in #39363
  • [shieldgemma] fix checkpoint loading by @zucchini-nlp in #39348
  • [BLIP] remove cache from Qformer by @zucchini-nlp in #39335
  • [Qwen2.5-VL] Fix torch.finfo() TypeError for integer attention_mask_tensor by @dsnsabari in #39333
  • Deprecate AutoModelForVision2Seq by @zucchini-nlp in #38900
  • Fix Lfm2 and common tests by @Cyrilvallez in #39398
  • [examples] fix do_reduce_labels argument for run_semantic_segmentation_no_trainer by @eromomon in #39322
  • Totally rewrite how pipelines load preprocessors by @Rocketknight1 in #38947
  • Use np.pad instead of np.lib.pad. by @rasmi in #39346
  • [Docs] Fix typo in CustomTrainer compute_loss method and adjust loss reduction logic by @MilkClouds in #39391
  • Update phi4_multimodal.md by @tanuj-rai in #38830
  • [siglip] fix pooling comment by @sameerajashyam in #39378
  • Fix typo in /v1/models output payload by @alvarobartt in #39414
  • support loading qwen3 gguf by @44670 in #38645
  • Ignore extra position embeddings weights for ESM by @Rocketknight1 in #39063
  • set document_question_answering pipeline _load_tokenizer to True by @jiqing-feng in #39411
  • Fix invalid property by @cyyever in #39384
  • refactor: remove set_tracer_provider and set_meter_provider calls by @McPatate in #39422
  • Fix bugs from pipeline preprocessor overhaul by @Rocketknight1 in #39425
  • Fix bugs in pytorch example run_clm when streaming is enabled by @HRezaei in #39286
  • Remove deprecated audio utils functions by @jiangwangyi in #39330
  • Remove residual quantization attribute from dequantized models by @DWarez in #39373
  • handle training summary when creating modelcard but offline mode is set by @winglian in #37095
  • [vlm] fix loading of retrieval VLMs by @zucchini-nlp in #39242
  • docs: update SuperGlue docs by @sbucaille in #39406
  • docs: update LightGlue docs by @sbucaille in #39407
  • CI workflow for performed test regressions by @ahadnagy in #39198
  • [autodocstring] add video and audio inputs by @zucchini-nlp in #39420
  • [Core] [Offloading] Fix saving offloaded submodules by @kylesayrs in #39280
  • Remove double soft-max in load-balancing loss. Fixes #39055 . by @rudolfwilliam in #39056
  • Fixed a bug calculating cross entropy loss in JetMoeForCausalLM by @Phoenix-Shen in #37830
  • [chat template] add a testcase for kwargs by @zucchini-nlp in #39415
  • Fix L270 - hasattr("moe_args") returning False error by @wjdghks950 in #38715
  • Defaults to adamw_torch_fused for Pytorch>=2.8 by @cyyever in #37358
  • Change log level from warning to info for scheduled request logging in ContinuousBatchProcessor by @qgallouedec in #39372
  • Add cosine_with_min_lr_schedule_with_warmup_lr_rate scheduler in Trainer by @richardodliu in #31870
  • Fix missing definition of diff_file_url in notification service by @ahadnagy in #39445
  • add test scanner by @molbap in #39419
  • Remove runtime conditions for type checking by @cyyever in #37340
  • docs: add missing numpy import to minimal example by @IliasAarab in #39444
  • [cache] make all classes cache compatible finally by @zucchini-nlp in #38635
  • Fix typo in generation configuration for Janus model weight conversion by @thisisiron in #39432
  • Better typing for model.config by @qubvel in #39132
  • [Bugfix] [Quantization] Remove unused init arg by @kylesayrs in #39324
  • Fix processor tests by @zucchini-nlp in #39450
  • Remove something that should have never been there by @ArthurZucker in #38254
  • make the loss context manager easier to extend by @winglian in #39321
  • Fixes #39204: add fallback if get_base_model missing by @sebastianvlad1 in #39226
  • [CI] Fix partially red CI by @vasqu in #39448
  • Updated Megatron conversion script for gpt2 checkpoints by @LckyLke in #38969
  • Fix indentation bug in SmolVLM image processor causing KeyError by @Krish0909 in #39452
  • fix cached file error when repo type is dataset by @hiyouga in #36909
  • Improve grammar and clarity in perf_hardware.md by @ridima11 in #39428
  • create ijepa modelcard (ref : PR #36979 ). by @dhruvmalik007 in #39354
  • Corrections to PR #38642 and enhancements to Wav2Vec2Processor call and pad docstrings by @renet10 in #38822
  • fix(pipelines): QA pipeline returns fewer than top_k results in batch mode by @yushi2006 in #39193
  • fix max_length calculating using cu_seq_lens by @KKZ20 in #39341
  • Fix tests due to breaking change in accelerate by @SunMarc in #39451
  • Use newer typing notation by @cyyever in #38934
  • fix a comment typo in utils.py by @klimarissa17 in #39459
  • Update GemmaIntegrationTest::test_model_2b_bf16_dola by @ydshieh in #39362
  • Fix convert_and_export_with_cache failures for GPU models by @Stonepia in #38976
  • Enable some ruff checks for performance and readability by @cyyever in #39383
  • fix: ImageTextToTextPipeline handles user-defined generation_config by @peteryschneider in #39374
  • Update integration_utils.py by @zhaiji0727 in #39469
  • Add unified logits_to_keep support to LLMClass by @hellopahe in #39472
  • Fix typing order by @Tavish9 in #39467
  • [dependencies] temporary pyarrow pin by @gante in #39496
  • Slack CI bot: set default result for non-existing artifacts by @ahadnagy in #39499
  • [dependencies] Update datasets pin by @gante in #39500
  • [chat template] return assistant mask in processors by @zucchini-nlp in #38545
  • [gemma3] Fix do_convert_rgb in image processors. by @MohitIntel in #39438
  • Fix BatchEncoding.to() for nested elements by @eginhard in #38985
  • Add fast image processor SAM by @yonigozlan in #39385
  • Improve @auto_docstring doc and rename args_doc.py to auto_docstring.py by @yonigozlan in #39439
  • Update SAM/SAM HQ attention implementation + fix Cuda sync issues by @yonigozlan in #39386
  • Fix placeholders replacement logic in auto_docstring by @yonigozlan in #39433
  • [gemma3] support sequence classification task by @zucchini-nlp in #39465
  • [qwen2 vl] fix packing with all attentions by @zucchini-nlp in #39447
  • GLM-4 Update by @zRzRzRzRzRzRzR in #39393
  • Fix bad tensor shape in failing Hubert test. by @ebezzam in #39502
  • Fix the check in flex test by @Cyrilvallez in #39548
  • Rename _supports_flash_attn_2 in examples and tests by @zucchini-nlp in #39471
  • Fix Qwen Omni integration test by @Cyrilvallez in #39553
  • Fix pylint warnings by @cyyever in #39477
  • Raise TypeError instead of ValueError for invalid types by @Sai-Suraj-27 in #38660
  • Fix missing initializations for models created in 2023 by @bvantuan in #39239
  • use the enable_gqa param in torch.nn.functional.scaled_dot_product_at… by @sywangyi in #39412
  • Fix Docstring of BarkProcessor by @st81 in #39546
  • Refactor MambaCache to modeling_mamba.py by @manueldeprada in #38086
  • fix ndim check of device_mesh for TP by @winglian in #39538
  • [Fast image processor] refactor fast image processor glm4v by @yonigozlan in #39490
  • 🌐 [i18n-KO] Translated perf_infer_gpu_multi.md to Korean by @luckyvickyricky in #39441
  • Refactor embedding input/output getter/setter by @molbap in #39339
  • [Fast image processors] Improve handling of image-like inputs other than images (segmentation_maps) by @yonigozlan in #39489
  • [CI] Fix post merge ernie 4.5 by @vasqu in #39561
  • Update modernbertdecoder docs by @orionw in #39453
  • Update OLMoE model card by @nlhmnlhmnlhm in #39344
  • [gemma3] fix bidirectional image mask by @zucchini-nlp in #39396
  • Bump AMD container for 2.7.1 PyTorch by @ahadnagy in #39458
  • Fixes needed for n-d parallelism and TP by @winglian in #39562
  • [timm_wrapper] add support for gradient checkpointing by @Yozer in #39287
  • Add AMD test expectations to DETR model by @ahadnagy in #39539
  • [docs] update attention implementation and cache docs by @zucchini-nlp in #39547
  • [docs] Create page on inference servers with transformers backend by @zucchini-nlp in #39550
  • Add AMD expectations to Mistral3 tests by @ahadnagy in #39481
  • Add AMD GPU expectations for LLaVA tests by @ahadnagy in #39486
  • General weight initialization scheme by @Cyrilvallez in #39579
  • [cache refactor] Move all the caching logic to a per-layer approach by @manueldeprada in #39106
  • Update docs/source/ko/_toctree.yml by @jungnerd in #39516
  • updated mistral3 model card by @cassiasamp in #39531
  • [Paged-Attention] Handle continuous batching for repetition penalty by @kashif in #39457
  • Torchdec RuntimeError catch by @SunMarc in #39580
  • Fix link in "Inference server backends" doc by @hmellor in #39589
  • [WIP] Add OneformerFastImageProcessor by @Player256 in #38343
  • 🎯 Trackio integration by @qgallouedec in #38814
  • Mask2former & Maskformer Fast Image Processor by @SangbumChoi in #35685
  • Fix DynamicCache and simplify Cache classes a bit by @Cyrilvallez in #39590
  • Generic task-specific base classes by @Cyrilvallez in #39584
  • [Trackio] Allow single-gpu training and monitor power by @qgallouedec in #39595
  • Rename supports_static_cache to can_compile_fullgraph by @zucchini-nlp in #39505
  • FP-Quant support by @BlackSamorez in #38696
  • fix moe routing_weights by @llbdyiu66 in #39581
  • [idefics3] fix for vLLM by @zucchini-nlp in #39470
  • enable triton backend on awq xpu by @jiqing-feng in #39443
  • Allow device_mesh have multiple dim by @S1ro1 in #38949
  • Fix typos and grammar issues in documentation and code by @cluster2600 in #39598
  • Fix important models CI by @molbap in #39576
  • Move openai import by @ebezzam in #39613
  • Fix DAC integration tests and checkpoint conversion. by @ebezzam in #39313
  • Feature/standardize opt model card by @JoestarGagan in #39568
  • standardized YOLOS model card according to template in #36979 by @EthanV431 in #39528
  • [Docs] Translate audio_classification.md from English to Spanish by @weezymatt in #39513
  • Update recent processors for vLLM backend by @zucchini-nlp in #39583
  • [efficientloftr] fix model_id in tests by @sbucaille in #39621
  • [timm] new timm pin by @gante in #39640
  • [Voxtral] values for A10 runners by @eustlb in #39605
  • revert behavior of _prepare_from_posids by @winglian in #39622
  • Add owlv2 fast processor by @lmarshall12 in #39041
  • [attention] fix test for packed padfree masking by @zucchini-nlp in #39582
  • Fix: explicit not none check for tensors in flash attention by @jeffrey-dot-li in #39639
  • revert change to cu_seqlen_k and max_k when preparing from position_ids by @winglian in #39653
  • Make pytorch examples UV-compatible by @lhoestq in #39635
  • [docs] fix ko cache docs by @gante in #39644
  • make fixup by @gante in #39661
  • fix(voxtral): correct typo in apply_transcription_request by @rev2607 in #39572
  • Rename huggingface_cli to hf by @LysandreJik in #39630
  • 🚨[Fast Image Processor] Force Fast Image Processor for Qwen2_VL/2_5_VL + Refactor by @yonigozlan in #39591
  • Fix ModernBERT Decoder model by @qubvel in #39671
  • [CI] revert device in test_export_static_cache by @gante in #39662
  • [Ernie 4.5] Post merge adaptations by @vasqu in #39664
  • Delete bad rebasing functions by @Cyrilvallez in #39672
  • Fixes the BC by @ArthurZucker in #39636
  • fix kyutai tests by @ydshieh in #39416
  • update expected outputs for whisper after #38778 by @ydshieh in #39304
  • Add missing flag for CacheLayer by @Cyrilvallez in #39678
  • Fix auto_docstring crashing when dependencies are missing by @yonigozlan in #39564
  • fix: HWIO to OIHW by @RyanMullins in #39200
  • Use auto_docstring for perception_lm fast image processor by @yonigozlan in #39679
  • bad_words_ids no longer slow on mps by @DWarez in #39556
  • Support typing.Literal as type of tool parameters or return value by @grf53 in #39633
  • fix break for ckpt without _tp_plan by @MoyanZitto in #39658
  • Fix tied weight test by @Cyrilvallez in #39680
  • Add padding-free to Granite hybrid moe models by @garrett361 in #39677

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @sbucaille
    • Update SuperPoint model card (#38896)
    • [superglue] fix wrong concatenation which made batching results wrong (#38850)
    • [lightglue] add support for remote code DISK keypoint detector (#39253)
    • docs: update SuperGlue docs (#39406)
    • docs: update LightGlue docs (#39407)
    • Add EfficientLoFTR model (#36355)
  • @yaswanth19
    • Cleanup Attention class for Siglip and dependent models (#39040)
    • ✨ Add EoMT Model || 🚨 Fix Mask2Former loss calculation (#37610)
    • 🚨🚨🚨 [eomt] make EoMT compatible with pipeline (#39122)
    • Add Aimv2 model (#36625)
    • [Tests] Update model_id in AIMv2 Tests (#39281)
  • @bvantuan
    • Fix initialization of OneFormer (#38901)
    • Fix key mapping for VLMs (#39029)
    • Fix missing initializations for models created in 2024 (#38987)
    • Fix missing initializations for models created in 2023 (#39239)
  • @NahieliV
    • add fast image processor nougat (#37661)
  • @MinJu-Ha
    • Add Fast Image Processor for mobileViT (#37143)
  • @zRzRzRzRzRzRzR
    • Fix some bug for finetune and batch infer For GLM-4.1V (#39090)
    • Glm 4 doc (#39247)
    • GLM-4 Update (#39393)
  • @simonreise
    • Add segmentation_maps support to MobileNetV2ImageProcessor (#37312)
  • @LoserCheems
    • Add Doge model (#35891)
  • @VladOS95-cyber
    • Add DeepSeek V2 Model into Transformers (#36400)
  • @paulpak58
    • LFM2 (#39340)
    • [modeling][lfm2] LFM2: Remove deprecated seen_tokens (#39342)
  • @shuminghu
    • PerceptionLM (#37878)
  • @juliendenize
    • Add mistral common support (#38906)
  • @orionw
    • Add ModernBERT Decoder Models - ModernBERT, but trained with CLM! (#38967)
    • Update modernbertdecoder docs (#39453)
  • @cyyever
    • Fix invalid property (#39384)
    • Defaults to adamw_torch_fused for Pytorch>=2.8 (#37358)
    • Remove runtime conditions for type checking (#37340)
    • Use newer typing notation (#38934)
    • Enable some ruff checks for performance and readability (#39383)
    • Fix pylint warnings (#39477)
  • @jungnerd
    • Update docs/source/ko/_toctree.yml (#39516)
  • @Player256
    • [WIP] Add OneformerFastImageProcessor (#38343)
  • @SangbumChoi
    • Mask2former & Maskformer Fast Image Processor (#35685)
  • @BlackSamorez
    • FP-Quant support (#38696)
Jul 23, 2025
Ernie-4.5 and Ernie-4.5 MoE (based on v4.53.2)

Two new models are added to transformers: Ernie 4.5, and its MoE variant, Ernie 4.5 MoE. They are added on top of the v4.53.2 release, and can be installed from the following tag: v4.53.2-Ernie-4.5-preview.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.53.2-Ernie-4.5-preview

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

As the tag implies, this tag is a preview of the Ernie-4.5 models. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.54.0.

Ernie-4.5 and its MoE variant

<img width="3812" height="1676" alt="image" src="https://github.com/user-attachments/assets/7ff55959-fb5c-4e77-a0a7-f4329e1ebb32" />

The Ernie 4.5 model was released in the Ernie 4.5 Model Family release by baidu. This family of models contains multiple different architectures and model sizes.

The Dense

This model in specific targets the base text model without mixture of experts (moe) with 0.3B parameters in total. It uses the standard Llama at its core.

The MoE

This model in specific targets the base text model with mixture of experts (moe) - one with 21B total, 3B active parameters and another one with 300B total, 47B active parameters. It uses the standard Llama at its core combined with a specialized MoE based on Mixtral with additional shared experts.

Usage example

Ernie-4.5 can be found on the Huggingface Hub.

Generating text with Ernie:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "baidu/ERNIE-4.5-0.3B-PT"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# prepare the model input
inputs = tokenizer("Hey, are you conscious? Can you talk to me?", return_tensors="pt")
prompt = "Hey, are you conscious? Can you talk to me?"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], add_special_tokens=False, return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32,
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# decode the generated ids
generate_text = tokenizer.decode(output_ids, skip_special_tokens=True)

See below for an example leveraging the MoE variant:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "baidu/ERNIE-4.5-21B-A3B-PT"

# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# prepare the model input
inputs = tokenizer("Hey, are you conscious? Can you talk to me?", return_tensors="pt")
prompt = "Hey, are you conscious? Can you talk to me?"
messages = [
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], add_special_tokens=False, return_tensors="pt").to(model.device)

# conduct text completion
generated_ids = model.generate(
    **model_inputs,
    max_new_tokens=32,
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()

# decode the generated ids
generate_text = tokenizer.decode(output_ids, skip_special_tokens=True)
Jul 22, 2025
Patch release v4.53.3

Small path release 4.53.3!

A small patch for open telemetry fixes! Sorry for the delay!

** refactor: remove set_tracer_provider and set_meter_provider calls (https://github.com/huggingface/transformers/pull/39422) from @McPatate

Jul 16, 2025
ModernBERT Decoder (based on v4.53.2)

A new model is added to transformers: ModernBERT Decoder It is added on top of the v4.53.2 release, and can be installed from the following tag: v4.53.2-modernbert-decoder-preview.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.53.2-modernbert-decoder-preview

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

As the tag implies, this tag is a preview of the ModernBERT Decoder model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.54.0.

ModernBERT Decoder

ModernBERT Decoder is the same architecture as ModernBERT but trained from scratch with a causal language modeling (CLM) objective. This allows for using the same architecture for comparing encoders and decoders. This is the decoder architecture implementation of ModernBERT, designed for autoregressive text generation tasks.

Like the encoder version, ModernBERT Decoder incorporates modern architectural improvements such as rotary positional embeddings to support sequences of up to 8192 tokens, unpadding to avoid wasting compute on padding tokens, GeGLU layers, and alternating attention patterns. However, it uses causal (unidirectional) attention to enable autoregressive generation.

Usage example

ModernBERT Decoder can be found on the Huggingface Hub.

Using pipeline:

import torch
from transformers import pipeline

generator = pipeline(
    task="text-generation",
    model="blab-jhu/test-32m-dec",
    torch_dtype=torch.float16,
    device=0
)
generator("The future of artificial intelligence is", max_length=50, num_return_sequences=1)

# For sequence classification
classifier = pipeline(
    task="text-classification",
    model="blab-jhu/test-32m-dec",
    torch_dtype=torch.float16,
    device=0
)
classifier("This movie is really great!")

Using AutoModel:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("blab-jhu/test-32m-dec")
model = AutoModelForCausalLM.from_pretrained(
    "blab-jhu/test-32m-dec",
    torch_dtype=torch.float16,
    device_map="auto",
)

prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_length=50,
        num_return_sequences=1,
        temperature=0.7,
        do_sample=True,
        pad_token_id=tokenizer.eos_token_id
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated text: {generated_text}")

# For sequence classification
from transformers import AutoModelForSequenceClassification

classifier_model = AutoModelForSequenceClassification.from_pretrained(
    "blab-jhu/test-32m-dec",
    torch_dtype=torch.float16,
    device_map="auto",
    num_labels=2
)

text = "This movie is really great!"
inputs = tokenizer(text, return_tensors="pt").to("cuda")

with torch.no_grad():
    outputs = classifier_model(**inputs)
    predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
    predicted_class = torch.argmax(predictions, dim=-1)

print(f"Predicted class: {predicted_class.item()}")
print(f"Prediction probabilities: {predictions}")

Using the transformers CLI:

echo "The future of artificial intelligence is" | transformers run --task text-generation --model your-username/modernbert-decoder-base --device 0
Jul 11, 2025
Patch Release v4.53.2

This patch contains the following bug fixes:

  • Fix some bug for finetune and batch infer For GLM-4.1V (#39090)
  • [bugfix] fix flash attention 2 unavailable error on Ascend NPU (#39166)
  • Fix errors when use verl to train GLM4.1v model (#39199)
  • [pagged-attention] fix off-by-1 error in pagged attention generation (#39258)
  • [smollm3] add tokenizer mapping for smollm3 (#39271)
  • [sliding window] revert and deprecate (#39301)
  • fix Glm4v batch videos forward (#39172)
  • Add a default value for position_ids in masking_utils (#39310)
Jul 4, 2025
Patch Release v4.53.1

This patch contains several bug fixes. The following commits are included:

  • Fix: unprotected import of tp plugin (#39083)
  • Fix key mapping for VLMs (#39029)
  • Several fixes for Gemma3n(#39135)
  • [qwen2-vl] fix FA2 inference (#39121)
  • [smolvlm] fix video inference (#39147)
  • Fix multimodal processor get duplicate arguments when receive kwargs for initialization (#39125)
  • when delaying optimizer creation only prepare the model (#39152)
  • Add packed tensor format support for flex/sdpa/eager through the mask! (#39194)
Jun 26, 2025
Release v4.53.0

Gemma3n

Gemma 3n models are designed for efficient execution on low-resource devices. They are capable of multimodal input, handling text, image, video, and audio input, and generating text outputs, with open weights for pre-trained and instruction-tuned variants. These models were trained with data in over 140 spoken languages.

Gemma 3n models use selective parameter activation technology to reduce resource requirements. This technique allows the models to operate at an effective size of 2B and 4B parameters, which is lower than the total number of parameters they contain. For more information on Gemma 3n's efficient parameter management technology, see the Gemma 3n page.

from transformers import pipeline
import torch

pipe = pipeline(
    "image-text-to-text",
    torch_dtype=torch.bfloat16,
    model="google/gemma-3n-e4b",
    device="cuda",
)
output = pipe(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
    text="<image_soft_token> in this image, there is"
)

print(output)

Dia

Dia is an opensource text-to-speech (TTS) model (1.6B parameters) developed by Nari Labs. It can generate highly realistic dialogue from transcript including nonverbal communications such as laughter and coughing. Furthermore, emotion and tone control is also possible via audio conditioning (voice cloning).

Model Architecture: Dia is an encoder-decoder transformer based on the original transformer architecture. However, some more modern features such as rotational positional embeddings (RoPE) are also included. For its text portion (encoder), a byte tokenizer is utilized while for the audio portion (decoder), a pretrained codec model DAC is used - DAC encodes speech into discrete codebook tokens and decodes them back into audio.

  • Add Dia model by @buttercrab in #38405

Kyutai Speech-to-Text

<img src="https://huggingface.co/datasets/eustlb/documentation-images/resolve/main/kyutai_stt.png"/>

Kyutai STT is a speech-to-text model architecture based on the Mimi codec, which encodes audio into discrete tokens in a streaming fashion, and a Moshi-like autoregressive decoder. Kyutai’s lab has released two model checkpoints:

  • kyutai/stt-1b-en_fr: a 1B-parameter model capable of transcribing both English and French
  • kyutai/stt-2.6b-en: a 2.6B-parameter model focused solely on English, optimized for maximum transcription accuracy
  • Add kyutai stt by @eustlb in #38909

Read more about the model in the documentation

V-JEPA 2

<div class="flex justify-center"> <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/vjepa.gif" alt="drawing" width="600"/> </div>

V-JEPA 2 is a self-supervised approach to training video encoders developed by FAIR, Meta. Using internet-scale video data, V-JEPA 2 attains state-of-the-art performance on motion understanding and human action anticipation tasks. V-JEPA 2-AC is a latent action-conditioned world model post-trained from V-JEPA 2 (using a small amount of robot trajectory interaction data) that solves robot manipulation tasks without environment-specific data collection or task-specific training or calibration.

  • Add V-JEPA 2 by @qubvel in #38746

Read more about the model in the documentation.

Arcee

Arcee is a decoder-only transformer model based on the Llama architecture with a key modification: it uses ReLU² (ReLU-squared) activation in the MLP blocks instead of SiLU, following recent research showing improved training efficiency with squared activations. This architecture is designed for efficient training and inference while maintaining the proven stability of the Llama design.

The Arcee model is architecturally similar to Llama but uses x * relu(x) in MLP layers for improved gradient flow and is optimized for efficiency in both training and inference scenarios.

  • Add Arcee model support by @Crystalcareai in #38621

Read more about the model in the documentation.

ColQwen2

ColQwen2 is a variant of the ColPali model designed to retrieve documents by analyzing their visual features. Unlike traditional systems that rely heavily on text extraction and OCR, ColQwen2 treats each page as an image. It uses the Qwen2-VL backbone to capture not only text, but also the layout, tables, charts, and other visual elements to create detailed multi-vector embeddings that can be used for retrieval by computing pairwise late interaction similarity scores. This offers a more comprehensive understanding of documents and enables more efficient and accurate retrieval.

  • Add ColQwen2 to 🤗 transformers by @tonywu71 in #35778

Read more about the model in the documentation.

MiniMax

MiniMax is a powerful language model with 456 billion total parameters, of which 45.9 billion are activated per token. To better unlock the long context capabilities of the model, MiniMax adopts a hybrid architecture that combines Lightning Attention, Softmax Attention and Mixture-of-Experts (MoE). Leveraging advanced parallel strategies and innovative compute-communication overlap methods—such as Linear Attention Sequence Parallelism Plus (LASP+), varlen ring attention, Expert Tensor Parallel (ETP), etc., MiniMax's training context length is extended to 1 million tokens, and it can handle a context of up to 4 million tokens during the inference. On various academic benchmarks, MiniMax also demonstrates the performance of a top-tier model.

The architecture of MiniMax is briefly described as follows:

  • Total Parameters: 456B
  • Activated Parameters per Token: 45.9B
  • Number Layers: 80
  • Hybrid Attention: a softmax attention is positioned after every 7 lightning attention.
    • Number of attention heads: 64
    • Attention head dimension: 128
  • Mixture of Experts:
    • Number of experts: 32
    • Expert hidden dimension: 9216
    • Top-2 routing strategy
  • Positional Encoding: Rotary Position Embedding (RoPE) applied to half of the attention head dimension with a base frequency of 10,000,000
  • Hidden Size: 6144
  • Vocab Size: 200,064

For more details refer to the release blog post.

  • Add support for MiniMax's MiniMax-Text-01 by @geetu040 in #35831

Read more about the model in the documentation.

Encoder-Decoder Gemma

T5Gemma (aka encoder-decoder Gemma) was proposed in a research paper by Google. It is a family of encoder-decoder large langauge models, developed by adapting pretrained decoder-only models into encoder-decoder. T5Gemma includes pretrained and instruction-tuned variants. The architecture is based on transformer encoder-decoder design following T5, with improvements from Gemma 2: GQA, RoPE, GeGLU activation, RMSNorm, and interleaved local/global attention.

T5Gemma has two groups of model sizes: 1) Gemma 2 sizes (2B-2B, 9B-2B, and 9B-9B), which are based on the offical Gemma 2 models (2B and 9B); and 2) T5 sizes (Small, Base, Large, and XL), where are pretrained under the Gemma 2 framework following T5 configuration. In addition, we also provide a model at ML size (medium large, ~2B in total), which is in-between T5 Large and T5 XL.

The pretrained varaints are trained with two objectives: prefix language modeling with knowledge distillation (PrefixLM) and UL2, separately. We release both variants for each model size. The instruction-turned varaints was post-trained with supervised fine-tuning and reinforcement learning.

  • Encoder-Decoder Gemma by @bzhangGo in #38332

Read more about the model in the documentation.

GLM-4.1V

The GLM-4.1V model architecture is added to transformers; no models have yet been released with that architecture. Stay tuned for the GLM team upcoming releases!

  • GLM-4.1V Model support by @zRzRzRzRzRzRzR in #38431

Read more about the model in the documentation.

Falcon H1

The FalconH1 model was developed by the TII Pretraining team. A comprehensive research paper covering the architecture, pretraining dynamics, experimental results, and conclusions is forthcoming. You can read more about this series in this website.

  • [MODEL] Add Falcon H1 by @younesbelkada in #38249

Read more about the model in the documentation.

LightGlue

The LightGlue model was proposed in LightGlue: Local Feature Matching at Light Speed by Philipp Lindenberger, Paul-Edouard Sarlin and Marc Pollefeys.

Similar to SuperGlue, this model consists of matching two sets of local features extracted from two images, its goal is to be faster than SuperGlue. Paired with the SuperPoint model, it can be used to match two images and estimate the pose between them. This model is useful for tasks such as image matching, homography estimation, etc.

The abstract from the paper is the following:

We introduce LightGlue, a deep neural network that learns to match local features across images. We revisit multiple design decisions of SuperGlue, the state of the art in sparse matching, and derive simple but effective improvements. Cumulatively, they make LightGlue more efficient - in terms of both memory and computation, more accurate, and much easier to train. One key property is that LightGlue is adaptive to the difficulty of the problem: the inference is much faster on image pairs that are intuitively easy to match, for example because of a larger visual overlap or limited appearance change. This opens up exciting prospects for deploying deep matchers in latency-sensitive applications like 3D reconstruction. The code and trained models are publicly available at this https URL

  • Add LightGlue model by @sbucaille in #31718

Read more about the model in the documentation.

dots.llm1

The abstract from the report is the following:

Mixture of Experts (MoE) models have emerged as a promising paradigm for scaling language models efficiently by activating only a subset of parameters for each input token. In this report, we present dots.llm1, a large-scale MoE model that activates 14B parameters out of a total of 142B parameters, delivering performance on par with state-of-the-art models while reducing training and inference costs. Leveraging our meticulously crafted and efficient data processing pipeline, dots.llm1 achieves performance comparable to Qwen2.5-72B after pretraining on high-quality corpus and post-training to fully unlock its capabilities. Notably, no synthetic data is used during pretraining. To foster further research, we open-source intermediate training checkpoints spanning the entire training process, providing valuable insights into the learning dynamics of large language models.

  • [Model] add dots1 by @redmoe-moutain in #38143

Read more about the model in the documentation.

SmolLM3

SmolLM3 is a fully open, compact language model designed for efficient deployment while maintaining strong performance. It uses a Transformer decoder architecture with Grouped Query Attention (GQA) to reduce the kv cache, and no RoPE, enabling improved performance on long-context tasks. It is trained using a multi-stage training approach on high-quality public datasets across web, code, and math domains. The model is multilingual and supports very large context lengths. The instruct variant is optimized for reasoning and tool use.

  • Add SmolLM3 by @anton-l in #38755

Read more about the model in the documentation.

Performance optimizations

Kernels

In previous versions, installing the kernels library would automatically activate the custom kernels added to transformers, because the @use_kernel_forward_from_the_hub decorator directly swapped out the model’s forward method. This implicit behavior caused several issues for users — including problems with torch.compile, non-determinism, and inconsistent outputs.

To address this, we've introduced a new opt-in mechanism called kernelize. You can now enable kernel usage explicitly by passing use_kernels=True to from_pretrained. The use_kernel_forward_from_the_hub decorator now simply stores the kernel name that the user wants to use — and kernelize handles the rest under the hood.

Example

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-1B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
    use_kernels=True
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")

input = "Hello"
input_ids = tokenizer(input, return_tensors="pt").to(model.device).input_ids
output = model.generate(input_ids, max_new_tokens=100)

print(tokenizer.decode(output[0], skip_special_tokens=True))

More kernels will be added over time — this will be a collaborative, community-driven effort to make transformers lighter and faster 🤗

  • Add kernelize to transformers by @MekkCyber in #38205

Flash Attention 3

Support for Flash Attention 3 is added across the most popular models.

  • Support for Flash Attention 3 by @EduardDurech in #38972

Notable repository maintenance & refactors

Several efforts refactoring the repository are happening in parallel. The direction is to greatly simplify the library, removing unnecessary codepaths. Whilst the efforts are spread across the library, they're particularly visible in each individual models; where non-modeling-specific code will be simplified and eventually removed.

We take the assumption that model-agnostic utilities shouldn't be in the modeling code. Things like the output of attentions, hidden states, router logits, are important for end-users but don't need to be explicitely displayed in the modeling code.

  • Apply GradientCheckpointingLayer to the whole repo by @qubvel in #38913
  • No more Tuple, List, Dict by @Rocketknight1 in #38797
  • Deprecate TF + JAX by @Rocketknight1 in #38758

Breaking changes

Several minimal breaking changes aiming to bring clearer defaults while greatly simplifying the library have been merged.

  • 🔴 Update default dtype for pipelines to auto by @Vaibhavs10 in #38882
  • 🚨🚨 Fix initialization of Mask2Former by @Cyrilvallez in #38864
  • :rotating_light: :rotating_light: Inherited CausalLM Tests by @Rocketknight1 in #37590
  • 🚨Early-error🚨 config will error out if output_attentions=True and the attn implementation is wrong by @ArthurZucker in #38288
  • 🔴 [VLM] modeling updates by @zucchini-nlp in #38317
  • :rotating_light: :rotating_light: Fix custom code saving by @Rocketknight1 in #37716
  • 🚨🚨[core] Completely rewrite the masking logic for all attentions by @Cyrilvallez in #37866
  • 🔴🔴🔴 [Attention] Refactor Attention Interface for Bart-based Models by @vasqu in #38108
  • 🔴[Attention] Attention refactor for Whisper-based models by @vasqu in #38235
  • Add CB by @ArthurZucker in #38085

Bugfixes and improvements

  • CI reporting improvements by @ydshieh in #38230
  • Revert parallelism temporarily by @LysandreJik in #38240
  • tp plan should not be NONE by @ArthurZucker in #38255
  • [Falcon H1] Fix Typo in Integration Test by @dhiaEddineRhaiem in #38256
  • [compile] re-enable for Qwen-VL models by @zucchini-nlp in #38127
  • fix multi-image case for llava-onevision by @cyr0930 in #38084
  • Add tearDown method to Quark to solve OOM issues by @MekkCyber in #38234
  • Clearer error on import failure by @LysandreJik in #38257
  • [whisper] small changes for faster tests by @gante in #38236
  • Simplify DTensor Check for modeling_utils.py by @amd-xiaoyu12 in #38245
  • Improve typing in TrainingArgument by @cyyever in #36944
  • Fix: missing else branch to handle "--load_best_model_at_end" in training_args.py by @danielyxyang in #38217
  • assign the correct torchao data layout for xpu by @jiqing-feng in #37781
  • Remove Japanese sequence_classification doc and update references by @ritsumei-aoi in #38246
  • Protect ParallelInterface by @ArthurZucker in #38262
  • Update Model Card for Mamba by @ParagEkbote in #37863
  • docs(swin): Update Swin model card to standard format by @BryanBradfo in #37628
  • add XPU info print in print_env by @yao-matrix in #38282
  • [whisper] move processor test into processor test file 🧹 by @gante in #38266
  • [Whisper] handle deprecation of forced_decoder_ids by @gante in #38232
  • add liger-kernel to docker file by @ydshieh in #38292
  • Fix tp error when torch distributed is already initialized by @SunMarc in #38294
  • More typing in src/transformers/training_args.py by @cyyever in #38106
  • refine transformers env output by @yao-matrix in #38274
  • Update CI Docker base image for AMD tests by @ahadnagy in #38261
  • Fix HybridChunedCache & Llama4 by @Cyrilvallez in #38299
  • Oups typo for HybridChunkedCache by @Cyrilvallez in #38303
  • [Tests] Cleanup Janus Testcase by @yaswanth19 in #38311
  • [emu3] fix conversion script by @zucchini-nlp in #38297
  • Fix run_slow by @cyyever in #38314
  • Fix typo: change 'env' to 'environment' in .circleci/config.yml by @AbdessamadEnabih in #38273
  • Adds use_repr to model_addition_debugger_context by @RyanMullins in #37984
  • [tf/flax] handle forced_decoder_ids deletion by @gante in #38316
  • [Whisper + beam search] fix usage of beam_indices by @gante in #38259
  • Expose AutoModelForTimeSeriesPrediction for import by @jinan-zhou in #38307
  • [custom_generate] don't forward custom_generate and trust_remote_code by @gante in #38304
  • add vasqu to self-comment-ci.yml by @ydshieh in #38324
  • Fix some tests (especially compile with fullgraph=True on Python<3.11) by @Cyrilvallez in #38319
  • [performance_optim] reduce frequency of declaring attention_mask in Ascend NPU flash attention by @FightingZhen in #38278
  • refactor can_save_slow_tokenizer by @itazap in #37722
  • [FlexAttention] Reenable flex for encoder-decoder and make the test more robust by @vasqu in #38321
  • Enhance Model Loading By Providing Parallelism, Uses Optional Env Flag by @inf3rnus in #36835
  • Use Gradient Checkpointing Layer in Jamba & Blip Related Models by @alex-jw-brooks in #38310
  • Never fallback to eager implicitly by @Cyrilvallez in #38327
  • Remove duplicate docstring: resample by @qqii in #38305
  • Update BioGPT model card by @Aguedoom in #38214
  • docs(swinv2): Update SwinV2 model card to new standard format by @BryanBradfo in #37942
  • [docs]: update roformer.md model card by @KsuParkhamchuk in #37946
  • new failure CI reports for all jobs by @ydshieh in #38298
  • Hot fix for AMD CI workflow by @ydshieh in #38349
  • Uninstall kernels for AMD docker images by @ydshieh in #38354
  • [VLMs] add helpers for get/set embedding by @zucchini-nlp in #38144
  • switch to device agnostic device calling for test cases by @yao-matrix in #38247
  • [OPT] Fix attention scaling by @vasqu in #38290
  • Fix all import errors based on older torch versions by @Cyrilvallez in #38370
  • Fix incorrect batching audio index calculation for Phi-4-Multimodal by @Isotr0py in #38103
  • Protect get_default_device for torch<2.3 by @Cyrilvallez in #38376
  • [Falcon H1] Fix slow path forward pass by @dhiaEddineRhaiem in #38320
  • Improved cache docs by @manueldeprada in #38060
  • for now disable compile by @ArthurZucker in #38383
  • Use one utils/notification_service.py by @ydshieh in #38379
  • Better check in initialize_weights by @Cyrilvallez in #38382
  • fix typos by @DeVikingMark in #38336
  • fix typo: tokenizer -> tokenize by @foldl in #38357
  • Stop TF weight rename reDOS by @Rocketknight1 in #38325
  • [cli] cli usable without torch by @gante in #38386
  • update gemma tests by @ydshieh in #38384
  • Stop autoconverting custom code checkpoints by @Rocketknight1 in #37751
  • Add AMD MI300 CI caller leveraging self-hosted runner scale set workflow in hf-workflows by @jitesh-gupta in #38132
  • Fix image token mask in Gemma3 by @Cyrilvallez in #38295
  • [transformers x vLLM] standardize processors by @zucchini-nlp in #37915
  • [paligemma] fix processor with suffix by @zucchini-nlp in #38365
  • [video utils] group and reorder by number of frames by @zucchini-nlp in #38374
  • [aya vision] fix processor for vLLM by @zucchini-nlp in #38371
  • guard size mismatch check to only quantized models by @SunMarc in #38397
  • [chat] improvements for thinking models and reduce default verbosity by @gante in #38322
  • Fix convert to original state dict for VLMs by @hiyouga in #38385
  • [chat] use the checkpoint's generation_config.json as base parameterization by @gante in #38330
  • Fix Qwen2.5-VL Video Processor by @yeliudev in #38366
  • [CSM] infer codec model with no_grad + audio eos label by @eustlb in #38215
  • Add report_repo_id to mi300 workflow by @ivarflakstad in #38401
  • [CSM] update model id by @eustlb in #38211
  • [cleanup] delete deprecated kwargs in qwen2_audio 🧹 by @gante in #38404
  • [tests] remove overload for deleted test (test_offloaded_cache_implementation) by @gante in #37896
  • [mllama] Allow pixel_values with inputs_embeds by @dxoigmn in #38334
  • Update Model Card for Mamba-2 by @ParagEkbote in #37951
  • Updated Zoedepth model card by @miniMaddy in #37898
  • Updated BigBird Model card as per #36979. by @RogerSinghChugh in #37959
  • Updated BERTweet model card. by @RogerSinghChugh in #37981
  • New bart model card by @RogerSinghChugh in #37858
  • Update granite.md by @Tanuj-rai in #37791
  • Falcon-H1 - Fix auto_docstring and add can_return_tuple decorator by @yonigozlan in #38260
  • Updated model card for OLMo2 by @andyvu923 in #38394
  • Add mi300 to amd daily ci workflows definition by @ivarflakstad in #38415
  • Change slack channel for mi250 CI by @ivarflakstad in #38410
  • Fix an error in verify_tp_plan for keys without '.' by @liwii in #38420
  • [qwen-vl] Look for vocab size in text config by @zucchini-nlp in #38372
  • Update CsmForConditionalGenerationIntegrationTest by @ydshieh in #38424
  • enable large_gpu and torchao cases on XPU by @yao-matrix in #38355
  • Disable mi210 scheduled CI by @ivarflakstad in #38411
  • Update error when using additional and/or masks by @Cyrilvallez in #38429
  • Fix CircleCI not triggered when PR is opened from a branch of huggingface/transformers by @ydshieh in #38413
  • make Llama4TextMoe forward more readable by @JJJYmmm in #37529
  • [core] support tensor-valued _extra_state values in from_pretrained by @pstjohn in #38155
  • Fix typo in tokenization_utils_base.py docstring by @cwngan in #38418
  • Fix convert weights for InternVL by @yonigozlan in #38233
  • Trigger doc-builder job after style bot by @ydshieh in #38398
  • Remove redundant test_sdpa_equivalence test by @Rocketknight1 in #38436
  • Fix MoE gradient test by @Rocketknight1 in #38438
  • Fix from_args_and_dict ProcessorMixin by @yonigozlan in #38296
  • Fix handling of slow/fast image processors in image_processing_auto.py by @yonigozlan in #38161
  • Updated the Model docs - for the ALIGN model by @1himan in #38072
  • Updated the model card for ViTMAE by @mreraser in #38302
  • Model card for mobilenet v1 and v2 by @yuanjua in #37948
  • Merge type hints from microsoft/python-type-stubs (post dropping support for Python 3.8) by @Avasam in #38335
  • Fix GLM4 checkpoints by @ydshieh in #38412
  • feat: add cache retention for requests by @McPatate in #38446
  • [Tests] Clean up test cases for few models by @yaswanth19 in #38315
  • Fix TypeError in save_pretrained error handling (fixes #38422) by @rahulrshetty45 in #38449
  • Cleanup BatchFeature and BatchEncoding by @lgeiger in #38459
  • Fix Gemma3IntegrationTest by @ydshieh in #38471
  • [Qwen2.5-Omni] Fix dtype of cos,sin when used with flash attention by @HarryHsing in #38453
  • fix: handle no scheduler passed by user by @McPatate in #38407
  • make it go brrrr by @ArthurZucker in #38409
  • Fix convert_internvl_weights_to_hf.py to support local paths by @xvyv99 in #38264
  • Fix incorrect bbox_embed initialization when decoder_bbox_embed_share=False in GroundingDINO by @islemyakoubi in #38238
  • [Tests] Reduced model size for albert-test model by @saqlain2204 in #38480
  • Align TP check by @SunMarc in #38328
  • protect dtensor import by @SunMarc in #38496
  • [docs] add xpu environment variable for gpu selection by @faaany in #38194
  • Remove deprecated use_flash_attention_2 parameter by @cyyever in #37131
  • Fix setting FLASH_ATTENTION_DETERMINISTIC after importing by @HollowMan6 in #37185
  • [seamless_m4t] Skip some tests when speech is not available by @remi-or in #38430
  • Update Loss Functions to Accept Tensor num_items_in_batch by @NEREUScode in #38029
  • [generate] add soft deprecations on custom generation methods by @gante in #38406
  • [generate] move SinkCache to a custom_generate repo by @gante in #38399
  • remove unhandled parameter by @itazap in #38145
  • Fix amp deprecation issue by @SunMarc in #38100
  • [flax/mistral] support sliding_window: null in config by @yiding in #37402
  • Num parameters in model.safetensors.index.json by @LysandreJik in #38531
  • Remove type annotation in Siglip Attention Module by @yaswanth19 in #38503
  • Fix Gemma2IntegrationTest by @ydshieh in #38492
  • Fix blip2 tests by @ydshieh in #38510
  • [tests] expand flex-attn test for vision models by @zucchini-nlp in #38434
  • Don't use default attn if pre-set in sub-config by @zucchini-nlp in #38526
  • update emu3 test by @jiqing-feng in #38543
  • Update docker image to use av by @ydshieh in #38548
  • [bugfix] [WIP] fix apply_rotary_emb error on Ascend NPU by @FightingZhen in #38491
  • [TP] Change command in tests to python3 by @S1ro1 in #38555
  • Explicitly setting encoding in tokenization_utils_base.py by @Muqi1029 in #38553
  • Fix utils/notification_service.py by @ydshieh in #38556
  • Name change AOPermod -> ModuleFqn by @drisspg in #38456
  • Fix hqq issue by @SunMarc in #38551
  • [docs] Format fix by @stevhliu in #38414
  • [janus] Fix failing tests on mi3XX by @remi-or in #38426
  • Fix chameleon tests by @ydshieh in #38565
  • update utils/notification_service.py for AMD vs Nvidia by @ydshieh in #38563
  • Fix deepseekv3 by @ydshieh in #38562
  • [FlexAttn] Fix models with unique characteristics by @vasqu in #38433
  • fix(attention_visualizer): add default value for image_seq_length by @IceGiraffe in #38577
  • allow custom head_dim for qwen2_moe by @bzantium in #37188
  • Docs: fix code formatting in torchao docs by @Manalelaidouni in #38504
  • feat: add repository field to benchmarks table by @McPatate in #38582
  • [Dinov2] Enable device_map="auto" support by @aryanchauhan31 in #38487
  • tests/roformer: fix couple roformer tests on gpus by @dvrogozh in #38570
  • New gpt neo model card by @RogerSinghChugh in #38505
  • Updated deprecated typing imports with equivalents for Python 3.9+ by @Sai-Suraj-27 in #38546
  • added fast image processor for ZoeDepth and expanded tests accordingly by @henrikm11 in #38515
  • [qwen-omni] fix sliding window by @zucchini-nlp in #38525
  • Remove custom pytest and pluggy by @ydshieh in #38589
  • pin pandas by @ydshieh in #38605
  • Allow mlm_probability to be set to None when mlm=False in DataCollatorForLanguageModeling by @KameniAlexNea in #38522)
  • Avoid overwrite existing local implementation when loading remote custom model by @Isotr0py in #38474
  • fix spelling errors by @davidjsonn in #38608
  • Remove isort from dependencies by @Sai-Suraj-27 in #38616
  • Fix return_dict=False giving errors in a few VLM models by @ydshieh in #38519
  • docs: fix dark mode logo display. by @johncaged in #38586
  • Fix typo in LLaVa documentation by @mynameismon in #38618
  • [Nit] Add Note on SigOpt being in Public Archive Mode by @ParagEkbote in #38610
  • Updated Aria model card by @1himan in #38472
  • Fix MiniMax (docs and integration tests checkpoint) by @geetu040 in #38575
  • enable more test cases on xpu by @yao-matrix in #38572
  • Improve test_initialization by @ydshieh in #38607
  • Use torch 2.7.1 on CircleCI jobs by @ydshieh in #37856
  • [generation] bring back tests on vision models by @zucchini-nlp in #38603
  • update ColQwen2ModelIntegrationTest by @ydshieh in #38583
  • Improve test_initialization for SwiftFormer by @ydshieh in #38636
  • fix: support grad clipping for TP through replicating non-sharded modules by @kmehant in #36132
  • Don't run AriaForConditionalGenerationModelTest on CircleCI by @ydshieh in #38615
  • fix total batch size calculation in trainer by @inkcherry in #38286
  • fix torch_dtype on awq by @jiqing-feng in #38463
  • Better CI by @ydshieh in #38552
  • remove ipex_optimize_model usage by @yao-matrix in #38632
  • Skip torchscript tests for 2 models by @ydshieh in #38643
  • Fix InternVL integration test by @ydshieh in #38612
  • Use torch 2.7.1 on daily CI by @ydshieh in #38620
  • Fix qwen2-audio chat template audio placeholder insertion by @Isotr0py in #38640
  • Fixed modeling_auto.py MODEL_FOR_MASK_GENERATION_MAPPING_NAMES variable by @sbucaille in #38664
  • fix: "check out" as verb by @DePasqualeOrg in #38678
  • Fix attention mask expansion when converting to executorch by @pweglik in #38637
  • Fix some models import by @nicelulu in #38694
  • Fix retrieve function signature and remove faiss requirement by @Fiona-Waters in #38624
  • Fix TypeError: 'NoneType' object is not iterable for esm by @dbleyl in #38667)
  • Docs: update bitsandbytes torch.compile compatibility by @matthewdouglas in #38651
  • Drop as_target_processor from the call and pad methods by @marcndo in #38642
  • Created model card for XLM model by @AshAnand34 in #38595
  • Update XLM-RoBERTa model documentation with enhanced usage examples and improved layout by @AshAnand34 in #38596
  • Created model card for xlm-roberta-xl by @AshAnand34 in #38597
  • Fix aya_vision test by @ydshieh in #38674
  • Standardize ByT5 model card format by @yanamis in #38699
  • Fix smart resize by @rdonggroq in #38706
  • Update some tests for torch 2.7.1 by @ydshieh in #38701
  • Logging message for is_bitsandbytes_available() by @ved1beta in #38528
  • Fix llava tests by @ydshieh in #38722
  • Use OSError by @cyyever in #38712
  • [add-new-model-like] Robust search & proper outer '),' in tokenizer mapping by @alexzms in #38703
  • Fix typo in Language Modeling example scripts and update TPU type by @framoncg in #38652
  • Add AGENTS.md by @Rocketknight1 in #38734
  • New canine model card by @RogerSinghChugh in #38631
  • Fixed a multiple-devices issue in SmolVLM model by @remi-or in #38736
  • [llava] fix integration tests with Siglip by @zucchini-nlp in #38732
  • fix: Add method to get image features in PaliGemmaForConditionalGeneration by @YushunXiang in #38730
  • from 1.11.0, torchao.prototype.low_bit_optim is promoted to torchao.optim by @yao-matrix in #38689
  • fix: bf16 with TPU is allowed in configuration by @yevvonlim in #38670
  • [DeepSeek-V3] implement when q_lora_rank is None by @bzantium in #38743
  • Revert "Trigger doc-builder job after style bot" by @ydshieh in #38735
  • Add z-loss to Bamba for v2 by @daviswer in #37842
  • Better typing for num_items_in_batch by @SunMarc in #38728
  • Prepare for TF+Jax deprecation by @Rocketknight1 in #38760
  • Remove IPEX requirement for bitsandbytes on CPU by @matthewdouglas in #38594
  • Update repo consistency check by @Rocketknight1 in #38763
  • fix(qwen3_moe): pass kwargs to self_attn by @llllvvuu in #38691
  • Update pegasus model card by @dross20 in #38675
  • Make style bot trigger CI after push by @ydshieh in #38754
  • chore(pixtral): emit block attention mask when using flash attention by @starcatmeow in #38741
  • Update altCLIP model card by @EmileAydar in #38306
  • Add Qwen2 MoE model card by @rileyafox in #38649
  • [masking utils] check None instead of try/except by @zucchini-nlp in #38561
  • [Hotfix] Fix style bot by @ydshieh in #38779
  • Fix masking utils by @Cyrilvallez in #38783
  • [video processors] support frame sampling within processors by @zucchini-nlp in #38105
  • Skip some export tests on torch 2.7 by @ydshieh in #38677
  • Reduce verbosity for average_tokens_across_devices=True and world size = 1 by @qgallouedec in #38785
  • Update PULL_REQUEST_TEMPLATE.md by @qgallouedec in #38770
  • [docs] Add int4wo + 2:4 sparsity example to TorchAO README by @jcaip in #38592
  • Fix qwen_2_5 omni by @ydshieh in #38658
  • Fix llava_onevision tests by @ydshieh in #38791
  • Reword README in light of model definitions by @LysandreJik in #38762
  • Fix Typos in Comments: "quantitation" → "quantization", "averege" → "average" by @leopardracer in #38766
  • Initialize flash attn flag by @farnasirim in #38768
  • Fix mllama by @ydshieh in #38704
  • build: :pushpin: Remove upper bound on PyTorch by @KyleMylonakisProtopia in #38789
  • Remove all traces of low_cpu_mem_usage by @Cyrilvallez in #38792
  • [Docs] New DiT model card by @yushi2006 in #38721
  • Add missing div in Pegasus model card by @dross20 in #38773
  • Updated moonshine modelcard by @SohamPrabhu in #38711
  • refactor create_token_type_ids_from_sequences by @itazap in #37681
  • [docs] update cache docs with new info by @zucchini-nlp in #38775
  • Fix erroneous docstring for the ordering of SWA layers by @norpadon in #38794
  • Fix configs and doc for the Qwens by @Cyrilvallez in #38808
  • Unbreak optimum-executorch by @guangy10 in #38646
  • Disable custom MRA kernels for ROCm by @ahadnagy in #38738
  • Use HF papers by @qgallouedec in #38184
  • Simplify and update trl examples by @qgallouedec in #38772
  • Better pipeline type hints ✨ by @qubvel in #38049
  • Fix llava_next tests by @ydshieh in #38813
  • Expectation fixes and added AMD expectations by @remi-or in #38729
  • Use wandb.run.url instead of wandb.run.get_url() (deprecated) by @qgallouedec in #38817
  • Refactor DBRX tests to use CausalLMModelTest base classes by @Rocketknight1 in #38475
  • change fsdp_strategy to fsdp in TrainingArguments in accelerate doc by @PT-10 in #38807
  • Fix a minor security issue by @ydshieh in #38815
  • Fix trainer.py not showing signature columns by @nenesekai in #38465
  • Add V-JEPA for video classification model by @qubvel in #38788
  • fixed docstring in modular_qwen2_5_vl.py by @lawrencefeng17 in #38798
  • [docs] Update docs moved to the course by @stevhliu in #38800
  • [docs] updated roberta model card by @allmight05 in #38777
  • Updated Albert model Card by @souvikchand in #37753
  • [internvl] fix video inference by @zucchini-nlp in #38811
  • Fix redundant code in Janus by @yaswanth19 in #38826
  • bugfix: propage weight key_mapping to peft to fix 3.52 VLM renaming by @ManuelFay in #38627
  • Fix peft integration by @Cyrilvallez in #38841
  • Fix broken notebooks link in Italian training docs by @VolodymyrBg in #38834
  • Fix broken tag in Longformer model card by @dross20 in #38828
  • [BugFix] QA pipeline edge case: align_to_words=True in QuestionAnsweringPipeline can lead to duplicate answers by @yushi2006 in #38761
  • GraniteMoeHybrid: Allow for only shared expert case. by @shawntan in #38801
  • Updated aya_vision.md by @1himan in #38749
  • Remove merge conflict artifacts in Albert model doc by @druvdub in #38849
  • [video processor] fix BC when no video config if found by @zucchini-nlp in #38840
  • Fix incorrect width ratio calculation in Llama4 image processor by @Jingxiang-Zhang in #38842
  • Allow customization of sdpa in executorch.py by @kimishpatel in #38827
  • Fix qwen2_5_vl tests by @ydshieh in #38845
  • Improve auxiliary_in_channels default behavior in UperNet by @simonreise in #37540
  • Fix qwen3 tests by @ydshieh in #38862
  • Update CvT documentation with improved usage examples and additional … by @sezan92 in #38731
  • Update roc bert docs by @SohamPrabhu in #38835
  • Post-PR fixes! by @Rocketknight1 in #38868
  • enable misc test cases on XPU by @yao-matrix in #38852
  • Fix phi4_multimodal tests by @ydshieh in #38816
  • Fix qwen3_moe tests by @ydshieh in #38865
  • Fix HQQ model param device transfer issue by @HighCWu in #38466
  • Fixed markdown for BertTokenizer's '[CLS]' token. by @eu90h in #38506
  • null deepspeed_plugin in args for wandb callback fake trainer by @winglian in #38867
  • More PYUP fixes by @cyyever in #38883
  • Fix loop var naming by @Rocketknight1 in #38885
  • [bugfix] fix ATTN_MASK_NPU device mismatch error on multi-device NPU … by @qykong in #38876
  • log: Add logging when using split_batches and per_device_train_batch_size by @KeshavSingh29 in #38633
  • Docs: Add custom fine-tuning tutorial to TrOCR model page by @Ashutosh-4485 in #38847
  • 36978 | Fast image processor for DPT model by @samrae7 in #37481
  • [video processor] fix slow tests by @zucchini-nlp in #38881
  • Update bamba model card by @druvdub in #38853
  • Add support for specifying revisions when pushing to Hub via internal Trainer call by @IsaacBreen in #36852
  • Use raise from e in hub.py utility by @Wauplin in #37241
  • [phi-4] use mel filters from audio utils by @eustlb in #36966
  • Fix fsmt tests by @ydshieh in #38904
  • Fix unnecessary super calls by @cyyever in #38897
  • align xpu's autocast behavior w/ cuda by using device agnostic torch APIs by @yao-matrix in #38284
  • Fix FalconMambaIntegrationTests by @ydshieh in #38566
  • Skip sdpa tests if submodule does not support sdpa by @ivarflakstad in #38907
  • Fix ReDOS in tokenizer digit substitution by @Rocketknight1 in #38844
  • feat: Add granite architectures to auto tokenizer name mappings by @gabe-l-hart in #38802
  • Allow make-fixup on main branch, albeit slowly by @Rocketknight1 in #38892
  • feat: add flexible Liger Kernel configuration to TrainingArguments by @hamza-hcompany in #38911
  • Remove deprecated classes in modeling_utils.py by @Cyrilvallez in #38919
  • Skip some tests for now by @ydshieh in #38931
  • Modernbert fixes by @remi-or in #38912
  • add pytorch-xpu Dockerfile by @yao-matrix in #38875
  • Remove ALL_LAYERNORM_LAYERS by @Cyrilvallez in #38922
  • [static cache] fix device map per layer in VLMs by @zucchini-nlp in #38488
  • Add kwargs for timm.create_model in TimmWrapper by @qubvel in #38860
  • Pin PyTorch extras for AMD containers by @ahadnagy in #38941
  • Correctly raise error for awq quantization by @Cyrilvallez in #38945
  • Fix more flaky test_initialization by @ydshieh in #38932
  • Switch to use A10 progressively by @ydshieh in #38936
  • Fix custom generate from local directory by @manueldeprada in #38916
  • Update blip model card by @devkade in #38513
  • Gaudi3 CI by @IlyasMoutawwakil in #38790
  • Fix DTensor import compatibility for PyTorch < 2.5 by @Benoqtr in #38836
  • Fix(informer): Correct tensor shape for input_size=1 by @Flink-ddd in #38856
  • [modular] CLI allows positional arguments, and more defaults names for the optional arg by @Cyrilvallez in #38979
  • Remove dead protected imports by @Cyrilvallez in #38980
  • Break tie in Expectations and gemma3 fixes by @remi-or in #38943
  • Add Idefics2/3 and SmolVLM Fast image processors + improvements for fast image processors by @yonigozlan in #38157
  • fix: add bool operator to tokenizer to avoid bloated asserts by @kallewoof in #38899
  • Add support for auto_docstring with model outputs by @yonigozlan in #38242
  • fix mistral and mistral3 tests by @ydshieh in #38978
  • [Feature] Support is_split_into_words in the TokenClassificationPipeline. by @yushi2006 in #38818
  • Fix rag by @ydshieh in #38585
  • [docs] Typos - Single GPU efficient training features by @casinca in #38964
  • [qwen] refactor attentions for vision/audio by @zucchini-nlp in #38930
  • Removing extra space in large command for speech-pretraining example by @dggaytan in #38705
  • [Attention] Small fix on output attentions by @vasqu in #38948
  • Fixes for Arcee model by @Cyrilvallez in #39001
  • Added scikit-learn to the example image-classification requirements.txt by @mylonjones in #37506
  • Update attention_visualizer.py by @Tanuj-rai in #37860
  • Skip non-selected experts for qwen3_moe by @seven-mile in #38133
  • Fix undeterministic order in modular dependencies by @Cyrilvallez in #39005
  • Granite speech - minor fixes to support training with the HF trainer by @avihu111 in #38833
  • Fix bugs in DynamicCache by @tugsbayasgalan in #37880
  • Update self-comment-ci.yml user list by @ivarflakstad in #39014
  • Skip sdpa dispatch on flash test due to unsupported head dims by @ivarflakstad in #39010
  • [HPU][Critical Issue Fix] ThreadPool instead of Pool for parallel pre-processing by @dsmertin in #39002
  • Add Hugging Face authentication procedure for IDEs (PyCharm, VS Code,… by @marcndo in #38954
  • [LightGlue] Fixed attribute usage from descriptor_dim to keypoint_detector_descriptor_dim by @sbucaille in #39021
  • Add zero dim tensor check when using flash_attention by @ranzhejiang in #38280
  • Fix graph break in torch.compile when using FA2 with attention_mask=None and batch size > 1 by @efsotr in #37332
  • [AutoModelForMaskGeneration] Remove duplicate code by @NielsRogge in #38622
  • [video processor] support torchcodec and decrease cuda memory usage by @zucchini-nlp in #38880
  • Drop unnecessary tokens in GPT2Model generation by @null-pointer-access in #39016
  • Fix the seamless_m4t cannot work on Gaudi by @yuanwu2017 in #38363
  • fix: astronomical loss with ModernBERT when using gradient checkpointing by @umarbutler in #38982)
  • fix gemma3 grad acc by @SunMarc in #37208
  • Remove script datasets in tests by @lhoestq in #38940
  • Fix grammatical error in models documentation by @marcndo in #39019
  • refactor: remove custom BarkLayerNorm by @eginhard in #39003
  • [Kyutai-STT] correct model type + model id by @eustlb in #39035
  • Two ReDOS fixes by @Rocketknight1 in #39013
  • [tests] remove TF tests (uses of require_tf) by @gante in #38944
  • Granite speech speedup + model saving bugfix by @avihu111 in #39028
  • Fix Bad Outputs in Fast Path for GraniteMoeHybrid by @alex-jw-brooks in #39033

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @ydshieh
    • CI reporting improvements (#38230)
    • add liger-kernel to docker file (#38292)
    • add vasqu to self-comment-ci.yml (#38324)
    • new failure CI reports for all jobs (#38298)
    • Hot fix for AMD CI workflow (#38349)
    • Uninstall kernels for AMD docker images (#38354)
    • Use one utils/notification_service.py (#38379)
    • update gemma tests (#38384)
    • Update CsmForConditionalGenerationIntegrationTest (#38424)
    • Fix CircleCI not triggered when PR is opened from a branch of huggingface/transformers (#38413)
    • Trigger doc-builder job after style bot (#38398)
    • Fix GLM4 checkpoints (#38412)
    • Fix Gemma3IntegrationTest (#38471)
    • Fix Gemma2IntegrationTest (#38492)
    • Fix blip2 tests (#38510)
    • Update docker image to use av (#38548)
    • Fix utils/notification_service.py (#38556)
    • Fix chameleon tests (#38565)
    • update utils/notification_service.py for AMD vs Nvidia (#38563)
    • Fix deepseekv3 (#38562)
    • Remove custom pytest and pluggy (#38589)
    • pin pandas (#38605)
    • Fix return_dict=False giving errors in a few VLM models (#38519)
    • Improve test_initialization (#38607)
    • Use torch 2.7.1 on CircleCI jobs (#37856)
    • update ColQwen2ModelIntegrationTest (#38583)
    • Improve test_initialization for SwiftFormer (#38636)
    • Don't run AriaForConditionalGenerationModelTest on CircleCI (#38615)
    • Better CI (#38552)
    • Skip torchscript tests for 2 models (#38643)
    • Fix InternVL integration test (#38612)
    • Use torch 2.7.1 on daily CI (#38620)
    • Fix aya_vision test (#38674)
    • Update some tests for torch 2.7.1 (#38701)
    • Fix llava tests (#38722)
    • Revert "Trigger doc-builder job after style bot" (#38735)
    • Make style bot trigger CI after push (#38754)
    • [Hotfix] Fix style bot (#38779)
    • Skip some export tests on torch 2.7 (#38677)
    • Fix qwen_2_5 omni (#38658)
    • Fix llava_onevision tests (#38791)
    • Fix mllama (#38704)
    • Fix llava_next tests (#38813)
    • Fix a minor security issue (#38815)
    • Fix qwen2_5_vl tests (#38845)
    • Fix qwen3 tests (#38862)
    • Fix phi4_multimodal tests (#38816)
    • Fix qwen3_moe tests (#38865)
    • Fix fsmt tests (#38904)
    • Fix FalconMambaIntegrationTests (#38566)
    • Skip some tests for now (#38931)
    • Fix more flaky test_initialization (#38932)
    • Switch to use A10 progressively (#38936)
    • fix mistral and mistral3 tests (#38978)
    • Fix rag (#38585)
  • @ArthurZucker
    • tp plan should not be NONE (#38255)
    • Protect ParallelInterface (#38262)
    • Add CB (#38085)
    • 🚨Early-error🚨 config will error out if output_attentions=True and the attn implementation is wrong (#38288)
    • for now disable compile (#38383)
    • make it go brrrr (#38409)
  • @younesbelkada
    • [MODEL] Add Falcon H1 (#38249)
  • @cyr0930
    • fix multi-image case for llava-onevision (#38084)
  • @cyyever
    • Improve typing in TrainingArgument (#36944)
    • More typing in src/transformers/training_args.py (#38106)
    • Fix run_slow (#38314)
    • Remove deprecated use_flash_attention_2 parameter (#37131)
    • Use OSError (#38712)
    • More PYUP fixes (#38883)
    • Fix unnecessary super calls (#38897)
  • @ritsumei-aoi
    • Remove Japanese sequence_classification doc and update references (#38246)
  • @yao-matrix
    • add XPU info print in print_env (#38282)
    • refine transformers env output (#38274)
    • switch to device agnostic device calling for test cases (#38247)
    • enable large_gpu and torchao cases on XPU (#38355)
    • enable more test cases on xpu (#38572)
    • remove ipex_optimize_model usage (#38632)
    • from 1.11.0, torchao.prototype.low_bit_optim is promoted to torchao.optim (#38689)
    • enable misc test cases on XPU (#38852)
    • align xpu's autocast behavior w/ cuda by using device agnostic torch APIs (#38284)
    • add pytorch-xpu Dockerfile (#38875)
  • @vasqu
    • 🔴🔴🔴 [Attention] Refactor Attention Interface for Bart-based Models (#38108)
    • [FlexAttention] Reenable flex for encoder-decoder and make the test more robust (#38321)
    • [OPT] Fix attention scaling (#38290)
    • 🔴[Attention] Attention refactor for Whisper-based models (#38235)
    • [FlexAttn] Fix models with unique characteristics (#38433)
    • [Attention] Small fix on output attentions (#38948)
  • @itazap
    • refactor can_save_slow_tokenizer (#37722)
    • remove unhandled parameter (#38145)
    • refactor create_token_type_ids_from_sequences (#37681)
  • @eustlb
    • [CSM] infer codec model with no_grad + audio eos label (#38215)
    • [CSM] update model id (#38211)
    • [phi-4] use mel filters from audio utils (#36966)
    • Add kyutai stt (#38909)
    • [Kyutai-STT] correct model type + model id (#39035)
  • @RogerSinghChugh
    • Updated BigBird Model card as per #36979. (#37959)
    • Updated BERTweet model card. (#37981)
    • New bart model card (#37858)
    • New gpt neo model card (#38505)
    • New canine model card (#38631)
  • @1himan
    • Updated the Model docs - for the ALIGN model (#38072)
    • Updated Aria model card (#38472)
    • Updated aya_vision.md (#38749)
  • @Avasam
    • Merge type hints from microsoft/python-type-stubs (post dropping support for Python 3.8) (#38335)
  • @remi-or
    • [seamless_m4t] Skip some tests when speech is not available (#38430)
    • [janus] Fix failing tests on mi3XX (#38426)
    • Fixed a multiple-devices issue in SmolVLM model (#38736)
    • Expectation fixes and added AMD expectations (#38729)
    • Modernbert fixes (#38912)
    • Break tie in Expectations and gemma3 fixes (#38943)
  • @tonywu71
    • Add ColQwen2 to 🤗 transformers (#35778)
  • @geetu040
    • Add support for MiniMax's MiniMax-Text-01 (#35831)
    • Fix MiniMax (docs and integration tests checkpoint) (#38575)
  • @sbucaille
    • Fixed modeling_auto.py MODEL_FOR_MASK_GENERATION_MAPPING_NAMES variable (#38664)
    • Add LightGlue model (#31718)
    • [LightGlue] Fixed attribute usage from descriptor_dim to keypoint_detector_descriptor_dim (#39021)
  • @samrae7
    • 36978 | Fast image processor for DPT model (#37481)
  • @Crystalcareai
    • Add Arcee model support (#38621)
  • @zRzRzRzRzRzRzR
    • GLM-4.1V Model support (#38431)
  • @bzhangGo
    • Encoder-Decoder Gemma (#38332)
  • @redmoe-moutain
    • [Model] add dots1 (#38143)
  • @EduardDurech
    • Support for Flash Attention 3 (#38972)
Latest
v5.5.4
Tracking Since
Apr 23, 2024
Last checked Apr 19, 2026
Transformers — Hugging Face — releases.sh