releases.shpreview
Hugging Face/Transformers

Transformers

$npx -y @buildinternet/releases show transformers
Mon
Wed
Fri
AprMayJunJulAugSepOctNovDecJanFebMarApr
Less
More
Releases13Avg4/moVersionsv4.57.4 → v5.5.3
Apr 13, 2026
Patch release v5.5.4

This is mostly some fixes that are good to have asap, mostly for tokenizers; ** Fix Kimi-K2.5 tokenizer regression and _patch_mistral_regex Attribute… (#45305) by ArthurZucker

For training: ** Fix #45305 + add regression test GAS (#45349) by florian6973, SunMarc ** Fix IndexError with DeepSpeed ZeRO-3 when kernels rotary is active (#…) by ArthurZucker

And for Qwen2.5-VL : ** Fix Qwen2.5-VL temporal RoPE scaling applied to still images (#45330) by Kash6, zucchini-nlp

Apr 9, 2026
Patch release: v5.5.3

Small patch release to fix device_map support for Gemma4! It contains the following commit:

  • [gemma4] Fix device map auto (#45347) by @Cyrilvallez
Patch release: v5.5.2

Small patch dedicated to optimizing gemma4, fixing inference with use_cache=False due to k/v states sharing between layers, as well as conversion mappings for some models that would inconsistently serialize their weight names. It contains the following PRs:

  • Add MoE to Gemma4 TP plan (#45219) by @sywangyi and @Cyrilvallez
  • [gemma4] Dissociate kv states sharing from the Cache (#45312) by @Cyrilvallez
  • [gemma4] Remove all shared weights, and silently skip them during loading (#45336) by @Cyrilvallez
  • Fix conversion mappings for vlms (#45340) by @Cyrilvallez
Patch release v5.5.1

This patch is very small and focuses on vLLM and Gemma4!

** Fix export for gemma4 and add Integration tests (#45285) by @Cyrilvallez ** Fix vllm cis (#45139) by @ArthurZucker

Apr 2, 2026
Release v5.5.0
<img width="2786" height="1504" alt="image" src="https://github.com/user-attachments/assets/6c8c878f-042b-4858-9f64-73fd9ccd7e4b" />

New Model additions

Gemma4

Gemma 4 is a multimodal model with pretrained and instruction-tuned variants, available in 1B, 13B, and 27B parameters. The architecture is mostly the same as the previous Gemma versions. The key differences are a vision processor that can output images of fixed token budget and a spatial 2D RoPE to encode vision-specific information across height and width axis.

<img width="1478" height="1374" alt="image" src="https://github.com/user-attachments/assets/9d88bd1b-02ea-4829-b7d0-fac0e347d436" />

You can find all the original Gemma 4 checkpoints under the Gemma 4 release.

The key difference from previous Gemma releases is the new design to process images of different sizes using a fixed-budget number of tokens. Unlike many models that squash every image into a fixed square (like 224×224), Gemma 4 keeps the image's natural aspect ratio while making it the right size. There a a couple constraints to follow:

  • The total number of pixels must fit within a patch budget
  • Both height and width must be divisible by 48 (= patch size 16 × pooling kernel 3)

[!IMPORTANT] Gemma 4 does not apply the standard ImageNet mean/std normalization that many other vision models use. The model's own patch embedding layer handles the final scaling internally (shifting values to the [-1, 1] range).

The number of "soft tokens" (aka vision tokens) an image processor can produce is configurable. The supported options are outlined below and the default is 280 soft tokens per image.

Soft TokensPatches (before pooling)Approx. Image Area
70630~161K pixels
1401,260~323K pixels
2802,520~645K pixels
5605,040~1.3M pixels
1,12010,080~2.6M pixels

To encode positional information for each patch in the image, Gemma 4 uses a learned 2D position embedding table. The position table stores up to 10,240 positions per axis, which allows the model to handle very large images. Each position is a learned vector of the same dimensions as the patch embedding. The 2D RoPE which Gemma 4 uses independently rotate half the attention head dimensions for the x-axis and the other half for the y-axis. This allows the model to understand spatial relationships like "above," "below," "left of," and "right of."

NomicBERT

NomicBERT is a BERT-inspired encoder model that applies Rotary Position Embeddings (RoPE) to create reproducible long context text embeddings. It is the first fully reproducible, open-source text embedding model with 8192 context length that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on short-context MTEB and long context LoCo benchmarks. The model generates dense vector embeddings for various tasks including search, clustering, and classification using specific instruction prefixes.

Links: Documentation | Paper

  • Internalise the NomicBERT model (#43067) by @ed22699 in #43067

MusicFlamingo

Music Flamingo is a fully open large audio–language model designed for robust understanding and reasoning over music. It builds upon the Audio Flamingo 3 architecture by including Rotary Time Embeddings (RoTE), which injects temporal position information to enable the model to handle audio sequences up to 20 minutes. The model features a unified audio encoder across speech, sound, and music with special sound boundary tokens for improved audio sequence modeling.

Links: Documentation | Paper

  • Add Music Flamingo (#43538) by @lashahub in #43538

Breaking changes

Mamba and hybrid model caches are now first-class native citizens in the library, so users working with Mamba-based or hybrid (Mamba + attention) models should update their code to use the new native cache classes instead of any previous workarounds.

  • 🚨 [Cache] Native mamba & hybrid cache (#44950) by @Cyrilvallez

Remote code execution support has been removed from the native LightGlue integration, so users who were loading LightGlue with trust_remote_code=True must remove that argument and use the model directly through the standard native API.

  • :rotating_light: [LightGlue] Remove remote code execution (#45122) by @vasqu

Vision

Several vision-related bugs were fixed in this release, including correcting the Gemma vision mask to support video inputs, resolving a dependency issue that incorrectly required torchvision for PIL-based image processors, and patching bugs in the Janus image generation model and image loading. Local code resolution for tokenizers and image processors was also corrected.

  • Generalize gemma vision mask to videos (#45185) by @zucchini-nlp in [#45185]
  • Fix explicit local code resolution for tokenizers and image processors (#45169) by @hmellor in [#45169]
  • fix bug for janus model image generation (#45044) by @kaixuanliu in [#45044]
  • [Bugfix] Remove incorrect torchvision requirement from PIL backend image processors (#45045) by @Lidang-Jiang in [#45045]
  • Avoid Image.open failure (#44645) by @sywangyi in [#44645]

Cache

Improved the performance of repository checks (check-repo) by introducing file-level and AST-level disk caching, achieving up to a 27x speedup (from ~46s to ~1.6s with a warm cache), and fixed the mlinter cache location in .gitignore.

  • refactoring: speedup static checks with disk cache (#44992) by @tarekziade in [#44992]
  • refactor: added cache in check_repo (#45012) by @tarekziade in [#45012]
  • chore: Fix mlinter cache location (#45052) by @tarekziade in [#45052]

Bugfixes and improvements

  • Fix resized LM head weights being overwritten by post_init (#45079) by @javierdejesusda in [#45079]
  • [Qwen3.5 MoE] Add _tp_plan to ForConditionalGeneration (#45124) by @danielquintas8 in [#45124]
  • fix(models): Fix dtype mismatch in SwitchTransformers and TimmWrapperModel (#45074) by @harshaljanjani in [#45074]
  • [misc] fix qwen35 tests: correct the text model type and skip reverse_mapping (#45173) by @JJJYmmm in [#45173]
  • 🔒 Pin GitHub Actions to commit SHAs (#45180) by @paulinebm in [#45180]
  • Use doc-builder runnable example for GLM-ASR (#44277) by @tarekziade in [#44277]
  • CI] Small T5 expectations updated (#45138) by @Abdennacer-Badaoui in [#45138]
  • fix: correct type annotations across config classes for @strict validation (#45007) by @Krishnachaitanyakc in [#45007]
  • Fix T5Attention shape mismatch under Tensor Parallelism (#45109) by @aws-zhanxun in [#45109]
  • [refactor] Serving into proper modules (#44796) by @SunMarc in [#44796]
  • Re-add regex substitutions to the response parsing spec (#45166) by @Rocketknight1 in [#45166]
  • Fix incorrect TrainingArguments example in training.md (#45150) by @maanas1234 in [#45150]
  • Add parse_response to Processor, make it a bit more official (#45143) by @Rocketknight1 in [#45143]
  • DeepGEMM (#44832) by @IlyasMoutawwakil in [#44832]
  • fix: prefer registered config over remote code in AutoConfig.from_pretrained (#45094) by @HanFa in [#45094]
  • [serving] Fix continuous batching JSON response serialization (#45057) by @NathanHB in [#45057]
  • Fix stupid test fetcher (#45140) by @ydshieh in [#45140]
  • [CB] Add warmup feature (#45112) by @remi-or in [#45112]
  • feature: added import complexity checker (#45013) by @tarekziade in [#45013]
  • Fix tests for janus model (#44739) by @kaixuanliu in [#44739]
  • CB improvements for serving (#45063) by @SunMarc in [#45063]
  • [docs] continuous batching (#44896) by @stevhliu in [#44896]
  • Fix few issues in Qwen_3_Omni_Moe (#44848) by @Sai-Suraj-27 in [#44848]
  • Fix TypeError in rope validation when ignore_keys is a list (#45069) by @Fr0do in [#45069]
  • Remove unused TensorFlow env var (#45065) by @Sai-Suraj-27 in [#45065]
  • fix: add identity reverse_op to dequantize ops for save_pretrained (#44983) by @Hyungkeun-Park-Nota in [#44983]
  • Fix when RoPE params are in kwargs (#45049) by @zucchini-nlp in [#45049]
  • chore: update update_metdata.yml (#45054) by @hf-security-analysis[bot] in [#45054]
  • [FA] Fix BC support for a few versions + add deprecation cycle (#45061) by @vasqu in [#45061]
  • fix(testing): Fix Parakeet, Evolla, Pi0, and Phi-3 test failures on main CI (#45004) by @harshaljanjani in [#45004]
  • Allow advanced users to override model_type in AutoConfig.from_pretrained (#45058) by @hmellor in [#45058]
  • Fix failing SmolLM3IntegrationTest (#45048) by @Sai-Suraj-27 in [#45048]
  • chore: remove old extras (#45024) by @tarekziade in [#45024]
  • Embedding VLMs don't need a head (#45000) by @zucchini-nlp in [#45000]
  • Fix GraniteConfig type hints to accept int for multiplier fields (#45019) by @javierdejesusda in [#45019]
  • fix: preserve rotary_pct across save/load cycle in GPTNeoX configs (#44985) by @Krishnachaitanyakc in [#44985]

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @ed22699
    • Internalise the NomicBERT model (#43067)
  • @tarekziade
    • Use doc-builder runnable example for GLM-ASR (#44277)
    • refactoring: speedup static checks with disk cache (#44992)
    • feature: added import complexity checker (#45013)
    • refactor: added cache in check_repo (#45012)
    • chore: remove old extras (#45024)
    • chore: Fix mlinter cache location (#45052)
    • refactor: speed up docstring checker (#45009)
  • @Krishnachaitanyakc
    • fix: correct type annotations across config classes for @strict validation (#45007)
    • fix: preserve rotary_pct across save/load cycle in GPTNeoX configs (#44985)
  • @lashahub
    • Add Music Flamingo (#43538)
  • @Lidang-Jiang
    • [Bugfix] Remove incorrect torchvision requirement from PIL backend image processors (#45045)
Mar 27, 2026
Release v5.4.0: PaddlePaddle models 🙌, Mistral 4, PI0, VidEoMT, UVDoc, SLANeXt, Jina Embeddings v3

New Model additions

VidEoMT

<img width="1480" height="460" alt="image" src="https://github.com/user-attachments/assets/bec6fc25-b0ab-4227-8c2b-a838554f37f3" />

Video Encoder-only Mask Transformer (VidEoMT) is a lightweight encoder-only model for online video segmentation built on a plain Vision Transformer (ViT). It eliminates the need for dedicated tracking modules by introducing a lightweight query propagation mechanism that carries information across frames and employs a query fusion strategy that combines propagated queries with temporally-agnostic learned queries. VidEoMT achieves competitive accuracy while being 5x-10x faster than existing approaches, running at up to 160 FPS with a ViT-L backbone.

Links: Documentation | Paper

  • Add VidEoMT (#44285) by @NielsRogge in #44285

UVDoc

<img width="1765" height="875" alt="image" src="https://github.com/user-attachments/assets/365e510e-8fb8-46cb-8f4b-e8b7082f0ae2" />

UVDoc is a machine learning model designed for document image rectification and correction. The main purpose of this model is to carry out geometric transformation on images to correct document distortion, inclination, perspective deformation and other problems in document images. It provides both single input and batched inference capabilities for processing distorted document images.

Links: Documentation

  • [Model] Add UVDoc Model Support (#43385) by @XingweiDeng in #43385

Jina Embeddings v3

<img width="595" height="513" alt="image" src="https://github.com/user-attachments/assets/2aee0692-8286-4c6b-98db-847b95ab2d40" />

The Jina-Embeddings-v3 is a multilingual, multi-task text embedding model designed for a variety of NLP applications. Based on the XLM-RoBERTa architecture, this model supports Rotary Position Embeddings (RoPE) replacing absolute position embeddings to support long input sequences up to 8192 tokens. Additionally, it features 5 built-in Task-Specific LoRA Adapters that allow the model to generate task-specific embeddings (e.g., for retrieval vs. classification) without increasing inference latency significantly.

Links: Documentation | Paper

  • Add Jina-Embeddings-V3 Model (#44251) by @Sai-Suraj-27 in #44251

Mistral4

<img width="2429" height="1787" alt="image" src="https://github.com/user-attachments/assets/a6feb0da-8504-4eab-be65-22d6c676336f" />

Mistral 4 is a powerful hybrid model with the capability of acting as both a general instruction model and a reasoning model. It unifies the capabilities of three different model families - Instruct, Reasoning (previously called Magistral), and Devstral - into a single, unified model. The model features a MoE architecture with 128 experts and 4 active, 119B parameters with 6.5B activated per token, 256k context length, and supports multimodal input with both text and image processing capabilities.

Links: Documentation

  • Add Mistral 4 (#44760) by @juliendenize in #44760

PI0

PI0 is a vision-language-action model for robotics manipulation that jointly processes visual observations and language instructions to generate robot actions. It uses a novel flow matching architecture built on top of a pre-trained vision-language model to inherit Internet-scale semantic knowledge. The model can perform complex dexterous tasks like laundry folding, table cleaning, and assembling boxes across multiple robot platforms including single-arm robots, dual-arm robots, and mobile manipulators.

Links: Documentation | Paper

  • Add model lerobot PI0 to transformers (#44160) by @molbap in #44160

SLANeXt

SLANeXt is a series of dedicated lightweight models for table structure recognition, focusing on accurately recognizing table structures in documents and natural scenes. The SLANeXt series is a new generation of table structure recognition models independently developed by the Baidu PaddlePaddle Vision Team, with dedicated weights trained separately for wired and wireless tables. The recognition ability for all types of tables has been significantly improved, especially for wired tables.

Links: Documentation

  • [Model] Add SLANeXt Model Support (#43707) by @liu-jiaxuan in #43707

PP-OCRv5_mobile_rec

PP-OCRv5_mobile_rec is a dedicated lightweight model for text recognition, focusing specifically on efficient recognition and understanding of text elements in multi-language documents and natural scenes. It is designed to efficiently and accurately support the recognition of Simplified Chinese, Traditional Chinese, English, Japanese, as well as complex text scenarios such as handwriting, vertical text, pinyin, and rare characters with a single model. While maintaining recognition performance, it also balances inference speed and model robustness, providing efficient and accurate technical support for document understanding in various scenarios.

Links: Documentation

  • [Model] Add PP-OCRv5_server_rec and PP-OCRv5_mobile_rec models Support (#44808) by @zhang-prog in #44808

PP-OCRv5_server_rec

PP-OCRv5_server_rec is a dedicated lightweight model for text recognition, focusing specifically on efficient recognition and understanding of text elements in multi-language documents and natural scenes. It is designed to efficiently and accurately support the recognition of Simplified Chinese, Traditional Chinese, English, Japanese, as well as complex text scenarios such as handwriting, vertical text, pinyin, and rare characters with a single model. While maintaining recognition performance, it also balances inference speed and model robustness, providing efficient and accurate technical support for document understanding in various scenarios.

Links: Documentation

  • [Model] Add PP-OCRv5_server_rec and PP-OCRv5_mobile_rec models Support (#44808) by @zhang-prog in #44808

PP-OCRv5_mobile_det

PP-OCRv5_mobile_det is a dedicated lightweight model for text detection, focusing specifically on efficient detection and understanding of text elements in multi-language documents and natural scenes. It is part of the latest generation of text detection models developed by the PaddleOCR team that efficiently and accurately supports the detection of text in diverse scenarios—including handwriting, vertical, rotated, and curved text—across multiple languages such as Simplified Chinese, Traditional Chinese, English, and Japanese. The model features robust handling of complex layouts, varying text sizes, and challenging backgrounds, making it suitable for practical applications like document analysis, license plate recognition, and scene text detection.

Links: Documentation

  • [Model] Add PP-OCRV5_mobile_det Model Support (#43247) by @XingweiDeng in #43247

PPLCNet

PP-LCNet is a family of efficient, lightweight convolutional neural networks designed for real-world document understanding and OCR tasks. It balances accuracy, speed, and model size, making it ideal for both server-side and edge deployment. The model has three main variants optimized for specific tasks: document image orientation classification, table classification, and text line orientation classification.

Links: Documentation

  • [Model] Add PP-OCRV5_mobile_det Model Support (#43247) by @XingweiDeng in #43247

PPLCNetV3

PPLCNetV3 is a lightweight CPU-optimized convolutional backbone designed for efficient image classification and downstream vision tasks. It builds on the PP-LCNet architecture with improved training strategies and structural refinements for better accuracy-latency tradeoffs on CPU hardware.

Links: Documentation | Paper

  • [Model] Add PP-OCRV5_mobile_det Model Support (#43247) by @XingweiDeng in #43247

PP-OCRv5_server_det

PP-OCRv5_server_det is a high-performance text detection model optimized for server-side applications, focusing on accurate detection of multi-language text in documents and natural scenes. It supports the detection of text in diverse scenarios—including handwriting, vertical, rotated, and curved text—across multiple languages such as Simplified Chinese, Traditional Chinese, English, and Japanese. The model features robust handling of complex layouts, varying text sizes, and challenging backgrounds, making it suitable for practical applications like document analysis, license plate recognition, and scene text detection.

Links: Documentation

  • [Model] Add PP-OCRV5_server_det Model Support (#43274) by @XingweiDeng in #43274

CHMv2

CHMv2 is a global, meter-resolution canopy height mapping model that uses DINOv3 to estimate forest canopy heights from high-resolution optical satellite imagery. Building on the original canopy height maps released in 2024, CHMv2 delivers substantial improvements in accuracy, detail, and global consistency by leveraging Meta's self-supervised vision model. The model is trained against airborne laser scanning data and provides essential information for quantifying forest carbon, monitoring restoration and degradation, and assessing habitat structure.

Links: Documentation | Paper | Blog Post

  • Add CHMv2 (#44595) by @yonigozlan in #44595

Breaking changes

The dual BaseImageProcessor/BaseImageProcessorFast design has been replaced with a unified backend architecture, and the image_processing_utils_fast module has been removed — users should migrate to the new unified image_processing_utils module.

  • 🚨🚨 Refactor Image Processors to support different backends (#43514) by @yonigozlan

PreTrainedConfig and model config classes have been refactored to use @dataclass and no longer accept positional arguments — users must update any config instantiation calls to use keyword arguments only.

  • :rotating_light: Validate config attributes (#41250) by @zucchini-nlp

Flash Attention 2 (FA2) support now requires version 2.3.3 or newer, and initial Flash Attention 4 (FA4) support has been added — users on older FA2 versions must upgrade to at least 2.3.3.

  • :rotating_light: [FA4] Initial support (#42435) by @vasqu

Weight tying behavior has changed so that weights are now tied even when both keys are already present in a checkpoint — users relying on the previous behavior (e.g., with .bin checkpoints containing duplicate keys) should verify their models load as expected.

  • [tie weights] 🚨 If both weights are present with same weights, still tie them (#44497) by @Cyrilvallez

The cache_position argument has been removed from the forward signatures of most major models — users passing cache_position directly to these models should remove it, as it is now handled internally by generate.

  • [core] 🚨 Completely remove cache positions (#44181) by @Cyrilvallez

Parallelization

Several bug fixes and improvements were made to pipeline parallel (PP) and tensor parallel (TP) support, including fixing supports_tp/pp_plan detection, resolving attribute errors in PP for Qwen2VL-based models, correcting FSDP loading with meta devices, and ensuring TP weight sharding properly updates parent module attributes (e.g., in_features/out_features) to improve compatibility with libraries like PEFT.

  • Fix several based models' pipeline parallel support (#44699) by @hmellor in [#44699]
  • [Model] Add PP-Chart2Table Model Support (#43767) by @XingweiDeng in [#43767]
  • enable tp for benchmark (#43750) by @sywangyi in [#43750]
  • Fix supports_{tp/pp}_plan (#44696) by @hmellor in [#44696]
  • Allow to disable stdout hiding for TP (#44608) by @michaelbenayoun in [#44608]
  • fix FSDP loading with meta devices (#44473) by @winglian in [#44473]
  • Fix: Conditionally import torch.distributed.fsdp in trainer_seq2seq.py (#44507) by @0xDELUXA in [#44507]
  • Supplement skip logic for XPU in the CPU-only tp tests (#44536) by @YangKai0616 in [#44536]
  • Update parent module attributes when sharding with TP (#44421) by @michaelbenayoun in [#44421]
  • trigger tensor parallel utils test in the CI (#44460) by @3outeille in [#44460]

Quantization

Quantization support was improved with up to 30x faster FP8 grouped and batched matmuls, static FP8 expert support for multi-GPU setups, and a torchao minimum version bump to 0.15.0. Additionally, MXFP4 dependency error messages were made more actionable, and AWQ tests were updated to align with the GPTQModel migration.

  • fix: split MXFP4 dependency checks for specific error messages (#44930) by @javierdejesusda in [#44930]
  • Add static FP8 expert support (#44895) by @SunMarc in [#44895]
  • Bump torchao >=0.15 and fix quantization CI (#44604) by @SunMarc in [#44604]
  • Fix AWQ tests for GPTQModel migration (#44654) by @jiqing-feng in [#44654]
  • [Performance] FP8 Grouped and Batched Matmuls (#44231) by @IlyasMoutawwakil in [#44231]
  • Fix PR comment CI for quantization job (#44579) by @ydshieh in [#44579]

Tokenization

Several performance improvements were made to tokenizer loading and saving, including eliminating redundant file parsing and unnecessary deep copies of large vocabularies that caused significant overhead. Additionally, bug fixes were applied for incorrect tokenizer class names on the Hub (DeepSeek V2/V3, ModernBERT), a clean_up_tokenization_spaces misconfiguration in Llama 3 tokenizer conversion, and a string replacement issue in AutoTokenizer class name resolution.

  • fix: improve processor loading performance by avoiding redundant tokenizer parsing (#44927) by @ydshieh in [#44927]
  • fix processing_utils.py: avoid deepcopying tokenizer in ProcessorMixin to improve performance (#44894) by @ydshieh in [#44894]
  • fix: set clean_up_tokenization_spaces=False in Llama 3 tokenizer conversion (#44914) by @maxsloef-goodfire in [#44914]
  • deepseek_v2, deepseek_v3, and modernbert fix for having incorrect tokenizer class on the hub (#44801) by @itazap in [#44801]
  • Add XPU Expectations for vibe voice acoustic tokenizer tests (#44428) by @kaixuanliu in [#44428]
  • fix(tokenizer): Only strip Fast from class names in AutoTokenizer if used as a suffix (#44443) by @harshaljanjani in [#44443]

Kernels

Kernel support has been expanded with Flash Attention 4 fallback integration, a paged_attention kernel for continuous batching, and Neuron device support for custom kernels. Several stability fixes were also made, including bumping the kernels version dependency to prevent crashes and correcting the LFM2 kernel path.

  • [FA4] Add kernels fallback (#44797) by @vasqu in [#44797]
  • Bump kernels version dependency to avoid crashes (#44887) by @Cyrilvallez in [#44887]
  • Fix lfm2 kernel path (#44634) by @Cyrilvallez in [#44634]
  • [CB] Add paged_attention kernel (#44379) by @remi-or in [#44379]
  • Neuron kernels integration (#44417) by @michaelbenayoun in [#44417]

Cache

Several cache-related fixes and improvements were made, including aligning LFM2's cache implementation with other Mamba caches, fixing a tensor indexing crash in KV cache continuation for the transformers serve streaming endpoint, and resolving a generation bug in Idefics3 when using use_cache=False. A caching layer was also added to the model linter to skip unchanged valid files and improve build performance.

  • Align lfm2 cache to other mamba caches (#44866) by @Cyrilvallez in [#44866]
  • feat: added cache to the model linter (#44790) by @tarekziade in [#44790]
  • Fix tensor indexing crash in serve generate_response KV cache continuation (#44735) by @mango766 in [#44735]
  • Idefics3 without cache fix (#44607) by @gabe-l-hart in [#44607]

Vision

Fixed backward compatibility for full-path imports of Fast Image Processors and resolved a Llama4 vision rotary embedding initialization error where freqs_ci was not registered as a buffer, causing failures when loading models with device_map="auto".

  • Fix backward compatibility for full path imports of Fast Image Processors (#44926) by @yonigozlan in [#44926]
  • fix(models, testing): Fix Llama4 vision rotary meta tensor initialization and MyT5 get_tokenizer signature (#44581) by @harshaljanjani in [#44581]
  • Fix AMD Docker image build timeout by pinning Flash Attention commit (#44546) by @Abdennacer-Badaoui in [#44546]

Generation

The cache_position argument has been fully removed from the generation pipeline, as all models have been updated to no longer use it (with a backward-compatibility path retained for remote code models). Additionally, integration tests for LASR with chunked decoding were added, and outdated references to deprecated pipeline tasks were cleaned up.

  • [generate] Never use cache_position anymore in generation (#44816) by @Cyrilvallez in [#44816]
  • Add an integration test for LASR using pipe and chunked decoding (#42823) by @kho in [#42823]
  • Fix: Remove references to text2text-generation, summarization and translation pipeline tasks (#44510) by @math-hiyoko in [#44510]

Bugfixes and improvements

  • Dynamic weight conversion is recursive (#44300) by @zucchini-nlp in [#44300]
  • Don't run tests_hub if no tests found (#45014) by @ydshieh in [#45014]
  • Fix type hint for attention_chunk_size in Llama4TextConfig (#45002) by @hmellor in [#45002]
  • Fix AutoProcessor.from_pretrained silently dropping hub kwargs (#44710) by @he-yufeng in [#44710]
  • Fix maybe_autocast crashing on meta device tensors (#44984) by @Butanium in [#44984]
  • fix: remove Copied from comments between @torch.jit.script and def for Python 3.13 compat (#44986) by @Krishnachaitanyakc in [#44986]
  • More small vllm fixes (#44990) by @ArthurZucker in [#44990]
  • fix(models): Fix Perceiver interpolate_pos_encoding interpolating to the source size (#44899) by @harshaljanjani in [#44899]
  • Allow mm_token_type be non-padded lists (#44563) by @zucchini-nlp in [#44563]
  • Fix CPU 16 bytes alignment issue using equivalent fallback (#44970) by @IlyasMoutawwakil in [#44970]
  • refactor: unify QA calls (#44879) by @tarekziade in [#44879]
  • Fix tie_word_embedding issues with Qwen2VL (#44976) by @hmellor in [#44976]
  • Support Modular (!!) + Configs in check_auto_docstrings (#44803) by @yonigozlan in [#44803]
  • [ vllm x v5] nit (#44971) by @ArthurZucker in [#44971]
  • LwDetrImageLoss: Fix dtype casting to prevent crash when using amp on cuda device (#44886) by @m-matthias in [#44886]
  • [AMD CI] Gemma3/Gemma3n Expectations (#44972) by @Abdennacer-Badaoui in [#44972]
  • Officially launch parse_response (#44674) by @Rocketknight1 in [#44674]
  • fix load_best_model_checkpoint_at_end do not load the best model chec… (#44583) by @wilnn in [#44583]
  • Fix failing T5ModelIntegrationTest (#44934) by @Sai-Suraj-27 in [#44934]
  • Config kwargs (#44953) by @zucchini-nlp in [#44953]
  • [CB] [Minor] Simplify test suite (#44858) by @remi-or in [#44858]
  • Allow arbitrary template kwargs in processors (#44881) by @zucchini-nlp in [#44881]
  • Fix missing post_processor in DebertaV2Tokenizer causing no special t… (#44570) by @umbilnm in [#44570]
  • incorrect model list update (#44880) by @itazap in [#44880]
  • refactor: mlinter as its own package (#44939) by @tarekziade in [#44939]
  • [CB] Add an option to return logprobs (#44835) by @remi-or in [#44835]
  • [docs] peft (#44804) by @stevhliu in [#44804]
  • Continuous batching thread safety (#44924) by @Qubitium in [#44924]
  • Fix variable shadowing in pipeline example and typo in BART docs (BERT → BART) (#44935) by @VanshikaSohal in [#44935]
  • Fix failing job Update Transformers metadata after #43514 (#44941) by @ydshieh in [#44941]
  • Clearer type hints and fix rope validation in configs (#44943) by @zucchini-nlp in [#44943]
  • Correct docstrings for from_pretrained (url input deprecated) (#44946) by @BSchilperoort in [#44946]
  • fix(i18n): replace broken relative links to awesome-transformers.md with absolute URLs (#44905) by @NicoleRobin in [#44905]
  • chore(typing): added rule 11 (#44865) by @tarekziade in [#44865]
  • fix(camembert): add tie_word_embeddings=True to CamembertConfig (#44931) by @r266-tech in [#44931]
  • Support SizeDict import in get_size_dict (#44903) by @yonigozlan in [#44903]
  • Add big angry code agent warnings! (#44890) by @Rocketknight1 in [#44890]
  • [docs] model cards (#44837) by @stevhliu in [#44837]
  • Add backward compatibility for direct imports from legacy image_processing_utils_fast (#44897) by @yonigozlan in [#44897]
  • Fix core dumped when NemotronH is torch compiled (#44854) by @ydshieh in [#44854]
  • fix(testing): Fix PaliGemma 2 and PaddleOCR-VL test failures on main (#44765) by @harshaljanjani in [#44765]
  • Fix dtype guessing from state dict (#44883) by @Cyrilvallez in [#44883]
  • Add missing dunder methods to SizeDict (#44884) by @hmellor in [#44884]
  • Fix VL model rope_deltas batch size mismatch in online RL training (#44873) by @sergiopaniego in [#44873]
  • Fix layer_types type hint for AFMoE and Llama4 (#44874) by @hmellor in [#44874]
  • Fix nemotron config docstrings (#44878) by @Cyrilvallez in [#44878]
  • Fix nemotron_h modular (#44876) by @Cyrilvallez in [#44876]
  • [Mistral] Fix query scaling for Mistral4 and Ministral3 (#44860) by @Cyrilvallez in [#44860]
  • Update some type hints (#44851) by @zucchini-nlp in [#44851]
  • Fix glm dsa (#44564) by @ArthurZucker in [#44564]
  • Update AFMoE architecture to use v5-style MoE impl (#44063) by @AutumnAurelium in [#44063]
  • Fix KeyError in convert_to_native_format for dict vocab (#44452) by @<NOT FOUND> in [#44452]
  • fix: XLNet: relative_positional_encoding computes on CPU every forward (#44782) by @JiwaniZakir in [#44782]
  • Fix annotations reader for python 3.14 in PreTrainedModel (#44672) by @neo in [#44672]
  • [CB] Better parametrization for compile (#44578) by @remi-or in [#44578]
  • Fix KeyError when patching mistral regex (#43376) by @LeonardoEmili in [#43376]
  • Correct code block formatting in weightconverter.md (#44839) by @zhulinchng in [#44839]
  • feat(ci): added a network debug report (#44636) by @tarekziade in [#44636]
  • Add GreedyLR adaptive learning rate scheduler (#44271) by @balak4 in [#44271]
  • Fix unexpected position_ids keys when loading OwlViT models (#44508) by @KartikPawade in [#44508]
  • Update more modular examples (#44834) by @Cyrilvallez in [#44834]
  • Fix and re-run modular converter on examples (#44833) by @Cyrilvallez in [#44833]
  • Remove cache_position in more models (4 and last one) (#44828) by @Cyrilvallez in [#44828]
  • Fix loading issue in Sam3 (#44831) by @zucchini-nlp in [#44831]
  • feat(integration): Add KubeflowCallback to enable automatic progress … (#44487) by @abhijeet-dhumal in [#44487]
  • Add GGUF support for MiniMax-M2.1 model (#44526) by @JoursBleu in [#44526]
  • Centralize AI agent templates in .ai (#44489) by @tarekziade in [#44489]
  • support xxxFast alias in v5 tokenizers (#44766) by @itazap in [#44766]
  • Remove cache_position in more models (3) (#44759) by @Cyrilvallez in [#44759]
  • [CI] Temporarily skip Mistral4 tests as they almost all fail (#44825) by @Cyrilvallez in [#44825]
  • [Gemma] Update conversion scripts for Transformers v5 Comaptibility (#44631) by @RyanMullins in [#44631]
  • fix bug embedding_size mismatch with hidden_size in electra model test (#44657) by @kaixuanliu in [#44657]
  • Fix pegasus conversion (#44571) by @ArthurZucker in [#44571]
  • Fix repo-check bot (#44812) by @ydshieh in [#44812]
  • [docs] is_causal feature (#44777) by @stevhliu in [#44777]
  • docs(tasks): remove references to removed question-answering pipeline (#44787) by @<NOT FOUND> in [#44787]
  • Fix configs with @strict (#44770) by @zucchini-nlp in [#44770]
  • [AMD CI] Fix test failures across important models (#44632) by @Abdennacer-Badaoui in [#44632]
  • Move VLM conversions to the main mapping (#44627) by @zucchini-nlp in [#44627]
  • Fix config loading issues (type issues) (#44789) by @ydshieh in [#44789]
  • Remove is_causal from EuroBertConfig (#44774) by @ydshieh in [#44774]
  • model-linter: Added rule 10 (#44761) by @tarekziade in [#44761]
  • [fix] mistral 4 docs (#44776) by @stevhliu in [#44776]
  • Fix: Eurobert model was missing @strict decorator and invalid test kwargs (#44767) by @tarekziade in [#44767]
  • fix: sig lip import (#44764) by @tarekziade in [#44764]
  • Disable async loading when quantizing on the fly (#44576) by @SunMarc in [#44576]
  • [MistralCommonBackend] Upgrade mistral-common to v1.10.0 (#44656) by @juliendenize in [#44656]
  • Fix mlcd auto config/model/mapping issues (#44730) by @ydshieh in [#44730]
  • Fix bug and add XPU Expectations for qwen2 and jamba tests (#44733) by @kaixuanliu in [#44733]
  • [medasr] doc update (#44633) by @eustlb in [#44633]
  • Fix missing / incorrect config class in some model class definitions (#44715) by @ydshieh in [#44715]
  • Update Nvidia CI docker file to use torch 2.10 (#44712) by @ydshieh in [#44712]
  • [FA] Fix fa detection (#44703) by @vasqu in [#44703]
  • Fix set_encoder (#44698) by @hmellor in [#44698]
  • [docs] cb config (#44675) by @stevhliu in [#44675]
  • Fix more model tester missing parent issue (#44685) by @ydshieh in [#44685]
  • Add register method for ParallelInterface (#44640) by @michaelbenayoun in [#44640]
  • [CB] [Bug] Fix crashes when running without cuda (#44673) by @remi-or in [#44673]
  • Another (small) set of fixes required for tiny model creation (#44666) by @ydshieh in [#44666]
  • Fix CookieCutter (#44334) by @NielsRogge in [#44334]
  • pipelines do not have modelcard (#44621) by @KoichiYasuoka in [#44621]
  • [Chmv2] Fix conversion after capture refactor (#44665) by @vasqu in [#44665]
  • [CB] Add dedicated config (#44434) by @remi-or in [#44434]
  • fix(models): Forward timm model kwargs to timm.create_model for OmDet-Turbo (#44611) by @harshaljanjani in [#44611]
  • Ensure same dtype for subconfig when _from_config (#44629) by @zucchini-nlp in [#44629]
  • Remove cache_position in more models (2) (#44602) by @Cyrilvallez in [#44602]
  • fix: cast to proper dtype in EmbeddingParallel (#44612) by @michaelbenayoun in [#44612]
  • Remove many output_attentions and other traced outputs on 100+ models (#43590) by @molbap in [#43590]
  • fix: raise error if mm_token_type_ids not supplied (#44433) by @leopold-tzafon in [#44433]
  • Fix output capturing for Backbones (#44638) by @Cyrilvallez in [#44638]
  • Fix for VibeVoiceAcousticTokenizer (#44628) by @ydshieh in [#44628]
  • Fix off-by-one in decode_spans boundary check (#44584) by @mvanhorn in [#44584]
  • Fix more wrong HF hub checkpoint names (#44624) by @ydshieh in [#44624]
  • Update agentic contributions guidelines in AGENTS.md to force yielding. (#44411) by @burtenshaw in [#44411]
  • Expand model-structure lint rules with a fast AST-based, ruff-like framework (#44174) by @tarekziade in [#44174]
  • feat: add neuron in tensor parallelism initialization (#44498) by @michaelbenayoun in [#44498]
  • [WIP] FIX Make Mixtral LoRA loading work (#44478) by @BenjaminBossan in [#44478]
  • Fix Llava tests for torch too! (#44476) by @Rocketknight1 in [#44476]
  • Fix training ci and clean some tests (#44491) by @SunMarc in [#44491]
  • Remove useless identity assignment (#44600) by @Cyrilvallez in [#44600]
  • Add Yoni to run-slow workflow (#44598) by @vasqu in [#44598]
  • Add shared VLM tests (#42964) by @Rocketknight1 in [#42964]
  • Fix wrong (non-existing) checkpoints (#44549) by @ydshieh in [#44549]
  • Remove cache_position in more models (#44330) by @Cyrilvallez in [#44330]
  • Fix CircleCI summary report not showing due to missing dependency (#44597) by @ydshieh in [#44597]
  • Fix typos in add_new_model_like docstrings (#43544) by @Olexandr88 in [#43544]
  • Fix UnboundLocalError for tp_plan_alt when tp_plan is empty (#44540) by @YangKai0616 in [#44540]
  • FIX Multiple PEFT errors after v5 transition (#44592) by @BenjaminBossan in [#44592]
  • Fix missing BPE token conversion step in Chameleon (#44582) by @yonigozlan in [#44582]
  • Make paligemma embed tokens standard (#44432) by @zucchini-nlp in [#44432]
  • chore(typing): Add type checking to src/transformers/quantizers (#44412) by @tarekziade in [#44412]
  • Fix: AQLM quantizer to match updated replace_with_aqlm_linear signature (#44577) by @tarekziade in [#44577]
  • [device_map] Fix device_map computation by correctly adjusting memory available (#44565) by @Cyrilvallez in [#44565]
  • Fix error message label and docstring default in load_sharded_checkpoint (#44523) by @jnMetaCode in [#44523]
  • Correct Tapas initialization (#44575) by @Rocketknight1 in [#44575]
  • [fix] Prevent crash with Apertus without xielu installed (#44567) by @tomaarsen in [#44567]
  • Fix failing MusicgenStereo integration tests (#44527) by @Sai-Suraj-27 in [#44527]
  • Fix zamba2 rotary embedding call when use_mem_rope is False (#44551) by @echarlaix in [#44551]
  • [Bugfix] fix video inference of qwen3vl and qwen3.5 series (#44474) by @JJJYmmm in [#44474]
  • add XPU Expectations for higgs_audio_v2 tests (#44482) by @kaixuanliu in [#44482]
  • chameleon added to MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS (#44475) by @itazap in [#44475]
  • Revert "test merge queue 1" (#44552) by @ydshieh in [#44552]
  • test merge queue 1 (#44529) by @ydshieh2 in [#44529]
  • fix(testing): Fix MoonshineEncoder UnboundLocalError and Florence2VisionBackbone dtype mismatch (#44503) by @harshaljanjani in [#44503]
  • Fix: Remove references to transformers run command (#44513) by @math-hiyoko in [#44513]
  • [LW-DETR] Fix training (#44441) by @NielsRogge in [#44441]
  • Make _prepare_input_fn and _prepare_output_fn instance methods (#44499) by @michaelbenayoun in [#44499]
  • Fix ShieldGemma2 non-reproducible outputs by adding _tied_weights_keys (#44358) by @hardikmeisheri in [#44358]
  • Tensor Parallelism and mps device (#44506) by @michaelbenayoun in [#44506]
  • Fix failing GPTNeoModelLanguageGenerationTest (#44515) by @Sai-Suraj-27 in [#44515]
  • Fix failing MarianIntegrationTests (#44519) by @Sai-Suraj-27 in [#44519]
  • fix pin_memory for contiguous batching (#44455) by @jiqing-feng in [#44455]
  • Fix continuous batching for multimodal models (#44436) by @jw9603 in [#44436]
  • Fix KeyError in _parse_type_hint when Union contains Any (#44525) by @jnMetaCode in [#44525]
  • Fix AssistantTracker.is_active() returning False after activation with empty lists (#44524) by @jnMetaCode in [#44524]
  • Fix and re-enable extra_state tests (#43510) by @pstjohn in [#43510]
  • Fix ansi codes in loading reports when not connected to terminal (#44544) by @Cyrilvallez in [#44544]
  • Follow-up typing checking fixes (#44500) by @tarekziade in [#44500]
  • Fix backend dependency (#44542) by @Cyrilvallez in [#44542]
  • Add a new job in build_pr_documentation.yml (will be the new required job) (#44538) by @ydshieh in [#44538]
  • Update build_pr_documentation workflow for merge_group event (#44532) by @ydshieh in [#44532]
  • Fixed typo in docs/source/en/kv_cache.md (#44501) by @frogNotToad in [#44501]
  • Docs: fix SigLIP2 usage examples (#43641) by @KOKOSde in [#43641]
  • Fix type checker (#44502) by @Cyrilvallez in [#44502]
  • Add MLU bf16 support to is_torch_bf16_gpu_available (#44381) by @carcel-yu in [#44381]
  • fix model parallelism bug for eurobert model (#44490) by @kaixuanliu in [#44490]
  • Update ty to 0.0.20 (#44494) by @tarekziade in [#44494]
  • Add auto-docstring on configs (#44296) by @zucchini-nlp in [#44296]
  • Fix failed unit tests for moonshine_streaming model (#43936) by @kaixuanliu in [#43936]
  • Update distributed tests (#44338) by @SunMarc in [#44338]
  • Add diffusers to CI docker file (#44480) by @ydshieh in [#44480]
  • Replace placeholder tokens as specified in added_tokens_decoder (#44468) by @itazap in [#44468]
  • [vLLM] Fix backward compatibility with hardcoded subprocessors classes in processors (#44447) by @yonigozlan in [#44447]
  • [remote code/vllm] Fix incorrect tied weights (#44469) by @Cyrilvallez in [#44469]
  • Integrate the Neuron device to TrainingArguments (#44302) by @michaelbenayoun in [#44302]
  • Fix failing DepthProModelIntegrationTest (#44456) by @Sai-Suraj-27 in [#44456]
  • [timesfm2_5] fix loss scaling (#44465) by @kashif in [#44465]
  • Fix failing ProphetNetModelIntegrationTest (#44439) by @Sai-Suraj-27 in [#44439]
  • [Trainer] fix SP loss (#44461) by @kashif in [#44461]
  • skip 1 invalid test case for higgs_audio_v2 (#44350) by @kaixuanliu in [#44350]
  • Fix position_ids typo in Qwen3_5TextModel forward pass (#44399) by @<NOT FOUND> in [#44399]

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @ydshieh
    • Don't run tests_hub if no tests found (#45014)
    • Fix failing job Update Transformers metadata after #43514 (#44941)
    • fix: improve processor loading performance by avoiding redundant tokenizer parsing (#44927)
    • fix processing_utils.py: avoid deepcopying tokenizer in ProcessorMixin to improve performance (#44894)
    • Fix core dumped when NemotronH is torch compiled (#44854)
    • Fix repo-check bot (#44812)
    • Fix config loading issues (type issues) (#44789)
    • Remove is_causal from EuroBertConfig (#44774)
    • Fix mlcd auto config/model/mapping issues (#44730)
    • Fix missing / incorrect config class in some model class definitions (#44715)
    • Update Nvidia CI docker file to use torch 2.10 (#44712)
    • Fix more model tester missing parent issue (#44685)
    • Another (small) set of fixes required for tiny model creation (#44666)
    • Fix for VibeVoiceAcousticTokenizer (#44628)
    • Fix more wrong HF hub checkpoint names (#44624)
    • Fix wrong (non-existing) checkpoints (#44549)
    • Fix CircleCI summary report not showing due to missing dependency (#44597)
    • Fix PR comment CI for quantization job (#44579)
    • Revert "test merge queue 1" (#44552)
    • Add a new job in build_pr_documentation.yml (will be the new required job) (#44538)
    • Update build_pr_documentation workflow for merge_group event (#44532)
    • Add diffusers to CI docker file (#44480)
  • @NielsRogge
    • Add VidEoMT (#44285)
    • Fix CookieCutter (#44334)
    • [LW-DETR] Fix training (#44441)
  • @tarekziade
    • refactor: unify QA calls (#44879)
    • refactor: mlinter as its own package (#44939)
    • chore(typing): added rule 11 (#44865)
    • feat: added cache to the model linter (#44790)
    • feat(ci): added a network debug report (#44636)
    • Centralize AI agent templates in .ai (#44489)
    • model-linter: Added rule 10 (#44761)
    • Fix: Eurobert model was missing @strict decorator and invalid test kwargs (#44767)
    • fix: sig lip import (#44764)
    • Expand model-structure lint rules with a fast AST-based, ruff-like framework (#44174)
    • chore(typing): Add type checking to src/transformers/quantizers (#44412)
    • Fix: AQLM quantizer to match updated replace_with_aqlm_linear signature (#44577)
    • Follow-up typing checking fixes (#44500)
    • Update ty to 0.0.20 (#44494)
  • @Sai-Suraj-27
    • Fix failing T5ModelIntegrationTest (#44934)
    • Add Jina-Embeddings-V3 Model (#44251)
    • Fix failing MusicgenStereo integration tests (#44527)
    • Fix failing GPTNeoModelLanguageGenerationTest (#44515)
    • Fix failing MarianIntegrationTests (#44519)
    • Fix failing DepthProModelIntegrationTest (#44456)
    • Fix failing ProphetNetModelIntegrationTest (#44439)
  • @remi-or
    • [CB] [Minor] Simplify test suite (#44858)
    • [CB] Add an option to return logprobs (#44835)
    • [CB] Better parametrization for compile (#44578)
    • [CB] [Bug] Fix crashes when running without cuda (#44673)
    • [CB] Add dedicated config (#44434)
    • [CB] Add paged_attention kernel (#44379)
  • @XingweiDeng
    • [Model] Add UVDoc Model Support (#43385)
    • [Model] Add PP-Chart2Table Model Support (#43767)
    • [Model] Add PP-OCRV5_mobile_det Model Support (#43247)
    • [Model] Add PP-OCRV5_server_det Model Support (#43274)
  • @vasqu
    • [FA4] Add kernels fallback (#44797)
    • [FA] Fix fa detection (#44703)
    • :rotating_light: [FA4] Initial support (#42435)
    • [Chmv2] Fix conversion after capture refactor (#44665)
    • Add Yoni to run-slow workflow (#44598)
  • @liu-jiaxuan
    • [Model] Add SLANeXt Model Support (#43707)
  • @zhang-prog
    • [Model] Add PP-OCRv5_server_rec and PP-OCRv5_mobile_rec models Support (#44808)
  • @balak4
    • Add GreedyLR adaptive learning rate scheduler (#44271)
  • @kaixuanliu
    • fix bug embedding_size mismatch with hidden_size in electra model test (#44657)
    • Fix bug and add XPU Expectations for qwen2 and jamba tests (#44733)
    • Add XPU Expectations for vibe voice acoustic tokenizer tests (#44428)
    • add XPU Expectations for higgs_audio_v2 tests (#44482)
    • fix model parallelism bug for eurobert model (#44490)
    • Fix failed unit tests for moonshine_streaming model (#43936)
    • skip 1 invalid test case for higgs_audio_v2 (#44350)
  • @juliendenize
    • Add Mistral 4 (#44760)
    • [MistralCommonBackend] Upgrade mistral-common to v1.10.0 (#44656)
  • @molbap
    • Add model lerobot PI0 to transformers (#44160)
    • Remove many output_attentions and other traced outputs on 100+ models (#43590)
  • @JJJYmmm
    • [Bugfix] fix video inference of qwen3vl and qwen3.5 series (#44474)
  • @math-hiyoko
    • Fix: Remove references to text2text-generation, summarization and translation pipeline tasks (#44510)
    • Fix: Remove references to transformers run command (#44513)
Mar 4, 2026
v5.3.0: EuroBERT, VibeVoice ASR, TimesFM2.5, PP-DocLayoutV2, OlmoHybrid, ModernVBert, Higgs Audio V2

New Model additions

EuroBERT

<img width="1080" height="1080" alt="image" src="https://github.com/user-attachments/assets/33603f42-5435-421a-9641-baf72faacb22" />

EuroBERT is a multilingual encoder model based on a refreshed transformer architecture, akin to Llama but with bidirectional attention. It supports a mixture of European and widely spoken languages, with sequences of up to 8192 tokens.

Links: Documentation | Paper | Blog Post

  • Add eurobert (#39455) by @ArthurZucker in #39455

VibeVoice ASR

<img width="673" height="464" alt="image" src="https://github.com/user-attachments/assets/e4093a6b-fc6e-4136-a15d-2fcd7b27a69e" />

VibeVoice ASR is an automatic speech recognition model from Microsoft that combines acoustic and semantic audio tokenizers with a causal language model for robust speech-to-text transcription. The model uses VibeVoice's acoustic and semantic tokenizers that process audio at 24kHz, paired with a Qwen2-based language decoder for generating transcriptions. It can process up to 60 minutes of continuous audio input, supports customized hotwords, performs joint ASR/diarization/timestamping, and handles over 50 languages with code-switching support.

Links: Documentation | Paper

  • Add VibeVoice ASR (#43625) by @ebezzam in #43625

TimesFM2.5

<img width="799" height="497" alt="image" src="https://github.com/user-attachments/assets/1e486803-1b68-496b-aa67-4c3f2055fbeb" />

TimesFM 2.5 is a pretrained time-series foundation model that uses a decoder-only attention architecture with input patching for forecasting. The model is designed to provide accurate zero-shot forecasts across different domains, forecasting horizons and temporal granularities without requiring dataset-specific training. It builds on the original TimesFM architecture with enhancements including rotary attention, QK normalization, per-dimension attention scaling, and continuous quantile prediction.

Links: Documentation | Paper

  • Timesfm 2.5 (#41763) by @kashif in #41763

PP-DocLayoutV2

<img width="1440" height="436" alt="image" src="https://github.com/user-attachments/assets/31d6609b-ef42-4f15-8c34-eeb2c0d679a9" />

PP-DocLayoutV2 is a dedicated lightweight model for layout analysis, focusing specifically on element detection, classification, and reading order prediction. The model is composed of two sequentially connected networks: an RT-DETR-based detection model that performs layout element detection and classification, followed by a pointer network that orders these layout elements. It is designed to analyze document layouts by identifying and organizing various layout components in their proper reading sequence.

Links: Documentation

  • [Model] Add PP-DocLayoutV2 Model Support (#43018) by @zhang-prog in #43018

OlmoHybrid

OLMo Hybrid is a hybrid architecture model from Ai2 that combines standard transformer attention layers with linear attention layers using the Gated Deltanet. This hybrid approach aims to improve efficiency while maintaining model quality by interleaving full attention layers with linear attention layers. The model uses a custom cache system that handles both KV cache for attention layers and recurrent state for linear attention layers.

Links: Documentation

  • Add OLMo Hybrid model (#43358) by @yanhong-lbh in #43358

ModernVBert

<img width="332" height="343" alt="image" src="https://github.com/user-attachments/assets/23e1e140-9ad2-4144-b5d6-8b8c1e3414c9" />

ModernVBert is a Vision-Language encoder that combines ModernBert with a SigLIP vision encoder. It is optimized for visual document understanding and retrieval tasks, making it suitable for processing documents that contain both text and visual elements.

Links: Documentation | Paper

  • Add ModernVBERT models (#42504) by @paultltc in #42504

ColModernVBert

ColModernVBert is a model for efficient visual document retrieval that leverages ModernVBert to construct multi-vector embeddings directly from document images, following the ColPali approach. The model enables retrieval and scoring of visual documents by processing both text queries and document images to generate embeddings that can be compared for relevance scoring.

Links: Documentation | Paper

  • Add ModernVBERT models (#42504) by @paultltc in #42504

Higgs Audio V2

<img width="3065" height="1464" alt="image" src="https://github.com/user-attachments/assets/94ad4db1-3c10-43d1-b1f7-2ce01329c8a4" />

Higgs Audio V2 is a powerful audio foundation model developed by Boson AI that was pretrained on over 10 million hours of audio data and diverse text data. Despite having no post-training or fine-tuning, the model excels in expressive audio generation thanks to its deep language and acoustic understanding. The model supports various audio generation tasks including single-speaker and multi-speaker smart voice, zero-shot voice cloning, and multi-speaker voice cloning.

Links: Documentation

  • Add Higgs Audio V2 Model (#40294) by @szhengac in #40294

Higgs Audio V2 Tokenizer

The Higgs Audio V2 Tokenizer is an audio tokenization model that operates at a low frame rate of 25 fps while maintaining high audio quality, effectively halving the frame rate of many baseline models. It uses unified 24 kHz training that mixes speech, music, and sound-event clips in one model to capture both semantic and acoustic details, facilitating the training of audio language models. The model enables fast inference by avoiding diffusion steps, with an encoder/decoder architecture that processes batches quickly for real-time or large-scale tasks.

Links: Documentation

  • Add Higgs Audio V2 Model (#40294) by @szhengac in #40294

Breaking changes

Tensor parallelism (TP) support for dense and MoE decoder-only models has been fixed and stabilized, requiring users to update their TP configurations and conversion mappings accordingly.

  • 🚨 fix + tests dense & MoE TP all reduce (decoder only) (#43722) by @3outeille

The Ernie4.5 VL MoE model class and configuration names have been renamed to align with vLLM/SGLang conventions, requiring users to update any references to the old model names in their code.

  • :rotating_light: [Ernie 4.5 VL Moe] Fix up namings to vllm/sglang convention (#44299) by @vasqu

Several pipeline tasks have been removed or updated in the V5 cleanup (including question-answering, visual-question-answering, and image-to-image), requiring users to migrate to the replacement pipelines or updated task names.

  • 🚨 More V5 pipeline cleanup (#43325) by @Rocketknight1

3D position IDs for vision-language models have been unified under a common interface (sourced from qwen2-vl), requiring users of affected VLMs (e.g., Ernie, GLM4V) to update their processors and any code that manually constructs position IDs.

  • :rotating_light: Unify 3D position ids (#43972) by @zucchini-nlp

🚨 Tokenizer x vLLM fixes 🚨 :

Unigram tokenizers were missing the spm precompiled charsmap support. We ran an overall v4 vs v5 regression test and fixed what we had missed.

This was done in:

  • [vllm + v5 fix] handle TokenizersBackend fallback properly for v5 (#44255) by @itazap

Generation

Generation input preparation was significantly refactored to stop relying on cache_position and instead pass pre-sliced input_ids/inputs_embeds directly to prepare_inputs_for_generation, simplifying the generation loop and laying groundwork for broader cache_position removal. Several bug fixes were also applied, including correct sampling for HiggsAudioV2, flaky cache-equality test stabilization for Idefics, and restored generation integration tests.

  • [higgs-audio-v2] fix sampling (#44386) by @eustlb in [#44386]
  • fix(flaky): idefics generate cache flake (#44180) by @tarekziade in [#44180]
  • Fix generation integration tests (#44225) by @zucchini-nlp in [#44225]
  • [generate] Always pass full input_ids in prepare_inputs_for_generation (#44226) by @Cyrilvallez in [#44226]
  • fix: HiggsAudioV2 cached decode inputs in compiled generation (#44201) by @tarekziade in [#44201]
  • [generate] Completely stop relying on cache_position to prepare inputs (#44130) by @Cyrilvallez in [#44130]
  • Simplify input preparation in generate (#44126) by @Cyrilvallez in [#44126]

Tokenization

Several tokenization bugs were fixed in this release, including resolving an AttributeError in MLukeTokenizer caused by the v5 rename of additional_special_tokens, correcting the Fuyu tokenizer class mapping, fixing LayoutXLM tokenization test failures from the slow tokenizer removal refactor, and adding olmo_hybrid to the auto-tokenizer mapping. The tokenizer documentation was also updated to reflect the new unified v5 backend architecture and reorganized for clarity.

  • [tiny] Add olmo_hybrid to tokenizer auto-mapping (#44416) by @tyler-romero in [#44416]
  • fix(tokenizer): Fix MLukeTokenizer AttributeError post-v5 refactor (#44362) by @harshaljanjani in [#44362]
  • update fuyu tokenizer class (#44235) by @itazap in [#44235]
  • fix(testing): Fix LayoutXLM tokenization test and LightOnOCR SDPA flash test failures on main CI (#43988) by @harshaljanjani in [#43988]
  • [docs] tokenizer summary (#43965) by @stevhliu in [#43965]
  • [docs] refactor tokenizer docs (#43900) by @stevhliu in [#43900]

Kernels

Fixed several kernel-related issues including a security vulnerability, corrected Mamba kernel loading to handle incompatible import structures, ensured Liger Kernel is properly enabled during hyperparameter search, and expanded Flash Attention to support multiple compatible implementations.

  • Fix kernels security issue (#44395) by @Cyrilvallez in [#44395]
  • Enable Liger Kernel when doing hyperparameter search. (#44329) by @linfeng-du in [#44329]
  • [Mamba] Fix kernel loading (#44176) by @vasqu in [#44176]
  • [Flash Attn] Enable compatible implementations (#44177) by @vasqu in [#44177]
  • Fix percentage formatting in help messages for gradient checkpointing, Liger Kernel, and empty cache steps (#44100) by @qgallouedec in [#44100]

Quantization

This release adds several new quantization backends and fixes, including MLX quantization support for MPS devices, Four Over Six (4/6) NVFP4 quantization integration for NVIDIA Blackwell GPUs, and CPU support for MXFP4 models, alongside a bug fix for MXFP4 model saving using reverse_op.

  • [Quantization] Fixing mxfp4 saving using reverse_op (#43148) by @MekkCyber in [#43148]
  • [Quantization] Add metal quantization for MPS devices! (#43934) by @MekkCyber in [#43934]
  • Enable mxfp4 model on CPU (#43512) by @jiqing-feng in [#43512]
  • Add Four Over Six quantization integration (#43970) by @jackcook in [#43970]

Vision

Fixed backward compatibility for image processors loaded from older remote code that lack valid_kwargs definitions, and resolved test failures in AMD ROCm CI by adding the missing timm dependency to the Docker image.

  • [AMD CI] Add missing timm dependency to ROCm Docker image (#44389) by @Abdennacer-Badaoui in [#44389]
  • update glm image model expected out for tests (#43907) by @kaixuanliu in [#43907]
  • Fix image processors from_dict backward compatibility with old remote code (#44245) by @yonigozlan in [#44245]

Bugfixes and improvements

  • Update PR template (#44415) by @SunMarc in [#44415]
  • Add Qwen3.5 support for sequence classification (#44406) by @medhakimbedhief in [#44406]
  • update the expected output for qwen2_5_vl w/ pytorch 2.10 XPU (#44426) by @kaixuanliu in [#44426]
  • add support for nemotron_3 (#44390) by @liding-nv in [#44390]
  • [ Dynamic weight loader] fix remote code when format matches (#44396) by @ArthurZucker in [#44396]
  • [timesfm2_5] fix timesfm2.5 loss (#44331) by @kashif in [#44331]
  • Fix peft conversion mappings (#44413) by @Cyrilvallez in [#44413]
  • Reduce tqdm verbosity during model loading (#44414) by @Cyrilvallez in [#44414]
  • docs: Add NeMo Automodel community integration docs (#44304) by @adil-a in [#44304]
  • [CB] Small fixes (#44227) by @remi-or in [#44227]
  • Support non-gated experts (#44319) by @IlyasMoutawwakil in [#44319]
  • [Bugfix] fix qwen3.5 no split module (#44382) by @JJJYmmm in [#44382]
  • Fix mutable default arguments and resource leaks (#44287) by @jashshah999 in [#44287]
  • skip 2 invalid test cases for voxtral_realtime model (#44321) by @kaixuanliu in [#44321]
  • Mamba-1/-2 init weights in mixer class (#43778) by @kevinli573 in [#43778]
  • add expectations for xpu for olmo_hybrid model (#44353) by @kaixuanliu in [#44353]
  • [VITS] Add speaking_rate as an optionl forward argument (#43283) by @gau-nernst in [#43283]
  • Strict export cleanup (#44293) by @IlyasMoutawwakil in [#44293]
  • [docs] kernelconfig fix (#44337) by @stevhliu in [#44337]
  • Add ProcessingKwargs ImagesKwargs etc. to docs (#44269) by @yonigozlan in [#44269]
  • Fix typos in comments and docstrings (#44332) by @tysoncung in [#44332]
  • Add testing guide for agents for trainer tests (#44328) by @SunMarc in [#44328]
  • Update common tests Trainer (#44260) by @SunMarc in [#44260]
  • [timesfm2_5] fix timesfm mlp bias (#44325) by @kashif in [#44325]
  • fix zero3 init config (#44236) by @SunMarc in [#44236]
  • Update expected output for Jais2 model tests (#43910) by @kaixuanliu in [#43910]
  • Improve has_similar_generate_outputs assertions (#44166) by @tarekziade in [#44166]
  • Fix failed test case for exaone_moe model (#43938) by @kaixuanliu in [#43938]
  • fix(modeling_attn_mask_utils): remove FutureWarning from logger.warning_once() (#44307) by @imstevenpmwork in [#44307]
  • Remove remaining vestiges of the TranslationPipeline (#43869) by @Rocketknight1 in [#43869]
  • XPU now supports backward for the FA2 fixed path (#43905) by @YangKai0616 in [#43905]
  • Fix: use TokenizersBackend for Olmo3 to preserve custom pre_tokenizer (#44294) by @mario-sanz in [#44294]
  • Fix special token maps BC (#44281) by @ArthurZucker in [#44281]
  • [Modular] Fix file type regression (#44283) by @vasqu in [#44283]
  • [auto_docstring] Improve typing parsing and add tests (#43748) by @yonigozlan in [#43748]
  • Restore response_schema saving-loading (#44282) by @Rocketknight1 in [#44282]
  • Use associative scan HOP mamba recurrentgemma (#43737) by @riccardofelluga in [#43737]
  • chore: fixes in Trainer class docs (compute_loss & hyperparameter_search) (#44268) by @ethanknights in [#44268]
  • fix(trainer): pass optim_args to SGD, Adagrad, and RMSprop optimizers (#44203) by @nightcityblade in [#44203]
  • fix(utils): Make torch_compilable_check compatible with torch.export strict mode (#44266) by @harshaljanjani in [#44266]
  • Fix TypeError in convert_rope_params_to_dict when ignore_keys is a list (#44272) by @hangjun-ezra in [#44272]
  • [docs] callbacks and collators (#44239) by @stevhliu in [#44239]
  • [docs] trainer part 1 (#44185) by @stevhliu in [#44185]
  • Remove refs to grouped_entities (#44182) by @Rocketknight1 in [#44182]
  • [mimi] nit (#44237) by @eustlb in [#44237]
  • Fix local dataset loading priority in run_image_classification_no_tra… (#44199) by @gowthamr-tech in [#44199]
  • chore: added CLAUDE.md alias (#44232) by @tarekziade in [#44232]
  • fix: add missing return type annotations to type-checking utilities in generic.py (#44241) by @yushiran in [#44241]
  • Fix return value - fixes #44238 (#44240) by @tarekziade in [#44240]
  • fix regression report_to "all" (#44250) by @SunMarc in [#44250]
  • [fix] Set input_modalities on various architectures that aren't just text (#44078) by @tomaarsen in [#44078]
  • Add processing tests for phi4 multimodal (#44234) by @yonigozlan in [#44234]
  • fix: VersionComparison.from_string return type mismatch (#43709) by @tarekziade in [#43709]
  • refactor _inner_training_loop to smaller methods (#44041) by @winglian in [#44041]
  • [docs] fix broken chat_templating links in tasks docs (#44115) by @Deep-unlearning in [#44115]
  • Add missing backtick in AnyToAnyPipeline.__call__ docstring (#44229) by @alvarobartt in [#44229]
  • Docs(it): fix typo in sentencepiece install command (#44218) by @matisgagneux21 in [#44218]
  • Docs(it): fix typo in docstring wording (#44219) by @matisgagneux21 in [#44219]
  • fix bug with position_ids on qwen3-vl models, such that position_ids include text position (#44158) by @leopold-tzafon in [#44158]
  • Update 404ing BillSum dataset URL on Summarization Task guide (#44212) by @alexandercarruthers in [#44212]
  • fix(models): Fix LayoutLMv2 NER crash and broken batched truncation/padding (#44187) by @harshaljanjani in [#44187]
  • [CB] [Major] Asynchronous batching (#43960) by @remi-or in [#43960]
  • Fix LASR feature extractor regression from invalid center argument (#44207) by @ainergiz in [#44207]
  • Models with incorrect tokenizer_class in tokenization_config.json tha… (#44179) by @itazap in [#44179]
  • chore(typing): initial ty integration (#44167) by @tarekziade in [#44167]
  • fix(flaky): test_generate_with_and_without_position_ids in GLM ORC (#44173) by @tarekziade in [#44173]
  • [docs] Add Chinese translations for common NLP task tutorials (#44144) by @TinderZ in [#44144]
  • [Mimi] Calibrate to ensure encoder streaming performs correctly (#43971) by @caffeinism in [#43971]
  • ESM2 attention_mask and token_dropout fix (#44163) by @lhallee in [#44163]
  • bring back our demons: clean_up_tokenization_spaces (#44035) by @ArthurZucker in [#44035]
  • Fix Seq2SeqTrainingArguments documentation (#35258) by @qgallouedec in [#35258]
  • AutoGrad support for grouped_mm fallback (#44152) by @IlyasMoutawwakil in [#44152]
  • Patch __setitem__ on ModelOutput even if the parameter was previously None (#44080) by @tomaarsen in [#44080]
  • [simple] Fix up __repr__ whitespace/brackets (#44048) by @tomaarsen in [#44048]
  • [chore] Fix incorrect forward type hint for Gemma3n (#44051) by @tomaarsen in [#44051]
  • Raise informative error when loading video processors (#44125) by @zucchini-nlp in [#44125]
  • fix(flaky): Different approach to make sure loss exists (#43804) by @tarekziade in [#43804]
  • [voxtral] fix voxtral proc (#44132) by @eustlb in [#44132]
  • [docs] Fix typos in GenerationConfig docstring (#44143) by @nightcityblade in [#44143]
  • Fix gemma3n get_audio_features (#44040) by @zucchini-nlp in [#44040]
  • Fix UMT5EncoderModel embedding weights not being tied after loading (#43880) by @jiqing-feng in [#43880]
  • fix(testing): Update stale device override test in GraniteSpeech (#44113) by @harshaljanjani in [#44113]
  • [Misc][vlms] Use text_config when initializing the fine-grained FP8Expert (#44032) by @JJJYmmm in [#44032]
  • docs: fix typo 'AuoQuant' → 'AutoQuant' and clarify FINEGRAINED_FP8 library column (#44131) by @cluster2600 in [#44131]
  • Update post proc (#44090) by @itazap in [#44090]
  • Fix: flaky Kosmos2ModelTest test (#44061) by @tarekziade in [#44061]
  • AutoTokenizer ignores config when model_type is None (#44127) by @itazap in [#44127]
  • Migrate GPT2 to standardized output capture decorators (#43983) by @Aki-07 in [#43983]
  • grouped_mm fallback (#44043) by @IlyasMoutawwakil in [#44043]
  • Bump dev version (#44099) by @qgallouedec in [#44099]
  • Fix loading logic issue (#44095) by @Cyrilvallez in [#44095]
  • [docs] customizing tokenizers (#43929) by @stevhliu in [#43929]
  • Merge test_keep_in_fp32_modules and test_keep_in_fp32_modules_strict (#44097) by @Rocketknight1 in [#44097]
  • [voxtral-realtime] update runner expected values (#44096) by @eustlb in [#44096]
  • Use torch.isfinite (#44069) by @cyyever in [#44069]
  • add default flash impl (#44081) by @ArthurZucker in [#44081]
  • Remove unused dependencies (#43904) by @cyyever in [#43904]
  • Fix patchtsmixer call to post_init (#44082) by @Cyrilvallez in [#44082]
  • Fix false positive right-padding warning for decoder-only models in pipeline (#44021) by @<NOT FOUND> in [#44021]

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @ArthurZucker
    • Add eurobert (#39455)
    • [ Dynamic weight loader] fix remote code when format matches (#44396)
    • Fix special token maps BC (#44281)
    • bring back our demons: clean_up_tokenization_spaces (#44035)
    • add default flash impl (#44081)
  • @liding-nv
    • add support for nemotron_3 (#44390)
  • @kashif
    • [timesfm2_5] fix timesfm2.5 loss (#44331)
    • [timesfm2_5] fix timesfm mlp bias (#44325)
    • Timesfm 2.5 (#41763)
  • @remi-or
    • [CB] Small fixes (#44227)
    • [CB] [Major] Asynchronous batching (#43960)
  • @ebezzam
    • [VibeVoice ASR] Use updated padding cache for ASR model. (#44392)
    • Add VibeVoice ASR (#43625)
  • @MekkCyber
    • [Quantization] Fixing mxfp4 saving using reverse_op (#43148)
    • [Quantization] Add metal quantization for MPS devices! (#43934)
  • @tarekziade
    • perf: Optimize SynthID logits processor batch index construction (#44172)
    • Improve has_similar_generate_outputs assertions (#44166)
    • fix(flaky): idefics generate cache flake (#44180)
    • chore: added CLAUDE.md alias (#44232)
    • Fix return value - fixes #44238 (#44240)
    • fix: VersionComparison.from_string return type mismatch (#43709)
    • fix: HiggsAudioV2 cached decode inputs in compiled generation (#44201)
    • chore(typing): initial ty integration (#44167)
    • fix(flaky): test_generate_with_and_without_position_ids in GLM ORC (#44173)
    • fix(flaky): Different approach to make sure loss exists (#43804)
    • Fix: flaky Kosmos2ModelTest test (#44061)
  • @zhang-prog
    • [Model] Add PP-DocLayoutV2 Model Support (#43018)
  • @yanhong-lbh
    • Add OLMo Hybrid model (#43358)
  • @vasqu
    • :rotating_light: [Ernie 4.5 VL Moe] Fix up namings to vllm/sglang convention (#44299)
    • [Modular] Fix file type regression (#44283)
    • [Mamba] Fix kernel loading (#44176)
    • [Flash Attn] Enable compatible implementations (#44177)
  • @jackcook
    • Add Four Over Six quantization integration (#43970)
  • @winglian
    • refactor _inner_training_loop to smaller methods (#44041)
  • @paultltc
    • Add ModernVBERT models (#42504)
  • @TinderZ
    • [docs] Add Chinese translations for common NLP task tutorials (#44144)
  • @szhengac
    • Add Higgs Audio V2 Model (#40294)
Feb 16, 2026
v5.2.0: GLM-5, Qwen3.5, Voxtral Realtime, VibeVoice Acoustic Tokenizer

New Model additions

VoxtralRealtime

<img width="1920" height="1080" alt="image" src="https://github.com/user-attachments/assets/80e37670-6d70-402b-8c8e-ccfb8c32df2d" />

VoxtralRealtime is a streaming speech-to-text model from Mistral AI, designed for real-time automatic speech recognition (ASR). Unlike the offline Voxtral model which processes complete audio files, VoxtralRealtime is architected for low-latency, incremental transcription by processing audio in chunks as they arrive.

The model combines an audio encoder with a Mistral-based language model decoder, using time conditioning embeddings and causal convolutions with padding caches to enable efficient streaming inference.

  • Add Voxtral Realtime (#43769) by @eustlb

GLM-5 - GlmMoeDsa

<img width="947" height="638" alt="image" src="https://github.com/user-attachments/assets/4c4fff37-7f40-4e86-b4a0-db718f45c93b" />

The zAI team launches GLM-5, and introduces it as such:

GLM-5, targeting complex systems engineering and long-horizon agentic tasks. Scaling is still one of the most important ways to improve the intelligence efficiency of Artificial General Intelligence (AGI). Compared to GLM-4.5, GLM-5 scales from 355B parameters (32B active) to 744B parameters (40B active), and increases pre-training data from 23T to 28.5T tokens. GLM-5 also integrates DeepSeek Sparse Attention (DSA), largely reducing deployment cost while preserving long-context capacity.

Reinforcement learning aims to bridge the gap between competence and excellence in pre-trained models. However, deploying it at scale for LLMs is a challenge due to the RL training inefficiency. To this end, we developed slime, a novel asynchronous RL infrastructure that substantially improves training throughput and efficiency, enabling more fine-grained post-training iterations. With advances in both pre-training and post-training, GLM-5 delivers significant improvement compared to GLM-4.7 across a wide range of academic benchmarks and achieves best-in-class performance among all open-source models in the world on reasoning, coding, and agentic tasks, closing the gap with frontier models.

  • Add GlmMoeDsa (#43858) by @Cyrilvallez

Qwen3.5, Qwen3.5 Moe

<img width="1920" height="1080" alt="image" src="https://github.com/user-attachments/assets/b56dcaca-80e7-4b22-80a5-2f767bb65095" />

The Qwen team launches Qwen 3.5, and introduces it as such:

We are delighted to announce the official release of Qwen3.5, introducing the open-weight of the first model in the Qwen3.5 series, namely Qwen3.5-397B-A17B. As a native vision-language model, Qwen3.5-397B-A17B demonstrates outstanding results across a full range of benchmark evaluations, including reasoning, coding, agent capabilities, and multimodal understanding, empowering developers and enterprises to achieve significantly greater productivity. Built on an innovative hybrid architecture that fuses linear attention (via Gated Delta Networks) with a sparse mixture-of-experts, the model attains remarkable inference efficiency: although it comprises 397 billion total parameters, just 17 billion are activated per forward pass, optimizing both speed and cost without sacrificing capability. We have also expanded our language and dialect support from 119 to 201, providing broader accessibility and enhanced support to users around the world.

  • Adding Support for Qwen3.5 (#43830) by @bozheng-hit

VibeVoice Acoustic Tokenizer

<img width="821" height="349" alt="image" src="https://github.com/user-attachments/assets/b1433597-b43b-4d2d-a2c7-216d7792b8c9" />

VibeVoice is a novel framework for synthesizing high-fidelity, long-form speech with multiple speakers by employing a next-token diffusion approach within a Large Language Model (LLM) structure. It's designed to capture the authentic conversational "vibe" and is particularly suited for generating audio content like podcasts and multi-participant audiobooks.

One key feature of VibeVoice is the use of two continuous audio tokenizers, one for extracting acoustic features and another for semantic features.

  • Add VibeVoice Acoustic Tokenizer (#43400) by @ebezzam

Breaking changes

  • :rotating_light: [Attn] New attn mask interface everywhere (#42848)
  • :rotating_light: Modify ModernBERT's default attention implementation to stop using FA (#43764)

:rotating_light: This one is quite breaking for super super super old modles: :rotating_light: :rotating_light:

Bugfixes and improvements

  • [docs] deploying (#43241) by @stevhliu
  • [Trainer] Move NEFTune impl to standalone functions (#43714) by @SunMarc
  • Fix convert_rope_params_to_dict so it uses rope_theta from the config (#43766) by @hmellor
  • Bump dev version (#43777) by @qgallouedec
  • Improved AGENTS.md (#43763) by @tarekziade
  • Fix-release-ubild (#43773) by @ArthurZucker
  • unpin torch for CircleCI (#43790) by @ydshieh
  • [Modular Dependencies] Fixup qwen rms norms (#43772) by @vasqu
  • fix(testing): Fix BLOOM tokenizer, CLAP audio features, and CLVP text tester usage in tests (#43798) by @harshaljanjani
  • Remove unconditional train_batch_size assignment (#43770) by @lordaarush
  • [Repo Consistency] Fix rms norm (#43803) by @vasqu
  • fix: Prevent AutoTokenizer type mismatch from directory name substrin… (#43791) by @tarekziade
  • Refactor trainer data_collator and callbacks tests (#43776) by @SunMarc
  • [core] Faster and thread-safe check_model_inputs implementation (#43765) by @Cyrilvallez
  • [Trainer] use deepspeed SP process group when Accelerate doesn’t build a mesh (#43799) by @kashif
  • fix(flaky): enforce manual seed to reduce flakiness (#43794) by @tarekziade
  • Add TRL CI bot workflow to trigger tests on PR comments (#43809) by @qgallouedec
  • Fix DeepSpeed model preparation logic in Trainer class (#43780) by @qgallouedec
  • [docs] reveal more in toctree (#43808) by @stevhliu
  • Fix markdown documentation (#43076) by @cyyever
  • Fix slack-report workflow file (#43851) by @ydshieh
  • add do_sample=False to qwen2_5_vl model tests to stablize the output (#43728) by @kaixuanliu
  • Fix incorrect timestamp calculation in Qwen3VL Processor (#43659) by @jonathan-fulton
  • Remove GPU tracking from TrackioCallback and remove env var support (#43371) by @qgallouedec
  • Add id and resume support to SwanLab integration (#43719) by @i-pj
  • fix gptoss crash in tp (#43853) by @sywangyi
  • Delete batch_split from EncoderDecoderCache (#43814) by @cyyever
  • delete unnecessary code to make moe compatible to full graph compile (#43855) by @kaixuanliu
  • Update ModelType for Unigram tokenizer (#43860) by @pavel-esir
  • [docs] Remove pipeline() examples from summarization/translation tasks (#43831) by @Mr-Neutr0n
  • Fix video interpolation in pe_audio_video (#43811) by @Rocketknight1
  • Look for the pad_token_id in the right place for Llama4 (#43539) by @Rocketknight1
  • Fix cardinality error for DETR models without explicit background class (#43513) by @heathdutton
  • docs: Add Switch Transformers docstring notes and update spectrogram comment (#43336) by @harshaljanjani
  • [xLSTM] Fix bugs preventing small model training (#43209) by @Anri-Lombard
  • docs: correct typo 'neccessary' to 'necessary' (#43868) by @thecaptain789
  • Improve PR comment CI feedback (#43852) by @ydshieh
  • Fix init weights in remote code (#43768) by @zucchini-nlp
  • Fix GlmMoeDsaConfig default mlp_layer_types in modular conversion (#43876) by @OiPunk
  • [MistralCommonBackend] fix loading proc (#43887) by @eustlb
  • [Jamba] Fallback to slow path and warn instead of error out (#43889) by @vasqu
  • Fix SwanLab callback to forward resume init args (#43848) by @OiPunk
  • Fix old tech stack in doc (#43879) by @cyyever
  • Update TrainingArguments (#43806) by @SunMarc
  • Remove unnecessary code or checks for PT 2.4+ (#43787) by @cyyever
  • Make it possible to evaluate when using sequence parallel in HF Trainer (#43517) by @jp1924
  • [Trainer] Move optimizer cls init to trainer_optimizer.py (#43738) by @SunMarc
  • fix the error of tests/quantization/fbgemm_fp8/test_fbgemm_fp8.py::Fb… (#43547) by @sywangyi
  • fix fbgemm fp8 multi-device load failure. (#43581) by @sywangyi
  • Refactor trainer init (#43807) by @SunMarc
  • [fix] Use last_hidden_state key from get_image_features for llama4 (#43882) by @tomaarsen
  • [Docs] Add docs for GLM-OCR and fix EomT-DINOv3 (#43710) by @NielsRogge
  • Update hub metadata (#43892) by @zucchini-nlp
  • [fix] DAC model: Apply STE in Dac.from_latents to match the forward pass (#43820) by @harshaljanjani
  • Separate check_model_inputs into capture_outputs and merge_with_config_defaults + ensure correctness (#43862) by @Cyrilvallez
  • Remove mask slicing in all eager attentions (#42186) by @Cyrilvallez
  • Fix expected DAC outputs due to (old) change in CI settings. (#43896) by @ebezzam
  • Minor changes trainer (#43744) by @SunMarc
  • adding BC for custom toks accessing slow tok attrs deprecated in v5 (#43898) by @itazap
  • Fix typo in quantization_operations in PEFT integrations (#43821) by @redpanda1995
  • Update KERNELS_MIN_VERSION to 0.10.2 to be the same as setup.py (#43753) by @cyyever
  • Decorate cache updates with no_grad, just in case (#43897) by @Rocketknight1
  • revert place_model_on_device to property (#43895) by @SunMarc
  • Train sampler unification (#43138) by @jiosephlee
  • fix(moe): Handle dtype mismatch in torch._grouped_mm with autocast (#43839) by @Mr-Neutr0n
  • Fix missing fast image patch counter in Glm46V (#43877) by @OiPunk
  • Fix old tech stack in doc (#43902) by @cyyever
  • Move _keys_to_ignore_on_load_missing for now (#43893) by @ArthurZucker
  • Changes to cache_utils should trigger all tests all the time (#43920) by @Cyrilvallez
  • Ernie4 5 vl moe (#43755) by @kaixuanliu
  • Harmonize input_embeds to inputs_embeds everywhere (#43916) by @Cyrilvallez
  • fix: TextClassificationPipeline docs mentioning deprecated return_all_scores (#43903) by @math-hiyoko
  • Revert #43897 (#43923) by @Rocketknight1
  • Fix AttributeError in OwlViT conversion script for Python 3.10+ (#43922) by @DimiChatzipavlis
  • add openAI style image_url content support in apply_chat_template (#43786) by @kaixuanliu
  • Prepare and keep track of position ids in generate (#43734) by @zucchini-nlp
  • Fix lifted_tensor in Gemma3n export which dynamo can't reason about (#43801) by @robell
  • Fix bark test (#43942) by @Cyrilvallez
  • Fix docker files (#43946) by @ydshieh
  • Fix flaky test for multimodal LLMs (#43944) by @Rocketknight1
  • Add explicit utf-8 encoding to CircleCI scripts for Windows compatibility (#43925) by @<NOT FOUND>
  • Modernize string formatting (f-strings) in conversion scripts (#43943) by @<NOT FOUND>
  • Fix weight decay exclusions in run_*_no‑trainer.py examples (#42769) by @casinca
  • fix: Better weight decay exclusion in run_*_no‑trainer.py examples (#43947) by @casinca
  • Timm backbone saves and loads out_features (#43886) by @zucchini-nlp
  • Fix qwen-vl position ids when generating several times (#43952) by @zucchini-nlp
  • Fix get_number_of_image_tokens (#43948) by @zucchini-nlp
  • Fix typos in docstrings, comments, and error messages (#43949) by @<NOT FOUND>
  • Fix LASR test layerdrop issue (#43954) by @Rocketknight1
  • [kernels] fix kernel versions (#43955) by @MekkCyber
  • [Doc tests] Fix bug (#43729) by @NielsRogge
  • fix(models): Preserve custom token IDs through DiaConfig save and load (#43928) by @harshaljanjani
  • update somes audio models (#43865) by @Deep-unlearning
  • Improve memory allocator during loading (#43945) by @Cyrilvallez
  • Inclusion of process_group in the gather_full_tensor function in tensor_parallel.py (#43932) by @quic-meetkuma
  • Fix sync gradient (#43919) by @SunMarc
  • Reorder Trainer methods (#43914) by @SunMarc
  • Fix TypeError in dot_natural_key when state_dict keys have mixed types at same position (#43966) by @shtse8
  • Enhance JSON schema generation to support instance, static, and class methods (#43968) by @qgallouedec
  • Remove unused squeeze from VJEPA2 embeddings rotation (#43984) by @materight
  • Improve new failing test analysis for PR comment CI (#44033) by @ydshieh
  • Remove other_workflow_run_ids for issue_comment in utils/notification_service.py (#44036) by @ydshieh
  • stable grouped_mm API (#43977) by @IlyasMoutawwakil
  • create .git-blame-ignore-revs file (#43982) by @SunMarc
  • docs: fix typos across documentation files (#43993) by @saurav0369
  • update python requirement to 3.10+ to match codebase (#44009) by @mariam851
  • Improve use of torch.is_autocast_enabled (#43930) by @cyyever
  • Use torch.xlogy (#44006) by @cyyever
  • [Deespeed] fix WeightConverter.convert() use (#43926) by @kashif
  • Reduce reduce CUDA sync (#44005) by @cyyever
  • split out accelerator args builder method (#43987) by @winglian
  • SINQ quantization strategy integration (adapted for Transformers V5) (#43112) by @ChiaraBoretti
  • fix(models): Unpack BitNet packed weights to fix CI failure (#43721) by @harshaljanjani

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @ChiaraBoretti
    • SINQ quantization strategy integration (adapted for Transformers V5) (#43112)
  • @cyyever
    • Reduce reduce CUDA sync (#44005)
    • Use torch.xlogy (#44006)
    • Improve use of torch.is_autocast_enabled (#43930)
    • Fix old tech stack in doc (#43902)
    • Update KERNELS_MIN_VERSION to 0.10.2 to be the same as setup.py (#43753)
    • Remove unnecessary code or checks for PT 2.4+ (#43787)
    • Fix old tech stack in doc (#43879)
    • Delete batch_split from EncoderDecoderCache (#43814)
    • Fix markdown documentation (#43076)
  • @eustlb
    • Add Voxtral Realtime (#43769)
    • [MistralCommonBackend] fix loading proc (#43887)
  • @ebezzam
    • Fix expected DAC outputs due to (old) change in CI settings. (#43896)
    • Add VibeVoice Acoustic Tokenizer (#43400)
  • @vasqu
    • [Jamba] Fallback to slow path and warn instead of error out (#43889)
    • :rotating_light: [Attn] New attn mask interface everywhere (#42848)
    • [Repo Consistency] Fix rms norm (#43803)
    • [Modular Dependencies] Fixup qwen rms norms (#43772)
  • @bozheng-hit
    • Adding Support for Qwen3.5 (#43830)
Feb 5, 2026
v5.1.0: EXAONE-MoE, PP-DocLayoutV3, Youtu-LLM, GLM-OCR

New Model additions

EXAONE-MoE

<img width="2278" height="1142" alt="image" src="https://github.com/user-attachments/assets/0c3d5341-0483-49c3-8467-f9784ec94b37" />

K-EXAONE is a large-scale multilingual language model developed by LG AI Research. Built using a Mixture-of-Experts architecture, K-EXAONE features 236 billion total parameters, with 23 billion active during inference. Performance evaluations across various benchmarks demonstrate that K-EXAONE excels in reasoning, agentic capabilities, general knowledge, multilingual understanding, and long-context processing.

  • Add EXAONE-MoE implementations (#43080) by @nuxlear

PP-DocLayoutV3

<img width="6252" height="1892" alt="image" src="https://github.com/user-attachments/assets/b2e58244-8ed3-42c6-80d7-e32842977ddb" />

PP-DocLayoutV3 is a unified and high-efficiency model designed for comprehensive layout analysis. It addresses the challenges of complex physical distortions—such as skewing, curving, and adverse lighting—by integrating instance segmentation and reading order prediction into a single, end-to-end framework.

  • [Model] Add PP-DocLayoutV3 Model Support (#43098) by @zhang-prog

Youtu-LLM

<img width="564" height="352" alt="image" src="https://github.com/user-attachments/assets/864372be-4ecb-41fd-8c92-f3515be040d3" />

Youtu-LLM is a new, small, yet powerful LLM, contains only 1.96B parameters, supports 128k long context, and has native agentic talents. On general evaluations, Youtu-LLM significantly outperforms SOTA LLMs of similar size in terms of Commonsense, STEM, Coding and Long Context capabilities; in agent-related testing, Youtu-LLM surpasses larger-sized leaders and is truly capable of completing multiple end2end agent tasks.

  • Add Youtu-LLM model (#43166) by @LuJunru

GlmOcr

<img width="3972" height="2352" alt="image" src="https://github.com/user-attachments/assets/a7ddfb4f-42ea-4dc6-bc73-aefb0f750c4e" />

GLM-OCR is a multimodal OCR model for complex document understanding, built on the GLM-V encoder–decoder architecture. It introduces Multi-Token Prediction (MTP) loss and stable full-task reinforcement learning to improve training efficiency, recognition accuracy, and generalization. The model integrates the CogViT visual encoder pre-trained on large-scale image–text data, a lightweight cross-modal connector with efficient token downsampling, and a GLM-0.5B language decoder. Combined with a two-stage pipeline of layout analysis and parallel recognition based on PP-DocLayout-V3, GLM-OCR delivers robust and high-quality OCR performance across diverse document layouts.

  • [GLM-OCR] GLM-OCR Support (#43391)by @zRzRzRzRzRzRzR

Breaking changes

  • 🚨 T5Gemma2 model structure (#43633) - Makes sure that the attn implementation is set to all sub-configs. The config.encoder.text_config was not getting its attn set because we aren't passing it to PreTrainedModel.init. We can't change the model structure without breaking so I manually re-added a call to self.adjust_attn_implemetation in modeling code

  • 🚨 Generation cache preparation (#43679) - Refactors cache initialization in generation to ensure sliding window configurations are now properly respected. Previously, some models (like Afmoe) created caches without passing the model config, causing sliding window limits to be ignored. This is breaking because models with sliding window attention will now enforce their window size limits during generation, which may change generation behavior or require adjusting sequence lengths in existing code.

  • 🚨 Delete duplicate code in backbone utils (#43323) - This PR cleans up backbone utilities. Specifically, we have currently 5 different config attr to decide which backbone to load, most of which can be merged into one and seem redundant After this PR, we'll have only one config.backbone_config as a single source of truth. The models will load the backbone from_config and load pretrained weights only if the checkpoint has any weights saved. The overall idea is same as in other composite models. A few config arguments are removed as a result.

  • 🚨 Refactor DETR to updated standards (#41549) - standardizes the DETR model to be closer to other vision models in the library.

  • 🚨Fix floating-point precision in JanusImageProcessor resize (#43187) - replaces an int() with round(), expect light numerical differences

  • 🚨 Remove deprecated AnnotionFormat (#42983) - removes a missnamed class in favour of AnnotationFormat.

Bugfixes and improvements

  • fix(models): Migrate legacy segmentation_indices to out_indices in BeitConfig (#43505) by @harshaljanjani
  • [docs] Update torch version (#42135) by @stevhliu
  • Remove SDPA workarounds for torch 2.4+ (#43754) by @cyyever
  • add use_deterministic to guarantee the consistency for youtu-llm model (#43759) by @kaixuanliu
  • fix: add compatible_model_types to suppress model type mismatch warnings (#43495) by @leoneperdigao
  • Fix T5 v1.1 detection (#43681) by @githubnemo
  • Add moonshine streaming (#43702) by @eustlb
  • Allow bi-directional attention for all models (#43705) by @Cyrilvallez
  • Docs: fix Training step by removing tokenizer from trainer initialization (#43733) by @nesjett
  • Fix scheduler initialization order (#43711) by @SunMarc
  • Fix accelerate integration import (#43732) by @SunMarc
  • Update torch minimum version to 2.4 (#41307) by @cyyever
  • Fix dtype in image-text-to-text pipe (#43731) by @zucchini-nlp
  • Preventing initialization of siglip's lecun_normal_, default_flax_embed_init in ZeRO3 (#43574) by @jp1924
  • fix: AttributeError for Qwen3_omni_moe (#43593) by @Vallabh-1504
  • Improve typing/explanations for general model properties (#43712) by @Cyrilvallez
  • [Kernels] kernel migration updates for activation kernels (#43518) by @ariG23498
  • [feat] Allow loading T5Gemma2Encoder with AutoModel (#43559) by @tomaarsen
  • Added S110 - try-except-pass rule (#43687) by @tarekziade
  • [docs] benchmarks (#43694) by @stevhliu
  • fix norm_eps dtype (#43669) by @fschlatt
  • Llava onevision: output align for tests and add image_sizes input param (#43678) by @kaixuanliu
  • Fix CLIPOutput attentions not being returned (#43657) by @jonathan-fulton
  • [Attn] Fixup interface usage after refactor (#43706) by @vasqu
  • Fix model/processor mismatch in SigLIP2 quantization example (#43652) by @jonathan-fulton
  • Fix crash of custom models in Notebook or Repl (#43690) by @Cyrilvallez
  • Simplify TrainingArguments docstring (#43568) by @SunMarc
  • Composite model inherit automatically all important properties from their children (#43691) by @Cyrilvallez
  • Update configuration_qwen3.py (#43703) by @francesco-bertolotti
  • fix gptoss tp crash (#43695) by @sywangyi
  • [CB] Keep order of incoming requests (#43626) by @remi-or
  • Fix Apertus model loading (NotImplementedError: Cannot copy out of meta tensor; no data!) (#43473) by @xenova
  • Remove num_frames in ASR pipeline (#43546) by @jiqing-feng
  • remove ipex and ccl for xpu and cpu (#42852) by @yao-matrix
  • update guide with new attr name for toks (#43689) by @itazap
  • Docs: fix typos in Get started (index, quicktour) (#43666) by @CodeByKodi
  • the cache class is deprecated by @vasqu (direct commit on main)
  • custom tok init fix (#43591) by @itazap
  • More export friendly rewrites and skipping the failing ones (#43436) by @IlyasMoutawwakil
  • Cast byte_count to int in caching_allocator_warmup for MPS compatibility (#43608) by @tobyliu2004
  • [Docs] Complete missing Llama4 configuration docs (#43460) by @udaymehta
  • Fix t5 failures (#43374) by @Abdennacer-Badaoui
  • Add EoMT with DINOv3 backbone (#41212) by @NielsRogge
  • Update DBRX docs to reference re-uploaded checkpoint (#43196) by @qgallouedec
  • [loading] Fix forced upcasting to fp32 (#43683) by @Cyrilvallez
  • Fix FP8Expert for Qwen (#43670) by @yiliu30
  • Simplify loading structure (#43589) by @Cyrilvallez
  • [CB] Refactor logic for inputs and outputs outside of the main API (#43569) by @remi-or
  • Make sure hub errors are surfaced in PreTrainedTokenizerBase (#43675) by @tarekziade
  • Fix FP8Expert for DeepSeek R1 (#43616) by @yiliu30
  • Use correct sampling rate in chat template (#43674) by @zucchini-nlp
  • [HunYuan] Fix RoPE init (#43411) by @vasqu
  • XPU now supports MoE kernel(MegaBlocks) implementation (#43435) by @YangKai0616
  • [Sam] Fixup training flags (#43567) by @vasqu
  • remove torchao.autoquant from transformers (#43561) by @vkuzo
  • [DeepSpeed] properly handle MoE weight conversion (#43524) by @kashif
  • Tie zamba weights correctly (#43623) by @zucchini-nlp
  • [kernels] Centralize kernels tests (#42819) by @MekkCyber
  • Fix process_bad_commit_report.py: avoid items to appear in null author in the report (#43662) by @ydshieh
  • Fix KeyError in check_bad_commit.py (#43655) by @ydshieh
  • [Benchmark] Minor fix for benchmark: kernel is not correctly called (#43428) by @sywangyi
  • Add explicit commit info to PR comment CI feedback (#43635) by @ydshieh
  • Better new failures reporting for PR comment CI (#43629) by @ydshieh
  • [docs] serving (#42853) by @stevhliu
  • add XPU expected output for MixedInt8GPT2Test (#43615) by @kaixuanliu
  • Don't modify mappings in tests (#43634) by @Rocketknight1
  • Allow Attention and Experts to be used as standalone modules (#43622) by @Cyrilvallez
  • Don't modify tied_weight_keys in-place (#43619) by @zucchini-nlp
  • [Rope] Revert #43410 and make inheritance implicit again (#43620) by @vasqu
  • [vllm compat] Separate renaming from conversion ops (#43621) by @Cyrilvallez
  • refactor + robusts tests for Tensor Parallel (#42809) by @3outeille
  • add contiguous operation for diffllama model for xpu to enable compile mode. (#43614) by @kaixuanliu
  • add xpu expectation for lw_detr model (#43339) by @kaixuanliu
  • minimax_m2: fix failed test case for XPU (#43324) by @kaixuanliu
  • Improve new failures reporting (#43628) by @ydshieh
  • Fix extras on all supported Python versions (#43490) by @tarekziade
  • fix(models): Fix suno/bark-small CPU offload device mismatch causing CI failures (#43607) by @harshaljanjani
  • [CB] [Serve] Fix broken serve tests (#43594) by @remi-or
  • Docs: fix typo in weight converter guide (#43610) by @KOKOSde
  • [MoE] Use int input for histc on CUDA to support deterministic algorithms (#43583) by @YangKai0616
  • Fixes configuration default values (#43592) by @zucchini-nlp
  • Fix make_batched_video with 5D arrays (#43486) by @zucchini-nlp
  • Operation Green CI II (#43537) by @Rocketknight1
  • enable cpu paged cache (#42869) by @jiqing-feng
  • Qwen3 omni - fix get video features (#43588) by @zucchini-nlp
  • [GLM-Image] Add batch > 1 support and fix configuration defaults (#43342) by @JaredforReal
  • [Model] Refactor modernbert with the attention interface (#43030) by @YangKai0616
  • Regex post processing in loading (#43585) by @Cyrilvallez
  • simplify extra tokens logic in base (#43230) by @itazap
  • Add XPU support to the tests for solar_open (#43579) by @YangKai0616
  • remove FbgemmFp8LinearTest (#43545) by @sywangyi
  • Increase default ReadTimeout in tests (#43586) by @Wauplin
  • Fix mistral checkpoint loading in utils/fetch_hub_objects_for_ci.py: avoid too many requests and/or timeout (#43584) by @ydshieh
  • [CI][AMD] Fix Pipeline CI (#43178) by @Abdennacer-Badaoui
  • fix(converter): speed up MistralConverter.extract_vocab_merges_from_model (#43557) by @tarekziade
  • Improve GPU monitoring: switch to multiprocessing and use amdsmi for AMD GPUs (#43552) by @Abdennacer-Badaoui
  • Update test of Youtu-LLM to pr-aligned repos (#43578) by @LuJunru
  • Rework dependencies and extras + Remove outdated templates folder (#43536) by @Cyrilvallez
  • Fix repo. consistency bot (push permission issue) (#43570) by @ydshieh
  • Fix Wav2vec and a few others (#43566) by @Cyrilvallez
  • [Modular] Allow to add new bases that are not present in the inherited class (#43556) by @vasqu
  • add an option to disable Sam3VideoModel progress bar (#43564) by @ndeybach
  • check/fix repo. check bot workflow (#43565) by @ydshieh
  • Increase timeout when preparing CI (#43560) by @Rocketknight1
  • 43054: Add Siglip2Tokenizer to enforce training-time text preprocessing defaults (#43101) by @vaibhav-research
  • check PR bot permission - part 3 (try content attribute) (#43555) by @ydshieh
  • check PR bot permission - part 2 (style only) (#43554) by @ydshieh
  • check PR bot permission - part 1 (#43553) by @ydshieh
  • Fix failing tests due to no attribute pad_token_id (#43453) by @Sai-Suraj-27
  • fix: GPT OSS Conversion Script Enhancements (#42901) by @KyleMylonakisProtopia
  • [Quantization] Fix triton_kernels name after being renamed to gpt-oss-triton-kernels (#43528) by @MekkCyber
  • [Quantization] Add cutlass kernel for FP8 (#43304) by @MekkCyber
  • [CB] Minor perf improvements and ty compatibility (#43521) by @remi-or
  • Fix tiles mixing for batched input, add tie_word_embeddings to LFM2VL config (#43379) by @ankke
  • fix: return labels instead of label in reduce_label method in BeitImageProcessorFast (#43527) by @sbucaille
  • [RoPE] Make explicit inheritance (#43410) by @vasqu
  • Fix for #43530 (#43535) by @Rocketknight1
  • Operation Green CI (#43530) by @Rocketknight1
  • Tie the weights even if initializing from a config on meta device (#43523) by @Cyrilvallez
  • [kernels] Update cv_utils name (#43529) by @MekkCyber
  • add trackio to training notebooks (#43442) by @merveenoyan
  • Mark test_prompt_lookup_decoding as flaky (#42184) by @Rocketknight1
  • Fix some MoE routers (#43445) by @IlyasMoutawwakil
  • batched_mm is slow on cpu (#43438) by @IlyasMoutawwakil
  • fix: initialize BatchNorm2d buffers only when needed (#43520) by @tarekziade
  • Fix loading of Qwen3 FP8 (#43494) by @githubnemo
  • fix ShieldGemma2IntegrationTest::test_model (#43343) by @sywangyi
  • Update SamHQModelIntegrationTest::test_inference_mask_generation_batched_points_batched_images for XPU (#43511) by @sywangyi
  • Revert utils files changes from PR #42845 (#43507) by @ydshieh
  • Move hardcoded time_step params to config for Bamba, FalconH1, GraniteMoeHybrid (#43461) by @raimbekovm
  • Prepare inputs for generation is called from super() (#43280) by @zucchini-nlp
  • Enhance repo. consistency bot (#43503) by @ydshieh
  • Add pytest-random-order for reproducible test randomization (#43483) by @tarekziade
  • Add missing GPURawMetrics.from_dict() method in benchmark_v2 (#43499) by @Abdennacer-Badaoui
  • push dev version 5.0.1.dev0 by @ArthurZucker (direct commit on main)
  • Fix failing markuplm & perception_lm integration tests (#43464) by @Sai-Suraj-27
  • fix(Phi4Multimodal): Fix incorrect default vision/audio config initialization in Phi4MultimodalConfig (#43480) by @charlieJ107
  • handle 1D position_ids for modeling_flash_attention_utils as well (#43403) by @kaixuanliu
  • Remove stale TODO comments in UDOP tied weights (#43477) by @raimbekovm
  • Fix Mxfp4 dequantize (#43326) by @Cyrilvallez

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @cyyever
    • Remove SDPA workarounds for torch 2.4+ (#43754)
    • Update torch minimum version to 2.4 (#41307)
    • 🚨 Remove deprecated AnnotionFormat (#42983)
  • @eustlb
    • Add moonshine streaming (#43702)
  • @tarekziade
    • Added S110 - try-except-pass rule (#43687)
    • Make sure hub errors are surfaced in PreTrainedTokenizerBase (#43675)
    • Fix extras on all supported Python versions (#43490)
    • fix(converter): speed up MistralConverter.extract_vocab_merges_from_model (#43557)
    • fix: initialize BatchNorm2d buffers only when needed (#43520)
    • Add pytest-random-order for reproducible test randomization (#43483)
  • @nuxlear
    • Add EXAONE-MoE implementations (#43080)
  • @vasqu
    • [Attn] Fixup interface usage after refactor (#43706)
    • the cache class is deprecated
    • [HunYuan] Fix RoPE init (#43411)
    • [Sam] Fixup training flags (#43567)
    • [Rope] Revert #43410 and make inheritance implicit again (#43620)
    • [Modular] Allow to add new bases that are not present in the inherited class (#43556)
    • [RoPE] Make explicit inheritance (#43410)
  • @remi-or
    • [CB] Keep order of incoming requests (#43626)
    • [CB] Refactor logic for inputs and outputs outside of the main API (#43569)
    • [CB] [Serve] Fix broken serve tests (#43594)
    • [CB] Minor perf improvements and ty compatibility (#43521)
  • @NielsRogge
    • Add EoMT with DINOv3 backbone (#41212)
  • @YangKai0616
    • XPU now supports MoE kernel(MegaBlocks) implementation (#43435)
    • [MoE] Use int input for histc on CUDA to support deterministic algorithms (#43583)
    • [Model] Refactor modernbert with the attention interface (#43030)
    • Add XPU support to the tests for solar_open (#43579)
  • @ydshieh
    • Fix process_bad_commit_report.py: avoid items to appear in null author in the report (#43662)
    • Fix KeyError in check_bad_commit.py (#43655)
    • Add explicit commit info to PR comment CI feedback (#43635)
    • Better new failures reporting for PR comment CI (#43629)
    • Improve new failures reporting (#43628)
    • Fix mistral checkpoint loading in utils/fetch_hub_objects_for_ci.py: avoid too many requests and/or timeout (#43584)
    • Fix repo. consistency bot (push permission issue) (#43570)
    • check/fix repo. check bot workflow (#43565)
    • check PR bot permission - part 3 (try content attribute) (#43555)
    • check PR bot permission - part 2 (style only) (#43554)
    • check PR bot permission - part 1 (#43553)
    • Revert utils files changes from PR #42845 (#43507)
    • Enhance repo. consistency bot (#43503)
  • @JaredforReal
    • [GLM-Image] Add batch > 1 support and fix configuration defaults (#43342)
  • @zhang-prog
    • [Model] Add PP-DocLayoutV3 Model Support (#43098)
  • @LuJunru
    • Update test of Youtu-LLM to pr-aligned repos (#43578)
    • Add Youtu-LLM model (#43166)
  • @zRzRzRzRzRzRzR
    • [GLM-OCR] GLM-OCR Support (#43391)
Jan 26, 2026
Transformers v5

Transformers v5 release notes

<img width="1800" height="1013" alt="image" src="https://github.com/user-attachments/assets/7b5187d7-6945-4108-a546-6d1d7bfb55e3" />
  • Highlights
  • Significant API changes: dynamic weight loading, tokenization
  • Backwards Incompatible Changes
  • Bugfixes and improvements

We have a migration guide that will be continuously updated available on the main branch, please check it out in case you're facing issues: migration guide.

Highlights

We are excited to announce the initial release of Transformers v5. This is the first major release in five years, and the release is significant: 1200 commits have been pushed to main since the latest minor release. This release removes a lot of long-due deprecations, introduces several refactors that significantly simplify our APIs and internals, and comes with a large number of bug fixes.

We give an overview of our focus for this release in the following blogpost. In these release notes, we'll focus directly on the refactors and new APIs coming with v5.

This release is the full V5 release. It sets in motion something bigger: going forward, starting with v5, we'll now release minor releases every week, rather than every 5 weeks. Expect v5.1 to follow next week, then v5.2 the week that follows, etc.

We're moving forward with this change to ensure you have access to models as soon as they're supported in the library, rather than a few weeks after.

In order to install this release, please do so with the following:

pip install transformers

For us to deliver the best package possible, it is imperative that we have feedback on how the toolkit is currently working for you. Please try it out, and open an issue in case you're facing something inconsistent/a bug.

Transformers version 5 is a community endeavor, and we couldn't have shipped such a massive release without the help of the entire community.

Significant API changes

Dynamic weight loading

We introduce a new weight loading API in transformers, which significantly improves on the previous API. This weight loading API is designed to apply operations to the checkpoints loaded by transformers.

Instead of loading the checkpoint exactly as it is serialized within the model, these operations can reshape, merge, and split the layers according to how they're defined in this new API. These operations are often a necessity when working with quantization or parallelism algorithms.

This new API is centered around the new WeightConverter class:

class WeightConverter(WeightTransform):
    operations: list[ConversionOps]
    source_keys: Union[str, list[str]]
    target_keys: Union[str, list[str]]

The weight converter is designed to apply a list of operations on the source keys, resulting in target keys. A common operation done on the attention layers is to fuse the query, key, values layers. Doing so with this API would amount to defining the following conversion:

conversion = WeightConverter(
    ["self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj"],  # The input layers
    "self_attn.qkv_proj",  # The single layer as output
    operations=[Concatenate(dim=0)],
)

In this situation, we apply the Concatenate operation, which accepts a list of layers as input and returns a single layer.

This allows us to define a mapping from architecture to a list of weight conversions. Applying those weight conversions can apply arbitrary transformations to the layers themselves. This significantly simplified the from_pretrained method and helped us remove a lot of technical debt that we accumulated over the past few years.

This results in several improvements:

  • Much cleaner definition of transformations applied to the checkpoint
  • Reversible transformations, so loading and saving a checkpoint should result in the same checkpoint
  • Faster model loading thanks to scheduling of tensor materialization
  • Enables complex mix of transformations that wouldn't otherwise be possible (such as quantization + MoEs, or TP + MoEs)

Linked PR: https://github.com/huggingface/transformers/pull/41580

Tokenization

Just as we moved towards a single backend library for model definition, we want our tokenizers, and the Tokenizer object to be a lot more intuitive. With v5, tokenizer definition is much simpler; one can now initialize an empty LlamaTokenizer and train it directly on your corpus.

Defining a new tokenizer object should be as simple as this:

from transformers import TokenizersBackend, generate_merges
from tokenizers import pre_tokenizers, Tokenizer
from tokenizers.model import BPE

class Llama5Tokenizer(TokenizersBackend):
    def __init__(self, unk_token="<unk>",bos_token="<s>", eos_token="</s>", vocab=None, merges=None ):
        if vocab is None:
            self._vocab = {
                str(unk_token): 0,
                str(bos_token): 1,
                str(eos_token): 2,
            }

        else:
            self._vocab = vocab

            self._merges = merges

        self._tokenizer = Tokenizer(
            BPE(vocab=self._vocab, merges=self._merges, fuse_unk=True)
        )
        self._tokenizer.pre_tokenizer = pre_tokenizers.Metaspace(
            replacement="▁", prepend_scheme=_get_prepend_scheme(self.add_prefix_space, self), split=False
        )
        super().__init__(
            tokenizer_object=self._tokenizer,
            unk_token=unk_token,
            bos_token=bos_token,
            eos_token=eos_token,
        )

Once the tokenizer is defined as above, you can load it with the following: Llama5Tokenizer(). Doing this returns you an empty, trainable tokenizer that follows the definition of the authors of Llama5 (it does not exist yet :wink:).

The above is the main motivation towards refactoring tokenization: we want tokenizers to behave similarly to models: trained or empty, and with exactly what is defined in their class definition.

Backend Architecture Changes: moving away from the slow/fast tokenizer separation

Up to now, transformers maintained two parallel implementations for many tokenizers:

  • "Slow" tokenizers (tokenization_<model>.py) - Python-based implementations, often using SentencePiece as the backend.
  • "Fast" tokenizers (tokenization_<model>_fast.py) - Rust-based implementations using the 🤗 tokenizers library.

In v5, we consolidate to a single tokenizer file per model: tokenization_<model>.py. This file will use the most appropriate backend available:

  1. TokenizersBackend (preferred): Rust-based tokenizers from the 🤗 tokenizers library. In general it provides optimal performance, but it also offers a lot more features that are commonly adopted across the ecosystem:
  • handling additional tokens
  • a full python API for setting and updating
  • automatic parallelization,
  • automatic offsets
  • customization
  • training
  1. SentencePieceBackend: for tokenizers requiring the sentencepiece library. It inherits from PythonBackend.
  2. PythonBackend: a Python implementations of the features provided by tokenizers. Basically allows adding tokens.
  3. MistralCommonBackend: relies on MistralCommon's tokenization library. (Previously known as the MistralCommonTokenizer)

The AutoTokenizer automatically selects the appropriate backend based on available files and dependencies. This is transparent, you continue to use AutoTokenizer.from_pretrained() as before. This allows transformers to be future-proof and modular to easily support future backends.

Defining a tokenizers outside of the existing backends

We enable users and tokenizer builders to define their own tokenizers from top to bottom. Tokenizers are usually defined using a backend such as tokenizers, sentencepiece or mistral-common, but we offer the possibility to design the tokenizer at a higher-level, without relying on those backends.

To do so, you can import the PythonBackend (which was previously known as PreTrainedTokenizer). This class encapsulates all the logic related to added tokens, encoding, and decoding.

If you want something even higher up the stack, then PreTrainedTokenizerBase is what PythonBackend inherits from. It contains the very basic tokenizer API features:

  • encode
  • decode
  • vocab_size
  • get_vocab
  • convert_tokens_to_ids
  • convert_ids_to_tokens
  • from_pretrained
  • save_pretrained
  • among a few others

API Changes

1. Direct tokenizer initialization with vocab and merges

Starting with v5, we now enable initializing blank, untrained tokenizers-backed tokenizers:

from transformers import LlamaTokenizer

tokenizer = LlamaTokenizer()

This tokenizer will therefore follow the definition of the LlamaTokenizer as defined in its class definition. It can then be trained on a corpus as can be seen in the tokenizers documentation.

These tokenizers can also be initialized from vocab and merges (if necessary), like the previous "slow" tokenizers:

from transformers import LlamaTokenizer

vocab = {"<unk>": 0, "<s>": 1, "</s>": 2, "hello": 3, "world": 4}
merges = [("h", "e"), ("l", "l"), ("o", " ")]

tokenizer = LlamaTokenizer(vocab=vocab, merges=merges)

This tokenizer will behave as a Llama-like tokenizer, with an updated vocabulary. This allows comparing different tokenizer classes with the same vocab; therefore enabling the comparison of different pre-tokenizers, normalizers, etc.

⚠️ The vocab_file (as in, a path towards a file containing the vocabulary) cannot be used to initialize the LlamaTokenizer as loading from files is reserved to the from_pretrained method.

2. Simplified decoding API

The batch_decode and decode methods have been unified to reflect behavior of the encode method. Both single and batch decoding now use the same decode method. See an example of the new behavior below:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("t5-small") 
inputs = ["hey how are you?", "fine"]
tokenizer.decode(tokenizer.encode(inputs))

Gives:

- 'hey how are you?</s> fine</s>'
+ ['hey how are you?</s>', 'fine</s>']

We expect encode and decode to behave, as two sides of the same coin: encode, process, decode, should work.

[!NOTE] A common use-case would be: encode, model.generate, decode. However, using generate would return list[list[int]], which would then be incompatible with decode.

3. Unified encoding API

The encode_plus method is deprecated in favor of the single __call__ method.

4. apply_chat_template returns BatchEncoding

Previously, apply_chat_template returned input_ids for backward compatibility. Starting with v5, it now consistently returns a BatchEncoding dict like other tokenizer methods.

# v5
messages = [
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hi there!"}
]

# Now returns BatchEncoding with input_ids, attention_mask, etc.
outputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
print(outputs.keys())  # dict_keys(['input_ids', 'attention_mask'])

5. Removed legacy configuration file saving:

We simplify the serialization of tokenization attributes:

  • special_tokens_map.json - special tokens are now stored in tokenizer_config.json.
  • added_tokens.json - added tokens are now stored in tokenizer.json.
  • added_tokens_decoder is only stored when there is no tokenizer.json.

When loading older tokenizers, these files are still read for backward compatibility, but new saves use the consolidated format. We're gradually moving towards consolidating attributes to fewer files so that other libraries and implementations may depend on them more reliably.

6. Model-Specific Changes

Several models that had identical tokenizers now import from their base implementation:

  • LayoutLM → uses BertTokenizer
  • LED → uses BartTokenizer
  • Longformer → uses RobertaTokenizer
  • LXMert → uses BertTokenizer
  • MT5 → uses T5Tokenizer
  • MVP → uses BartTokenizer

These modules will eventually be removed altogether.

Removed T5-specific workarounds

The internal _eventually_correct_t5_max_length method has been removed. T5 tokenizers now handle max length consistently with other models.

Testing Changes

A few testing changes specific to tokenizers have been applied:

  • Model-specific tokenization test files now focus on integration tests.
  • Common tokenization API tests (e.g., add_tokens, encode, decode) are now centralized and automatically applied across all tokenizers. This reduces test duplication and ensures consistent behavior

For legacy implementations, the original BERT Python tokenizer code (including WhitespaceTokenizer, BasicTokenizer, etc.) is preserved in bert_legacy.py for reference purposes.

7. Deprecated / Modified Features

Special Tokens Structure:

  • SpecialTokensMixin: Merged into PreTrainedTokenizerBase to simplify the tokenizer architecture.
  • special_tokens_map: Now only stores named special token attributes (e.g., bos_token, eos_token). Use extra_special_tokens for additional special tokens (formerly additional_special_tokens). all_special_tokens includes both named and extra tokens.
# v4
tokenizer.special_tokens_map  # Included 'additional_special_tokens'

# v5
tokenizer.special_tokens_map  # Only named tokens
tokenizer.extra_special_tokens  # Additional tokens
  • special_tokens_map_extended and all_special_tokens_extended: Removed. Access AddedToken objects directly from _special_tokens_map or _extra_special_tokens if needed.
  • additional_special_tokens: Still accepted for backward compatibility but is automatically converted to extra_special_tokens.

Deprecated Methods:

  • sanitize_special_tokens(): Already deprecated in v4, removed in v5.
  • prepare_seq2seq_batch(): Deprecated; use __call__() with text_target parameter instead.
# v4
model_inputs = tokenizer.prepare_seq2seq_batch(src_texts, tgt_texts, max_length=128)

# v5
model_inputs = tokenizer(src_texts, text_target=tgt_texts, max_length=128, return_tensors="pt")
model_inputs["labels"] = model_inputs.pop("input_ids_target")
  • BatchEncoding.words(): Deprecated; use word_ids() instead.

Removed Methods:

  • create_token_type_ids_from_sequences(): Removed from base class. Subclasses that need custom token type ID creation should implement this method directly.
  • prepare_for_model(), build_inputs_with_special_tokens(), truncate_sequences(): Moved from tokenization_utils_base.py to tokenization_python.py for PythonBackend tokenizers. TokenizersBackend provides model-ready input via tokenize() and encode(), so these methods are no longer needed in the base class.
  • _switch_to_input_mode(), _switch_to_target_mode(), as_target_tokenizer(): Removed from base class. Use __call__() with text_target parameter instead.
# v4
with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_texts, ...)

# v5
labels = tokenizer(text_target=tgt_texts, ...)
  • parse_response(): Removed from base class.

Performance

MoE Performance

The v5 release significantly improves the performance of the MoE models, as can be seen in the graphs below. We improve and optimize MoE performance through batched and grouped experts implementations, and we optimize them for decoding using batched_mm.

<img width="2048" height="1451" alt="image" src="https://github.com/user-attachments/assets/c3f2e59f-3026-4f56-9a56-36e4eb0fcf73" />

Core performance

We focus on improving the performance of loading weights on device (which gives speedups up to 6x in tensor parallel situations); this is preliminary work that we'll continue to work on in the coming weeks. Some notable improvements:

Library-wide changes with lesser impact

Default dtype update

We have updated the default dtype for all models loaded with from_pretrained to be auto. This will lead to model instantiations respecting the dtype in which the model was saved, rather than forcing it to load in float 32.

You can, of course, still specify the dtype in which you want to load your model by specifying it as an argument to the from_pretrained method.

Shard size

The Hugging Face Hub infrastructure has gradually moved to a XET backend. This will significantly simplify uploads and downloads, with higher download and upload speeds, partial uploads, and, most notably, a higher threshold for accepted file sizes on the Hugging Face Hub.

To reflect this, we're increasing the default shard size of models serialized on the Hub to 50GB (up from 5GB).

use_auth_token

The use_auth_token argument/parameter is deprecated in favor of token everywhere. You should be able to search and replace use_auth_token with token and get the same logic.

Linked PR: https://github.com/huggingface/transformers/pull/41666

Attention-related features

We decided to remove some features for the upcoming v5 as they are currently only supported in a few old models and no longer integrated in current model additions. It's recommended to stick to v4.x in case you need them. Following features are affected:

  • No more head masking, see #41076. This feature allowed to turn off certain heads during the attention calculation and only worked for eager.
  • No more relative positional biases in Bert-like models, see #41170. This feature was introduced to allow relative position scores within attention calculations (similar to T5). However, this feature is barely used in official models and a lot of complexity instead. It also only worked with eager.
  • No more head pruning, see #41417 by @gante. As the name suggests, it allowed to prune heads within your attention layers.

Updates to supported torch APIs

We dropped support for two torch APIs:

Those APIs were deprecated by the PyTorch team, and we're instead focusing on the supported APIs dynamo and export.

Quantization changes

We clean up the quantization API in transformers, and significantly refactor the weight loading as highlighted above.

We drop support for two quantization arguments that have been deprecated for some time:

  • load_in_4bit
  • load_in_8bit

We remove them in favor of the quantization_config argument which is much more complete. As an example, here is how you would load a 4-bit bitsandbytes model using this argument:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True)

model_4bit = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    device_map="auto",
    quantization_config=quantization_config
)

Configuration

  • Methods to init a nested config such as from_xxx_config are deleted. Configs can be init from the __init__ method in the same way. See #41314.
  • It is no longer possible to load a config class from a URL file. Configs must be loaded from either a local path or a repo on the Hub. See #42383.
  • All parameters for configuring model's rotary embedding are now stored under mode.rope_parameters, including the rope_theta and rope_type. Model's config.rope_parameters is a simple dictionaty in most cases, and can also be a nested dict in special cases (i.e. Gemma3 and ModernBert) with different rope parameterization for each layer type. Trying to get config.rope_theta will throw an attribute error from now on. See #39847 and #42255
  • Qwen-VL family configuration is in a nested format and trying to access keys directly will throw an error (e.g. config.vocab_size). Users are expected to access keys from their respective sub-configs (config.text_config.vocab_size).
  • Configurations of non-generative models (any model that doesn't call model.generate()) will no longer have a generation_config and model.config.generation_config will throw an attribute error.

Processing

Tokenization

  • Slow tokenizer files (aka: tokenization_<model>.py ) will be removed in favor of using fast tokenizer files tokenization_<model>_fast.py --> will be renamed to tokenization_<model>.py. As fast tokenizers are :hugs:tokenizers - backend, they include a wider range of features that are maintainable and reliable.
  • Other backends (sentence piece, tokenizers, etc.) will be supported with a light layer if loading a fast tokenizer fails
  • Remove legacy files like special_tokens_map.json and added_tokens.json
  • Remove _eventually_correct_t5_max_length
  • encode_plus --> __call__
  • batch_decode --> decode

apply_chat_template by default returns naked input_ids rather than a BatchEncoding dict. This was inconvenient - it should return a BatchEncoding dict like tokenizer.__call__(), but we were stuck with it for backward compatibility. The method now returns a BatchEncoding.

Linked PRs:

Processing classes

Modeling

  • Some RotaryEmbeddings layers will start returning a dict of tuples, in case the model uses several RoPE configurations (Gemma2, ModernBert). Each value will be a tuple of "cos, sin" per RoPE type.
  • Config attribute for RotaryEmbeddings layer will be unified and accessed via config.rope_parameters. Config attr for rope_theta might not be accessible anymore for some models, and instead will be in config.rope_parameters['rope_theta']. BC will be supported for a while as much as possible, and in the near future we'll gradually move to the new RoPE format (https://github.com/huggingface/transformers/pull/39847)
  • Vision Language models will not have a shortcut access to its language and vision component from the generative model via model.language_model. It is recommended to either access the module with model.model.language_model or model.get_decoder(). See #42156
  • All models now accept kwargs in their forward methods

Generate

  • Old, deprecated output type aliases were removed (e.g. GreedySearchEncoderDecoderOutput). We now only have 4 output classes built from the following matrix: decoder-only vs encoder-decoder, uses beams vs doesn't use beams (https://github.com/huggingface/transformers/pull/40998)
  • Removed deprecated classes regarding decoding methods that were moved to the Hub due to low usage (constraints and beam scores) (https://github.com/huggingface/transformers/pull/41223)
  • If generate doesn't receive any KV Cache argument, the default cache class used is now defined by the model (as opposed to always being DynamicCache) (https://github.com/huggingface/transformers/pull/41505)
  • Generation parameters are no longer accessible via model's config. If generation paramaters are serialized in config.json for any old model, it will be loaded back into model's generation config. Users are expected to access or modify generation parameters only with model.generation_config.do_sample = True.

Trainer

New Features

  • ALST/Ulysses Sequence Parallelism Integration
    • Added sequence parallelism support via HF Accelerate for training with longer sequences. Enables splitting sequences across devices using ALST (All-to-All Long Sequence Training) and Ulysses algorithms with DeepSpeed.
  • Improved compute_loss_func Handling
    • compute_loss_func now always takes priority over the model's built-in loss computation, giving users consistent control over custom loss functions.
  • num_items_in_batch in Prediction Step
    • The num_items_in_batch argument is now passed to compute_loss during prediction_step, enabling proper loss scaling during evaluation.

Breaking Changes

  • report_to now defaults to "none"
    • Logging integrations are no longer auto-detected by default; users must explicitly specify which reporting backends to use.

Removing arguments without deprecation cycle in TrainingArguments due to low usage

  • mp_parameters -> legacy param that was later on added to the Sagemaker trainer
  • _n_gpu -> not intended for users to set, we will initialize it correctly instead of putting it in the TrainingArguments
  • overwrite_output_dir - > replaced by resume_from_checkpoint, and it was only used in the examples script, no impact on Trainer.
  • logging_dir -> only used for tensorboard, set TENSORBOARD_LOGGING_DIR env var instead
  • jit_mode_eval -> use use_torch_compile instead, as torchscript is not recommended anymore
  • tpu_num_cores-> It is actually better to remove it, as it is not recommended to set the number of cores. By default, all TPU cores are used . Set TPU_NUM_CORES env var instead
  • past_index -> it was only used for a very small number of models that have special architecture like transformersxl + it was not documented at all how to train those models
  • ray_scope -> only for a minor arg for ray integration. Set RAY_SCOPE var env instead
  • warmup_ratio -> use warmup_step instead. We combined both args together by allowing passing float values in warmup_step.

Removing deprecated arguments in TrainingArguments

  • fsdp_min_num_params and fsdp_transformer_layer_cls_to_wrap -> use fsdp_config
  • tpu_metrics_debug -> debug
  • push_to_hub_token -> hub_token
  • push_to_hub_model_id and push_to_hub_organization -> hub_model_id
  • include_inputs_for_metrics -> include_for_metrics
  • per_gpu_train_batch_size -> per_device_train_batch_size
  • per_gpu_eval_batch_size -> per_device_eval_batch_size
  • use_mps_device -> mps will be used by default if detected
  • fp16_backend and half_precision_backend -> we will only rely on torch.amp as everything has been upstreamed to torch
  • no_cuda -> use_cpu
  • include_tokens_per_second -> include_num_input_tokens_seen
  • use_legacy_prediction_loop -> we only use evaluation_loop function from now on

Removing deprecated arguments in Trainer

  • tokenizer in initialization -> processing_class
  • model_path in train() -> resume_from_checkpoint

Removed features for Trainer

  • sigpot integration for hp search was removed as the library was archived + the api stopped working
  • drop support for sagemaker API <1.10
  • bump accelerate minimum version to 1.1.0
  • bump peft minimum version to 0.18.0
  • bump bitsandbytes minimum version to 0.46.1

New defaults for Trainer

  • use_cache in the model config will be set to False. You can still change the cache value through TrainingArguments usel_cache argument if needed.

Pipeline

  • Image text to text pipelines will no longer accept images as a separate argument along with conversation chats. Image data has to be embedded in the chat's "content" field. See #42359

PushToHubMixin

  • removed deprecated organization and repo_url from PushToHubMixin. You must pass a repo_id instead.
  • removed ignore_metadata_errors from PushToMixin. In practice if we ignore errors while loading the model card, we won't be able to push the card back to the Hub so it's better to fail early and not provide the option to fail later.
  • push_to_hub do not accept **kwargs anymore. All accepted parameters are explicitly documented.
  • arguments of push_to_hub are now keyword-only to avoid confusion. Only repo_id can be positional since it's the main arg.
  • removed use_temp_dir argument from push_to_hub. We now use a tmp dir in all cases.

Linked PR: https://github.com/huggingface/transformers/pull/42391.

CLI

The deprecated transformers-cli ... command was deprecated, transformers ... is now the only CLI entry point.

transformers CLI has been migrated to Typer, making it easier to maintain + adding some nice features out of the box (improved --help section, autocompletion).

Biggest breaking change is in transformers chat. This command starts a terminal UI to interact with a chat model. It used to also be able to start a Chat Completion server powered by transformers and chat with it. In this revamped version, this feature has been removed in favor of transformers serve. The goal of splitting transformers chat and transformers serve is to define clear boundaries between client and server code. It helps with maintenance but also makes the commands less bloated. The new signature of transformers chat is:

Usage: transformers chat [OPTIONS] BASE_URL MODEL_ID [GENERATE_FLAGS]...

Chat with a model from the command line.

It works hand in hand with transformers serve, which means that if transformers serve is running on its default endpoint, transformers chat can be launched as follows:

transformers chat HuggingFaceTB/SmolLM3-3B

It can however use any OpenAI API compatible HTTP endpoint:

transformers chat HuggingFaceTB/SmolLM3-3B https://router.huggingface.co/v1

Linked PRs:

Removal of the run method

The transformers run (previously transformers-cli run) is an artefact of the past, was not documented nor tested, and isn't part of any public documentation. We're removing it for now and ask you to please let us know in case this is a method you are using; in which case we should bring it back with better support.

Linked PR: https://github.com/huggingface/transformers/pull/42447

Environment variables

  • Legacy environment variables like TRANSFORMERS_CACHE, PYTORCH_TRANSFORMERS_CACHE, and PYTORCH_PRETRAINED_BERT_CACHE have been removed. Please use HF_HOME instead.
  • Constants HUGGINGFACE_CO_EXAMPLES_TELEMETRY, HUGGINGFACE_CO_EXAMPLES_TELEMETRY, HUGGINGFACE_CO_PREFIX, and HUGGINGFACE_CO_RESOLVE_ENDPOINT have been removed. Please use huggingface_hub.constants.ENDPOINT instead.

Linked PR: https://github.com/huggingface/transformers/pull/42391.

Requirements update

transformers v5 pins the huggingface_hub version to >=1.0.0. See this migration guide to learn more about this major release. Here are to main aspects to know about:

  • switched the HTTP backend from requests to httpx. This change was made to improve performance and to support both synchronous and asynchronous requests the same way. If you are currently catching requests.HTTPError errors in your codebase, you'll need to switch to httpx.HTTPError.
  • related to 1., it is not possible to set proxies from your script. To handle proxies, you must set the HTTP_PROXY / HTTPS_PROXY environment variables
  • hf_transfer and therefore HF_HUB_ENABLE_HF_TRANSFER have been completed dropped in favor of hf_xet. This should be transparent for most users. Please let us know if you notice any downside!

typer-slim has been added as required dependency, used to implement both hf and transformers CLIs.

New model additions in v5

CWM

<img width="809" height="471" alt="image" src="https://github.com/user-attachments/assets/58bb9c70-d481-48ed-ab8f-6553be7c240f" />

The Code World Model (CWM) model was proposed in CWM: An Open-Weights LLM for Research on Code Generation with World Models by Meta FAIR CodeGen Team. CWM is an LLM for code generation and reasoning about code that has, in particular, been trained to better represent and reason about how code and commands affect the state of a program or system. Specifically, we mid-trained CWM on a large number of observation-action trajectories from Python execution traces and agentic interactions in containerized environments. We post-trained with extensive multi-task RL in verifiable coding, math, and multi-turn software engineering environments.

  • Add Code World Model (CWM) by @jacobkahn in #41199

SAM3

<img width="1505" height="915" alt="image" src="https://github.com/user-attachments/assets/eec48633-f02b-464a-ae5c-c65473387e53" />

SAM3 (Segment Anything Model 3) was introduced in SAM 3: Segment Anything with Concepts.

The SAM3 addition adds four new architectures:

  • Sam3
  • Sam3Tracker
  • Sam3TrackerVideo
  • Sam3Video

SAM3 performs Promptable Concept Segmentation (PCS) on images. PCS takes text and/or image exemplars as input (e.g., "yellow school bus"), and predicts instance and semantic masks for every single object matching the concept.

Sam3Tracker and Sam3TrackerVideo perform Promptable Visual Segmentation (PVS) on images. PVS takes interactive visual prompts (points, boxes, masks) or text inputs to segment a specific object instance per prompt. This is the task that SAM 1 and SAM 2 focused on, and SAM 3 improves upon it. Sam3Tracker and Sam3TrackerVideo are updated versions of SAM2 Video that maintain the same API while providing improved performance and capabilities.

SAM3 Video performs Promptable Concept Segmentation (PCS) on videos. PCS takes text as input (e.g., "yellow school bus"), and predicts instance and semantic masks for every single object matching the concept, while preserving object identities across video frames. The model combines a detection module (SAM3) with a tracking module (SAM2-style tracker) to enable robust object tracking across video frames using text prompts.

  • Add SAM3 to 🤗 Transformers by @yonigozlan in #42285

LFM2 MoE

<img width="1080" height="849" alt="image" src="https://github.com/user-attachments/assets/a9fa1b81-114d-4054-9699-5083ac69d830" />

LFM2-MoE is a Mixture-of-Experts (MoE) variant of LFM2. The LFM2 family is optimized for on-device inference by combining short‑range, input‑aware gated convolutions with grouped‑query attention (GQA) in a layout tuned to maximize quality under strict speed and memory constraints.

LFM2‑MoE keeps this fast backbone and introduces sparse MoE feed‑forward networks to add representational capacity without significantly increasing the active compute path. The first LFM2-MoE release is LFM2-8B-A1B, with 8.3B total parameters and 1.5B active parameters. The model excels in quality (comparable to 3-4B dense models) and speed (faster than other 1.5B class models).

  • [Model] Lfm2Moe by @paulpak58 in #41401

VideoLlama 3

<img width="812" height="366" alt="image" src="https://github.com/user-attachments/assets/21c82c6e-cf0a-4d6c-a707-b9e57663ca85" />

The VideoLLaMA3 model is a major update to VideoLLaMA2 from Alibaba DAMO Academy.

  • [model] Add VideoLLaMA3 implementation by @lkhl in #40499

AudioFlamingo 3

<img width="621" height="475" alt="image" src="https://github.com/user-attachments/assets/c9616758-b3aa-41d0-bd58-695966ba146d" />

Audio Flamingo 3 (AF3) is a fully open large audio–language model designed for robust understanding and reasoning over speech, environmental sounds, and music. AF3 pairs a Whisper-style audio encoder with a causal language model and performs replace-in-place audio–text fusion: the processor aligns post-pool audio frames to a dedicated placeholder token and the model replaces those token slots with projected audio embeddings during the forward pass.

The model checkpoint is available at: nvidia/audio-flamingo-3-hf

Highlights:

  • Unified audio encoder across speech, sound, and music.
  • Long-audio support via windowing and post-pool alignment (up to 10 minutes maximum). The model processes audio in 30-second windows with a hard limit of 20 windows (10 minutes total). Audio longer than 10 minutes will be truncated.
  • Deterministic fusion that preserves sequence length by replacing audio placeholder tokens with audio embeddings.
  • [models] Add AudioFlamingo3 integration by @lashahub in #40290

Nanochat

NanoChat is a compact decoder-only transformer model designed for educational purposes and efficient training. The model features several fundamental architectural innovations which are common in modern transformer models. Therefore, it is a good model to use as a starting point to understand the principles of modern transformer models. NanoChat is a variant of the Llama architecture, with simplified attention mechanism and normalization layers.

  • [MODEL] Nanochat implementation by @burtenshaw in #41634

FastVLM

<img width="868" height="331" alt="image" src="https://github.com/user-attachments/assets/cd8b82cf-10de-49b0-af2a-28ffbeac6fa7" />

FastVLM is an open-source vision-language model featuring a novel hybrid vision encoder, FastViTHD. Leveraging reparameterizable convolutional layers, scaled input resolution, and a reduced number of visual tokens, FastVLM delivers high accuracy with exceptional efficiency. Its optimized architecture enables deployment even on edge devices, achieving ultra-low TTFT (time to first token) without sacrificing performance.

PaddleOCR-VL

<img width="3840" height="2160" alt="image" src="https://github.com/user-attachments/assets/712fbe57-a4b6-4bf1-acf6-3ef803f75f0e" />

PaddleOCR-VL is a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios.

SAM: Perception Encoder Audiovisual

<img width="719" height="541" alt="image" src="https://github.com/user-attachments/assets/2fc5ee26-3c15-451f-bc7f-c779c2a78919" />

PE Audio (Perception Encoder Audio) is a state-of-the-art multimodal model that embeds audio and text into a shared (joint) embedding space. The model enables cross-modal retrieval and understanding between audio and text.

Text input

  • Produces a single embedding representing the full text.

Audio input

  • PeAudioFrameLevelModel
    • Produces a sequence of embeddings, one every 40 ms of audio.
    • Suitable for audio event localization and fine-grained temporal analysis.
  • PeAudioModel
    • Produces a single embedding for the entire audio clip.
    • Suitable for global audio-text retrieval tasks.

The resulting embeddings can be used for:

  • Audio event localization
  • Cross-modal (audio–text) retrieval and matching

Jais2

<img width="2100" height="1154" alt="image" src="https://github.com/user-attachments/assets/a9343e81-903b-4445-ba0c-61c87830776a" />

Jais2 a next-generation Arabic open-weight LLM trained on the richest Arabic-first dataset to date. Built from the ground up with 8B and 70B parameters, Jais 2 understands Arabic the way it's truly spoken across dialects, cuulutre, and modern expression. It is developed by MBZUAI, Inception and Cerebras Systems and based on the transformer architecture with modifications including:

  • LayerNorm instead of RMSNorm
  • ReLU² activation function
  • Rotary Position Embeddings (RoPE)

Pixio

<img width="5478" height="2102" alt="image" src="https://github.com/user-attachments/assets/abbd93d4-8c4c-4fd9-8fab-58d969d8b296" />

Pixio is a vision foundation model that uses ViT as a feature extractor for multiple downstream tasks like depth estimation, semantic segmentation, feed-forward 3D reconstruction, robotics, and image classification. It is built on the Masked Autoencoder (MAE) pre-training framework, with four minimal yet critical updates: 1) deeper decoder, 2) larger masking granularity, 3) more class tokens, and 4) web-scale curated training data.

Ernie 4.5 VL MoE

<img width="848" height="853" alt="image" src="https://github.com/user-attachments/assets/02817ead-7560-4eea-8f2b-0b959553a3cd" />

The Ernie 4.5 VL MoE model was released in the Ernie 4.5 Model Family release by baidu. This family of models contains multiple different architectures and model sizes. The Vision-Language series in specific is composed of a novel multimodal heterogeneous structure, sharing paremeters across modalities and dedicating parameters to specific modalities. This becomes especially apparent in the Mixture of Expert (MoE) which is composed of

  • Dedicated Text Experts
  • Dedicated Vision Experts
  • Shared Experts

This architecture has the advantage to enhance multimodal understanding without compromising, and even improving, performance on text-related tasks. An more detailed breakdown is given in the Technical Report.

GLM-ASR

<img width="1600" height="1029" alt="image" src="https://github.com/user-attachments/assets/d630a900-9ef5-467c-93ab-34f064501b8d" />

GLM-ASR-Nano-2512 is a robust, open-source speech recognition model with 1.5B parameters. Designed for real-world complexity, it outperforms OpenAI Whisper V3 on multiple benchmarks while maintaining a compact size.

Key capabilities include:

  • Exceptional Dialect Support Beyond standard Mandarin and English, the model is highly optimized for Cantonese (粤语) and other dialects, effectively bridging the gap in dialectal speech recognition.

  • Low-Volume Speech Robustness Specifically trained for "Whisper/Quiet Speech" scenarios. It captures and accurately transcribes extremely low-volume audio that traditional models often miss.

  • SOTA Performance Achieves the lowest average error rate (4.10) among comparable open-source models, showing significant advantages in Chinese benchmarks (Wenet Meeting, Aishell-1, etc..).

This model was contributed by Eustache Le Bihan and Yuxuan Zhang. you can check the model card for more details and our github repo.

GLM 4.7 Flash

<img width="5038" height="5860" alt="image" src="https://github.com/user-attachments/assets/af0f2821-3831-4afd-a7dc-26a91f17afd0" />

GLM-4.7-Flash offers a new option for lightweight deployment that balances performance and efficiency.

GLM Image

<img width="10101" height="7371" alt="image" src="https://github.com/user-attachments/assets/e2de76ab-35a9-4e42-9252-6b3dbf75e991" />

We present GLM-4.1V-Thinking, GLM-4.5V, and GLM-4.6V, a family of vision-language models (VLMs) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document interpretation. In a comprehensive evaluation across 42 public benchmarks, GLM-4.5V achieves state-of-the-art performance on nearly all tasks among open-source models of similar size, and demonstrates competitive or even superior results compared to closed-source models such as Gemini-2.5-Flash on challenging tasks including Coding and GUI Agents. Meanwhile, the smaller GLM-4.1V-9B-Thinking remains highly competitive-achieving superior results to the much larger Qwen2.5-VL-72B on 29 benchmarks. We open-source both GLM-4.1V-9B-Thinking and GLM-4.5V. We further introduce the GLM-4.6V series, open-source multimodal models with native tool use and a 128K context window. A brief overview is available at this https URL. Code, models and more information are released at https://github.com/zai-org/GLM-V

LWDetr

<img width="706" height="242" alt="image" src="https://github.com/user-attachments/assets/2c4770f4-5d48-4d87-a454-b574b648e5ae" />

LW-DETR proposes a light-weight Detection Transformer (DETR) architecture designed to compete with and surpass the dominant YOLO series for real-time object detection. It achieves a new state-of-the-art balance between speed (latency) and accuracy (mAP) by combining recent transformer advances with efficient design choices.

The LW-DETR architecture is characterized by its simple and efficient structure: a plain ViT Encoder, a Projector, and a shallow DETR Decoder. It enhances the DETR architecture for efficiency and speed using the following core modifications:

Efficient ViT Encoder: Uses a plain ViT with interleaved window/global attention and a window-major organization to drastically reduce attention complexity and latency.
Richer Input: Aggregates multi-level features from the encoder and uses a C2f Projector (YOLOv8) to pass two-scale features ( 1 / 8 and 1 / 32 ).
Faster Decoder: Employs a shallow 3-layer DETR decoder with deformable cross-attention for lower latency and faster convergence.
Optimized Queries: Uses a mixed-query scheme combining learnable content queries and generated spatial queries.

LightOnOCR

<img width="1172" height="661" alt="image" src="https://github.com/user-attachments/assets/cee98e82-b3d0-42a6-b820-3061752ad4a8" />

LightOnOcr combines a Vision Transformer encoder (Pixtral-based) with a lightweight text decoder (Qwen3-based) distilled from high-quality open VLMs. It is optimized for document parsing tasks, producing accurate, layout-aware text extraction from high-resolution pages.

Bugfixes and improvements

  • JetMoe Fix jetmoe after #40132 by @ArthurZucker in #41324
  • Fixed tiny incorrect import in gemma3 by @Sai-Suraj-27 in #41354
  • Rope for Qwen2--5-vl by @zucchini-nlp in #41173
  • 🚨 Bump to Python 3.10 and rework how we check 3rd-party libraries existence by @Cyrilvallez in #41268
  • Standardize PretrainedConfig to PreTrainedConfig by @Cyrilvallez in #41300
  • Fix trainer for py3.9 by @SunMarc in #41359
  • Check model inputs - hidden states by @zucchini-nlp in #40994
  • [ModularChecker] QOL for the modular checker by @ArthurZucker in #41361
  • Fixing a typo for BLT model by @Narsil in #41325
  • :rotating_light: [v5] Remove relative position embeddings (for bert like models) by @vasqu in #41170
  • Fix typo in model proposal template by @Ombucha in #41352
  • Better typehints for apply_chat_template by @Samoed in #41355
  • 🚨 Remove BetterTransformer by @Cyrilvallez in #41367
  • [testing] update test_longcat_generation_cpu by @ydshieh in #41368
  • Fix flash_attention.py: wrong argument passing for attn_implementation by @TKONIY in #41347
  • Use canonical get_size_with_aspect_ratio (with max_size) from transformers.image_transforms to fix #37939 by @sonianuj287 in #41284
  • Fixes in check_model_inputs, GPTBigCodeModel and ImageGPTModel by @IlyasMoutawwakil in #40811
  • Remove unnecessary list comprehension by @cyyever in #41305
  • make some ut cases pass on xpu w/ latest torch by @yao-matrix in #41337
  • Remove unused function patameters by @cyyever in #41358
  • [CB] Refactors the way we access paged by @ArthurZucker in #41370
  • serve: add non-streaming mode to /v1/responses; stream event parity; remove placeholder logprobs by @antznette1 in #41353
  • Update from pretrained error when loading by @ArthurZucker in #33380
  • [v5] Sync Bert and Bart eager attention by @vasqu in #41248
  • fix asr ut failures by @yao-matrix in #41332
  • fix resample in asr pipeline by @yhzx233 in #41298
  • Correct numerical regression in vision embeddings by @i3hz in #41374
  • [kernels] Kernel Config by @MekkCyber in #41232
  • [Cache] lfm2 cache: allocate empty kv layers during init by @paulpak58 in #41396
  • Fix test for model with dotted name and relative imports by @st81 in #41343
  • Prefer raising TypeError exception for invalid type by @Sai-Suraj-27 in #41346
  • [v5] Bump accelerate to 1.1.0 by @SunMarc in #41234
  • Fix incorrect assignment in update_device_map for GPTQ quantizer by @Sai-Suraj-27 in #41328
  • [v5] Delete left traces of feature extractor by @zucchini-nlp in #41321
  • Remove deprecation warning by @Cyrilvallez in #41425
  • Fix overriding common_kwargs defaults in processor calls by @yonigozlan in #41381
  • v5 dev version by @LysandreJik in #41436
  • Tiny Cleanup - Removed duplicate class field definition's by @Sai-Suraj-27 in #41293
  • 🚨🚨 Remove all traces of legacy cache format by @Cyrilvallez in #41378
  • 🚨 [v5] Prune prune_heads by @gante in #41417
  • [v5] Bump min version of bitsandbytes to 0.46.1 by @SunMarc in #41283
  • Fixing comments in init file by @MekkCyber in #41414
  • Use accelerator API to free device memory by @cyyever in #41195
  • enable new model uts to xpu and fix some failures on xpu by @yao-matrix in #41386
  • [torchao] Add regex support for ModuleFqnToConfig by @jerryzh168 in #41242
  • :facepalm: CB nit! by @ArthurZucker in #41413
  • Remove Python 3.9 classifier by @cyyever in #41410
  • [JetMoe] Fix KV head repetition and padding free by @vasqu in #41423
  • [testing] Fix JetMoeIntegrationTest by @ydshieh in #41377
  • Add Top-H decoding (entropy-bounded truncation) as a LogitsWarper for text generation by @ErfanBaghaei in #40837
  • Validate processing kwargs with @strict from huggingface_hub by @zucchini-nlp in #40793
  • Update hqq.md by @prathamesh-chavan-22 in #41452
  • enable some falcon-mamba uts on xpu by @yao-matrix in #41428
  • Fix generate outputs and simplify cache tests by @Cyrilvallez in #41440
  • Fix doc by @Cyrilvallez in #41457
  • 🚨 [v5] Rename left traces of past_key_value in BERT-like models by @zucchini-nlp in #41448
  • Subconfig is a class attribute by @zucchini-nlp in #41308
  • [v5] rm utils/tf_ops/ by @gante in #41402
  • Update GLM-4.1V MMRope implementation by @zRzRzRzRzRzRzR in #41182
  • [kernels] Cleanup deta kernel by @MekkCyber in #41470
  • 🚨 [v5] Rendundant code in nested configs by @zucchini-nlp in #41314
  • Remove KERAS_NLP_IMPORT_ERROR by @cyyever in #41468
  • Fix auto model configuration for encoder of perceptionlm by @fschlatt in #41464
  • Fix tests fsdp by @SunMarc in #41422
  • Import Callable from collections.abc by @cyyever in #41130
  • Pickle - part 2 by @ydshieh in #41476
  • Remove infer_device by @cyyever in #41088
  • Change RT-Detr docs to reflect fixed 640x640 input size by @konstantinos-p in #41364
  • Cleaning hub kernels by @MekkCyber in #41477
  • [v5] remove load_in_4bit and load_in_8bit by @SunMarc in #41287
  • :rotating_light: [Attention Masks] Bidirectional masks for encoder and encoder-decoder models by @vasqu in #41265
  • [Fix] Fix test file error by @YangKai0616 in #40973
  • enhance patched_tearDown to support python 3.11+ by @yao-matrix in #41429
  • RT-Detr correct 2d positional embeddings for non-square images by @konstantinos-p in #41380
  • Fix bnb fsdp loading for pre-quantized checkpoint by @SunMarc in #41415
  • Remove SigOpt by @SunMarc in #41479
  • Remove past_index by @SunMarc in #41384
  • Remove deprecated args in Trainer for v5 by @SunMarc in #41404
  • Update GLM-4.6 doc by @zRzRzRzRzRzRzR in #41471
  • report_to default changed to "none" + cleaning deprecated env var by @SunMarc in #41375
  • deprecate overwrite_output_dir by @SunMarc in #41323
  • [CI] Fix copies on main by @vasqu in #41486
  • [Trainer] deprecate ray scope by @SunMarc in #41403
  • deprecate jit_mode_eval by @SunMarc in #41376
  • Remove local_rank arg from TrainingArguments by @SunMarc in #41382
  • Update philosophy by @molbap in #41438
  • Remove DISABLE_KERNEL_MAPPING flag by @MekkCyber in #41475
  • Streaming should be handled at the request-level rather than at the istance level by @LysandreJik in #41444
  • fix bnb model loading by @jiqing-feng in #41499
  • [kernels] Remove RWKV kernel finally ! by @MekkCyber in #41493
  • [kernels] rm yoso kernel by @MekkCyber in #41495
  • Try to remove pickle - BloomTokenizerFast by @ydshieh in #41466
  • Fixed tiny incorrect imports in glm4v by @Sai-Suraj-27 in #41483
  • [Parakeet] unnecessary warning & auto mapping by @eustlb in #41412
  • [causallm tester] automate pipeline mappings + bloom tests by @gante in #41318
  • Fix some tests by @Cyrilvallez in #41503
  • fix gemma3n case failure by @yao-matrix in #41426
  • [voxtral] language detection + skipping lang:xx by @eustlb in #41225
  • Set truncation to False in Qwen3Omni to avoid default truncation by @BakerBunker in #41473
  • [QoL] modular conversion shows LoC saved by @molbap in #41500
  • More trainer cleaning by @SunMarc in #41489
  • Bump to hfh 1.0.0.rc5 to fix test by @Wauplin in #41508
  • Revert local_rank deletion and some cleaning by @SunMarc in #41504
  • Fix detectron2 import by @Cyrilvallez in #41510
  • add Trainer import to .md in appropriate cell block for training.ipynb transformers_doc by @benkeene in #41484
  • Remove outdated flags by @Cyrilvallez in #41512
  • remove tpu_num_cores by @SunMarc in #41383
  • Allow optuna's catch kwargs passthrough by @nicha-api in #41496
  • Fix Latex typesetting in documentation by @cyyever in #41177
  • [testing] reduce runtime of HunYuanMoEV1IntegrationTest:test_model_generation by @ydshieh in #41373
  • [Qwen3VL] fix: hidden_states in place modification error by @HollowMan6 in #41535
  • Add MLlama fast image processor by @yonigozlan in #41391
  • Fixed Type-hints in function defintions by @Sai-Suraj-27 in #41525
  • [SAM] Fix typing hints by @zucchini-nlp in #41506
  • Restore cuda graphs to continuous batching by @remi-or in #41421
  • Add AMD developer cloud support by @fan-amd in #41126
  • Enable modular files from other libraries by @regisss in #41372
  • 🚨 [v5] generate delegates default cache initialization to the model by @gante in #41505
  • Fixed typos and formatting by @julian-st in #34215
  • Add VideoMAE video processor by @Aki-07 in #41534
  • [from_pretrained] Small refactor from_pretrained: move around unrelated stuff by @ArthurZucker in #41445
  • Remove references to AutoModelForVision2Seq by @Rocketknight1 in #41513
  • [Qwen3VL] fix device mismatch error for FSDP2 training by @HollowMan6 in #41536
  • Patch MistralCommonTokenizer by @juliendenize in #41439
  • Fix an import error with PreTrainModel by @remi-or in #41571
  • [Qwen3VLMoe] Fixed: Expected self.dtype to be equal to src.dtype - routing_weights casting by @danielquintas8 in #41420
  • [kernels] rm mra kernels by @MekkCyber in #41507
  • delete some tokenizer tests using pickle by @ydshieh in #41514
  • Add DINOv3Backbone for ConvNext variant by @merveenoyan in #40651
  • Add conditional checks to _check_and_adjust_attn_implementation() by @zheliuyu in #41542
  • add rmsnorm kernels support for Intel XPU by @kaixuanliu in #41563
  • Revert "add rmsnorm kernels support for Intel XPU" by @MekkCyber in #41579
  • [VisionEncoderDecoderModel] Update loss function by @NielsRogge in #40863
  • Add iter to DynamicCache by @remi-or in #41569
  • Revert some breaking changes bnb by @SunMarc in #41581
  • Fix typsetting and content of llm_tutorial_optimization.md by @cyyever in #41172
  • Gemma3 fixes by @remi-or in #41572
  • Benchmark overhaul by @remi-or in #41408
  • Enable non-streaming mode in transformers serve by @LysandreJik in #41446
  • [device_map] Accelerate loading by computing device_map much faster by @Cyrilvallez in #41548
  • Add logits_to_keep to many older CausalLM models by @philiproeleveld in #41335
  • fix some case failures lead by "torch.compile recompiled part of th… by @sywangyi in #41558
  • remove ray_scope and check_quantized_param by @SunMarc in #41587
  • Update issue template by @SunMarc in #41573
  • [Docs] Fix changed references by @vasqu in #41614
  • Import expand_device_map instead of redefining it by @Cyrilvallez in #41608
  • Fix trainer simple tests by @SunMarc in #41449
  • More markdown file fixes by @cyyever in #41599
  • torch 2.9 don't ❤️ torchcodec 💔 by @ydshieh in #41610
  • Update a dataset reop link by @ydshieh in #41618
  • Add fast path for bidirectional mask creation to fix regression by @i3hz in #41586
  • enable sdpa enable gqa logic for Ascend NPU by @FightingZhen in #41601
  • Fix video processing channel format by @zucchini-nlp in #41603
  • [chat template] update when "push_to_hub" by @zucchini-nlp in #39815
  • Remove the head masking block in some vision models by @ydshieh in #41620
  • Remove deprecated code by @SunMarc in #41616
  • Fix quantization base class by @SunMarc in #41613
  • [docs] Duplicate entry by @stevhliu in #41591
  • Update executorch.md by @jackzhxng in #41582
  • Add Backbone API fine-tuning tutorial by @merveenoyan in #41590
  • 🚨 [v5] Toggle the serialization format in processors by @zucchini-nlp in #41474
  • Add aux loss for GLM-4.5V by @zRzRzRzRzRzRzR in #41564
  • Allow passing tp_plan in from_pretrained directly by @Cyrilvallez in #41435
  • Fix tokenization test by @Cyrilvallez in #41649
  • Remove randomly added script by @Cyrilvallez in #41650
  • Add missing dates to docs by @yonigozlan in #41576
  • Migrate transformers cli to Typer by @Wauplin in #41487
  • Fix FP-Quant quantization fallback CPU dispatch. by @BlackSamorez in #41619
  • fix check inputs for text2text pipeline by @jiqing-feng in #41556
  • [Executorch] Simplify for encoder models by @vasqu in #41627
  • [Ernie 4.5 Moe] Fix Moe and offloading by @vasqu in #41385
  • [CI] Build translated docs by @stevhliu in #41632
  • Fix fp32_ln for various models by @remi-or in #41605
  • Adjust device logging level and add minor fixes by @mario-koddenbrock in #41636
  • Fix EncoderDecoder cache by @remi-or in #41612
  • Format MarkDown documentation and tiny fixes by @cyyever in #41638
  • Fix typos in documentation by @cyyever in #41641
  • Fix confusing cls assignment by @cyyever in #41642
  • Double router compute? by @molbap in #41653
  • [kernels] refactor function kernel calling by @MekkCyber in #41577
  • [Fix] Deepseek V3 expert bias routing by @fjosw in #41647
  • purge HF_HUB_ENABLE_HF_TRANSFER; promote Xet by @Vaibhavs10 in #41656
  • [Masks] Fix mask handling in eager for vision models by @vasqu in #41625
  • Use | for Optional and Union typing by @cyyever in #41646
  • Switch to CB if cache_implementation == paged by @remi-or in #41655
  • Add in-out modalities as class attribute per model by @zucchini-nlp in #41366
  • Fix dtype casting with quantization by @Cyrilvallez in #41665
  • Fix serving continuous batching by @SunMarc in #41624
  • Small changes to benchmarking script by @remi-or in #41662
  • Improve package version check by @Cyrilvallez in #41661
  • improve utils/check_bad_commit.py by @ydshieh in #41658
  • Erroring when KernelConfig is passed without use_kernels = True by @MekkCyber in #41657
  • [Trainer] [Breaking change] use_cache default to False by @SunMarc in #41585
  • 🌐 [i18n-KO] Translated chat_extras.md to Korean by @Judy-Choi in #39863
  • 🌐 [i18n-KO] Translated sam_hq.md to Korean by @HyunZ118 in #41340
  • [i18n-KO] Translated big_bird.md to Korean by @ssum21 in #40445
  • 🌐 [i18n-KO] Translated code_llama.md to Korean by @Judy-Choi in #40558
  • 🌐 [i18n-KO] Translated llama4.md to Korean by @TaskerJang in #40396
  • :globe_with_meridians: [i18n-KO] Translated ko-LFM2.md to Korean by @ssum21 in #41502
  • Adding superglue fast image processing by @AlphaOrOmega in #41394
  • Fix ckpt in docs by @zucchini-nlp in #41659
  • torch 2.9 still don't ❤️ torchcodec 0.8 💔 by @ydshieh in #41686
  • Remove deprecated use_auth_token parameter by @Wauplin in #41666
  • Remove require_torch_bf16_gpu by @cyyever in #40979
  • path validation for security reason by @ydshieh in #41256
  • 🚨 Remove torchscript support by @Cyrilvallez in #41688
  • Fix MarkDown syntax by @cyyever in #41676
  • Use | for Optional and Union typing by @cyyever in #41675
  • 🚨 [v5] Refactor RoPE for layer types by @zucchini-nlp in #39847
  • Enable faiss-cpu on Windows by @cyyever in #41678
  • Fix Pylint warnings by @cyyever in #41644
  • 🚨 Remove torch.fx support by @Cyrilvallez in #41683
  • Remove skipped tests without parents by @Cyrilvallez in #41691
  • Enable FURB rules in ruff by @cyyever in #41395
  • Remove upper version bound of pandas by @cyyever in #41677
  • [Attn] Allow dynamic causality in SDPA via Kwargs by @vasqu in #41692
  • Simplify GQA conditions in sdpa_attention.py by @justinchuby in #41699
  • [docs] Manual tp-plan by @stevhliu in #41674
  • 🌐 [i18n-KO] Translated gemma3n.md to Korean by @HyunZ118 in #40873
  • pin torchcodec on CI docker image by @ydshieh in #41703
  • Update run_name docs in TrainingArguments by @tobiasofsn in #41705
  • further improve utils/check_bad_commit.py by @ydshieh in #41658)
  • feat: add benchmark v2 ci with results pushed to dataset by @McPatate in #41672
  • Gemma3 conversion script maintenance by @RyanMullins in #41704
  • Fix Qwen3-Omni inference when mixing video and image inputs in one batch by @BakerBunker in #41741
  • Fix typo in LFM-VL by @zucchini-nlp in #41742
  • Revert "Remove upper version bound of pandas" by @ydshieh in #41744
  • [doc] remove broken notebooks on AMD Dev Cloud by @pagezyhf in #41743
  • Update type hints in tokenization_utils.py to use | syntax by @faizan842 in #41713
  • Fix documentation issues by @cyyever in #41726
  • Apply RUFF PIE rules by @cyyever in #41727
  • Small Fix for imports by @MekkCyber in #41411
  • Docs(zh-hans): Refine wording for professionalism in README by @Ri-Nai in #40943
  • Add vision contribution guide by @molbap in #41456
  • upgrade xpu docker file to torch 2.8 by @yao-matrix in #41551
  • [v5] Delete videos from image processing classes by @zucchini-nlp in #41607
  • Fixed incorrect model_type for qwen2vl and qwen2.5vl when config is saved and loaded again by @i3hz in #41758
  • [kernels] Add version to function mapping by @MekkCyber in #41685
  • Reduce warning noise caused by Tensor.new_tensor by @st81 in #41748
  • Fix graphormer model compilation with Cython 3.1.4 by @alexmalyshev in #41671
  • Update type hints in modeling_rope_utils.py to use | syntax by @faizan842 in #41714
  • [v5] Remove deprecated tranformers.onnx by @echarlaix in #41700
  • Modernize CLIP modeling code by @molbap in #41546
  • Simplify pipeline padding logic by @Rocketknight1 in #41667
  • Chat response parsing by @Rocketknight1 in #40894
  • Add LightGlue fast image processor by @yonigozlan in #41670
  • Fix bark after #41445 by @ydshieh in #41645
  • Remove invalid @staticmethod from module-level get_device_and_memory_breakdown by @albertvillanova in #41747
  • Fix CUDA index out of bounds for q_idx in VLM token type masking for Gemma3, PaliGemma, and example modular by @albertvillanova in #41757
  • fix: Gemma 3 weights conversion vision and multimodal projector paths by @RyanMullins in #41767
  • [v5] Delete legacy chat template saving by @zucchini-nlp in #41648
  • [quantization] fix compressed_tensors tests by @MekkCyber in #41780
  • [quantization] Skip Fp8 tests when hardware capability < 8.9 by @MekkCyber in #41785
  • Swap columns and rows of the grid layout in LFM2-VL by @ankke in #41755
  • fix type annotation typo in docstring by @johntheprime in #41788
  • Fix chat schema tests by @Rocketknight1 in #41793
  • Fix attention mask in mamba layers by @zucchini-nlp in #41790
  • [quantization] fix torchao tests after 0.14.0 release by @MekkCyber in #41777
  • [Onnx docs] Remove some traces by @vasqu in #41791
  • flash attn pytest marker by @ydshieh in #41781
  • Bump AMD docker by @remi-or in #41792
  • make apollo test case pass by @yao-matrix in #41805
  • Add a safeguard around a flaky test in gemma2 by @remi-or in #41811
  • Fix Qwen3Next dtype API usage by @SrijanUpadhyay in #41735
  • [Trainer] remove env vars by @SunMarc in #41697
  • Fixed grammar mistakes by @FrogWarlord in #41799
  • Fixed some grammar mistakes by @FrogWarlord in #41802
  • transformers cli default flag fix by @ArjunPimpale in #41761
  • Deprecate warmup_ratio by @SunMarc in #41326
  • transformers serve quantization docs + some api fixes for bitsandbytes by @SunMarc in #41253
  • [Parakeet] add output_attention_mask by @eustlb in #41694
  • unpin torch/torchcodec for CircleCI by @ydshieh in #41839
  • extend bitnet cases to xpu, all 8 cases pass by @yao-matrix in #41831
  • extend 2 trainer test cases to xpu by @yao-matrix in #41829
  • extend 2 blip2 and falcon_h1 test cases to xpu by @yao-matrix in #41825
  • further reducing flakiness in utils/check_bad_commit.py by @ydshieh in #41658)
  • Remove redundant code from Qwen3VLProcessor by @Xqle in #41836
  • Fix MXFP4 quantizer to support variable num_local_experts and hidden_size by @marksverdhei in #41795
  • Fix Qwen2Audio flash attention mask format for generation by @Abdennacer-Badaoui in #41843
  • Fix const parsing for dict inputs in chat schemas by @Rocketknight1 in #41824
  • Share embedding modules in BART, not only weights by @githubnemo in #41821
  • Fix TypeError: find_adapter_config_file() got an unexpected keyword argument '_adapter_model_path' by @albertvillanova in #41604
  • :rotating_light: [Clip] Fix masking and enable flash attention on all model types by @vasqu in #41750
  • CI workflow for Flash Attn by @ydshieh in #41857
  • Fix torch.no_grad decorator in VLMS by @yaswanth19 in #41888
  • Fix installation cmds in docs by @yaswanth19 in #41887
  • revert changes in _is_package_available by @MekkCyber in #41891
  • make lfm2_moe integration test pass on XPU by @yao-matrix in #41796
  • Fix: avoid duplicate token in maybe_load_adapters by @luaenrique in #41903
  • speed up loading checkpoints for zero stage 3 by @ri938 in #41850
  • evaluate>=0.4.6 is needed by @stas00 in #41920
  • Add 6 huggingface notebooks on AMD dev cloud by @fan-amd in #41883
  • Fix invalid examples in QwenVL model docstrings and add Qwen3VL example by @Xqle in #41812
  • Allow parse_response to accept token IDs by @Rocketknight1 in #41849
  • Fix Florence2 conversion script model_type KeyError by @i3hz in #41866
  • Update some workflow files by @ydshieh in #41892
  • fix some ut failures on XPU w/ torch 2.9 by @yao-matrix in #41923
  • Cache latest pytorch amd image locally on mi325 CI runner cluster by @jitesh-gupta in #41926
  • Minor fix in docker image build workflow by @ydshieh in #41949
  • fix some ut failures on XPU w/ torch 2.9 by @yao-matrix in #41941
  • Fix rope_parameters for gemma3 weights conversion script by @douglas-reid in #41922
  • Fix: Gemma3TextConfig rope scaling assignments by @RyanMullins in #41934
  • fix prepare_config_and_inputs_for_common bug in llava test by @yao-matrix in #41942
  • Fix: prevent .gitignore truncation in run_clm_no_trainer.py by @luaenrique in #41957
  • V4.57.1 training ci: Refactor test_tensor_parallel.py by @3outeille in #41918
  • [v5] Return a BatchEncoding dict from apply_chat_template by default by @Rocketknight1 in #41626
  • make recurrent_gemma and voxtral cases pass on xpu by @yao-matrix in #41958
  • Fix typo in image_processing_lfm2_vl_fast by @yonigozlan in #41940
  • Run slow v2 by @ydshieh in #41914
  • Fix detectron2 installation in docker files by @ydshieh in #41975
  • Fix autoawq[kernels] installation in quantization docker file by @ydshieh in #41978
  • add support for saving encoder only so any parakeet model can be loaded for inference by @nithinraok in #41969
  • Use indices as position_ids in modernebert by @remi-or in #41789
  • test tensor parallel: make tests for dense model more robust by @3outeille in #41968
  • fix: dict[RopeParameters] to dict[str, RopeParameters] by @RyanMullins in #41963
  • docs: add continuous batching page by @McPatate in #41847
  • Fix torchcodec version in quantization docker file by @ydshieh in #41988
  • [kernels] Add Tests & CI for kernels by @MekkCyber in #41765
  • Move the Mi355 to regular docker by @remi-or in #41989
  • More data in benchmarking by @remi-or in #41848
  • fix (CI): Refactor SSH runners by @glegendre01 in #41991
  • fix 3 failed test cases for video_llama_3 model on Intel XPU by @kaixuanliu in #41931
  • Integrate colqwen2.5 using colqwen2 modelling code by @sahil-kabir in #40600
  • Fixed wrong padding value in OWLv2 by @gjamesgoenawan in #41938
  • Fix run slow v2: empty report when there is only one model by @ydshieh in #42002
  • [kernels] change import time in KernelConfig by @MekkCyber in #42004
  • DOC Fix typo in argument name: pseudoquant by @BenjaminBossan in #41994
  • Fix torch+deepspeed docker file by @ydshieh in #41985
  • Correct syntax error in trainer.md by @Yacklin in #42001
  • Reduce the number of benchmark in the CI by @remi-or in #42008
  • Fix continuous batching tests by @Rocketknight1 in #42012
  • add back logging_dir by @SunMarc in #42013
  • Fix issue with from pretrained and kwargs in image processors by @yonigozlan in #41997
  • Fix default image_rows and image_cols initialization in Idefics3 and SmolVLM processors by @MilkClouds in #41871
  • Add GLPNImageProcessorFast by @Aravind-11 in #41725
  • add fuyu fast image processors by @DeXtAr47-oss in #41817
  • [kernels] Fix XPU layernorm kernel by @MekkCyber in #41583
  • [v5] Deprecate Text2Text and related pipelines by @Rocketknight1 in #41996
  • [FPQuant] MXFP8 and MXFP4 backwards support by @BlackSamorez in #41897
  • fix deeepspeed in AMD docker file by @ydshieh in #42025
  • CodeQL workflow for security analysis by @paulinebm in #42015
  • [tests] Add Context-parallel CI tests by @kashif in #41860
  • extend fp_quant cases to xpu by @yao-matrix in #41833
  • Change trigger time for AMD CI by @ydshieh in #42034
  • Fix the order of methods in processor loading by @zucchini-nlp in #42031
  • 🔴 Isolate prefill from generation loops by @manueldeprada in #40652
  • update huggingface_hub dependency version by @hanouticelina in #42033
  • Remove some custom datasets defined in codebase by @ydshieh in #41511
  • Cleanup workflow - part 1 by @ydshieh in #42023
  • Fix pr_slow_ci_suggestion.yml after #42023 by @ydshieh in #42049
  • Fix AutoImageProcessor.register and documentation in auto processing modules by @MilkClouds in #41864
  • Fix Qwen3-Omni RoPE by @zucchini-nlp in #41778
  • Avoid explicit checkout in workflow by @ydshieh in #42057
  • Annoying typo in attention error message by @manueldeprada in #42037
  • Be careful at explicit checkout actions by @ydshieh in #42060
  • Fix another Argument list too long in pr_slow_ci_suggestion.yml by @ydshieh in #42061
  • Fix KeyError in GPT-OSS weight conversion script by @Aznix07 in #42007
  • Fix KeyError in _is_package_available for packages with dotted names by @yashwantbezawada in #42050
  • Revert back to use GitHub context by @ydshieh in #42066
  • Fix missing arg in check_docstring by @yonigozlan in #42054
  • [deepspeed tests fixes] by @stas00 in #41925
  • Fix logic in setting self.fsdp when it is False by @roychan in #41974
  • fix tensor device placement issue of 2 UT cases by @yao-matrix in #41921
  • add workflow to check permissions and advise a set of permissions req… by @paulinebm in #42071
  • Fix security issue 5 by @paulinebm in #42072
  • Fix inconsistency of commit sha during the workflow run by @ydshieh in #42074
  • QwenVL: add skipped keys in setattr as well by @zucchini-nlp in #41808
  • permissions worflows fix by @paulinebm in #42080
  • 4.1V Model and GLM-4.5V Model Conversion Code Updates by @zRzRzRzRzRzRzR in #41784
  • feat(ci): add continuous batching to benchmarks by @McPatate in #41916
  • Fix modular docstring for Mixtral by @diegoakel in #42041
  • Fix Auto classes to support dynamically registered processors by @MilkClouds in #41865
  • Reinstate self.scaling in Gemma3nTextAttention by @RyanMullins in #41751
  • [v5] 🚨Refactor subprocessors handling in processors by @yonigozlan in #41633
  • add xpu support in test_modeling_janus.py::JanusIntegrationTest::test… by @sywangyi in #41986
  • Revert "permissions worflows fix" by @ydshieh in #42110
  • Fix return metadata checking logic by @Xqle in #42108
  • Correctly handle unbatched audio inputs in Gemma3nAudioFeatureExtractor by @kho in #42076
  • [Bugfix] fix qwen3vl expand generation with video by @JJJYmmm in #42089
  • Fix base model prefix in VLMs by @zucchini-nlp in #42059
  • fix continuous batching issues, extend ut cases to xpu by @yao-matrix in #41830
  • 📝 docs(smolvlm): fix variable name in batch inference example by @gorkachea in #42123
  • fix qwen2vl/qwen3vl video processor temporal padding when num_frames%temporal_patch_size!=1 by @yaogang2060 in #42083
  • [Attn Masks] Non-vmap default for attention masks by @vasqu in #41852
  • Fix GPT-2 Flash Attention 2 generation with left-padding by @Abdennacer-Badaoui in #41966
  • Fix model name test for compressed tensors by @SunMarc in #42128
  • Fix MaskFormer/Mask2Former fast image processors by @yonigozlan in #41393
  • Remove unused functions in image_transforms.py by @yaswanth19 in #42044
  • update deps table by @ArthurZucker in #42120
  • fix: improve video processing fps assignment logic by @Xqle in #42009
  • Fix T5Gemma module structure by @Cyrilvallez in #42145
  • DataCollatorForLanguageModeling warning error fixed by @mjaliz in #42144
  • Bugfix/remove emojis from print by @7amim in #42091
  • Avoid mutating user-provided arguments in preprocessing utils by @LeonardoEmili in #42126
  • Enforce check_auto_docstring by @yonigozlan in #41635
  • Add dinov3 autobackbone by @vijayabhaskar-ev in #41276
  • Fix logic error in prepare_inputs_for_generation cache slicing condition by @albertvillanova in #41764
  • :rotating_light: Fix gradient checkpointing for several models and improve test robustness by @githubnemo in #41818
  • [T5Gemma] Fix cross attention cache by @vasqu in #41890
  • T5 migration to new masking interface by @Aravind-11 in #41804
  • fix: improve visibility of ValueError root causes in model config loading by @scottzh8 in #41972
  • add xpu to valid hardware for torch.compile by @sywangyi in #42079
  • extend test_beam_search_early_stop_heuristic case to other device by @sywangyi in #42078
  • fix failure of tests/models/shieldgemma2/test_modeling_shieldgemma2.p… by @sywangyi in #42022
  • Fixes Flash Attention implementation for models by @i3hz in #42149
  • fix test failure of speculative_generation on xpu by @sywangyi in #42052
  • add rmsnorm kernels support for npu by @zheliuyu in #42106
  • update torchao doc by @jiqing-feng in #42139
  • feat(kernels): add opt-out flag to disable kernels hub usage through the lib by @mfuntowicz in #41990
  • handle inputs from Siglip/Siglip2 non-automapped encoder layers by @molbap in #41930
  • Add slow to some examples tests by @SunMarc in #42164
  • fix(ci): unexpected keyword argument streaming by @McPatate in #42102
  • pin pytest<9 for now by @ydshieh in #42162
  • Docs/i18n updates by @lilin-1 in #42006
  • Fix in-place modification of user-input in SAM2 embed boxes by @xenova in #42173
  • [Pop2Piano] Fix cache usage by @vasqu in #42170
  • Fix helper fn for new processor config format by @zucchini-nlp in #42085
  • Remove unnecessary slicing in sdpa_attention_forward by @justinchuby in #41900
  • [PEFT] Fix prefix tuning by @vasqu in #41696
  • [typo] fix mrope-interleave annotation to avoid ambiguity by @JJJYmmm in #42177
  • Update transformers to support FqnToConfig by @jcaip in #41894
  • [PEFT] Fix the general test for prefix tuning by @vasqu in #42185
  • [TP] Fix parameter detection issue and some invalid TP-plans by @Cyrilvallez in #42129
  • Refactor weight loading by @ArthurZucker in #41580
  • 🚨 Delete deprecations with end-cycle in v4.xx and v5.0 by @zucchini-nlp in #41681
  • Add AutoTokenizer mapping for mistral3 and ministral by @patrickvonplaten in #42198
  • Fix checkpoint loading with DeepSpeed ZeRO3 by @tohtana in #42201
  • [Pop2Piano] Fix tied weights by @vasqu in #42193
  • New docker from AMD by @remi-or in #42208
  • Add cross links for model contribution by @zucchini-nlp in #42207
  • Stop inheriting tests! by @Rocketknight1 in #42192
  • Refactor check_auto_docstring using AST by @yonigozlan in #41432
  • [BLT] Fix cache usage by @vasqu in #42188
  • Update test_dynamic_cache_exportability_multiple_run (failing on torch 2.10 nightly) by @ydshieh in #42212
  • Much more efficient and clear weight initialization and tie weights by @Cyrilvallez in #42191
  • GLM-V update with new processor by @zRzRzRzRzRzRzR in #42122
  • Fix initialization guard for pytest by @Cyrilvallez in #42234
  • Fix TP plans for MoE models by @Cyrilvallez in #42236
  • Add prefix sharing to continuous batching by @remi-or in #42094
  • Loading optimization by @Cyrilvallez in #42239
  • calls AttentionMaskConverter._unmask_unattended for xpu device before by @kaixuanliu in #42230
  • FIX Broken PEFT adapter loading by @BenjaminBossan in #42187
  • Fix processor test for glm by @molbap in #42233
  • Fix UnboundLocalError in RT-DETR loss computation by @yashwantbezawada in #42224
  • Stop inheriting tests (again) by @Rocketknight1 in #42247
  • [loading] Fix device when source and target are different by @Cyrilvallez in #42246
  • Reduce timing on CircleCI - part 1 (Use @slow for IntegrationTests) by @ydshieh in #42206
  • 🚨 Delete generation params from model config by @zucchini-nlp in #41695
  • Allow VLMs to have a correct base_model by @zucchini-nlp in #41589
  • Make tests run in less time by reducing batch_size by @ydshieh in #42213
  • Revert "Make tests run in less time by reducing batch_size" by @ydshieh in #42258
  • Cleanup reference to TFBertTokenizer and TFGPT2Tokenizer by @Rocketknight1 in #42182
  • delete already deprecated models by @ydshieh in #42235
  • Fix bnb for the weights refactor by @SunMarc in #42043
  • Fix looping in torch guard decorator by @Cyrilvallez in #42260
  • 🚨 Generalize get_decoder() for multimodal and delete redundant code 🔪 by @zucchini-nlp in #42156
  • Audio Flamingo3 - fix attention masking by @zucchini-nlp in #42278
  • Add support for torch device objects in device validator by @yonigozlan in #42267
  • Remove doc files of other langs for deleted models by @ydshieh in #42276
  • [testing] fix cwm by @ydshieh in #42261
  • fix a typo: pbd -> pdb by @jaeminoh in #42268
  • Enable glm46v UTs on XPU by @YangKai0616 in #42274
  • [testing] fix some cases in xpu by @sywangyi in #42273
  • Remove random flag by @Cyrilvallez in #42282
  • Fix accelerate integration by @Cyrilvallez in #42264
  • Fix validation checks order in benchmark_v2 by @Abdennacer-Badaoui in #42280
  • Update torchcodec to match torchaudio version by @remi-or in #42288
  • Use torch.get_autocast_dtype instead of torch.get_autocast_gpu_dtype by @qgallouedec in #42055
  • perf: Optimization for Min-p sampling implementation by @casinca in #42248
  • Fix device_map computation part 2 by @Cyrilvallez in #42290
  • Fixed the docstring for WhisperFeatureExtractor by @TopCoder2K in #42286
  • avoiding conditional indexing in positionalencoding to avoid possibil… by @ppadjinTT in #42090
  • ENH: Add support for LoRA hotswapping by @BenjaminBossan in #41297
  • Fix Break change of AWQ FusedModules due to Attention Refactor by @fanqiNO1 in #41909
  • Remove error string test that was failing by @Rocketknight1 in #42301
  • Properly protect the is_compiling checks by @Cyrilvallez in #42304
  • Remove outdated methods in modeling_utils.py by @Cyrilvallez in #42302
  • Fix Mac mps dataloader_num_workers > 1 causes RuntimeError: share_filename: only available on CPU by @AmitMY in #38819
  • Fix the init_weights for the MoE models by @Cyrilvallez in #42306
  • Update link to generation strategies documentation by @omkar-334 in #42252
  • Update conversion mapping to separate renaming from converting by @ArthurZucker in #42254
  • fix(granitemoe*): Only create block_sparse_moe if num_local_experts > 0 by @gabe-l-hart in #42036
  • [SAM3 Video] Add support for multi prompts by @yonigozlan in #42293
  • Add Pix2Struct fast image processor by @yonigozlan in #42020
  • Fix post processing methods in keypoints matching models by @yonigozlan in #42018
  • fix tests/models/xcodec/test_modeling_xcodec.py::XcodecIntegrationTest by @sywangyi in #42272
  • [loading] Fix device detection by @Cyrilvallez in #42323
  • Fix typo from side_dict to size_dict by @nihui in #42319
  • HF Trainer: ALST/Ulysses sequence parallelism integration via HF Accelerate by @stas00 in #41832
  • Fix gpt2 modeling tests by @Abdennacer-Badaoui in #42321
  • [loading] Use fewer threads by default for much better performances by @Cyrilvallez in #42324
  • Allow LayoutLMV3Processor to accept rescale_factor by @Rocketknight1 in #42305
  • Correctly create tied key mapping in post_init, and dynamic tie weight by @Cyrilvallez in #42270
  • [CI] Skip EfficientLoFTR test by @vasqu in #42327
  • [XPU] Add flash_attn2 support for XPU by @YangKai0616 in #41956
  • [Attn Masks] Lift bidirectional mask restriction on eager by @vasqu in #42325
  • fix bug when gemma3n model run on multiple device by @kaixuanliu in #42303
  • Fix ChineseCLIPModel.get_text_features by @JiangJQ2000 in #42351
  • Gemma3 hybrid fix by @remi-or in #42287
  • fix(benchmarks): correct sdpa_backend inconsistency and attn_implementation for continuous batching by @engmohamedsalah in #42339
  • Auto convert tekken.json by @ArthurZucker in #42299
  • [loading] Re-add and improve disk offloading support by @Cyrilvallez in #42242
  • Fix typo - indentation in JSON dump example by @anthropikos in #42332
  • Fix tied weight for Bart (for BC) by @Cyrilvallez in #42355
  • Fix reference to yelp dataset by @JuanFKurucz in #42349
  • Fix documentation reference to pytorch max memory allocated by @JuanFKurucz in #42350
  • Fix reference to imagenet 1k dataset by @JuanFKurucz in #42348
  • Fix typos by @omahs in #42354
  • Protect torch.distributed imports by @Cyrilvallez in #42361
  • Expand npu device for KernelConfig by @zheliuyu in #42358
  • Replace Optional and Union typing with | in some source files by @cyyever in #42294
  • Fix code examples to load gpt 1 openai community model by @JuanFKurucz in #42347
  • fix tekken pattern matching by @ArthurZucker in #42363
  • Fixed-wrong-ZeRO3-json-snippet-found-in-deepspeed-markdown-file by @Yacklin in #42346
  • Make benchmarking lighter: clean-up result files and remove non-needed arguments by @remi-or in #42357
  • Add image processor fast vitpose by @yonigozlan in #42021
  • Small tp fix by @ArthurZucker in #42366
  • Remove test inheritance for EfficientLoftr, rename KeypointMatchingOutput to model specific name by @yonigozlan in #42365
  • Tiny doc fix by @molbap in #42296
  • Fix TimesFM patch normalization instability by @AnMakc in #42099
  • [core] Fix torchao by @MekkCyber in #42289
  • Fix tp by @ArthurZucker in #42368
  • [Attn Masks] Add skip option for non-packed sequences by @vasqu in #42367
  • 📚 docs(granite-speech): add comprehensive usage examples by @gorkachea in #42125
  • Xcodec fix by @eustlb in #42095
  • Replace Optional and Union typing with | in some source files by @cyyever in #42372
  • [Mistral Tokenizers] Fix tokenizer detection by @vasqu in #42389
  • misc don't recreate it by @ArthurZucker in #42394
  • [SAM3] Fix precompute vision_embeds or text_embeds for inference by @yonigozlan in #42407
  • 🚨 Image-text pipeline expects correctly formatted chat by @zucchini-nlp in #42359
  • Many small fixes for the CI by @remi-or in #42364
  • [core] fix mxfp4 by @MekkCyber in #42382
  • fixed json syntax error for zero2 configuration file found in deepspeed.md by @Yacklin in #42406
  • GLM4V - delete duplicate config attribute by @zucchini-nlp in #42416
  • 🚨 Remove generic output_attentions warning by @Aravind-11 in #42334
  • Bart config doesn't need generation parameters by @zucchini-nlp in #42337
  • Simplify and standardize processor tests by @yonigozlan in #41773
  • Clean bnb integration using weight converter by @SunMarc in #42426
  • Any to any pipeline and auto-mapping by @zucchini-nlp in #40884
  • Fix processor usage + add chat_template support to TTS pipeline, and shift common chat template logic to base class. by @ebezzam in #42326
  • [fp8] fix scales param name by @MekkCyber in #42434
  • Fix an edge case for get_encoder() by @zucchini-nlp in #42295
  • Disable loss rounding in training stats log by @AnMakc in #42104
  • Benchmark simplification by @remi-or in #42408
  • Future annotations break FastAPI by @LysandreJik in #42450
  • [cleanup] Don't use Repository in create_dummy_models.py script by @Wauplin in #42380
  • [cleanup] Remove deprecated load config from file by @Wauplin in #42383
  • [FA] Cleanup loading logic by @vasqu in #41427
  • tiny fix for deepseekocr support [vllm] by @molbap in #42423
  • fix: Restore explicit .keys() calls for TensorDict compatibility by @pankajbaid567 in #42373
  • Transformers serve -> list all generative models from the cache by @LysandreJik in #42146
  • 🚨 [v5][PEFT] Bump min version requirement of PEFT to 0.18.0 by @BenjaminBossan in #41889
  • [cleanup] Offline mode and cache dir from huggingface_hub constants + cleanup in PushToHubMixin by @Wauplin in #42391
  • Correctly return finish reason length when finished by @LysandreJik in #42157
  • FIX: Minimal fix for loading PEFT weights by @BenjaminBossan in #42387
  • Let's break Qwen-VL 🚨 by @zucchini-nlp in #42420
  • [CI] Add to run slow by @vasqu in #42459
  • Fix the "test_offline" test by @LysandreJik in #42458
  • transformers chat launched without base_url has a direct tie to localhost:8000 by @LysandreJik in #42463
  • update with more recent tts models by @Deep-unlearning in #42328
  • rm slow tokenizers by @itazap in #40936
  • [loading/saving] Reverse all loading operations when saving by @Cyrilvallez in #42396
  • Fix T5 tests: use generation_config for generation parameters by @Abdennacer-Badaoui in #42419
  • remove reference to TF models from docs by @zucchini-nlp in #42443
  • [Trainer] use output.loss when using liger-kernel by @kashif in #42444
  • replace source_keys and target_keys by @SunMarc in #42471
  • Update migration guide - generation config by @zucchini-nlp in #42470
  • 🚨 Move rotary_partial_emb to RopeParams and delete unnecessary code 🔪 by @zucchini-nlp in #42255
  • Fix doc builds by @Rocketknight1 in #42478
  • extend CwmIntegrationTest to xpu by @sywangyi in #42314
  • add require_deterministic_for_xpu to make the case pass in xpu by @sywangyi in #42439
  • Skip failing irrelevant test for ColQwen2 by @Rocketknight1 in #42480
  • [quantization] make torchao tests slow by @MekkCyber in #42482
  • Fix gpt2 tokenizer add_prefix_space default value by @SunMarc in #42481

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @ArthurZucker
    • JetMoe Fix jetmoe after #40132 (#41324)
    • [ModularChecker] QOL for the modular checker (#41361)
    • [CB] Refactors the way we access paged (#41370)
    • Update from pretrained error when loading (#33380)
    • :facepalm: CB nit! (#41413)
    • [from_pretrained] Small refactor from_pretrained: move around unrelated stuff (#41445)
    • update deps table (#42120)
    • Refactor weight loading (#41580)
    • Update conversion mapping to separate renaming from converting (#42254)
    • Auto convert tekken.json (#42299)
    • fix tekken pattern matching (#42363)
    • Small tp fix (#42366)
    • Fix tp (#42368)
    • misc don't recreate it (#42394)
  • @vasqu
    • :rotating_light: [v5] Remove relative position embeddings (for bert like models) (#41170)
    • [v5] Sync Bert and Bart eager attention (#41248)
    • [JetMoe] Fix KV head repetition and padding free (#41423)
    • :rotating_light: [Attention Masks] Bidirectional masks for encoder and encoder-decoder models (#41265)
    • [CI] Fix copies on main (#41486)
    • [Docs] Fix changed references (#41614)
    • [Executorch] Simplify for encoder models (#41627)
    • [Ernie 4.5 Moe] Fix Moe and offloading (#41385)
    • [Masks] Fix mask handling in eager for vision models (#41625)
    • [Attn] Allow dynamic causality in SDPA via Kwargs (#41692)
    • [Onnx docs] Remove some traces (#41791)
    • :rotating_light: [Clip] Fix masking and enable flash attention on all model types (#41750)
    • [Attn Masks] Non-vmap default for attention masks (#41852)
    • [T5Gemma] Fix cross attention cache (#41890)
    • [Pop2Piano] Fix cache usage (#42170)
    • [PEFT] Fix prefix tuning (#41696)
    • [PEFT] Fix the general test for prefix tuning (#42185)
    • [Pop2Piano] Fix tied weights (#42193)
    • [BLT] Fix cache usage (#42188)
    • [CI] Skip EfficientLoFTR test (#42327)
    • [Attn Masks] Lift bidirectional mask restriction on eager (#42325)
    • [Attn Masks] Add skip option for non-packed sequences (#42367)
    • [Mistral Tokenizers] Fix tokenizer detection (#42389)
    • [FA] Cleanup loading logic (#41427)
    • [CI] Add to run slow (#42459)
  • @ydshieh
    • [testing] update test_longcat_generation_cpu (#41368)
    • [testing] Fix JetMoeIntegrationTest (#41377)
    • Pickle - part 2 (#41476)
    • Try to remove pickle - BloomTokenizerFast (#41466)
    • [testing] reduce runtime of HunYuanMoEV1IntegrationTest:test_model_generation (#41373)
    • delete some tokenizer tests using pickle (#41514)
    • torch 2.9 don't ❤️ torchcodec 💔 (#41610)
    • Update a dataset reop link (#41618)
    • Remove the head masking block in some vision models (#41620)
    • improve utils/check_bad_commit.py (#41658)
    • torch 2.9 still don't ❤️ torchcodec 0.8 💔 (#41686)
    • path validation for security reason (#41256)
    • pin torchcodec on CI docker image (#41703)
    • further improve utils/check_bad_commit.py (#41658) (#41690)
    • Revert "Remove upper version bound of pandas" (#41744)
    • Fix bark after #41445 (#41645)
    • flash attn pytest marker (#41781)
    • unpin torch/torchcodec for CircleCI (#41839)
    • further reducing flakiness in utils/check_bad_commit.py (#41658) (#41815)
    • CI workflow for Flash Attn (#41857)
    • Update some workflow files (#41892)
    • Minor fix in docker image build workflow (#41949)
    • Run slow v2 (#41914)
    • Fix detectron2 installation in docker files (#41975)
    • Fix autoawq[kernels] installation in quantization docker file (#41978)
    • Fix torchcodec version in quantization docker file (#41988)
    • Fix run slow v2: empty report when there is only one model (#42002)
    • Fix torch+deepspeed docker file (#41985)
    • fix deeepspeed in AMD docker file (#42025)
    • Change trigger time for AMD CI (#42034)
    • Remove some custom datasets defined in codebase (#41511)
    • Cleanup workflow - part 1 (#42023)
    • Fix pr_slow_ci_suggestion.yml after #42023 (#42049)
    • Avoid explicit checkout in workflow (#42057)
    • Be careful at explicit checkout actions (#42060)
    • Fix another Argument list too long in pr_slow_ci_suggestion.yml (#42061)
    • Revert back to use GitHub context (#42066)
    • Fix inconsistency of commit sha during the workflow run (#42074)
    • Revert "permissions worflows fix" (#42110)
    • pin pytest<9 for now (#42162)
    • Update test_dynamic_cache_exportability_multiple_run (failing on torch 2.10 nightly) (#42212)
    • Reduce timing on CircleCI - part 1 (Use @slow for IntegrationTests) (#42206)
    • Make tests run in less time by reducing batch_size (#42213)
    • Revert "Make tests run in less time by reducing batch_size" (#42258)
    • delete already deprecated models (#42235)
    • Remove doc files of other langs for deleted models (#42276)
    • [testing] fix cwm (#42261)
  • @cyyever
    • Remove unnecessary list comprehension (#41305)
    • Remove unused function patameters (#41358)
    • Use accelerator API to free device memory (#41195)
    • Remove Python 3.9 classifier (#41410)
    • Remove KERAS_NLP_IMPORT_ERROR (#41468)
    • Import Callable from collections.abc (#41130)
    • Remove infer_device (#41088)
    • Fix Latex typesetting in documentation (#41177)
    • Fix typsetting and content of llm_tutorial_optimization.md (#41172)
    • More markdown file fixes (#41599)
    • Format MarkDown documentation and tiny fixes (#41638)
    • Fix typos in documentation (#41641)
    • Fix confusing cls assignment (#41642)
    • Use | for Optional and Union typing (#41646)
    • Remove require_torch_bf16_gpu (#40979)
    • Fix MarkDown syntax (#41676)
    • Use | for Optional and Union typing (#41675)
    • Enable faiss-cpu on Windows (#41678)
    • Fix Pylint warnings (#41644)
    • Enable FURB rules in ruff (#41395)
    • Remove upper version bound of pandas (#41677)
    • Fix documentation issues (#41726)
    • Apply RUFF PIE rules (#41727)
    • Replace Optional and Union typing with | in some source files (#42294)
    • Replace Optional and Union typing with | in some source files (#42372)
  • @yao-matrix
    • make some ut cases pass on xpu w/ latest torch (#41337)
    • fix asr ut failures (#41332)
    • enable new model uts to xpu and fix some failures on xpu (#41386)
    • enable some falcon-mamba uts on xpu (#41428)
    • enhance patched_tearDown to support python 3.11+ (#41429)
    • fix gemma3n case failure (#41426)
    • upgrade xpu docker file to torch 2.8 (#41551)
    • make apollo test case pass (#41805)
    • extend bitnet cases to xpu, all 8 cases pass (#41831)
    • extend 2 trainer test cases to xpu (#41829)
    • extend 2 blip2 and falcon_h1 test cases to xpu (#41825)
    • make lfm2_moe integration test pass on XPU (#41796)
    • fix some ut failures on XPU w/ torch 2.9 (#41923)
    • fix some ut failures on XPU w/ torch 2.9 (#41941)
    • fix prepare_config_and_inputs_for_common bug in llava test (#41942)
    • make recurrent_gemma and voxtral cases pass on xpu (#41958)
    • extend fp_quant cases to xpu (#41833)
    • fix tensor device placement issue of 2 UT cases (#41921)
    • fix continuous batching issues, extend ut cases to xpu (#41830)
  • @MekkCyber
    • [kernels] Kernel Config (#41232)
    • Fixing comments in init file (#41414)
    • [kernels] Cleanup deta kernel (#41470)
    • Cleaning hub kernels (#41477)
    • Remove DISABLE_KERNEL_MAPPING flag (#41475)
    • [kernels] Remove RWKV kernel finally ! (#41493)
    • [kernels] rm yoso kernel (#41495)
    • [kernels] rm mra kernels (#41507)
    • Revert "add rmsnorm kernels support for Intel XPU" (#41579)
    • [kernels] refactor function kernel calling (#41577)
    • Erroring when KernelConfig is passed without use_kernels = True (#41657)
    • Small Fix for imports (#41411)
    • [kernels] Add version to function mapping (#41685)
    • [quantization] fix compressed_tensors tests (#41780)
    • [quantization] Skip Fp8 tests when hardware capability < 8.9 (#41785)
    • [quantization] fix torchao tests after 0.14.0 release (#41777)
    • revert changes in _is_package_available (#41891)
    • [kernels] Add Tests & CI for kernels (#41765)
    • [kernels] change import time in KernelConfig (#42004)
    • [kernels] Fix XPU layernorm kernel (#41583)
    • [core] Fix torchao (#42289)
    • [core] fix mxfp4 (#42382)
    • [fp8] fix scales param name (#42434)
    • [quantization] make torchao tests slow (#42482)
  • @paulpak58
    • [Cache] lfm2 cache: allocate empty kv layers during init (#41396)
    • [Model] Lfm2Moe (#41401)
  • @gante
    • 🚨 [v5] Prune prune_heads (#41417)
    • [v5] rm utils/tf_ops/ (#41402)
    • [causallm tester] automate pipeline mappings + bloom tests (#41318)
    • 🚨 [v5] generate delegates default cache initialization to the model (#41505)
  • @zRzRzRzRzRzRzR
    • Update GLM-4.1V MMRope implementation (#41182)
    • Update GLM-4.6 doc (#41471)
    • Add aux loss for GLM-4.5V (#41564)
    • 4.1V Model and GLM-4.5V Model Conversion Code Updates (#41784)
    • GLM-V update with new processor (#42122)
  • @jacobkahn
    • Add Code World Model (CWM) (#41199)
  • @molbap
    • Update philosophy (#41438)
    • [QoL] modular conversion shows LoC saved (#41500)
    • Double router compute? (#41653)
    • Add vision contribution guide (#41456)
    • Modernize CLIP modeling code (#41546)
    • handle inputs from Siglip/Siglip2 non-automapped encoder layers (#41930)
    • Fix processor test for glm (#42233)
    • Tiny doc fix (#42296)
    • tiny fix for deepseekocr support [vllm] (#42423)
  • @Wauplin
    • Bump to hfh 1.0.0.rc5 to fix test (#41508)
    • Migrate transformers cli to Typer (#41487)
    • Remove deprecated use_auth_token parameter (#41666)
    • added more breaking changes
    • [cleanup] Don't use Repository in create_dummy_models.py script (#42380)
    • [cleanup] Remove deprecated load config from file (#42383)
    • [cleanup] Offline mode and cache dir from huggingface_hub constants + cleanup in PushToHubMixin (#42391)
  • @remi-or
    • Restore cuda graphs to continuous batching (#41421)
    • Fix an import error with PreTrainModel (#41571)
    • Add iter to DynamicCache (#41569)
    • Gemma3 fixes (#41572)
    • Benchmark overhaul (#41408)
    • Fix fp32_ln for various models (#41605)
    • Fix EncoderDecoder cache (#41612)
    • Switch to CB if cache_implementation == paged (#41655)
    • Small changes to benchmarking script (#41662)
    • Bump AMD docker (#41792)
    • Add a safeguard around a flaky test in gemma2 (#41811)
    • Use indices as position_ids in modernebert (#41789)
    • Move the Mi355 to regular docker (#41989)
    • More data in benchmarking (#41848)
    • Reduce the number of benchmark in the CI (#42008)
    • New docker from AMD (#42208)
    • Add prefix sharing to continuous batching (#42094)
    • Update torchcodec to match torchaudio version (#42288)
    • Gemma3 hybrid fix (#42287)
    • Make benchmarking lighter: clean-up result files and remove non-needed arguments (#42357)
    • Many small fixes for the CI (#42364)
    • Benchmark simplification (#42408)
  • @lkhl
    • [model] Add VideoLLaMA3 implementation (#40499)
  • @philiproeleveld
    • Add logits_to_keep to many older CausalLM models (#41335)
  • @AlphaOrOmega
    • Adding superglue fast image processing (#41394)
  • @echarlaix
    • [v5] Remove deprecated tranformers.onnx (#41700)
  • @Aravind-11
    • Add GLPNImageProcessorFast (#41725)
    • T5 migration to new masking interface (#41804)
    • 🚨 Remove generic output_attentions warning (#42334)
  • @DeXtAr47-oss
    • add fuyu fast image processors (#41817)
  • @lashahub
    • [models] Add AudioFlamingo3 integration (#40290)
  • @lilin-1
    • Docs/i18n updates (#42006)
  • @burtenshaw
    • [MODEL] Nanochat implementation (#41634)
  • @itazap
    • rm slow tokenizers (#40936)
Release candidate v5.0.0rc3

New models:

What's Changed

We are getting closer and closer to the official release! This RC is focused on removing more of the deprecated stuff, fixing some minors issues, doc updates.

New Contributors

Full Changelog: https://github.com/huggingface/transformers/compare/v5.0.0rc2...v5.0.0rc3

Jan 16, 2026
Patch release v4.57.6

What's Changed

Another fix for qwen vl models that prevented correctly loading the associated model type - this works together with https://github.com/huggingface/transformers/pull/41808 of the previous patch release.

Full Changelog: https://github.com/huggingface/transformers/compare/v4.57.5...v4.57.6

Jan 13, 2026
Patch release v4.57.5

What's Changed

Should not have said last patch :wink: These should be the last remaining fixes that got lost in between patches and the transition to v5.

Full Changelog: https://github.com/huggingface/transformers/compare/v4.57.4...v4.57.5

Patch release v4.57.4

What's Changed

Last patch release for v4: We have a few small fixes for remote generation methods (e.g. group beam search), vLLM, and an offline tokenizer fix (if it's already been cached).

New Contributors

Full Changelog: https://github.com/huggingface/transformers/compare/v4.57.3...v4.57.4

Jan 8, 2026
Release candidate 5.0.0rc2

What's Changed

This release candidate is focused on fixing AutoTokenizer, expanding the dynamic weight loading support, and improving performances with MoEs!

MoEs and performances:

<img width="2048" height="1451" alt="image" src="https://github.com/user-attachments/assets/3ed2508e-3eb1-4f13-8717-cd9027d12a39" />

Tokenization:

The main issue with the tokenization refactor is that tokenizer_class are now "enforced" when in most cases they are wrong. This took a while to properly isolate and now we try to use TokenizersBackend whenever we can. #42894 has a much more detailed description of the big changes!

Core

Here we focused on boosting the performances of loading weights on device!

New models

Quantization

Breaking changes

Mostly around processors!

Thanks again to everyone !

New Contributors

Full Changelog: https://github.com/huggingface/transformers/compare/v5.0.0rc1...v5.0.0rc2

Release candidate 5.0.0rc1

What's Changed

This release candidate was focused mostly on quantization support with the new dynamic weight loader, and a few notable 🚨 breaking changes🚨:

  1. Default dtype for any model when using from_pretrained is now auto!
  1. Default shard size when saving a model is now 50GB:
  1. Kwargs. They are fundamental to enable integration with vllm and other toosl:

Dynamic weight loader updates:

Mostly QOL and fixed + support back CPU offloading.

New models:

Some notable quantization fixes:

Mostly added support for fbgemme , quanto,

Peft:

The dynamic weight loader broke small things, this adds glue for all models but MoEs.

Misc

Tokenization needed more refactoring, this time its a lot cleaner!

We omitted a lot of other commits for clarity, but thanks to everyone and the new contributors!

New Contributors

Full Changelog: https://github.com/huggingface/transformers/compare/v5.0.0rc0...v5.0.0rc1

Dec 1, 2025
Transformers v5.0.0rc0

Transformers v5 release notes

<img width="1800" height="1013" alt="image" src="https://github.com/user-attachments/assets/7b5187d7-6945-4108-a546-6d1d7bfb55e3" />
  • Highlights
  • Significant API changes: dynamic weight loading, tokenization
  • Backwards Incompatible Changes
  • Bugfixes and improvements

Highlights

We are excited to announce the initial release of Transformers v5. This is the first major release in five years, and the release is significant: 800 commits have been pushed to main since the latest minor release. This release removes a lot of long-due deprecations, introduces several refactors that significantly simplify our APIs and internals, and comes with a large number of bug fixes.

We give an overview of our focus for this release in the following blogpost. In these release notes, we'll focus directly on the refactors and new APIs coming with v5.

This release is a release candidate (RC). It is not the final v5 release, and we will push on pypi as a pre-release. This means that the current release is purely opt-in, as installing transformers without specifying this exact release will install the latest version instead (v4.57.3 as of writing).

In order to install this release, please do so with the following:

pip install transformers --pre

For us to deliver the best package possible, it is imperative that we have feedback on how the toolkit is currently working for you. Please try it out, and open an issue in case you're facing something inconsistent/a bug.

Transformers version 5 is a community endeavor, and this is the last mile. Let's ship this together!

Significant API changes

[!NOTE] 👀 Nothing is final and things are still actively in movement. We have a section dedicated to what is planned for future release candidates, yet is known not to work in the RC0. Look for "Disclaimers for the RC0".

We'll be eagerly awaiting your feedback in our GitHub issues!

Dynamic weight loading

We introduce a new weight loading API in transformers, which significantly improves on the previous API. This weight loading API is designed to apply operations to the checkpoints loaded by transformers.

Instead of loading the checkpoint exactly as it is serialized within the model, these operations can reshape, merge, and split the layers according to how they're defined in this new API. These operations are often a necessity when working with quantization or parallelism algorithms.

This new API is centered around the new WeightConverter class:

class WeightConverter(WeightTransform):
    operations: list[ConversionOps]
    source_keys: Union[str, list[str]]
    target_keys: Union[str, list[str]]

The weight converter is designed to apply a list of operations on the source keys, resulting in target keys. A common operation done on the attention layers is to fuse the query, key, values layers. Doing so with this API would amount to defining the following conversion:

conversion = WeightConverter(
    ["self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj"],  # The input layers
    "self_attn.qkv_proj",  # The single layer as output
    operations=[Concatenate(dim=0)],
)

In this situation, we apply the Concatenate operation, which accepts a list of layers as input and returns a single layer.

This allows us to define a mapping from architecture to a list of weight conversions. Applying those weight conversions can apply arbitrary transformations to the layers themselves. This significantly simplified the from_pretrained method and helped us remove a lot of technical debt that we accumulated over the past few years.

This results in several improvements:

  • Much cleaner definition of transformations applied to the checkpoint
  • Reversible transformations, so loading and saving a checkpoint should result in the same checkpoint
  • Faster model loading thanks to scheduling of tensor materialization
  • Enables complex mix of transformations that wouldn't otherwise be possible (such as quantization + MoEs, or TP + MoEs)

While this is being implemented, expect varying levels of support across different release candidates.

Linked PR: https://github.com/huggingface/transformers/pull/41580

Tokenization

Just as we moved towards a single backend library for model definition, we want our tokenizers, and the Tokenizer object to be a lot more intuitive. With v5, tokenizer definition is much simpler; one can now initialize an empty LlamaTokenizer and train it directly on your corpus.

Defining a new tokenizer object should be as simple as this:

from transformers import TokenizersBackend, generate_merges
from tokenizers import pre_tokenizers, Tokenizer
from tokenizers.model import BPE

class Llama5Tokenizer(TokenizersBackend):
    def __init__(self, unk_token="<unk>",bos_token="<s>", eos_token="</s>", vocab=None, merges=None ):
        if vocab is None:
            self._vocab = {
                str(unk_token): 0,
                str(bos_token): 1,
                str(eos_token): 2,
            }

        else:
            self._vocab = vocab

        if merges is not None:
            self._merges = merges
        else:
            self._merges = generate_merges(filtered_vocab)

        self._tokenizer = Tokenizer(
            BPE(vocab=self._vocab, merges=self._merges, fuse_unk=True)
        )
        self._tokenizer.pre_tokenizer = pre_tokenizers.Metaspace(
            replacement="▁", prepend_scheme=_get_prepend_scheme(self.add_prefix_space, self), split=False
        )
        super().__init__(
            tokenizer_object=self._tokenizer,
            unk_token=unk_token,
            bos_token=bos_token,
            eos_token=eos_token,
        )

Once the tokenizer is defined as above, you can load it with the following: Llama5Tokenizer(). Doing this returns you an empty, trainable tokenizer that follows the definition of the authors of Llama5 (it does not exist yet :wink:).

The above is the main motivation towards refactoring tokenization: we want tokenizers to behave similarly to models: trained or empty, and with exactly what is defined in their class definition.

Backend Architecture Changes: moving away from the slow/fast tokenizer separation

Up to now, transformers maintained two parallel implementations for many tokenizers:

  • "Slow" tokenizers (tokenization_<model>.py) - Python-based implementations, often using SentencePiece as the backend.
  • "Fast" tokenizers (tokenization_<model>_fast.py) - Rust-based implementations using the 🤗 tokenizers library.

In v5, we consolidate to a single tokenizer file per model: tokenization_<model>.py. This file will use the most appropriate backend available:

  1. TokenizersBackend (preferred): Rust-based tokenizers from the 🤗 tokenizers library. In general it provides optimal performance, but it also offers a lot more features that are commonly adopted across the ecosystem:
  • handling additional tokens
  • a full python API for setting and updating
  • automatic parallelization,
  • automatic offsets
  • customization
  • training
  1. SentencePieceBackend: for tokenizers requiring the sentencepiece library. It inherits from PythonBackend.
  2. PythonBackend: a Python implementations of the features provided by tokenizers. Basically allows adding tokens.
  3. MistralCommonBackend: relies on MistralCommon's tokenization library. (Previously known as the MistralCommonTokenizer)

The AutoTokenizer automatically selects the appropriate backend based on available files and dependencies. This is transparent, you continue to use AutoTokenizer.from_pretrained() as before. This allows transformers to be future-proof and modular to easily support future backends.

Defining a tokenizers outside of the existing backends

We enable users and tokenizer builders to define their own tokenizers from top to bottom. Tokenizers are usually defined using a backend such as tokenizers, sentencepiece or mistral-common, but we offer the possibility to design the tokenizer at a higher-level, without relying on those backends.

To do so, you can import the PythonBackend (which was previously known as PreTrainedTokenizer). This class encapsulates all the logic related to added tokens, encoding, and decoding.

If you want something even higher up the stack, then PreTrainedTokenizerBase is what PythonBackend inherits from. It contains the very basic tokenizer API features:

  • encode
  • decode
  • vocab_size
  • get_vocab
  • convert_tokens_to_ids
  • convert_ids_to_tokens
  • from_pretrained
  • save_pretrained
  • among a few others

API Changes

1. Direct tokenizer initialization with vocab and merges

Starting with v5, we now enable initializing blank, untrained tokenizers-backed tokenizers:

from transformers import LlamaTokenizer

tokenizer = LlamaTokenizer()

This tokenizer will therefore follow the definition of the LlamaTokenizer as defined in its class definition. It can then be trained on a corpus as can be seen in the tokenizers documentation.

These tokenizers can also be initialized from vocab and merges (if necessary), like the previous "slow" tokenizers:

from transformers import LlamaTokenizer

vocab = {"<unk>": 0, "<s>": 1, "</s>": 2, "hello": 3, "world": 4}
merges = [("h", "e"), ("l", "l"), ("o", " ")]

tokenizer = LlamaTokenizer(vocab=vocab, merges=merges)

This tokenizer will behave as a Llama-like tokenizer, with an updated vocabulary. This allows comparing different tokenizer classes with the same vocab; therefore enabling the comparison of different pre-tokenizers, normalizers, etc.

⚠️ The vocab_file (as in, a path towards a file containing the vocabulary) cannot be used to initialize the LlamaTokenizer as loading from files is reserved to the from_pretrained method.

2. Simplified decoding API

The batch_decode and decode methods have been unified to reflect behavior of the encode method. Both single and batch decoding now use the same decode method. See an example of the new behavior below:

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("t5-small") 
inputs = ["hey how are you?", "fine"]
tokenizer.decode(tokenizer.encode(inputs))

Gives:

- 'hey how are you?</s> fine</s>'
+ ['hey how are you?</s>', 'fine</s>']

We expect encode and decode to behave, as two sides of the same coin: encode, process, decode, should work.

[!NOTE] A common use-case would be: encode, model.generate, decode. However, using generate would return list[list[int]], which would then be incompatible with decode.

3. Unified encoding API

The encode_plus method is deprecated in favor of the single __call__ method.

4. apply_chat_template returns BatchEncoding

Previously, apply_chat_template returned input_ids for backward compatibility. Starting with v5, it now consistently returns a BatchEncoding dict like other tokenizer methods.

# v5
messages = [
    {"role": "user", "content": "Hello!"},
    {"role": "assistant", "content": "Hi there!"}
]

# Now returns BatchEncoding with input_ids, attention_mask, etc.
outputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
print(outputs.keys())  # dict_keys(['input_ids', 'attention_mask'])

5. Removed legacy configuration file saving:

We simplify the serialization of tokenization attributes:

  • special_tokens_map.json - special tokens are now stored in tokenizer_config.json.
  • added_tokens.json - added tokens are now stored in tokenizer.json.
  • added_tokens_decoder is only stored when there is no tokenizer.json.

When loading older tokenizers, these files are still read for backward compatibility, but new saves use the consolidated format. We're gradually moving towards consolidating attributes to fewer files so that other libraries and implementations may depend on them more reliably.

6. Model-Specific Changes

Several models that had identical tokenizers now import from their base implementation:

  • LayoutLM → uses BertTokenizer
  • LED → uses BartTokenizer
  • Longformer → uses RobertaTokenizer
  • LXMert → uses BertTokenizer
  • MT5 → uses T5Tokenizer
  • MVP → uses BartTokenizer

These modules will eventually be removed altogether.

Removed T5-specific workarounds

The internal _eventually_correct_t5_max_length method has been removed. T5 tokenizers now handle max length consistently with other models.

Testing Changes

A few testing changes specific to tokenizers have been applied:

  • Model-specific tokenization test files now focus on integration tests.
  • Common tokenization API tests (e.g., add_tokens, encode, decode) are now centralized and automatically applied across all tokenizers. This reduces test duplication and ensures consistent behavior

For legacy implementations, the original BERT Python tokenizer code (including WhitespaceTokenizer, BasicTokenizer, etc.) is preserved in bert_legacy.py for reference purposes.

7. Deprecated / Modified Features

Special Tokens Structure:

  • SpecialTokensMixin: Merged into PreTrainedTokenizerBase to simplify the tokenizer architecture.
  • special_tokens_map: Now only stores named special token attributes (e.g., bos_token, eos_token). Use extra_special_tokens for additional special tokens (formerly additional_special_tokens). all_special_tokens includes both named and extra tokens.
# v4
tokenizer.special_tokens_map  # Included 'additional_special_tokens'

# v5
tokenizer.special_tokens_map  # Only named tokens
tokenizer.extra_special_tokens  # Additional tokens
  • special_tokens_map_extended and all_special_tokens_extended: Removed. Access AddedToken objects directly from _special_tokens_map or _extra_special_tokens if needed.
  • additional_special_tokens: Still accepted for backward compatibility but is automatically converted to extra_special_tokens.

Deprecated Methods:

  • sanitize_special_tokens(): Already deprecated in v4, removed in v5.
  • prepare_seq2seq_batch(): Deprecated; use __call__() with text_target parameter instead.
# v4
model_inputs = tokenizer.prepare_seq2seq_batch(src_texts, tgt_texts, max_length=128)

# v5
model_inputs = tokenizer(src_texts, text_target=tgt_texts, max_length=128, return_tensors="pt")
model_inputs["labels"] = model_inputs.pop("input_ids_target")
  • BatchEncoding.words(): Deprecated; use word_ids() instead.

Removed Methods:

  • create_token_type_ids_from_sequences(): Removed from base class. Subclasses that need custom token type ID creation should implement this method directly.
  • clean_up_tokenization(): Removed from base class. Now defined at model class level for models that need it (e.g., PLBart, CLVP, Wav2Vec2).
  • prepare_for_model(), build_inputs_with_special_tokens(), truncate_sequences(): Moved from tokenization_utils_base.py to tokenization_python.py for PythonBackend tokenizers. TokenizersBackend provides model-ready input via tokenize() and encode(), so these methods are no longer needed in the base class.
  • _switch_to_input_mode(), _switch_to_target_mode(), as_target_tokenizer(): Removed from base class. Use __call__() with text_target parameter instead.
# v4
with tokenizer.as_target_tokenizer():
    labels = tokenizer(tgt_texts, ...)

# v5
labels = tokenizer(text_target=tgt_texts, ...)
  • parse_response(): Removed from base class.

Disclaimers for the RC0

PEFT + MoE:

Because we are switching from the naive MOE (nn.ModuleList for experts) we currently have an issue with MoEs that have adapters. For more details see https://github.com/huggingface/transformers/issues/42491#issuecomment-3591485649.

We aim for this to be fixed and released in a following release candidate in the week that follows RC0.

Tensor parallel and Expert parallel + MoE

We are streamlining the MoE support with vLLM; while this is being implemented, tensor parallelism and expert parallelism aren't working as expected. This is known and actively being worked on.

We aim for this to be fixed and released in a following release candidate in the week that follows RC0.

Custom pretrained models:

For anyone inheriting from a transformers PreTrainedModel, the weights are automatically initialized with the common scheme:


    @torch.no_grad()
    def _init_weights(self, module):
        """
        Initialize the weights. This is quite general on purpose, in the spirit of what we usually do. For more complex
        initialization scheme, it should be overridden by the derived `PreTrainedModel` class. In case a model adds an explicit
        `nn.Parameter`, this method should also be overridden in order to initialize it correctly.
        """
        if hasattr(self.config, "initializer_range"):
            std = self.config.initializer_range or 0.02
        elif hasattr(self.config, "init_std"):
            std = self.config.init_std
        elif hasattr(self.config, "initializer_factor"):
            std = self.config.initializer_factor
        else:
            # 0.02 is the standard default value across the library
            std = getattr(self.config.get_text_config(), "initializer_range", 0.02)

        if isinstance(module, (nn.Linear, nn.Conv1d, nn.Conv2d, nn.Conv3d, nn.ConvTranspose1d, nn.ConvTranspose2d)):
            if getattr(module, "weight", None) is not None:
                init.normal_(module.weight, mean=0.0, std=std)
            if getattr(module, "bias", None) is not None:
                init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            if getattr(module, "weight", None) is not None:
                init.normal_(module.weight, mean=0.0, std=std)
                # Here we need the check explicitly, as we slice the weight in the `zeros_` call, so it looses the flag
                if module.padding_idx is not None and not getattr(module.weight, "_is_hf_initialized", False):
                    init.zeros_(module.weight[module.padding_idx])
        elif isinstance(module, nn.MultiheadAttention):
            # This uses torch's original init
            module._reset_parameters()
        # We cannot use `isinstance` on the RMSNorms or LayerNorms, as they usually are custom modules which change names
        # between modelings (because they are prefixed with the model name)
        elif (
            isinstance(module, (nn.GroupNorm, nn.BatchNorm1d, nn.BatchNorm2d, nn.BatchNorm3d))
            or "LayerNorm" in module.__class__.__name__
            or "RMSNorm" in module.__class__.__name__
        ):
            # Norms can exist without weights (in which case they are None from torch primitives)
            if hasattr(module, "weight") and module.weight is not None:
                init.ones_(module.weight)
            if hasattr(module, "bias") and module.bias is not None:
                init.zeros_(module.bias)

If you want to avoid that, for now you should just do:

class CustomModel(Qwen3VLForConditionalGeneration):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.action_head = nn.Linear(1024, 7)
        self.positional_embedding = nn.Parameter(torch.randn(16, 1152))
        self.post_init()
    
    def _init_weights(self, module):
        pass 

There is a tracker for that here: https://github.com/huggingface/transformers/issues/42418.

Library-wide changes with lesser impact

use_auth_token

The use_auth_token argument/parameter is deprecated in favor of token everywhere. You should be able to search and replace use_auth_token with token and get the same logic.

Linked PR: https://github.com/huggingface/transformers/pull/41666

Attention-related features

We decided to remove some features for the upcoming v5 as they are currently only supported in a few old models and no longer integrated in current model additions. It's recommended to stick to v4.x in case you need them. Following features are affected:

  • No more head masking, see #41076. This feature allowed to turn off certain heads during the attention calculation and only worked for eager.
  • No more relative positional biases in Bert-like models, see #41170. This feature was introduced to allow relative position scores within attention calculations (similar to T5). However, this feature is barely used in official models and a lot of complexity instead. It also only worked with eager.
  • No more head pruning, see #41417 by @gante. As the name suggests, it allowed to prune heads within your attention layers.

Updates to supported torch APIs

We dropped support for two torch APIs:

Those APIs were deprecated by the PyTorch team, and we're instead focusing on the supported APIs dynamo and export.

Quantization changes

We clean up the quantization API in transformers, and significantly refactor the weight loading as highlighted above.

We drop support for two quantization arguments that have been deprecated for some time:

  • load_in_4bit
  • load_in_8bit

We remove them in favor of the quantization_config argument which is much more complete. As an example, here is how you would load a 4-bit bitsandbytes model using this argument:

from transformers import AutoModelForCausalLM, BitsAndBytesConfig

quantization_config = BitsAndBytesConfig(load_in_4bit=True)

model_4bit = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-3.2-3B",
    device_map="auto",
    quantization_config=quantization_config
)

Configuration

  • Methods to init a nested config such as from_xxx_config are deleted. Configs can be init from the __init__ method in the same way. See #41314.
  • It is no longer possible to load a config class from a URL file. Configs must be loaded from either a local path or a repo on the Hub. See #42383.
  • All parameters for configuring model's rotary embedding are now stored under mode.rope_parameters, including the rope_theta and rope_type. Model's config.rope_parameters is a simple dictionaty in most cases, and can also be a nested dict in special cases (i.e. Gemma3 and ModernBert) with different rope parameterization for each layer type. Trying to get config.rope_theta will throw an attribute error from now on. See #39847 and #42255
  • Qwen-VL family configuration is in a nested format and trying to access keys directly will throw an error (e.g. config.vocab_size). Users are expected to access keys from their respective sub-configs (config.text_config.vocab_size).
  • Configurations of non-generative models (any model that doesn't call model.generate()) will no longer have a generation_config and model.config.generation_config will throw an attribute error.

Processing

Tokenization

  • Slow tokenizer files (aka: tokenization_<model>.py ) will be removed in favor of using fast tokenizer files tokenization_<model>_fast.py --> will be renamed to tokenization_<model>.py. As fast tokenizers are :hugs:tokenizers - backend, they include a wider range of features that are maintainable and reliable.
  • Other backends (sentence piece, tokenizers, etc.) will be supported with a light layer if loading a fast tokenizer fails
  • Remove legacy files like special_tokens_map.json and added_tokens.json
  • Remove _eventually_correct_t5_max_length
  • encode_plus --> __call__
  • batch_decode --> decode

apply_chat_template by default returns naked input_ids rather than a BatchEncoding dict. This was inconvenient - it should return a BatchEncoding dict like tokenizer.__call__(), but we were stuck with it for backward compatibility. The method now returns a BatchEncoding.

Linked PRs:

Processing classes

Modeling

  • Some RotaryEmbeddings layers will start returning a dict of tuples, in case the model uses several RoPE configurations (Gemma2, ModernBert). Each value will be a tuple of "cos, sin" per RoPE type.
  • Config attribute for RotaryEmbeddings layer will be unified and accessed via config.rope_parameters. Config attr for rope_theta might not be accessible anymore for some models, and instead will be in config.rope_parameters['rope_theta']. BC will be supported for a while as much as possible, and in the near future we'll gradually move to the new RoPE format (https://github.com/huggingface/transformers/pull/39847)
  • Vision Language models will not have a shortcut access to its language and vision component from the generative model via model.language_model. It is recommended to either access the module with model.model.language_model or model.get_decoder(). See #42156

Generate

  • Old, deprecated output type aliases were removed (e.g. GreedySearchEncoderDecoderOutput). We now only have 4 output classes built from the following matrix: decoder-only vs encoder-decoder, uses beams vs doesn't use beams (https://github.com/huggingface/transformers/pull/40998)
  • Removed deprecated classes regarding decoding methods that were moved to the Hub due to low usage (constraints and beam scores) (https://github.com/huggingface/transformers/pull/41223)
  • If generate doesn't receive any KV Cache argument, the default cache class used is now defined by the model (as opposed to always being DynamicCache) (https://github.com/huggingface/transformers/pull/41505)
  • Generation parameters are no longer accessible via model's config. If generation paramaters are serialized in config.json for any old model, it will be loaded back into model's generation config. Users are expected to access or modify generation parameters only with model.generation_config.do_sample = True.

Trainer

New Features

  • ALST/Ulysses Sequence Parallelism Integration
    • Added sequence parallelism support via HF Accelerate for training with longer sequences. Enables splitting sequences across devices using ALST (All-to-All Long Sequence Training) and Ulysses algorithms with DeepSpeed.
  • Improved compute_loss_func Handling
    • compute_loss_func now always takes priority over the model's built-in loss computation, giving users consistent control over custom loss functions.
  • num_items_in_batch in Prediction Step
    • The num_items_in_batch argument is now passed to compute_loss during prediction_step, enabling proper loss scaling during evaluation.

Breaking Changes

  • report_to now defaults to "none"
    • Logging integrations are no longer auto-detected by default; users must explicitly specify which reporting backends to use.

Removing arguments without deprecation cycle in TrainingArguments due to low usage

  • mp_parameters -> legacy param that was later on added to the Sagemaker trainer
  • _n_gpu -> not intended for users to set, we will initialize it correctly instead of putting it in the TrainingArguments
  • overwrite_output_dir - > replaced by resume_from_checkpoint, and it was only used in the examples script, no impact on Trainer.
  • logging_dir -> only used for tensorboard, set TENSORBOARD_LOGGING_DIR env var instead
  • jit_mode_eval -> use use_torch_compile instead, as torchscript is not recommended anymore
  • tpu_num_cores-> It is actually better to remove it, as it is not recommended to set the number of cores. By default, all TPU cores are used . Set TPU_NUM_CORES env var instead
  • past_index -> it was only used for a very small number of models that have special architecture like transformersxl + it was not documented at all how to train those models
  • ray_scope -> only for a minor arg for ray integration. Set RAY_SCOPE var env instead
  • warmup_ratio -> use warmup_step instead. We combined both args together by allowing passing float values in warmup_step.

Removing deprecated arguments in TrainingArguments

  • fsdp_min_num_params and fsdp_transformer_layer_cls_to_wrap -> use fsdp_config
  • tpu_metrics_debug -> debug
  • push_to_hub_token -> hub_token
  • push_to_hub_model_id and push_to_hub_organization -> hub_model_id
  • include_inputs_for_metrics -> include_for_metrics
  • per_gpu_train_batch_size -> per_device_train_batch_size
  • per_gpu_eval_batch_size -> per_device_eval_batch_size
  • use_mps_device -> mps will be used by default if detected
  • fp16_backend and half_precision_backend -> we will only rely on torch.amp as everything has been upstreamed to torch
  • no_cuda -> use_cpu
  • include_tokens_per_second -> include_num_input_tokens_seen
  • use_legacy_prediction_loop -> we only use evaluation_loop function from now on

Removing deprecated arguments in Trainer

  • tokenizer in initialization -> processing_class
  • model_path in train() -> resume_from_checkpoint

Removed features for Trainer

  • sigpot integration for hp search was removed as the library was archived + the api stopped working
  • drop support for sagemaker API <1.10
  • bump accelerate minimum version to 1.1.0
  • bump peft minimum version to 0.18.0
  • bump bitsandbytes minimum version to 0.46.1

New defaults for Trainer

  • use_cache in the model config will be set to False. You can still change the cache value through TrainingArguments usel_cache argument if needed.

Pipeline

  • Image text to text pipelines will no longer accept images as a separate argument along with conversation chats. Image data has to be embedded in the chat's "content" field. See #42359

PushToHubMixin

  • removed deprecated organization and repo_url from PushToHubMixin. You must pass a repo_id instead.
  • removed ignore_metadata_errors from PushToMixin. In practice if we ignore errors while loading the model card, we won't be able to push the card back to the Hub so it's better to fail early and not provide the option to fail later.
  • push_to_hub do not accept **kwargs anymore. All accepted parameters are explicitly documented.
  • arguments of push_to_hub are now keyword-only to avoid confusion. Only repo_id can be positional since it's the main arg.
  • removed use_temp_dir argument from push_to_hub. We now use a tmp dir in all cases.

Linked PR: https://github.com/huggingface/transformers/pull/42391.

CLI

The deprecated transformers-cli ... command was deprecated, transformers ... is now the only CLI entry point.

transformers CLI has been migrated to Typer, making it easier to maintain + adding some nice features out of the box (improved --help section, autocompletion).

Biggest breaking change is in transformers chat. This command starts a terminal UI to interact with a chat model. It used to also be able to start a Chat Completion server powered by transformers and chat with it. In this revamped version, this feature has been removed in favor of transformers serve. The goal of splitting transformers chat and transformers serve is to define clear boundaries between client and server code. It helps with maintenance but also makes the commands less bloated. The new signature of transformers chat is:

Usage: transformers chat [OPTIONS] BASE_URL MODEL_ID [GENERATE_FLAGS]...

Chat with a model from the command line.

It works hand in hand with transformers serve, which means that if transformers serve is running on its default endpoint, transformers chat can be launched as follows:

transformers chat HuggingFaceTB/SmolLM3-3B

It can however use any OpenAI API compatible HTTP endpoint:

transformers chat HuggingFaceTB/SmolLM3-3B https://router.huggingface.co/v1

Linked PRs:

Removal of the run method

The transformers run (previously transformers-cli run) is an artefact of the past, was not documented nor tested, and isn't part of any public documentation. We're removing it for now and ask you to please let us know in case this is a method you are using; in which case we should bring it back with better support.

Linked PR: https://github.com/huggingface/transformers/pull/42447

Environment variables

  • Legacy environment variables like TRANSFORMERS_CACHE, PYTORCH_TRANSFORMERS_CACHE, and PYTORCH_PRETRAINED_BERT_CACHE have been removed. Please use HF_HOME instead.
  • Constants HUGGINGFACE_CO_EXAMPLES_TELEMETRY, HUGGINGFACE_CO_EXAMPLES_TELEMETRY, HUGGINGFACE_CO_PREFIX, and HUGGINGFACE_CO_RESOLVE_ENDPOINT have been removed. Please use huggingface_hub.constants.ENDPOINT instead.

Linked PR: https://github.com/huggingface/transformers/pull/42391.

Requirements update

transformers v5 pins the huggingface_hub version to >=1.0.0. See this migration guide to learn more about this major release. Here are to main aspects to know about:

  • switched the HTTP backend from requests to httpx. This change was made to improve performance and to support both synchronous and asynchronous requests the same way. If you are currently catching requests.HTTPError errors in your codebase, you'll need to switch to httpx.HTTPError.
  • related to 1., it is not possible to set proxies from your script. To handle proxies, you must set the HTTP_PROXY / HTTPS_PROXY environment variables
  • hf_transfer and therefore HF_HUB_ENABLE_HF_TRANSFER have been completed dropped in favor of hf_xet. This should be transparent for most users. Please let us know if you notice any downside!

typer-slim has been added as required dependency, used to implement both hf and transformers CLIs.

New model additions in v5

CWM

<img width="809" height="471" alt="image" src="https://github.com/user-attachments/assets/58bb9c70-d481-48ed-ab8f-6553be7c240f" />

The Code World Model (CWM) model was proposed in CWM: An Open-Weights LLM for Research on Code Generation with World Models by Meta FAIR CodeGen Team. CWM is an LLM for code generation and reasoning about code that has, in particular, been trained to better represent and reason about how code and commands affect the state of a program or system. Specifically, we mid-trained CWM on a large number of observation-action trajectories from Python execution traces and agentic interactions in containerized environments. We post-trained with extensive multi-task RL in verifiable coding, math, and multi-turn software engineering environments.

  • Add Code World Model (CWM) by @jacobkahn in #41199

SAM3

<img width="1505" height="915" alt="image" src="https://github.com/user-attachments/assets/eec48633-f02b-464a-ae5c-c65473387e53" />

SAM3 (Segment Anything Model 3) was introduced in SAM 3: Segment Anything with Concepts.

The SAM3 addition adds four new architectures:

  • Sam3
  • Sam3Tracker
  • Sam3TrackerVideo
  • Sam3Video

SAM3 performs Promptable Concept Segmentation (PCS) on images. PCS takes text and/or image exemplars as input (e.g., "yellow school bus"), and predicts instance and semantic masks for every single object matching the concept.

Sam3Tracker and Sam3TrackerVideo perform Promptable Visual Segmentation (PVS) on images. PVS takes interactive visual prompts (points, boxes, masks) or text inputs to segment a specific object instance per prompt. This is the task that SAM 1 and SAM 2 focused on, and SAM 3 improves upon it. Sam3Tracker and Sam3TrackerVideo are updated versions of SAM2 Video that maintain the same API while providing improved performance and capabilities.

SAM3 Video performs Promptable Concept Segmentation (PCS) on videos. PCS takes text as input (e.g., "yellow school bus"), and predicts instance and semantic masks for every single object matching the concept, while preserving object identities across video frames. The model combines a detection module (SAM3) with a tracking module (SAM2-style tracker) to enable robust object tracking across video frames using text prompts.

  • Add SAM3 to 🤗 Transformers by @yonigozlan in #42285

LFM2 MoE

<img width="1080" height="849" alt="image" src="https://github.com/user-attachments/assets/a9fa1b81-114d-4054-9699-5083ac69d830" />

LFM2-MoE is a Mixture-of-Experts (MoE) variant of LFM2. The LFM2 family is optimized for on-device inference by combining short‑range, input‑aware gated convolutions with grouped‑query attention (GQA) in a layout tuned to maximize quality under strict speed and memory constraints.

LFM2‑MoE keeps this fast backbone and introduces sparse MoE feed‑forward networks to add representational capacity without significantly increasing the active compute path. The first LFM2-MoE release is LFM2-8B-A1B, with 8.3B total parameters and 1.5B active parameters. The model excels in quality (comparable to 3-4B dense models) and speed (faster than other 1.5B class models).

  • [Model] Lfm2Moe by @paulpak58 in #41401

VideoLlama 3

<img width="812" height="366" alt="image" src="https://github.com/user-attachments/assets/21c82c6e-cf0a-4d6c-a707-b9e57663ca85" />

The VideoLLaMA3 model is a major update to VideoLLaMA2 from Alibaba DAMO Academy.

  • [model] Add VideoLLaMA3 implementation by @lkhl in #40499

AudioFlamingo 3

<img width="621" height="475" alt="image" src="https://github.com/user-attachments/assets/c9616758-b3aa-41d0-bd58-695966ba146d" />

Audio Flamingo 3 (AF3) is a fully open large audio–language model designed for robust understanding and reasoning over speech, environmental sounds, and music. AF3 pairs a Whisper-style audio encoder with a causal language model and performs replace-in-place audio–text fusion: the processor aligns post-pool audio frames to a dedicated placeholder token and the model replaces those token slots with projected audio embeddings during the forward pass.

The model checkpoint is available at: nvidia/audio-flamingo-3-hf

Highlights:

  • Unified audio encoder across speech, sound, and music.
  • Long-audio support via windowing and post-pool alignment (up to 10 minutes maximum). The model processes audio in 30-second windows with a hard limit of 20 windows (10 minutes total). Audio longer than 10 minutes will be truncated.
  • Deterministic fusion that preserves sequence length by replacing audio placeholder tokens with audio embeddings.
  • [models] Add AudioFlamingo3 integration by @lashahub in #40290

Nanochat

NanoChat is a compact decoder-only transformer model designed for educational purposes and efficient training. The model features several fundamental architectural innovations which are common in modern transformer models. Therefore, it is a good model to use as a starting point to understand the principles of modern transformer models. NanoChat is a variant of the Llama architecture, with simplified attention mechanism and normalization layers.

  • [MODEL] Nanochat implementation by @burtenshaw in #41634

Bugfixes and improvements

  • JetMoe Fix jetmoe after #40132 by @ArthurZucker in #41324
  • Fixed tiny incorrect import in gemma3 by @Sai-Suraj-27 in #41354
  • Rope for Qwen2--5-vl by @zucchini-nlp in #41173
  • 🚨 Bump to Python 3.10 and rework how we check 3rd-party libraries existence by @Cyrilvallez in #41268
  • Standardize PretrainedConfig to PreTrainedConfig by @Cyrilvallez in #41300
  • Fix trainer for py3.9 by @SunMarc in #41359
  • Check model inputs - hidden states by @zucchini-nlp in #40994
  • [ModularChecker] QOL for the modular checker by @ArthurZucker in #41361
  • Fixing a typo for BLT model by @Narsil in #41325
  • :rotating_light: [v5] Remove relative position embeddings (for bert like models) by @vasqu in #41170
  • Fix typo in model proposal template by @Ombucha in #41352
  • Better typehints for apply_chat_template by @Samoed in #41355
  • 🚨 Remove BetterTransformer by @Cyrilvallez in #41367
  • [testing] update test_longcat_generation_cpu by @ydshieh in #41368
  • Fix flash_attention.py: wrong argument passing for attn_implementation by @TKONIY in #41347
  • Use canonical get_size_with_aspect_ratio (with max_size) from transformers.image_transforms to fix #37939 by @sonianuj287 in #41284
  • Fixes in check_model_inputs, GPTBigCodeModel and ImageGPTModel by @IlyasMoutawwakil in #40811
  • Remove unnecessary list comprehension by @cyyever in #41305
  • make some ut cases pass on xpu w/ latest torch by @yao-matrix in #41337
  • Remove unused function patameters by @cyyever in #41358
  • [CB] Refactors the way we access paged by @ArthurZucker in #41370
  • serve: add non-streaming mode to /v1/responses; stream event parity; remove placeholder logprobs by @antznette1 in #41353
  • Update from pretrained error when loading by @ArthurZucker in #33380
  • [v5] Sync Bert and Bart eager attention by @vasqu in #41248
  • fix asr ut failures by @yao-matrix in #41332
  • fix resample in asr pipeline by @yhzx233 in #41298
  • Correct numerical regression in vision embeddings by @i3hz in #41374
  • [kernels] Kernel Config by @MekkCyber in #41232
  • [Cache] lfm2 cache: allocate empty kv layers during init by @paulpak58 in #41396
  • Fix test for model with dotted name and relative imports by @st81 in #41343
  • Prefer raising TypeError exception for invalid type by @Sai-Suraj-27 in #41346
  • [v5] Bump accelerate to 1.1.0 by @SunMarc in #41234
  • Fix incorrect assignment in update_device_map for GPTQ quantizer by @Sai-Suraj-27 in #41328
  • [v5] Delete left traces of feature extractor by @zucchini-nlp in #41321
  • Remove deprecation warning by @Cyrilvallez in #41425
  • Fix overriding common_kwargs defaults in processor calls by @yonigozlan in #41381
  • v5 dev version by @LysandreJik in #41436
  • Tiny Cleanup - Removed duplicate class field definition's by @Sai-Suraj-27 in #41293
  • 🚨🚨 Remove all traces of legacy cache format by @Cyrilvallez in #41378
  • 🚨 [v5] Prune prune_heads by @gante in #41417
  • [v5] Bump min version of bitsandbytes to 0.46.1 by @SunMarc in #41283
  • Fixing comments in init file by @MekkCyber in #41414
  • Use accelerator API to free device memory by @cyyever in #41195
  • enable new model uts to xpu and fix some failures on xpu by @yao-matrix in #41386
  • [torchao] Add regex support for ModuleFqnToConfig by @jerryzh168 in #41242
  • :facepalm: CB nit! by @ArthurZucker in #41413
  • Remove Python 3.9 classifier by @cyyever in #41410
  • [JetMoe] Fix KV head repetition and padding free by @vasqu in #41423
  • [testing] Fix JetMoeIntegrationTest by @ydshieh in #41377
  • Add Top-H decoding (entropy-bounded truncation) as a LogitsWarper for text generation by @ErfanBaghaei in #40837
  • Validate processing kwargs with @strict from huggingface_hub by @zucchini-nlp in #40793
  • Update hqq.md by @prathamesh-chavan-22 in #41452
  • enable some falcon-mamba uts on xpu by @yao-matrix in #41428
  • Fix generate outputs and simplify cache tests by @Cyrilvallez in #41440
  • Fix doc by @Cyrilvallez in #41457
  • 🚨 [v5] Rename left traces of past_key_value in BERT-like models by @zucchini-nlp in #41448
  • Subconfig is a class attribute by @zucchini-nlp in #41308
  • [v5] rm utils/tf_ops/ by @gante in #41402
  • Update GLM-4.1V MMRope implementation by @zRzRzRzRzRzRzR in #41182
  • [kernels] Cleanup deta kernel by @MekkCyber in #41470
  • 🚨 [v5] Rendundant code in nested configs by @zucchini-nlp in #41314
  • Remove KERAS_NLP_IMPORT_ERROR by @cyyever in #41468
  • Fix auto model configuration for encoder of perceptionlm by @fschlatt in #41464
  • Fix tests fsdp by @SunMarc in #41422
  • Import Callable from collections.abc by @cyyever in #41130
  • Pickle - part 2 by @ydshieh in #41476
  • Remove infer_device by @cyyever in #41088
  • Change RT-Detr docs to reflect fixed 640x640 input size by @konstantinos-p in #41364
  • Cleaning hub kernels by @MekkCyber in #41477
  • [v5] remove load_in_4bit and load_in_8bit by @SunMarc in #41287
  • :rotating_light: [Attention Masks] Bidirectional masks for encoder and encoder-decoder models by @vasqu in #41265
  • [Fix] Fix test file error by @YangKai0616 in #40973
  • enhance patched_tearDown to support python 3.11+ by @yao-matrix in #41429
  • RT-Detr correct 2d positional embeddings for non-square images by @konstantinos-p in #41380
  • Fix bnb fsdp loading for pre-quantized checkpoint by @SunMarc in #41415
  • Remove SigOpt by @SunMarc in #41479
  • Remove past_index by @SunMarc in #41384
  • Remove deprecated args in Trainer for v5 by @SunMarc in #41404
  • Update GLM-4.6 doc by @zRzRzRzRzRzRzR in #41471
  • report_to default changed to "none" + cleaning deprecated env var by @SunMarc in #41375
  • deprecate overwrite_output_dir by @SunMarc in #41323
  • [CI] Fix copies on main by @vasqu in #41486
  • [Trainer] deprecate ray scope by @SunMarc in #41403
  • deprecate jit_mode_eval by @SunMarc in #41376
  • Remove local_rank arg from TrainingArguments by @SunMarc in #41382
  • Update philosophy by @molbap in #41438
  • Remove DISABLE_KERNEL_MAPPING flag by @MekkCyber in #41475
  • Streaming should be handled at the request-level rather than at the istance level by @LysandreJik in #41444
  • fix bnb model loading by @jiqing-feng in #41499
  • [kernels] Remove RWKV kernel finally ! by @MekkCyber in #41493
  • [kernels] rm yoso kernel by @MekkCyber in #41495
  • Try to remove pickle - BloomTokenizerFast by @ydshieh in #41466
  • Fixed tiny incorrect imports in glm4v by @Sai-Suraj-27 in #41483
  • [Parakeet] unnecessary warning & auto mapping by @eustlb in #41412
  • [causallm tester] automate pipeline mappings + bloom tests by @gante in #41318
  • Fix some tests by @Cyrilvallez in #41503
  • fix gemma3n case failure by @yao-matrix in #41426
  • [voxtral] language detection + skipping lang:xx by @eustlb in #41225
  • Set truncation to False in Qwen3Omni to avoid default truncation by @BakerBunker in #41473
  • [QoL] modular conversion shows LoC saved by @molbap in #41500
  • More trainer cleaning by @SunMarc in #41489
  • Bump to hfh 1.0.0.rc5 to fix test by @Wauplin in #41508
  • Revert local_rank deletion and some cleaning by @SunMarc in #41504
  • Fix detectron2 import by @Cyrilvallez in #41510
  • add Trainer import to .md in appropriate cell block for training.ipynb transformers_doc by @benkeene in #41484
  • Remove outdated flags by @Cyrilvallez in #41512
  • remove tpu_num_cores by @SunMarc in #41383
  • Allow optuna's catch kwargs passthrough by @nicha-api in #41496
  • Fix Latex typesetting in documentation by @cyyever in #41177
  • [testing] reduce runtime of HunYuanMoEV1IntegrationTest:test_model_generation by @ydshieh in #41373
  • [Qwen3VL] fix: hidden_states in place modification error by @HollowMan6 in #41535
  • Add MLlama fast image processor by @yonigozlan in #41391
  • Fixed Type-hints in function defintions by @Sai-Suraj-27 in #41525
  • [SAM] Fix typing hints by @zucchini-nlp in #41506
  • Restore cuda graphs to continuous batching by @remi-or in #41421
  • Add AMD developer cloud support by @fan-amd in #41126
  • Enable modular files from other libraries by @regisss in #41372
  • 🚨 [v5] generate delegates default cache initialization to the model by @gante in #41505
  • Fixed typos and formatting by @julian-st in #34215
  • Add VideoMAE video processor by @Aki-07 in #41534
  • [from_pretrained] Small refactor from_pretrained: move around unrelated stuff by @ArthurZucker in #41445
  • Remove references to AutoModelForVision2Seq by @Rocketknight1 in #41513
  • [Qwen3VL] fix device mismatch error for FSDP2 training by @HollowMan6 in #41536
  • Patch MistralCommonTokenizer by @juliendenize in #41439
  • Fix an import error with PreTrainModel by @remi-or in #41571
  • [Qwen3VLMoe] Fixed: Expected self.dtype to be equal to src.dtype - routing_weights casting by @danielquintas8 in #41420
  • [kernels] rm mra kernels by @MekkCyber in #41507
  • delete some tokenizer tests using pickle by @ydshieh in #41514
  • Add DINOv3Backbone for ConvNext variant by @merveenoyan in #40651
  • Add conditional checks to _check_and_adjust_attn_implementation() by @zheliuyu in #41542
  • add rmsnorm kernels support for Intel XPU by @kaixuanliu in #41563
  • Revert "add rmsnorm kernels support for Intel XPU" by @MekkCyber in #41579
  • [VisionEncoderDecoderModel] Update loss function by @NielsRogge in #40863
  • Add iter to DynamicCache by @remi-or in #41569
  • Revert some breaking changes bnb by @SunMarc in #41581
  • Fix typsetting and content of llm_tutorial_optimization.md by @cyyever in #41172
  • Gemma3 fixes by @remi-or in #41572
  • Benchmark overhaul by @remi-or in #41408
  • Enable non-streaming mode in transformers serve by @LysandreJik in #41446
  • [device_map] Accelerate loading by computing device_map much faster by @Cyrilvallez in #41548
  • Add logits_to_keep to many older CausalLM models by @philiproeleveld in #41335
  • fix some case failures lead by "torch.compile recompiled part of th… by @sywangyi in #41558
  • remove ray_scope and check_quantized_param by @SunMarc in #41587
  • Update issue template by @SunMarc in #41573
  • [Docs] Fix changed references by @vasqu in #41614
  • Import expand_device_map instead of redefining it by @Cyrilvallez in #41608
  • Fix trainer simple tests by @SunMarc in #41449
  • More markdown file fixes by @cyyever in #41599
  • torch 2.9 don't ❤️ torchcodec 💔 by @ydshieh in #41610
  • Update a dataset reop link by @ydshieh in #41618
  • Add fast path for bidirectional mask creation to fix regression by @i3hz in #41586
  • enable sdpa enable gqa logic for Ascend NPU by @FightingZhen in #41601
  • Fix video processing channel format by @zucchini-nlp in #41603
  • [chat template] update when "push_to_hub" by @zucchini-nlp in #39815
  • Remove the head masking block in some vision models by @ydshieh in #41620
  • Remove deprecated code by @SunMarc in #41616
  • Fix quantization base class by @SunMarc in #41613
  • [docs] Duplicate entry by @stevhliu in #41591
  • Update executorch.md by @jackzhxng in #41582
  • Add Backbone API fine-tuning tutorial by @merveenoyan in #41590
  • 🚨 [v5] Toggle the serialization format in processors by @zucchini-nlp in #41474
  • Add aux loss for GLM-4.5V by @zRzRzRzRzRzRzR in #41564
  • Allow passing tp_plan in from_pretrained directly by @Cyrilvallez in #41435
  • Fix tokenization test by @Cyrilvallez in #41649
  • Remove randomly added script by @Cyrilvallez in #41650
  • Add missing dates to docs by @yonigozlan in #41576
  • Migrate transformers cli to Typer by @Wauplin in #41487
  • Fix FP-Quant quantization fallback CPU dispatch. by @BlackSamorez in #41619
  • fix check inputs for text2text pipeline by @jiqing-feng in #41556
  • [Executorch] Simplify for encoder models by @vasqu in #41627
  • [Ernie 4.5 Moe] Fix Moe and offloading by @vasqu in #41385
  • [CI] Build translated docs by @stevhliu in #41632
  • Fix fp32_ln for various models by @remi-or in #41605
  • Adjust device logging level and add minor fixes by @mario-koddenbrock in #41636
  • Fix EncoderDecoder cache by @remi-or in #41612
  • Format MarkDown documentation and tiny fixes by @cyyever in #41638
  • Fix typos in documentation by @cyyever in #41641
  • Fix confusing cls assignment by @cyyever in #41642
  • Double router compute? by @molbap in #41653
  • [kernels] refactor function kernel calling by @MekkCyber in #41577
  • [Fix] Deepseek V3 expert bias routing by @fjosw in #41647
  • purge HF_HUB_ENABLE_HF_TRANSFER; promote Xet by @Vaibhavs10 in #41656
  • [Masks] Fix mask handling in eager for vision models by @vasqu in #41625
  • Use | for Optional and Union typing by @cyyever in #41646
  • Switch to CB if cache_implementation == paged by @remi-or in #41655
  • Add in-out modalities as class attribute per model by @zucchini-nlp in #41366
  • Fix dtype casting with quantization by @Cyrilvallez in #41665
  • Fix serving continuous batching by @SunMarc in #41624
  • Small changes to benchmarking script by @remi-or in #41662
  • Improve package version check by @Cyrilvallez in #41661
  • improve utils/check_bad_commit.py by @ydshieh in #41658
  • Erroring when KernelConfig is passed without use_kernels = True by @MekkCyber in #41657
  • [Trainer] [Breaking change] use_cache default to False by @SunMarc in #41585
  • 🌐 [i18n-KO] Translated chat_extras.md to Korean by @Judy-Choi in #39863
  • 🌐 [i18n-KO] Translated sam_hq.md to Korean by @HyunZ118 in #41340
  • [i18n-KO] Translated big_bird.md to Korean by @ssum21 in #40445
  • 🌐 [i18n-KO] Translated code_llama.md to Korean by @Judy-Choi in #40558
  • 🌐 [i18n-KO] Translated llama4.md to Korean by @TaskerJang in #40396
  • :globe_with_meridians: [i18n-KO] Translated ko-LFM2.md to Korean by @ssum21 in #41502
  • Adding superglue fast image processing by @AlphaOrOmega in #41394
  • Fix ckpt in docs by @zucchini-nlp in #41659
  • torch 2.9 still don't ❤️ torchcodec 0.8 💔 by @ydshieh in #41686
  • Remove deprecated use_auth_token parameter by @Wauplin in #41666
  • Remove require_torch_bf16_gpu by @cyyever in #40979
  • path validation for security reason by @ydshieh in #41256
  • 🚨 Remove torchscript support by @Cyrilvallez in #41688
  • Fix MarkDown syntax by @cyyever in #41676
  • Use | for Optional and Union typing by @cyyever in #41675
  • 🚨 [v5] Refactor RoPE for layer types by @zucchini-nlp in #39847
  • Enable faiss-cpu on Windows by @cyyever in #41678
  • Fix Pylint warnings by @cyyever in #41644
  • 🚨 Remove torch.fx support by @Cyrilvallez in #41683
  • Remove skipped tests without parents by @Cyrilvallez in #41691
  • Enable FURB rules in ruff by @cyyever in #41395
  • Remove upper version bound of pandas by @cyyever in #41677
  • [Attn] Allow dynamic causality in SDPA via Kwargs by @vasqu in #41692
  • Simplify GQA conditions in sdpa_attention.py by @justinchuby in #41699
  • [docs] Manual tp-plan by @stevhliu in #41674
  • 🌐 [i18n-KO] Translated gemma3n.md to Korean by @HyunZ118 in #40873
  • pin torchcodec on CI docker image by @ydshieh in #41703
  • Update run_name docs in TrainingArguments by @tobiasofsn in #41705
  • further improve utils/check_bad_commit.py by @ydshieh in #41658)
  • feat: add benchmark v2 ci with results pushed to dataset by @McPatate in #41672
  • Gemma3 conversion script maintenance by @RyanMullins in #41704
  • Fix Qwen3-Omni inference when mixing video and image inputs in one batch by @BakerBunker in #41741
  • Fix typo in LFM-VL by @zucchini-nlp in #41742
  • Revert "Remove upper version bound of pandas" by @ydshieh in #41744
  • [doc] remove broken notebooks on AMD Dev Cloud by @pagezyhf in #41743
  • Update type hints in tokenization_utils.py to use | syntax by @faizan842 in #41713
  • Fix documentation issues by @cyyever in #41726
  • Apply RUFF PIE rules by @cyyever in #41727
  • Small Fix for imports by @MekkCyber in #41411
  • Docs(zh-hans): Refine wording for professionalism in README by @Ri-Nai in #40943
  • Add vision contribution guide by @molbap in #41456
  • upgrade xpu docker file to torch 2.8 by @yao-matrix in #41551
  • [v5] Delete videos from image processing classes by @zucchini-nlp in #41607
  • Fixed incorrect model_type for qwen2vl and qwen2.5vl when config is saved and loaded again by @i3hz in #41758
  • [kernels] Add version to function mapping by @MekkCyber in #41685
  • Reduce warning noise caused by Tensor.new_tensor by @st81 in #41748
  • Fix graphormer model compilation with Cython 3.1.4 by @alexmalyshev in #41671
  • Update type hints in modeling_rope_utils.py to use | syntax by @faizan842 in #41714
  • [v5] Remove deprecated tranformers.onnx by @echarlaix in #41700
  • Modernize CLIP modeling code by @molbap in #41546
  • Simplify pipeline padding logic by @Rocketknight1 in #41667
  • Chat response parsing by @Rocketknight1 in #40894
  • Add LightGlue fast image processor by @yonigozlan in #41670
  • Fix bark after #41445 by @ydshieh in #41645
  • Remove invalid @staticmethod from module-level get_device_and_memory_breakdown by @albertvillanova in #41747
  • Fix CUDA index out of bounds for q_idx in VLM token type masking for Gemma3, PaliGemma, and example modular by @albertvillanova in #41757
  • fix: Gemma 3 weights conversion vision and multimodal projector paths by @RyanMullins in #41767
  • [v5] Delete legacy chat template saving by @zucchini-nlp in #41648
  • [quantization] fix compressed_tensors tests by @MekkCyber in #41780
  • [quantization] Skip Fp8 tests when hardware capability < 8.9 by @MekkCyber in #41785
  • Swap columns and rows of the grid layout in LFM2-VL by @ankke in #41755
  • fix type annotation typo in docstring by @johntheprime in #41788
  • Fix chat schema tests by @Rocketknight1 in #41793
  • Fix attention mask in mamba layers by @zucchini-nlp in #41790
  • [quantization] fix torchao tests after 0.14.0 release by @MekkCyber in #41777
  • [Onnx docs] Remove some traces by @vasqu in #41791
  • flash attn pytest marker by @ydshieh in #41781
  • Bump AMD docker by @remi-or in #41792
  • make apollo test case pass by @yao-matrix in #41805
  • Add a safeguard around a flaky test in gemma2 by @remi-or in #41811
  • Fix Qwen3Next dtype API usage by @SrijanUpadhyay in #41735
  • [Trainer] remove env vars by @SunMarc in #41697
  • Fixed grammar mistakes by @FrogWarlord in #41799
  • Fixed some grammar mistakes by @FrogWarlord in #41802
  • transformers cli default flag fix by @ArjunPimpale in #41761
  • Deprecate warmup_ratio by @SunMarc in #41326
  • transformers serve quantization docs + some api fixes for bitsandbytes by @SunMarc in #41253
  • [Parakeet] add output_attention_mask by @eustlb in #41694
  • unpin torch/torchcodec for CircleCI by @ydshieh in #41839
  • extend bitnet cases to xpu, all 8 cases pass by @yao-matrix in #41831
  • extend 2 trainer test cases to xpu by @yao-matrix in #41829
  • extend 2 blip2 and falcon_h1 test cases to xpu by @yao-matrix in #41825
  • further reducing flakiness in utils/check_bad_commit.py by @ydshieh in #41658)
  • Remove redundant code from Qwen3VLProcessor by @Xqle in #41836
  • Fix MXFP4 quantizer to support variable num_local_experts and hidden_size by @marksverdhei in #41795
  • Fix Qwen2Audio flash attention mask format for generation by @Abdennacer-Badaoui in #41843
  • Fix const parsing for dict inputs in chat schemas by @Rocketknight1 in #41824
  • Share embedding modules in BART, not only weights by @githubnemo in #41821
  • Fix TypeError: find_adapter_config_file() got an unexpected keyword argument '_adapter_model_path' by @albertvillanova in #41604
  • :rotating_light: [Clip] Fix masking and enable flash attention on all model types by @vasqu in #41750
  • CI workflow for Flash Attn by @ydshieh in #41857
  • Fix torch.no_grad decorator in VLMS by @yaswanth19 in #41888
  • Fix installation cmds in docs by @yaswanth19 in #41887
  • revert changes in _is_package_available by @MekkCyber in #41891
  • make lfm2_moe integration test pass on XPU by @yao-matrix in #41796
  • Fix: avoid duplicate token in maybe_load_adapters by @luaenrique in #41903
  • speed up loading checkpoints for zero stage 3 by @ri938 in #41850
  • evaluate>=0.4.6 is needed by @stas00 in #41920
  • Add 6 huggingface notebooks on AMD dev cloud by @fan-amd in #41883
  • Fix invalid examples in QwenVL model docstrings and add Qwen3VL example by @Xqle in #41812
  • Allow parse_response to accept token IDs by @Rocketknight1 in #41849
  • Fix Florence2 conversion script model_type KeyError by @i3hz in #41866
  • Update some workflow files by @ydshieh in #41892
  • fix some ut failures on XPU w/ torch 2.9 by @yao-matrix in #41923
  • Cache latest pytorch amd image locally on mi325 CI runner cluster by @jitesh-gupta in #41926
  • Minor fix in docker image build workflow by @ydshieh in #41949
  • fix some ut failures on XPU w/ torch 2.9 by @yao-matrix in #41941
  • Fix rope_parameters for gemma3 weights conversion script by @douglas-reid in #41922
  • Fix: Gemma3TextConfig rope scaling assignments by @RyanMullins in #41934
  • fix prepare_config_and_inputs_for_common bug in llava test by @yao-matrix in #41942
  • Fix: prevent .gitignore truncation in run_clm_no_trainer.py by @luaenrique in #41957
  • V4.57.1 training ci: Refactor test_tensor_parallel.py by @3outeille in #41918
  • [v5] Return a BatchEncoding dict from apply_chat_template by default by @Rocketknight1 in #41626
  • make recurrent_gemma and voxtral cases pass on xpu by @yao-matrix in #41958
  • Fix typo in image_processing_lfm2_vl_fast by @yonigozlan in #41940
  • Run slow v2 by @ydshieh in #41914
  • Fix detectron2 installation in docker files by @ydshieh in #41975
  • Fix autoawq[kernels] installation in quantization docker file by @ydshieh in #41978
  • add support for saving encoder only so any parakeet model can be loaded for inference by @nithinraok in #41969
  • Use indices as position_ids in modernebert by @remi-or in #41789
  • test tensor parallel: make tests for dense model more robust by @3outeille in #41968
  • fix: dict[RopeParameters] to dict[str, RopeParameters] by @RyanMullins in #41963
  • docs: add continuous batching page by @McPatate in #41847
  • Fix torchcodec version in quantization docker file by @ydshieh in #41988
  • [kernels] Add Tests & CI for kernels by @MekkCyber in #41765
  • Move the Mi355 to regular docker by @remi-or in #41989
  • More data in benchmarking by @remi-or in #41848
  • fix (CI): Refactor SSH runners by @glegendre01 in #41991
  • fix 3 failed test cases for video_llama_3 model on Intel XPU by @kaixuanliu in #41931
  • Integrate colqwen2.5 using colqwen2 modelling code by @sahil-kabir in #40600
  • Fixed wrong padding value in OWLv2 by @gjamesgoenawan in #41938
  • Fix run slow v2: empty report when there is only one model by @ydshieh in #42002
  • [kernels] change import time in KernelConfig by @MekkCyber in #42004
  • DOC Fix typo in argument name: pseudoquant by @BenjaminBossan in #41994
  • Fix torch+deepspeed docker file by @ydshieh in #41985
  • Correct syntax error in trainer.md by @Yacklin in #42001
  • Reduce the number of benchmark in the CI by @remi-or in #42008
  • Fix continuous batching tests by @Rocketknight1 in #42012
  • add back logging_dir by @SunMarc in #42013
  • Fix issue with from pretrained and kwargs in image processors by @yonigozlan in #41997
  • Fix default image_rows and image_cols initialization in Idefics3 and SmolVLM processors by @MilkClouds in #41871
  • Add GLPNImageProcessorFast by @Aravind-11 in #41725
  • add fuyu fast image processors by @DeXtAr47-oss in #41817
  • [kernels] Fix XPU layernorm kernel by @MekkCyber in #41583
  • [v5] Deprecate Text2Text and related pipelines by @Rocketknight1 in #41996
  • [FPQuant] MXFP8 and MXFP4 backwards support by @BlackSamorez in #41897
  • fix deeepspeed in AMD docker file by @ydshieh in #42025
  • CodeQL workflow for security analysis by @paulinebm in #42015
  • [tests] Add Context-parallel CI tests by @kashif in #41860
  • extend fp_quant cases to xpu by @yao-matrix in #41833
  • Change trigger time for AMD CI by @ydshieh in #42034
  • Fix the order of methods in processor loading by @zucchini-nlp in #42031
  • 🔴 Isolate prefill from generation loops by @manueldeprada in #40652
  • update huggingface_hub dependency version by @hanouticelina in #42033
  • Remove some custom datasets defined in codebase by @ydshieh in #41511
  • Cleanup workflow - part 1 by @ydshieh in #42023
  • Fix pr_slow_ci_suggestion.yml after #42023 by @ydshieh in #42049
  • Fix AutoImageProcessor.register and documentation in auto processing modules by @MilkClouds in #41864
  • Fix Qwen3-Omni RoPE by @zucchini-nlp in #41778
  • Avoid explicit checkout in workflow by @ydshieh in #42057
  • Annoying typo in attention error message by @manueldeprada in #42037
  • Be careful at explicit checkout actions by @ydshieh in #42060
  • Fix another Argument list too long in pr_slow_ci_suggestion.yml by @ydshieh in #42061
  • Fix KeyError in GPT-OSS weight conversion script by @Aznix07 in #42007
  • Fix KeyError in _is_package_available for packages with dotted names by @yashwantbezawada in #42050
  • Revert back to use GitHub context by @ydshieh in #42066
  • Fix missing arg in check_docstring by @yonigozlan in #42054
  • [deepspeed tests fixes] by @stas00 in #41925
  • Fix logic in setting self.fsdp when it is False by @roychan in #41974
  • fix tensor device placement issue of 2 UT cases by @yao-matrix in #41921
  • add workflow to check permissions and advise a set of permissions req… by @paulinebm in #42071
  • Fix security issue 5 by @paulinebm in #42072
  • Fix inconsistency of commit sha during the workflow run by @ydshieh in #42074
  • QwenVL: add skipped keys in setattr as well by @zucchini-nlp in #41808
  • permissions worflows fix by @paulinebm in #42080
  • 4.1V Model and GLM-4.5V Model Conversion Code Updates by @zRzRzRzRzRzRzR in #41784
  • feat(ci): add continuous batching to benchmarks by @McPatate in #41916
  • Fix modular docstring for Mixtral by @diegoakel in #42041
  • Fix Auto classes to support dynamically registered processors by @MilkClouds in #41865
  • Reinstate self.scaling in Gemma3nTextAttention by @RyanMullins in #41751
  • [v5] 🚨Refactor subprocessors handling in processors by @yonigozlan in #41633
  • add xpu support in test_modeling_janus.py::JanusIntegrationTest::test… by @sywangyi in #41986
  • Revert "permissions worflows fix" by @ydshieh in #42110
  • Fix return metadata checking logic by @Xqle in #42108
  • Correctly handle unbatched audio inputs in Gemma3nAudioFeatureExtractor by @kho in #42076
  • [Bugfix] fix qwen3vl expand generation with video by @JJJYmmm in #42089
  • Fix base model prefix in VLMs by @zucchini-nlp in #42059
  • fix continuous batching issues, extend ut cases to xpu by @yao-matrix in #41830
  • 📝 docs(smolvlm): fix variable name in batch inference example by @gorkachea in #42123
  • fix qwen2vl/qwen3vl video processor temporal padding when num_frames%temporal_patch_size!=1 by @yaogang2060 in #42083
  • [Attn Masks] Non-vmap default for attention masks by @vasqu in #41852
  • Fix GPT-2 Flash Attention 2 generation with left-padding by @Abdennacer-Badaoui in #41966
  • Fix model name test for compressed tensors by @SunMarc in #42128
  • Fix MaskFormer/Mask2Former fast image processors by @yonigozlan in #41393
  • Remove unused functions in image_transforms.py by @yaswanth19 in #42044
  • update deps table by @ArthurZucker in #42120
  • fix: improve video processing fps assignment logic by @Xqle in #42009
  • Fix T5Gemma module structure by @Cyrilvallez in #42145
  • DataCollatorForLanguageModeling warning error fixed by @mjaliz in #42144
  • Bugfix/remove emojis from print by @7amim in #42091
  • Avoid mutating user-provided arguments in preprocessing utils by @LeonardoEmili in #42126
  • Enforce check_auto_docstring by @yonigozlan in #41635
  • Add dinov3 autobackbone by @vijayabhaskar-ev in #41276
  • Fix logic error in prepare_inputs_for_generation cache slicing condition by @albertvillanova in #41764
  • :rotating_light: Fix gradient checkpointing for several models and improve test robustness by @githubnemo in #41818
  • [T5Gemma] Fix cross attention cache by @vasqu in #41890
  • T5 migration to new masking interface by @Aravind-11 in #41804
  • fix: improve visibility of ValueError root causes in model config loading by @scottzh8 in #41972
  • add xpu to valid hardware for torch.compile by @sywangyi in #42079
  • extend test_beam_search_early_stop_heuristic case to other device by @sywangyi in #42078
  • fix failure of tests/models/shieldgemma2/test_modeling_shieldgemma2.p… by @sywangyi in #42022
  • Fixes Flash Attention implementation for models by @i3hz in #42149
  • fix test failure of speculative_generation on xpu by @sywangyi in #42052
  • add rmsnorm kernels support for npu by @zheliuyu in #42106
  • update torchao doc by @jiqing-feng in #42139
  • feat(kernels): add opt-out flag to disable kernels hub usage through the lib by @mfuntowicz in #41990
  • handle inputs from Siglip/Siglip2 non-automapped encoder layers by @molbap in #41930
  • Add slow to some examples tests by @SunMarc in #42164
  • fix(ci): unexpected keyword argument streaming by @McPatate in #42102
  • pin pytest<9 for now by @ydshieh in #42162
  • Docs/i18n updates by @lilin-1 in #42006
  • Fix in-place modification of user-input in SAM2 embed boxes by @xenova in #42173
  • [Pop2Piano] Fix cache usage by @vasqu in #42170
  • Fix helper fn for new processor config format by @zucchini-nlp in #42085
  • Remove unnecessary slicing in sdpa_attention_forward by @justinchuby in #41900
  • [PEFT] Fix prefix tuning by @vasqu in #41696
  • [typo] fix mrope-interleave annotation to avoid ambiguity by @JJJYmmm in #42177
  • Update transformers to support FqnToConfig by @jcaip in #41894
  • [PEFT] Fix the general test for prefix tuning by @vasqu in #42185
  • [TP] Fix parameter detection issue and some invalid TP-plans by @Cyrilvallez in #42129
  • Refactor weight loading by @ArthurZucker in #41580
  • 🚨 Delete deprecations with end-cycle in v4.xx and v5.0 by @zucchini-nlp in #41681
  • Add AutoTokenizer mapping for mistral3 and ministral by @patrickvonplaten in #42198
  • Fix checkpoint loading with DeepSpeed ZeRO3 by @tohtana in #42201
  • [Pop2Piano] Fix tied weights by @vasqu in #42193
  • New docker from AMD by @remi-or in #42208
  • Add cross links for model contribution by @zucchini-nlp in #42207
  • Stop inheriting tests! by @Rocketknight1 in #42192
  • Refactor check_auto_docstring using AST by @yonigozlan in #41432
  • [BLT] Fix cache usage by @vasqu in #42188
  • Update test_dynamic_cache_exportability_multiple_run (failing on torch 2.10 nightly) by @ydshieh in #42212
  • Much more efficient and clear weight initialization and tie weights by @Cyrilvallez in #42191
  • GLM-V update with new processor by @zRzRzRzRzRzRzR in #42122
  • Fix initialization guard for pytest by @Cyrilvallez in #42234
  • Fix TP plans for MoE models by @Cyrilvallez in #42236
  • Add prefix sharing to continuous batching by @remi-or in #42094
  • Loading optimization by @Cyrilvallez in #42239
  • calls AttentionMaskConverter._unmask_unattended for xpu device before by @kaixuanliu in #42230
  • FIX Broken PEFT adapter loading by @BenjaminBossan in #42187
  • Fix processor test for glm by @molbap in #42233
  • Fix UnboundLocalError in RT-DETR loss computation by @yashwantbezawada in #42224
  • Stop inheriting tests (again) by @Rocketknight1 in #42247
  • [loading] Fix device when source and target are different by @Cyrilvallez in #42246
  • Reduce timing on CircleCI - part 1 (Use @slow for IntegrationTests) by @ydshieh in #42206
  • 🚨 Delete generation params from model config by @zucchini-nlp in #41695
  • Allow VLMs to have a correct base_model by @zucchini-nlp in #41589
  • Make tests run in less time by reducing batch_size by @ydshieh in #42213
  • Revert "Make tests run in less time by reducing batch_size" by @ydshieh in #42258
  • Cleanup reference to TFBertTokenizer and TFGPT2Tokenizer by @Rocketknight1 in #42182
  • delete already deprecated models by @ydshieh in #42235
  • Fix bnb for the weights refactor by @SunMarc in #42043
  • Fix looping in torch guard decorator by @Cyrilvallez in #42260
  • 🚨 Generalize get_decoder() for multimodal and delete redundant code 🔪 by @zucchini-nlp in #42156
  • Audio Flamingo3 - fix attention masking by @zucchini-nlp in #42278
  • Add support for torch device objects in device validator by @yonigozlan in #42267
  • Remove doc files of other langs for deleted models by @ydshieh in #42276
  • [testing] fix cwm by @ydshieh in #42261
  • fix a typo: pbd -> pdb by @jaeminoh in #42268
  • Enable glm46v UTs on XPU by @YangKai0616 in #42274
  • [testing] fix some cases in xpu by @sywangyi in #42273
  • Remove random flag by @Cyrilvallez in #42282
  • Fix accelerate integration by @Cyrilvallez in #42264
  • Fix validation checks order in benchmark_v2 by @Abdennacer-Badaoui in #42280
  • Update torchcodec to match torchaudio version by @remi-or in #42288
  • Use torch.get_autocast_dtype instead of torch.get_autocast_gpu_dtype by @qgallouedec in #42055
  • perf: Optimization for Min-p sampling implementation by @casinca in #42248
  • Fix device_map computation part 2 by @Cyrilvallez in #42290
  • Fixed the docstring for WhisperFeatureExtractor by @TopCoder2K in #42286
  • avoiding conditional indexing in positionalencoding to avoid possibil… by @ppadjinTT in #42090
  • ENH: Add support for LoRA hotswapping by @BenjaminBossan in #41297
  • Fix Break change of AWQ FusedModules due to Attention Refactor by @fanqiNO1 in #41909
  • Remove error string test that was failing by @Rocketknight1 in #42301
  • Properly protect the is_compiling checks by @Cyrilvallez in #42304
  • Remove outdated methods in modeling_utils.py by @Cyrilvallez in #42302
  • Fix Mac mps dataloader_num_workers > 1 causes RuntimeError: share_filename: only available on CPU by @AmitMY in #38819
  • Fix the init_weights for the MoE models by @Cyrilvallez in #42306
  • Update link to generation strategies documentation by @omkar-334 in #42252
  • Update conversion mapping to separate renaming from converting by @ArthurZucker in #42254
  • fix(granitemoe*): Only create block_sparse_moe if num_local_experts > 0 by @gabe-l-hart in #42036
  • [SAM3 Video] Add support for multi prompts by @yonigozlan in #42293
  • Add Pix2Struct fast image processor by @yonigozlan in #42020
  • Fix post processing methods in keypoints matching models by @yonigozlan in #42018
  • fix tests/models/xcodec/test_modeling_xcodec.py::XcodecIntegrationTest by @sywangyi in #42272
  • [loading] Fix device detection by @Cyrilvallez in #42323
  • Fix typo from side_dict to size_dict by @nihui in #42319
  • HF Trainer: ALST/Ulysses sequence parallelism integration via HF Accelerate by @stas00 in #41832
  • Fix gpt2 modeling tests by @Abdennacer-Badaoui in #42321
  • [loading] Use fewer threads by default for much better performances by @Cyrilvallez in #42324
  • Allow LayoutLMV3Processor to accept rescale_factor by @Rocketknight1 in #42305
  • Correctly create tied key mapping in post_init, and dynamic tie weight by @Cyrilvallez in #42270
  • [CI] Skip EfficientLoFTR test by @vasqu in #42327
  • [XPU] Add flash_attn2 support for XPU by @YangKai0616 in #41956
  • [Attn Masks] Lift bidirectional mask restriction on eager by @vasqu in #42325
  • fix bug when gemma3n model run on multiple device by @kaixuanliu in #42303
  • Fix ChineseCLIPModel.get_text_features by @JiangJQ2000 in #42351
  • Gemma3 hybrid fix by @remi-or in #42287
  • fix(benchmarks): correct sdpa_backend inconsistency and attn_implementation for continuous batching by @engmohamedsalah in #42339
  • Auto convert tekken.json by @ArthurZucker in #42299
  • [loading] Re-add and improve disk offloading support by @Cyrilvallez in #42242
  • Fix typo - indentation in JSON dump example by @anthropikos in #42332
  • Fix tied weight for Bart (for BC) by @Cyrilvallez in #42355
  • Fix reference to yelp dataset by @JuanFKurucz in #42349
  • Fix documentation reference to pytorch max memory allocated by @JuanFKurucz in #42350
  • Fix reference to imagenet 1k dataset by @JuanFKurucz in #42348
  • Fix typos by @omahs in #42354
  • Protect torch.distributed imports by @Cyrilvallez in #42361
  • Expand npu device for KernelConfig by @zheliuyu in #42358
  • Replace Optional and Union typing with | in some source files by @cyyever in #42294
  • Fix code examples to load gpt 1 openai community model by @JuanFKurucz in #42347
  • fix tekken pattern matching by @ArthurZucker in #42363
  • Fixed-wrong-ZeRO3-json-snippet-found-in-deepspeed-markdown-file by @Yacklin in #42346
  • Make benchmarking lighter: clean-up result files and remove non-needed arguments by @remi-or in #42357
  • Add image processor fast vitpose by @yonigozlan in #42021
  • Small tp fix by @ArthurZucker in #42366
  • Remove test inheritance for EfficientLoftr, rename KeypointMatchingOutput to model specific name by @yonigozlan in #42365
  • Tiny doc fix by @molbap in #42296
  • Fix TimesFM patch normalization instability by @AnMakc in #42099
  • [core] Fix torchao by @MekkCyber in #42289
  • Fix tp by @ArthurZucker in #42368
  • [Attn Masks] Add skip option for non-packed sequences by @vasqu in #42367
  • 📚 docs(granite-speech): add comprehensive usage examples by @gorkachea in #42125
  • Xcodec fix by @eustlb in #42095
  • Replace Optional and Union typing with | in some source files by @cyyever in #42372
  • [Mistral Tokenizers] Fix tokenizer detection by @vasqu in #42389
  • misc don't recreate it by @ArthurZucker in #42394
  • [SAM3] Fix precompute vision_embeds or text_embeds for inference by @yonigozlan in #42407
  • 🚨 Image-text pipeline expects correctly formatted chat by @zucchini-nlp in #42359
  • Many small fixes for the CI by @remi-or in #42364
  • [core] fix mxfp4 by @MekkCyber in #42382
  • fixed json syntax error for zero2 configuration file found in deepspeed.md by @Yacklin in #42406
  • GLM4V - delete duplicate config attribute by @zucchini-nlp in #42416
  • 🚨 Remove generic output_attentions warning by @Aravind-11 in #42334
  • Bart config doesn't need generation parameters by @zucchini-nlp in #42337
  • Simplify and standardize processor tests by @yonigozlan in #41773
  • Clean bnb integration using weight converter by @SunMarc in #42426
  • Any to any pipeline and auto-mapping by @zucchini-nlp in #40884
  • Fix processor usage + add chat_template support to TTS pipeline, and shift common chat template logic to base class. by @ebezzam in #42326
  • [fp8] fix scales param name by @MekkCyber in #42434
  • Fix an edge case for get_encoder() by @zucchini-nlp in #42295
  • Disable loss rounding in training stats log by @AnMakc in #42104
  • Benchmark simplification by @remi-or in #42408
  • Future annotations break FastAPI by @LysandreJik in #42450
  • [cleanup] Don't use Repository in create_dummy_models.py script by @Wauplin in #42380
  • [cleanup] Remove deprecated load config from file by @Wauplin in #42383
  • [FA] Cleanup loading logic by @vasqu in #41427
  • tiny fix for deepseekocr support [vllm] by @molbap in #42423
  • fix: Restore explicit .keys() calls for TensorDict compatibility by @pankajbaid567 in #42373
  • Transformers serve -> list all generative models from the cache by @LysandreJik in #42146
  • 🚨 [v5][PEFT] Bump min version requirement of PEFT to 0.18.0 by @BenjaminBossan in #41889
  • [cleanup] Offline mode and cache dir from huggingface_hub constants + cleanup in PushToHubMixin by @Wauplin in #42391
  • Correctly return finish reason length when finished by @LysandreJik in #42157
  • FIX: Minimal fix for loading PEFT weights by @BenjaminBossan in #42387
  • Let's break Qwen-VL 🚨 by @zucchini-nlp in #42420
  • [CI] Add to run slow by @vasqu in #42459
  • Fix the "test_offline" test by @LysandreJik in #42458
  • transformers chat launched without base_url has a direct tie to localhost:8000 by @LysandreJik in #42463
  • update with more recent tts models by @Deep-unlearning in #42328
  • rm slow tokenizers by @itazap in #40936
  • [loading/saving] Reverse all loading operations when saving by @Cyrilvallez in #42396
  • Fix T5 tests: use generation_config for generation parameters by @Abdennacer-Badaoui in #42419
  • remove reference to TF models from docs by @zucchini-nlp in #42443
  • [Trainer] use output.loss when using liger-kernel by @kashif in #42444
  • replace source_keys and target_keys by @SunMarc in #42471
  • Update migration guide - generation config by @zucchini-nlp in #42470
  • 🚨 Move rotary_partial_emb to RopeParams and delete unnecessary code 🔪 by @zucchini-nlp in #42255
  • Fix doc builds by @Rocketknight1 in #42478
  • extend CwmIntegrationTest to xpu by @sywangyi in #42314
  • add require_deterministic_for_xpu to make the case pass in xpu by @sywangyi in #42439
  • Skip failing irrelevant test for ColQwen2 by @Rocketknight1 in #42480
  • [quantization] make torchao tests slow by @MekkCyber in #42482
  • Fix gpt2 tokenizer add_prefix_space default value by @SunMarc in #42481

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @ArthurZucker
    • JetMoe Fix jetmoe after #40132 (#41324)
    • [ModularChecker] QOL for the modular checker (#41361)
    • [CB] Refactors the way we access paged (#41370)
    • Update from pretrained error when loading (#33380)
    • :facepalm: CB nit! (#41413)
    • [from_pretrained] Small refactor from_pretrained: move around unrelated stuff (#41445)
    • update deps table (#42120)
    • Refactor weight loading (#41580)
    • Update conversion mapping to separate renaming from converting (#42254)
    • Auto convert tekken.json (#42299)
    • fix tekken pattern matching (#42363)
    • Small tp fix (#42366)
    • Fix tp (#42368)
    • misc don't recreate it (#42394)
  • @vasqu
    • :rotating_light: [v5] Remove relative position embeddings (for bert like models) (#41170)
    • [v5] Sync Bert and Bart eager attention (#41248)
    • [JetMoe] Fix KV head repetition and padding free (#41423)
    • :rotating_light: [Attention Masks] Bidirectional masks for encoder and encoder-decoder models (#41265)
    • [CI] Fix copies on main (#41486)
    • [Docs] Fix changed references (#41614)
    • [Executorch] Simplify for encoder models (#41627)
    • [Ernie 4.5 Moe] Fix Moe and offloading (#41385)
    • [Masks] Fix mask handling in eager for vision models (#41625)
    • [Attn] Allow dynamic causality in SDPA via Kwargs (#41692)
    • [Onnx docs] Remove some traces (#41791)
    • :rotating_light: [Clip] Fix masking and enable flash attention on all model types (#41750)
    • [Attn Masks] Non-vmap default for attention masks (#41852)
    • [T5Gemma] Fix cross attention cache (#41890)
    • [Pop2Piano] Fix cache usage (#42170)
    • [PEFT] Fix prefix tuning (#41696)
    • [PEFT] Fix the general test for prefix tuning (#42185)
    • [Pop2Piano] Fix tied weights (#42193)
    • [BLT] Fix cache usage (#42188)
    • [CI] Skip EfficientLoFTR test (#42327)
    • [Attn Masks] Lift bidirectional mask restriction on eager (#42325)
    • [Attn Masks] Add skip option for non-packed sequences (#42367)
    • [Mistral Tokenizers] Fix tokenizer detection (#42389)
    • [FA] Cleanup loading logic (#41427)
    • [CI] Add to run slow (#42459)
  • @ydshieh
    • [testing] update test_longcat_generation_cpu (#41368)
    • [testing] Fix JetMoeIntegrationTest (#41377)
    • Pickle - part 2 (#41476)
    • Try to remove pickle - BloomTokenizerFast (#41466)
    • [testing] reduce runtime of HunYuanMoEV1IntegrationTest:test_model_generation (#41373)
    • delete some tokenizer tests using pickle (#41514)
    • torch 2.9 don't ❤️ torchcodec 💔 (#41610)
    • Update a dataset reop link (#41618)
    • Remove the head masking block in some vision models (#41620)
    • improve utils/check_bad_commit.py (#41658)
    • torch 2.9 still don't ❤️ torchcodec 0.8 💔 (#41686)
    • path validation for security reason (#41256)
    • pin torchcodec on CI docker image (#41703)
    • further improve utils/check_bad_commit.py (#41658) (#41690)
    • Revert "Remove upper version bound of pandas" (#41744)
    • Fix bark after #41445 (#41645)
    • flash attn pytest marker (#41781)
    • unpin torch/torchcodec for CircleCI (#41839)
    • further reducing flakiness in utils/check_bad_commit.py (#41658) (#41815)
    • CI workflow for Flash Attn (#41857)
    • Update some workflow files (#41892)
    • Minor fix in docker image build workflow (#41949)
    • Run slow v2 (#41914)
    • Fix detectron2 installation in docker files (#41975)
    • Fix autoawq[kernels] installation in quantization docker file (#41978)
    • Fix torchcodec version in quantization docker file (#41988)
    • Fix run slow v2: empty report when there is only one model (#42002)
    • Fix torch+deepspeed docker file (#41985)
    • fix deeepspeed in AMD docker file (#42025)
    • Change trigger time for AMD CI (#42034)
    • Remove some custom datasets defined in codebase (#41511)
    • Cleanup workflow - part 1 (#42023)
    • Fix pr_slow_ci_suggestion.yml after #42023 (#42049)
    • Avoid explicit checkout in workflow (#42057)
    • Be careful at explicit checkout actions (#42060)
    • Fix another Argument list too long in pr_slow_ci_suggestion.yml (#42061)
    • Revert back to use GitHub context (#42066)
    • Fix inconsistency of commit sha during the workflow run (#42074)
    • Revert "permissions worflows fix" (#42110)
    • pin pytest<9 for now (#42162)
    • Update test_dynamic_cache_exportability_multiple_run (failing on torch 2.10 nightly) (#42212)
    • Reduce timing on CircleCI - part 1 (Use @slow for IntegrationTests) (#42206)
    • Make tests run in less time by reducing batch_size (#42213)
    • Revert "Make tests run in less time by reducing batch_size" (#42258)
    • delete already deprecated models (#42235)
    • Remove doc files of other langs for deleted models (#42276)
    • [testing] fix cwm (#42261)
  • @cyyever
    • Remove unnecessary list comprehension (#41305)
    • Remove unused function patameters (#41358)
    • Use accelerator API to free device memory (#41195)
    • Remove Python 3.9 classifier (#41410)
    • Remove KERAS_NLP_IMPORT_ERROR (#41468)
    • Import Callable from collections.abc (#41130)
    • Remove infer_device (#41088)
    • Fix Latex typesetting in documentation (#41177)
    • Fix typsetting and content of llm_tutorial_optimization.md (#41172)
    • More markdown file fixes (#41599)
    • Format MarkDown documentation and tiny fixes (#41638)
    • Fix typos in documentation (#41641)
    • Fix confusing cls assignment (#41642)
    • Use | for Optional and Union typing (#41646)
    • Remove require_torch_bf16_gpu (#40979)
    • Fix MarkDown syntax (#41676)
    • Use | for Optional and Union typing (#41675)
    • Enable faiss-cpu on Windows (#41678)
    • Fix Pylint warnings (#41644)
    • Enable FURB rules in ruff (#41395)
    • Remove upper version bound of pandas (#41677)
    • Fix documentation issues (#41726)
    • Apply RUFF PIE rules (#41727)
    • Replace Optional and Union typing with | in some source files (#42294)
    • Replace Optional and Union typing with | in some source files (#42372)
  • @yao-matrix
    • make some ut cases pass on xpu w/ latest torch (#41337)
    • fix asr ut failures (#41332)
    • enable new model uts to xpu and fix some failures on xpu (#41386)
    • enable some falcon-mamba uts on xpu (#41428)
    • enhance patched_tearDown to support python 3.11+ (#41429)
    • fix gemma3n case failure (#41426)
    • upgrade xpu docker file to torch 2.8 (#41551)
    • make apollo test case pass (#41805)
    • extend bitnet cases to xpu, all 8 cases pass (#41831)
    • extend 2 trainer test cases to xpu (#41829)
    • extend 2 blip2 and falcon_h1 test cases to xpu (#41825)
    • make lfm2_moe integration test pass on XPU (#41796)
    • fix some ut failures on XPU w/ torch 2.9 (#41923)
    • fix some ut failures on XPU w/ torch 2.9 (#41941)
    • fix prepare_config_and_inputs_for_common bug in llava test (#41942)
    • make recurrent_gemma and voxtral cases pass on xpu (#41958)
    • extend fp_quant cases to xpu (#41833)
    • fix tensor device placement issue of 2 UT cases (#41921)
    • fix continuous batching issues, extend ut cases to xpu (#41830)
  • @MekkCyber
    • [kernels] Kernel Config (#41232)
    • Fixing comments in init file (#41414)
    • [kernels] Cleanup deta kernel (#41470)
    • Cleaning hub kernels (#41477)
    • Remove DISABLE_KERNEL_MAPPING flag (#41475)
    • [kernels] Remove RWKV kernel finally ! (#41493)
    • [kernels] rm yoso kernel (#41495)
    • [kernels] rm mra kernels (#41507)
    • Revert "add rmsnorm kernels support for Intel XPU" (#41579)
    • [kernels] refactor function kernel calling (#41577)
    • Erroring when KernelConfig is passed without use_kernels = True (#41657)
    • Small Fix for imports (#41411)
    • [kernels] Add version to function mapping (#41685)
    • [quantization] fix compressed_tensors tests (#41780)
    • [quantization] Skip Fp8 tests when hardware capability < 8.9 (#41785)
    • [quantization] fix torchao tests after 0.14.0 release (#41777)
    • revert changes in _is_package_available (#41891)
    • [kernels] Add Tests & CI for kernels (#41765)
    • [kernels] change import time in KernelConfig (#42004)
    • [kernels] Fix XPU layernorm kernel (#41583)
    • [core] Fix torchao (#42289)
    • [core] fix mxfp4 (#42382)
    • [fp8] fix scales param name (#42434)
    • [quantization] make torchao tests slow (#42482)
  • @paulpak58
    • [Cache] lfm2 cache: allocate empty kv layers during init (#41396)
    • [Model] Lfm2Moe (#41401)
  • @gante
    • 🚨 [v5] Prune prune_heads (#41417)
    • [v5] rm utils/tf_ops/ (#41402)
    • [causallm tester] automate pipeline mappings + bloom tests (#41318)
    • 🚨 [v5] generate delegates default cache initialization to the model (#41505)
  • @zRzRzRzRzRzRzR
    • Update GLM-4.1V MMRope implementation (#41182)
    • Update GLM-4.6 doc (#41471)
    • Add aux loss for GLM-4.5V (#41564)
    • 4.1V Model and GLM-4.5V Model Conversion Code Updates (#41784)
    • GLM-V update with new processor (#42122)
  • @jacobkahn
    • Add Code World Model (CWM) (#41199)
  • @molbap
    • Update philosophy (#41438)
    • [QoL] modular conversion shows LoC saved (#41500)
    • Double router compute? (#41653)
    • Add vision contribution guide (#41456)
    • Modernize CLIP modeling code (#41546)
    • handle inputs from Siglip/Siglip2 non-automapped encoder layers (#41930)
    • Fix processor test for glm (#42233)
    • Tiny doc fix (#42296)
    • tiny fix for deepseekocr support [vllm] (#42423)
  • @Wauplin
    • Bump to hfh 1.0.0.rc5 to fix test (#41508)
    • Migrate transformers cli to Typer (#41487)
    • Remove deprecated use_auth_token parameter (#41666)
    • added more breaking changes
    • [cleanup] Don't use Repository in create_dummy_models.py script (#42380)
    • [cleanup] Remove deprecated load config from file (#42383)
    • [cleanup] Offline mode and cache dir from huggingface_hub constants + cleanup in PushToHubMixin (#42391)
  • @remi-or
    • Restore cuda graphs to continuous batching (#41421)
    • Fix an import error with PreTrainModel (#41571)
    • Add iter to DynamicCache (#41569)
    • Gemma3 fixes (#41572)
    • Benchmark overhaul (#41408)
    • Fix fp32_ln for various models (#41605)
    • Fix EncoderDecoder cache (#41612)
    • Switch to CB if cache_implementation == paged (#41655)
    • Small changes to benchmarking script (#41662)
    • Bump AMD docker (#41792)
    • Add a safeguard around a flaky test in gemma2 (#41811)
    • Use indices as position_ids in modernebert (#41789)
    • Move the Mi355 to regular docker (#41989)
    • More data in benchmarking (#41848)
    • Reduce the number of benchmark in the CI (#42008)
    • New docker from AMD (#42208)
    • Add prefix sharing to continuous batching (#42094)
    • Update torchcodec to match torchaudio version (#42288)
    • Gemma3 hybrid fix (#42287)
    • Make benchmarking lighter: clean-up result files and remove non-needed arguments (#42357)
    • Many small fixes for the CI (#42364)
    • Benchmark simplification (#42408)
  • @lkhl
    • [model] Add VideoLLaMA3 implementation (#40499)
  • @philiproeleveld
    • Add logits_to_keep to many older CausalLM models (#41335)
  • @AlphaOrOmega
    • Adding superglue fast image processing (#41394)
  • @echarlaix
    • [v5] Remove deprecated tranformers.onnx (#41700)
  • @Aravind-11
    • Add GLPNImageProcessorFast (#41725)
    • T5 migration to new masking interface (#41804)
    • 🚨 Remove generic output_attentions warning (#42334)
  • @DeXtAr47-oss
    • add fuyu fast image processors (#41817)
  • @lashahub
    • [models] Add AudioFlamingo3 integration (#40290)
  • @lilin-1
    • Docs/i18n updates (#42006)
  • @burtenshaw
    • [MODEL] Nanochat implementation (#41634)
  • @itazap
    • rm slow tokenizers (#40936)
Nov 25, 2025
Patch release v4.57.3

There was a hidden bug when loading models with local_files_only=True and a typo related to the recent patch.

The main fix is: https://github.com/huggingface/transformers/commit/b6055550a15a8fab367cf983b743ff68cc58d81a.

We are really sorry that this slipped through, our CIs just did not catch it.

As it affects a lot of users we are gonna yank the previous release

Nov 24, 2025
Patch Release v4.57.2

This patch most notably fixes an issue on some Mistral tokenizers. It contains the following commits:

  • Add AutoTokenizer mapping for mistral3 and ministral (#42198)
  • Auto convert tekken.json (#42299)
  • fix tekken pattern matching (#42363)
  • Check model inputs - hidden states (#40994)
  • Remove invalid @staticmethod from module-level get_device_and_memory_breakdown (#41747)
Oct 14, 2025
Patch release v4.57.1

This patch most notably fixes an issue with an optional dependency (optax), which resulted in parsing errors with poetry. It contains the following fixes:

Previous123Next
Latest
v5.5.4
Tracking Since
Apr 23, 2024
Last fetched Apr 19, 2026