releases.shpreview
Home/Hugging Face/Transformers

Transformers

Core NLP/ML library for state-of-the-art models

Mon
Wed
Fri
JunJulAugSepOctNovDecJanFebMarAprMay
Less
More
Releases30Avg Interval3dAvg Cadence9/mo

Three new models added; SAM3 text embeddings API changes; generation bugs fixed

This release3 featuresNew capabilities18 enhancementsImprovements to existing features8 fixesBug fixesAI-tallied from the release notes
Transformers · v5.9.0

New Model additions

Cohere2Moe

Command A+ is a Mixture-of-Experts (MoE) language model from Cohere that features a hybrid attention pattern combining sliding window and full attention layers. The model incorporates both shared and routed experts and supports a very large context window for processing extensive text sequences.

Links: Documentation

Parakeet tdt (#44171)
HRM-Text

HRM-Text is an improved autoregressive language-modeling variant of the Hierarchical Reasoning Model (HRM) that uses a hierarchical recurrent forward pass with two transformer stacks - one for slow, abstract planning (H) and one for fast, detailed computation (L) - reused inside a nested recurrence. It features PrefixLM attention where instruction tokens attend bidirectionally while response tokens attend causally, per-head sigmoid output gates, and parameterless RMSNorm. The model is designed as a base language model without instruction tuning or chat templates.

Links: Documentation | Paper

Breaking changes

The text_embeds input for SAM3, EdgeTAM, and SAM3-Lite-Text models now expects full text embeddings instead of just pooler outputs, aligning with other models in the library — users must update their inputs accordingly.

  • 🚨Fix memory leaks caused by lru decorators in vision models (#45922) by @yonigozlan

Audio

Audio support was expanded with the addition of AudioFlamingoNext model checkpoints and improved compilability of audio/vision encoders via standalone pure functions. Additional improvements include better error messaging when loading audio from video files and new documentation for audio/video processors.

  • user friendly error when loading audio from video (#45221) by @eustlb in [#45221]
  • [docs] adding audio/video processors (#45795) by @stevhliu in [#45795]
  • Support Audio Flamingo Next checkpoints (#44830) by @lashahub in [#44830]
  • Extract dynamic vision/audio tensors into standalone pure functions (#45396) by @IlyasMoutawwakil in [#45396]

Generation

Fixed generation issues including inputs_embeds and per_layer_inputs handling for Gemma4, an AttributeError in RAG's generate() caused by missing config fields, and flaky VLM generation tests by blocking special image tokens during sampling.

  • Fix Gemma4 generation from inputs_embeds and per_layer_inputs (#46049) by @Cyrilvallez in [#46049]
  • Fix AttributeError in RAG generate() for missing config fields (#46035) by @Sriniketh24 in [#46035]
  • Block image_start/end_token_id in generation test sampling (#45914) by @Rocketknight1 in [#45914]

Bugfixes and improvements

  • Remove mask visualization tool from masking_utils.py (#46066) by @Cyrilvallez in [#46066]
  • fix: owned_by field in GET /v1/models returns list instead of string (#46006) by @nileshpatil6 in [#46006]
  • [CB] Remove OpenTelemetry (#45984) by @remi-or in [#45984]
  • docs(readme): use canonical huggingface.co domain in prose links (#46042) by @kiwigitops in [#46042]
  • Fix remaining RAG doc examples that crash on current transformers (#46044) by @Sriniketh24 in [#46044]
  • Init the actual tensor, not a copy (#46030) by @Rocketknight1 in [#46030]
  • docs: sync legacy ACL anthology URLs and update metrics across i18n READMEs (#46027) by @irfaan101 in [#46027]
  • [MultimodalLM] add language_model to the get/set_input_embeddings logic (#46029) by @eustlb in [#46029]
  • [HRM Text] Add integration tests (#46033) by @vasqu in [#46033]
  • hy_v3: add XPU expectations (#45858) by @kaixuanliu in [#45858]
  • exaone4_5: add XPU expectations (#45890) by @kaixuanliu in [#45890]
  • hyperclovax: add XPU Expectations for CI test (#45926) by @kaixuanliu in [#45926]
  • chore(ci): remove dead env vars from circleci-failure-summary-comment.yml (#45972) by @XciD in [#45972]
  • [CB] [Major] Add tensor paralellism (#45821) by @remi-or in [#45821]
  • docs: update models architecture count and sync ACL anthology URLs (#46001) by @irfaan101 in [#46001]
  • bugfix(ci): avoid E2BIG in pr_slow_ci_suggestion (#45983) by @tarekziade in [#45983]
  • RFDetr - use correct Roboflow org for release (#45946) by @sbucaille in [#45946]
  • docs: Fix formatting issues in weightconverter.md (#45988) by @ArjunSrivastava1 in [#45988]
  • Fix colqwen2 test (#45981) by @IlyasMoutawwakil in [#45981]
  • Fix M-RoPE device mismatch in Qwen3VL family under FSDP2 CPU offload (#45861) by @jamesbraza in [#45861]
  • [docs] chat template prefill (#45947) by @stevhliu in [#45947]
  • [docs] decode fast path (#45899) by @stevhliu in [#45899]
  • fix: restore _attn_implementation and fix request offset in generate_batch() (#45943) by @sergiopaniego in [#45943]
  • Expose per_layer_inputs for every Gemma4 variants (#45927) by @Cyrilvallez in [#45927]
  • chore: update benchmark_v2.yml (#45966) by @hf-security-analysis[bot] in [#45966]
  • fix(ci): set persist-credentials: false on actions/checkout and close remaining template injection findings (#45964) by @XciD in [#45964]
  • chore(ci): set default workflow permissions to contents: read (#45961) by @XciD in [#45961]
  • fix(ci): remove template injection on pull_request_target workflows (#45956) by @XciD in [#45956]
  • chore(ci): pin all GitHub Actions and reusable workflows by SHA (#45955) by @XciD in [#45955]
  • [docs] ALMModelTest (#45900) by @stevhliu in [#45900]
  • Enhance apply_chat_template to support custom field prefilling (reasoning_content, thinking, etc.) (#45896) by @Mamiglia in [#45896]
  • BUGFIX: Support hubert models that don't have conv_pos_batch_norm configured (#45921) by @igordertigor in [#45921]
  • Revert 45777 (#45942) by @Rocketknight1 in [#45942]
  • pass the otel secrets (#45933) by @tarekziade in [#45933]
  • Add initial torch_tpu backend support (#45918) by @tengomucho in [#45918]
  • [CB] Hide activation footprint by using the CUDA graph pool (#45911) by @remi-or in [#45911]
  • Require input_ids for repetition penalty (#45389) by @ruben-aghayan in [#45389]
  • Fix undefined 'input' variable (#45895) by @fullyz in [#45895]
  • Fix post processing RF-DETR (#46041) by @yonigozlan (direct commit on v5.9.0)
  • [loading] Free up tensors faster inside ConversionOps (#46110) by @Cyrilvallez (direct commit on v5.9.0)
  • Add new cohere2_moe model (#46115) by @Cyrilvallez (direct commit on v5.9.0)
  • Fix cohere2 tp_plan for release by @Cyrilvallez (direct commit on v5.9.0)
  • Release v5.9.0 by @Cyrilvallez (direct commit on v5.9.0)

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @lmaksym
    • Parakeet tdt (#44171)
  • @eustlb
    • user friendly error when loading audio from video (#45221)
    • [MultimodalLM] add language_model to the get/set_input_embeddings logic (#46029)
  • @remi-or
    • [CB] Remove OpenTelemetry (#45984)
    • [CB] [Major] Add tensor paralellism (#45821)
    • [CB] Hide activation footprint by using the CUDA graph pool (#45911)
  • @abcd1927
    • Add hrm text (#46025)
Transformers · v5.8.1

This release is mainly to fix the Deepseek V4 integration!!!

<img width="714" height="774" alt="image" src="https://github.com/user-attachments/assets/0d85e891-a0ff-436e-a9d4-b6633096f2b5" />
Transformers · v5.8.0

Release v5.8.0

New Model additions

DeepSeek-V4
<img width="6604" height="3574" alt="image" src="https://github.com/user-attachments/assets/4c0fdb29-f770-463c-a97b-d24438896a4c" />

DeepSeek-V4 is the next-generation MoE (Mixture of Experts) language model from DeepSeek that introduces several architectural innovations over DeepSeek-V3. The architecture replaces Multi-head Latent Attention (MLA) with a hybrid local + long-range attention design, swaps residual connections for Manifold-Constrained Hyper-Connections (mHC), and bootstraps the first few MoE layers with a static token-id → expert-id hash table. This implementation covers DeepSeek-V4-Flash, DeepSeek-V4-Pro, and their -Base pretrained variants, which share the same architecture but differ in width, depth, expert count and weights.

Links: Documentation | Paper

Gemma 4 Assistant
<img width="2000" height="400" alt="image" src="https://github.com/user-attachments/assets/02c79b0b-a172-4495-b09d-a6a4b625ee66" />

Gemma 4 Assistant is a small, text-only model that enables speculative decoding for Gemma 4 models using the Multi-Token Prediction (MTP) method and associated candidate generator. The model shares the same Gemma4TextModel backbone as other Gemma 4 models but uses KV sharing throughout the entire model, allowing it to reuse the KV cache populated by the target model and skip the pre-fill phase entirely. This architecture includes cross-attention to make the most of the target model's context, allowing the assistant to accurately predict more drafted tokens per drafting round.

Links: Documentation

GraniteSpeechPlus
<img width="1310" height="930" alt="image" src="https://github.com/user-attachments/assets/94fc3730-742c-4b9e-ab6a-ed2e5c75d0bf" />

Granite Speech Plus is a variant of Granite Speech that enhances the projector by consuming the concatenation of the encoder's final hidden states with an arbitrary subset of its intermediate hidden states along the feature dimension. It is a multimodal speech-to-text model that can transcribe audio, provide speaker annotation and word level timestamps by responding to text prompts. The model inherits the same architecture components as Granite Speech including the speech encoder, query transformer projector, language model, and optional LoRA adapter.

Links: Documentation

  • Support for a new Granite-Speech-Plus model (#45695) by @zvik in #45695
Granite4Vision

Granite Vision 4.1 is a vision-language model from IBM Research designed for enterprise-grade document data extraction. It specializes in chart extraction (Chart2CSV, Chart2Summary, Chart2Code), table extraction (JSON, HTML, OTSL), and semantic key-value pair extraction. The model builds on LLaVA-NeXT with architectural innovations including SigLIP2 Vision Encoder, Window Q-Former Projectors, and DeepStack Feature Injection with 8 vision-to-LLM injection points.

Links: Documentation

EXAONE-4.5
<img width="3840" height="2160" alt="image" src="https://github.com/user-attachments/assets/55eb732d-f9da-4f97-8226-2cd3f6476ca0" />

EXAONE 4.5 is the first open-weight vision language model developed by LG AI Research, integrating a dedicated visual encoder into the existing EXAONE 4.0 framework to expand multimodal capabilities. The model features 33 billion parameters in total, including 1.2 billion parameters from the vision encoder, and achieves competitive performance in general benchmarks while outperforming similar-sized models in document understanding and Korean contextual reasoning. It builds on EXAONE 4.0 with key enhancements including an expanded vocabulary of 153,600 tokens, support for up to 256K token context windows, and a Multi-Token Prediction (MTP) mechanism.

Links: Documentation | Paper | Blog Post

PP-FormulaNet

PP-FormulaNet-L and PP-FormulaNet_plus-L are lightweight models designed for table structure recognition, focusing on accurately recognizing table structures in documents and natural scenes. The models are part of the SLANet series and can be used for image-to-text tasks, specifically for detecting and processing mathematical formulas and table structures from images.

Links: Documentation

Breaking changes

Apex integration has been removed from the library (including RMSNorm usage in T5 and related models), so users relying on Apex for mixed precision or fused ops should migrate to PyTorch's native equivalents instead.

Tokenization

Fixed tokenizer mapping issues for DeepSeek R1 distilled (Qwen2) and DeepSeek OCR models, and resolved a significant performance regression in PreTrainedTokenizer.convert_ids_to_tokens where skip_special_tokens=True was rebuilding the special token set on every iteration, resulting in a ~300x speedup for that code path.

  • deepseek r1 distilled tokenizer fix for qwen2 mapping (#45741) by @itazap in [#45741]
  • DeepSeek OCR specifies an incorrect tokenizer class on the Hub (#45739) by @hmellor in [#45739]
  • PythonBackend slow tokenizer convert_ids_to_tokens fix (#45728) by @i3hz in [#45728]

Bugfixes and improvements

  • fix: correct spelling in continuous_api docstring (#45749) by @Dhruv908615 in [#45749]
  • Fix link to modular transformers documentation (#45746) by @SangbumChoi in [#45746]
  • Gemma4: fix failed test cases (#45568) by @kaixuanliu in [#45568]
  • Fix CI: Allow more artifacts to be download in CI (#45785) by @ydshieh in [#45785]
  • Add concurrency to PR CI workflow file (pr-ci-caller.yml) (#45786) by @ydshieh in [#45786]
  • Reorder decorators for autodoc and dataclass (#45702) by @zucchini-nlp in [#45702]
  • Unwrap text_config in AutoModelFor*.from_config (#45770) by @jamesbraza in [#45770]
  • fix: Added Mps support in float fallback backends list (#45687) by @rigen1048 in [#45687]
  • Github Actions PR CI (caller) (#45476) by @ydshieh in [#45476]
  • make sure we call check_auto in CI (#45775) by @tarekziade in [#45775]
  • Fix auto mapping script (#45774) by @Cyrilvallez in [#45774]
  • [MINISTRAL3] Fix conversion script yarn's apply_scale support. (#45744) by @juliendenize in [#45744]
  • [nemotron_h] respect _no_reinit flag on dt_bias and out_proj.weight (#45591) by @vai-minzhou in [#45591]
  • fix(utils): Resolve backbone utils test regressions (#45594) by @harshaljanjani in [#45594]
  • [CB] Better overall script and decode bucketting (#45653) by @remi-or in [#45653]
  • [docs] model testing (#45152) by @stevhliu in [#45152]
  • update dev (#45726) by @vasqu in [#45726]
  • Doc translate to Persian(farsi) (#45664) by @zeoses in [#45664]
  • [OAI Privacy Filter] Add integration test (#45725) by @vasqu in [#45725]
  • Speedup Qwen2VLImageProcessor (#45719) by @lgeiger in [#45719]
  • Remove dead beam-search dummies from dummy_pt_objects.py (#45722) by @jw9603 in [#45722]
  • chore(typing): add ty type checking for 10 utility files (#45703) by @moonbogi in [#45703]
  • Llama3 video fix (#45040) by @sywangyi in [#45040]
  • Fix custom-module copies inheriting read-only permissions (#45686) by @nurpax in [#45686]
  • Python code in model docs (#45608) by @zucchini-nlp in [#45608]
  • fix failed test cases for blt model (#45596) by @kaixuanliu in [#45596]
  • chore(typing): add ty type checking for 3 pipeline files (#45667) by @moonbogi in [#45667]

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @artem-spector
    • Add Granite 4.1 Vision (granite4_vision) (#45597)
  • @SindhuRaghuram97
    • First model (#45788)
  • @nuxlear
    • Add EXAONE 4.5 implementations (#45471)
  • @ArthurZucker
    • Add DeepSeek V4 (#45643)
  • @remi-or
    • [CB] Better overall script and decode bucketting (#45653)
  • @zhang-prog
    • [Model] Add PP-FormulaNet Model Support (#45626)
  • @zvik
    • Support for a new Granite-Speech-Plus model (#45695)
Transformers · v5.7.0

New Model additions

Laguna
<img width="699" height="176" alt="image" src="https://github.com/user-attachments/assets/d3bae269-bea7-4ddf-a53f-d4718befdb17" />

Laguna is Poolside's mixture-of-experts language model family that extends standard SwiGLU MoE transformers with two key innovations. It features per-layer head counts allowing different decoder layers to have different query-head counts while sharing the same KV cache shape, and implements a sigmoid MoE router with auxiliary-loss-free load balancing that uses element-wise sigmoid of gate logits plus learned per-expert bias for router scoring.

Links: Documentation

DEIMv2
<img width="2874" height="908" alt="image" src="https://github.com/user-attachments/assets/fc8c59fe-f964-42ce-ae8e-c7fcace9beb7" />

DEIMv2 (DETR with Improved Matching v2) is a real-time object detection model that extends DEIM with DINOv3 features and spans eight model sizes from X to Atto for diverse deployment scenarios. It uses a Spatial Tuning Adapter (STA) for larger variants to convert DINOv3's single-scale output into multi-scale features, while ultra-lightweight models employ pruned HGNetv2 backbones. The unified design achieves superior performance-cost trade-offs, with DEIMv2-X reaching 57.8 AP with only 50.3M parameters and DEIMv2-S being the first sub-10M model to exceed 50 AP on COCO.

Links: Documentation | Paper

Attention

Several attention-related bugs were fixed across multiple models, including a cross-attention cache type error in T5Gemma2 for long inputs, incorrect cached forward behavior in Qwen3.5's gated-delta-net linear attention, and a crash in GraniteMoeHybrid when no Mamba layers are present. Attention function dispatch was also updated to align with the latest model implementations.

  • Fix cross-attention cache layer type for T5Gemma2 long inputs (#45540) by @Beichen-Ma in [#45540]
  • [Qwen3.5] Fix GDN linear attention multi-token cached forward (#45513) by @kashif in [#45513]
  • Fix GraniteMoeHybrid _update_mamba_mask crash on attention-only models (#45514) by @tianhaocui in [#45514]
  • Align latest model attention function dispatch (#45598) by @Cyrilvallez in [#45598]

Tokenizers

There was a bug in AutoTokenizer that caused the wrong tokenizer class to be initialized. This caused regressions in models like DeepSeek R1.

  • change got reverted (#45680) by @itazap in [#45680]

Generation

Continuous batching generation received several fixes and improvements, including correcting KV deduplication and memory estimation for long sequences (16K+), and removing misleading warnings about num_return_sequences and other unsupported features that were incorrectly firing even when functionality worked correctly. Documentation for per-request sampling parameters was also added.

  • generate: drop stale num_return_sequences warning on continuous batching path (#45582) by @joaquinhuigomez in [#45582]
  • Remove unnecessary generate warnings (#45619) by @Cyrilvallez in [#45619]
  • [CB] Changes for long generation (#45530) by @remi-or in [#45530]
  • [docs] per-request sampling params (#45553) by @stevhliu in [#45553]

Kernels

Improved kernel support by fixing configuration reading and error handling for FP8 checkpoints (e.g., Qwen3.5-35B-A3B-FP8), enabling custom expert kernels registered from the HF Hub to be properly loaded, and resolving an incompatibility that prevented Gemma3n and Gemma4 from using the rotary kernel.

  • Fix configuration reading and error handling for kernels (#45610) by @hmellor in [#45610]
  • Allow for registered experts from kernels hub (#45577) by @winglian in [#45577]
  • Gemma3n and Gemma4 cannot use rotary kernel (#45564) by @Cyrilvallez in [#45564]

Bugfixes and improvements

  • fixing more typos (#45689) by @vasqu in [#45689]
  • [docs] cb memory management (#45587) by @stevhliu in [#45587]
  • [docs] cpu offloading (#45660) by @stevhliu in [#45660]
  • docs(README_zh-hans): clarify conditions for not using Transformers (#45688) by @GuaiZai233 in [#45688]
  • fix padding side issue for fast_vlm tests (#45592) by @kaixuanliu in [#45592]
  • Fix x_clip: 8 failed test cases (#45394) by @kaixuanliu in [#45394]
  • zero_shot_object_detection ValueError fix for python 3.13 (#45669) by @AnkitAhlawat7742 in [#45669]
  • Fix pageable H2D copies in Gated DeltaNet PyTorch fallback (#45665) by @ruixiang63 in [#45665]
  • Fix UnboundLocalError in shard_and_distribute_module for replicated parameters (#45675) by @Abdennacer-Badaoui in [#45675]
  • [MistralCommonBackend] Soften validation mode and apply_chat_template arguments check (#45628) by @juliendenize in [#45628]
  • Fix NameError: PeftConfigLike triggered by PreTrainedModel.__init_subclass__ (#45658) by @qgallouedec in [#45658]
  • chore(typing): added modeling_utils to ty (#45425) by @tarekziade in [#45425]
  • [gemma4] infer from config instead of hardcoding (#45606) by @eustlb in [#45606]
  • Update quants tests (#45480) by @SunMarc in [#45480]
  • 🔴🔴🔴 fix: skip clean_up_tokenization for BPE tokenizers in PreTrainedTokenizerFast (#44915) by @maxsloef-goodfire in [#44915]
  • Fix colmodernvbert tests (#45652) by @Cyrilvallez in [#45652]
  • [CB] [Major] Add CPU request offloading (#45184) by @remi-or in [#45184]
  • Fix peft constructors (#45622) by @Cyrilvallez in [#45622]
  • chore: speedup modular converter (~30%) (#45046) by @tarekziade in [#45046]
  • Fix whisper return language (#42227) by @FredHaa in [#42227]
  • Add supports_gradient_checkpointing to NemotronHPreTrainedModel (#45625) by @sergiopaniego in [#45625]
  • Raise clear error for problem_type="single_label_classification" with num_labels=1 (#45611) by @gaurav0107 in [#45611]
  • CircleCI with torch 2.11 (#45633) by @ydshieh in [#45633]
  • chore: bump doc-builder SHA for main doc build workflow (#45631) by @rtrompier in [#45631]
  • Allow more artifacts to be download in CI (#45629) by @ydshieh in [#45629]
  • chore(qa): split pipeline and add type checking (#45432) by @tarekziade in [#45432]
  • Skip failing offloading tests (#45624) by @Cyrilvallez in [#45624]
  • fix: compute auxiliary losses when denoising is disabled in D-FINE (#45601) by @Abineshabee in [#45601]
  • qa: bumped mlinter and allow local override (#45585) by @tarekziade in [#45585]
  • Processing Utils: continue when content is a string (#45605) by @RyanMullins in [#45605]
  • SonicMoe (#45433) by @IlyasMoutawwakil in [#45433]
  • fix transformers + torchao nvfp4 serialization (#45573) by @vkuzo in [#45573]
  • [AMD CI] Fix expectations for Gemma3n (#45602) by @Abdennacer-Badaoui in [#45602]
  • [docs] multi-turn tool calling (#45554) by @stevhliu in [#45554]
  • Fix AttributeError on s_aux=None in flash_attention_forward (#45589) by @jamesbraza in [#45589]
  • do not index past decoded chars with special tokens (#45435) by @itazap in [#45435]
  • Update dev version (#45583) by @vasqu in [#45583]
  • Update torchao usage for XPU and CPU (#45560) by @jiqing-feng in [#45560]

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @vasqu
    • fixing more typos (#45689)
    • Update dev version (#45583)
  • @joerowell
    • Laguna XS.2 implementation (#45673)
  • @tarekziade
    • chore(typing): added modeling_utils to ty (#45425)
    • chore: speedup modular converter (~30%) (#45046)
    • chore(qa): split pipeline and add type checking (#45432)
    • qa: bumped mlinter and allow local override (#45585)
  • @harshaljanjani
    • model: Add DEIMv2 to Transformers (#44339)
  • @remi-or
    • [CB] [Major] Add CPU request offloading (#45184)
    • [CB] Changes for long generation (#45530)
Transformers · v5.6.0

New Model additions

OpenAI Privacy Filter

OpenAI Privacy Filter is a bidirectional token-classification model for personally identifiable information (PII) detection and masking in text. It is intended for high-throughput data sanitization workflows where teams need a model that they can run on-premises that is fast, context-aware, and tunable. The model labels an input sequence in a single forward pass, then decodes coherent spans with a constrained Viterbi procedure, predicting probability distributions over 8 privacy-related output categories for each input token.

Links: Documentation

QianfanOCR

Qianfan-OCR is a 4B-parameter end-to-end document intelligence model developed by Baidu that performs direct image-to-text conversion without traditional multi-stage OCR pipelines. It supports a broad range of prompt-driven tasks including structured document parsing, table extraction, chart understanding, document question answering, and key information extraction all within one unified model. The model features a unique "Layout-as-Thought" capability that generates structured layout representations before producing final outputs, making it particularly effective for complex documents with mixed element types.

Links: Documentation | Paper

SAM3-LiteText

SAM3-LiteText is a lightweight variant of SAM3 that replaces the heavy SAM3 text encoder (353M parameters) with a compact MobileCLIP-based text encoder optimized through knowledge distillation, while keeping the SAM3 ViT-H image encoder intact. This reduces text encoder parameters by up to 88% while maintaining segmentation performance comparable to the original model. The model enables efficient vision-language segmentation by addressing the redundancy found in text prompting for segmentation tasks.

Links: Documentation | Paper

SLANet

SLANet and SLANet_plus are lightweight models designed for table structure recognition, focusing on accurately recognizing table structures in documents and natural scenes. The model improves accuracy and inference speed by adopting a CPU-friendly lightweight backbone network PP-LCNet, a high-low-level feature fusion module CSP-PAN, and a feature decoding module SLA Head that aligns structural and positional information. SLANet was developed by Baidu PaddlePaddle Vision Team as part of their table structure recognition solutions.

Links: Documentation

Breaking changes

The internal rotary_fn is no longer registered as a hidden kernel function, so any code referencing self.rotary_fn(...) within an Attention module will break and must be updated to call the function directly instead.

  • 🚨 [Kernels] Fix kernel function registration (#45420) by @vasqu

Serve

The transformers serve command received several enhancements, including a new /v1/completions endpoint for legacy text completion, multimodal support for audio and video inputs, improved tool-calling via parse_response, proper forwarding of tool_calls/tool_call_id fields, a 400 error on model mismatch when the server is pinned to a specific model, and fixes for the response API. Documentation was also updated to cover new serving options such as --compile and --model-timeout.

  • Add /v1/completions endpoint (OpenAI legacy completions API) to transformers serve (#44558) by @rain-1 in [#44558]
  • Updated the image cache for Paddle models according to the latest API (#45562) by @zhang-prog in [#45562]
  • Raise 400 on model mismatch when transformers serve is pinned (#45443) by @qgallouedec in [#45443]
  • [serve] Update tool call to switch to parse_response (#45485) by @SunMarc in [#45485]
  • Fix response api support (#45463) by @SunMarc in [#45463]
  • [serve] Forward tool_calls/tool_call_id in processor inputs (#45418) by @qgallouedec in [#45418]
  • refactor(qa): extend extras so ty can run on server modules (#45456) by @tarekziade in [#45456]
  • Multimodal serve support (#45220) by @SunMarc in [#45220]
  • [docs] transformers serve (#45174) by @stevhliu in [#45174]

Vision

Several vision-related bug fixes were applied in this release, including correcting Qwen2.5-VL temporal RoPE scaling for still images, fixing missing/mismatched image processor backends for Emu3 and BLIP, resolving modular image processor class duplication, and preventing accelerate from incorrectly splitting vision encoders in PeVideo/PeAudioVideo models. Image loading performance was also improved by leveraging torchvision's native decode_image in the torchvision backend, yielding up to ~17% speedup over PIL-based loading.

  • Revert "Fix: modular image processors (#45492)" (#45531) by @tarekziade in [#45531]
  • Fix: modular image processors (#45492) by @zucchini-nlp in [#45492]
  • fix: prevent accelerate from splitting vision encoder by setting no… (#43047) by @<NOT FOUND> in [#43047]
  • Fix Qwen2.5-VL temporal RoPE scaling applied to still images (#45330) by @Kash6 in [#45330]
  • Use torchvision decode_image to load images in the torchvision backend (#45195) by @yonigozlan in [#45195]
  • Fix missing image processors backends (#45165) by @zucchini-nlp in [#45165]

Parallelization

Fixed several bugs affecting distributed training, including silently wrong results or NaN loss with Expert Parallelism, NaN weights on non-rank-0 FSDP processes, and a resize failure in PP-DocLayoutV3; additionally added support for loading adapters with Tensor Parallelism, added MoE to the Gemma4 TP plan, and published documentation for TP training.

  • Fix EP: RouterParallel shape, tp_plan property, grouped_mm sentinels (#45473) by @AmineDiro in [#45473]
  • Fix NaN weights on non-rank-0 FSDP processes (#45050) by @albertvillanova in [#45050]
  • Load adapter with TP (#45155) by @michaelbenayoun in [#45155]
  • [docs] tp training (#44613) by @stevhliu in [#44613]
  • Fix resize failure caused by zero-sized masks in PP-DocLayoutV3 (#45281) by @zhang-prog in [#45281]
  • Add MoE to Gemma4 TP plan (#45219) by @sywangyi in [#45219]

Tokenization

Fixed a docstring typo in streamer classes, resolved a Kimi-K2.5 tokenizer regression and _patch_mistral_regex AttributeError, and patched a streaming generation crash for Qwen3VLProcessor caused by incorrect _tokenizer attribute access. Additional housekeeping included moving the GPT-SW3 instruct tokenizer to an internal testing repo and fixing a global state leak in the tokenizer registry during tests.

  • [Doc] Fix 'tokenized' -> 'tokenizer' typo in streamer docstrings (#45508) by @avasis-ai in [#45508]
  • Fix Kimi-K2.5 tokenizer regression and _patch_mistral_regex AttributeError (#45359) by @ArthurZucker in [#45359]
  • fix(serving): resolve rust tokenizer from ProcessorMixin in streaming generation (#45368) by @sharziki in [#45368]
  • [Tokenizers] Move gpt sw3 tokenizer out (#45404) by @vasqu in [#45404]
  • fix: leak in tokenizer registry for test_processors (#45318) by @tarekziade in [#45318]

Cache

Cache handling was improved for Gemma4 and Gemma3n models by dissociating KV state sharing from the Cache class, ensuring KV states are always shared regardless of whether a Cache is used. Additionally, the image cache for Paddle models was updated to align with the latest API.

  • Align gemma3n cache sharing to gemma4 (#45489) by @Cyrilvallez in [#45489]
  • remove cache file from tree (#45392) by @tarekziade in [#45392]
  • [gemma4] Dissociate kv states sharing from the Cache (#45312) by @Cyrilvallez in [#45312]

Audio

Audio models gained vLLM compatibility through targeted fixes across several model implementations, while reliability improvements were also made including exponential back-off retries for audio file downloads, a crash fix in the text-to-speech pipeline when generation configs contain None values, and corrected test failures for Kyutai Speech-To-Text.

  • feat[vLLM × v5]: Add vLLM compatibility for audio models (#45326) by @harshaljanjani in [#45326]
  • http retries on audio file downloads (#45126) by @tarekziade in [#45126]
  • fix(testing): Fix Kyutai Speech-To-Text and LongCatFlash test failures on main CI (#44695) by @harshaljanjani in [#44695]
  • Fix text-to-speech pipeline crash when generation config contains None values (#45107) by @jiqing-feng in [#45107]

Bugfixes and improvements

  • [Privacy Filter] Add model (#45580) by @vasqu in [#45580]
  • Add ForSequenceClassification heads for the OLMo family (#45551) by @earino in [#45551]
  • Add IndexCache support for GLM5 DSA (#45424) by @louzongzhi in [#45424]
  • Fix redundant logic in video processing SmolVLM (#45272) by @yonigozlan in [#45272]
  • Fix typos (#45574) by @vasqu in [#45574]
  • [Model] Add SLANet Model Support (#45532) by @zhang-prog in [#45532]
  • refactor(Dots1): drop Dots1MoE override to pass (inherits from DSV3 MoE) (#45572) by @casinca in [#45572]
  • perf: avoid recomputing rotary_emb for each layer in some Google and ModernBERT models (#45555) by @casinca in [#45555]
  • Gemma4 training with text-only samples (#45454) by @zucchini-nlp in [#45454]
  • [nemotron_h] Add support for MLP mixers (#44763) by @xenova in [#44763]
  • add expert parallelism for gemma-4-26B-A4B-it (#45279) by @sywangyi in [#45279]
  • Add full GGUF loading support for GPT‑OSS (fixes #43366, supersedes #43757) latest (#45506) by @sirzechs66 in [#45506]
  • Update Gemma4 weight conversion script (#45328) by @RyanMullins in [#45328]
  • Move some conversion mappings to PrefixChange (#45567) by @Cyrilvallez in [#45567]
  • fix table update versions (#45544) by @tarekziade in [#45544]
  • Add disable_mmap kwarg to from_pretrained with hf-mount auto-detection (#45547) by @rtrompier in [#45547]
  • fix(DSV3): parity between native DeepseekV3MoE and remote official implementation (#45441) by @casinca in [#45441]
  • [modular] Fix modular logic broken in #45045 (#45539) by @Cyrilvallez in [#45539]
  • Fix: propagate quantization_config to text sub-config for composite models in AutoModelForCausalLM (#45494) by @lvliang-intel in [#45494]
  • T5Gemma2: fix prepare_decoder_input_ids_from_labels (#45516) by @Tokarak in [#45516]
  • [Trainer] Add ddp_static_graph option (#45519) by @KeitaW in [#45519]
  • Add dtype config options for Four Over Six (#45367) by @jackcook in [#45367]
  • [Sam3LiteText] Remove unnecessary modules/configs (#45535) by @yonigozlan in [#45535]
  • Fix conditional check for float formatting (#44425) by @qgallouedec in [#44425]
  • Fix AMD CI: rebuild torchvision with libjpeg + refresh expectations (#45533) by @Abdennacer-Badaoui in [#45533]
  • Reapply modular to examples (#45527) by @Cyrilvallez in [#45527]
  • qa: re-run modular converter when the script itself is modified (#45528) by @tarekziade in [#45528]
  • [GGUF] Reduce peak RAM usage by casting dequantized tensors early during load (#45386) by @UsamaKenway in [#45386]
  • Fix CSM TextToAudioPipeline missing <bos> token (#45525) by @jiqing-feng in [#45525]
  • [Conversion Mapping] Small fixups (#45483) by @vasqu in [#45483]
  • fix: return empty tuple from import_protobuf_decode_error when protobuf is unavailable (#45486) by @jw9603 in [#45486]
  • throw error when conversion required (#45078) by @itazap in [#45078]
  • chore: bump doc-builder SHA for PR upload workflow (#45450) by @rtrompier in [#45450]
  • xpu output align with cuda in test case (#45526) by @sywangyi in [#45526]
  • chore(qa): split out mlinter (#45475) by @tarekziade in [#45475]
  • [loading] Clean way to add/remove full parts in checkpoint names (#45448) by @Cyrilvallez in [#45448]
  • Fix Zamba2MambaMixer ignoring use_mamba_kernels=False (#44853) by @sergiopaniego in [#44853]
  • revert sha commit pointing to main for transformers_amd_ci_ workflows (#45495) by @paulinebm in [#45495]
  • Fix ZeRO-3 from_pretrained: load registered buffers in _load_state_dict_into_zero3_model (#45402) by @saslifat-gif in [#45402]
  • Remove redundant condition checks in get_image_size method (#45461) by @JiauZhang in [#45461]
  • Add check-auto in repo-consistency and fix sorting (#45481) by @zucchini-nlp in [#45481]
  • Fix typos in src/transformers/utils/output_capturing.py (#45269) by @ryota-komatsu in [#45269]
  • typing: rule 15 - checks for tie_word_embeddings presence (#44988) by @tarekziade in [#44988]
  • [CB] Fix capture of max_seqlen (#45323) by @remi-or in [#45323]
  • Minor update (#45484) by @ydshieh in [#45484]
  • Add Neuron to auto-compile hardware list (#44757) by @dacorvo in [#44757]
  • Allow loading Qwen Thinker 'base' models without generative head (#45457) by @tomaarsen in [#45457]
  • [fix] Always early return for non-Mistral models in _patch_mistral_regex (#45444) by @tomaarsen in [#45444]
  • Fix spurious position_ids warnings for at least 40 architectures (#45437) by @tomaarsen in [#45437]
  • [fix] Make Qwen2_5OmniProcessor warning a lot less noisy via warning_once (#45455) by @tomaarsen in [#45455]
  • Dynamic auto mapping (#45018) by @zucchini-nlp in [#45018]
  • [docs] vlm addition (#45271) by @stevhliu in [#45271]
  • fix: dont download artifacts from the test hub (#45319) by @tarekziade in [#45319]
  • fix(clipseg): fix 2 failing tests (#45403) by @kaixuanliu in [#45403]
  • [docs] @auto_docstring decorator (#45130) by @stevhliu in [#45130]
  • Fix Sam3Processor missing input_boxes_labels for padded None entries (#45171) by @Kash6 in [#45171]
  • better grad acc tests (#45434) by @SunMarc in [#45434]
  • Add example for iterative chatting with MLLMs (#45398) by @zucchini-nlp in [#45398]
  • Gemma4 resizing per layer inputs (#45324) by @zucchini-nlp in [#45324]
  • Add step3_vl to MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS (#45449) by @hmellor in [#45449]
  • Update workflow references to new commit hash (#45442) by @paulinebm in [#45442]
  • [Gemma4] Add docstrings for Per-Layer Embeddings (PLE) pipeline (#45207) by @w4nderlust in [#45207]
  • [Doc] Correct checkpoint path in Dinov2 model_docs (#45430) by @ambroiseodt in [#45430]
  • Fix ty for transformers cli (#45190) by @SunMarc in [#45190]
  • fix(models): Resolve regressions in Wav2Vec2PhonemeCTCTokenizer (wav2vec2-lv-60-espeak-cv-ft) (#45199) by @harshaljanjani in [#45199]
  • Fix Qwen2.5VL temporal grid positions (#45400) by @zucchini-nlp in [#45400]
  • [fix] PEFT integration fixes preventing save/load & integration (#45428) by @tomaarsen in [#45428]
  • Fix the response schema for the gemma4 converter (#45411) by @Rocketknight1 in [#45411]
  • [Doc] MoE routing capture and replay recipe (#44925) by @kashif in [#44925]
  • Fix apply_chat_template crash on tool_call messages without content (#45348) by @qgallouedec in [#45348]
  • [AMD CI] Fix torch.compile/export failures on AMD CI due to untraceable set.contains (#45282) by @Abdennacer-Badaoui in [#45282]
  • [inference_fusion] convert conv3d patch embed to linear (#45041) by @JJJYmmm in [#45041]
  • Fix #45305 + add regression test GAS (#45349) by @florian6973 in [#45349]
  • Update trackio integration to use Buckets and "freeze" Space after training (#45329) by @abidlabs in [#45329]
  • fix(qwen3_moe): correct return type annotation on Qwen3MoeSparseMoeBlock.forward (#45352) by @RudrenduPaul in [#45352]
  • Fix: NotebookProgressCallback crash when evaluating with the Trainer (#44949) by @Charly21r in [#44949]
  • docs: fix 5 docstring errors in Gemma3nTextConfig (typos, grammar, formatting) (#45370) by @RudrenduPaul in [#45370]
  • Less unnecessary RoPE warnings (#45289) by @zucchini-nlp in [#45289]
  • Fix unintended Hub metadata calls from _patch_mistral_regex (#43603) by @vaibhav-research in [#43603]
  • Fix MoE routers returning probabilities instead of logits (#45131) by @yacinemebarki in [#45131]
  • [docs] training on specific hardware (#44799) by @stevhliu in [#44799]
  • [docs] zero + sequence parallelism (#44605) by @stevhliu in [#44605]
  • Fix vlm weight mappings (#45358) by @Cyrilvallez in [#45358]
  • Copy the template resolution logic from the base apply_chat_template to Voxtral (#45117) by @Rocketknight1 in [#45117]
  • add kwargs to all methods in the CallbackHandler class (#45353) by @wilnn in [#45353]
  • Close file handler (#45187) by @ydshieh in [#45187]
  • fix: restore mypy type checking for PreTrainedConfig subclasses (#45071) (#45240) by @shhKnight30 in [#45240]
  • cohere_asr: fix device issue for test_model_parallel_beam_search (#45214) by @kaixuanliu in [#45214]
  • Fix AttributeError in Gemma3ForConditionalGeneration and Gemma3ForSequenceClassification when config.return_dict=False (#45277) by @kamalrajkannan78 in [#45277]
  • fix bug for videomt model device mismatch (#45204) by @kaixuanliu in [#45204]
  • fix gemma4 gradient accumulation loss and last token incorrect labels (#45354) by @winglian in [#45354]
  • Logger has [transformers] prefix in non-verbose mode (#45316) by @zucchini-nlp in [#45316]
  • Fix AttributeError in AssistantToTargetTranslator.unmap_input_ids with cross-vocab models (#45320) by @Regata3010 in [#45320]
  • musicflamingo: add test support for Intel XPU device (#45212) by @kaixuanliu in [#45212]
  • nomic_bert: make the test suitable for general device. (#45209) by @kaixuanliu in [#45209]
  • Skip invalid flash-attn tests for pi0 model (#45011) by @kaixuanliu in [#45011]
  • Add cuda compatibility check for using grouped_mm (#45001) by @Sai-Suraj-27 in [#45001]
  • [docs] optimizers, hyperparam search, training features (#44290) by @stevhliu in [#44290]
  • Remove unused parameters and improve add_tensor_parallel_hooks_t… (#44768) by @michaelbenayoun in [#44768]
  • [gemma4] Fix device map auto (#45347) by @Cyrilvallez in [#45347]
  • Refactor CLIP-like models (#44431) by @zucchini-nlp in [#44431]
  • refactor: display test duration (#45344) by @tarekziade in [#45344]
  • Fix Wav2Vec2Config.vocab_size type to allow None (#45108) by @jiqing-feng in [#45108]
  • Add THD support in ESM (#44145) by @balvisio in [#44145]
  • [gemma4] Remove all shared weights, and silently skip them during loading (#45336) by @Cyrilvallez in [#45336]
  • Fix conversion mappings for vlms (#45340) by @Cyrilvallez in [#45340]
  • chore: added circleci python script to ruff and ty checkers (#45339) by @tarekziade in [#45339]
  • tweak checkers output on errors (#45163) by @tarekziade in [#45163]
  • chore: remove test_hub for now (#45337) by @tarekziade in [#45337]
  • [docs] pipeline cleanup (#44954) by @stevhliu in [#44954]
  • Fix export for gemma4 and add Integration tests (#45285) by @Cyrilvallez in [#45285]
  • Fix vllm cis (#45139) by @ArthurZucker in [#45139]
  • [docs] static model rules (#45232) by @stevhliu in [#45232]
  • fix(security): prevent untrusted users from triggering TRL CI dispatch (#45302) by @jagwar in [#45302]
  • [AMD CI] Fix Qwen2 expectations (#45284) by @Abdennacer-Badaoui in [#45284]
  • Add hasattr(torch.backends.cudnn, "conv") to conftest.py (#45263) by @ydshieh in [#45263]
  • Fix SmolVLM video processor resize using wrong interpolation after backend refactor (#45258) by @ydshieh in [#45258]
  • Fix Qwen2IntegrationTest (#45268) by @ydshieh in [#45268]
  • doc: fix TokenizersBackend.convert_to_native_format docstring (#45262) by @lowzhao in [#45262]
  • empty (#45261) by @ydshieh in [#45261]
  • Fix unexpected TF32 being enabled in testing (#45252) by @ydshieh in [#45252]
  • Fix tf32 issue: set torch.backends.cudnn.conv.fp32_precision explicitly. (#45248) by @ydshieh in [#45248]
  • Nvidia CI with torch 2.11 (#45243) by @ydshieh in [#45243]
  • Update tiny model creation script (#45241) by @ydshieh in [#45241]
  • Update get_test_info.py (related to tiny model creation) (#45238) by @ydshieh in [#45238]
  • More fix for tiny model creation (#45228) by @ydshieh in [#45228]
  • remove unnecessary entries in some auto model mappings (#45224) by @ydshieh in [#45224]
  • fix: hf-doc-builder insallation was failing (#45225) by @tarekziade in [#45225]
  • [CB] Add per-request logits processors (#45026) by @remi-or in [#45026]
  • [docs] formatting (#45196) by @stevhliu in [#45196]
  • fix test_register_result_handler (#45188) by @SunMarc in [#45188]
  • [CB] Tweaks to update and minor fixes (#45179) by @remi-or in [#45179]
  • Fix pypi release (#45210) by @ArthurZucker in [#45210]
  • fix(docs): correct gemma4 docs and examples (#45197) by @douglas-reid in [#45197]
  • Add Turkish (tr) translation for Get Started section (#45158) by @onwp in [#45158]

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @vasqu
    • [Privacy Filter] Add model (#45580)
    • Fix typos (#45574)
    • [Conversion Mapping] Small fixups (#45483)
    • 🚨 [Kernels] Fix kernel function registration (#45420)
    • [Tokenizers] Move gpt sw3 tokenizer out (#45404)
  • @rain-1
    • Add /v1/completions endpoint (OpenAI legacy completions API) to transformers serve (#44558)
  • @zhang-prog
    • Updated the image cache for Paddle models according to the latest API (#45562)
    • [Model] Add SLANet Model Support (#45532)
    • Fix resize failure caused by zero-sized masks in PP-DocLayoutV3 (#45281)
  • @tarekziade
    • fix table update versions (#45544)
    • qa: re-run modular converter when the script itself is modified (#45528)
    • Revert "Fix: modular image processors (#45492)" (#45531)
    • chore(qa): split out mlinter (#45475)
    • typing: rule 15 - checks for tie_word_embeddings presence (#44988)
    • fix: dont download artifacts from the test hub (#45319)
    • refactor(qa): extend extras so ty can run on server modules (#45456)
    • remove cache file from tree (#45392)
    • refactor: display test duration (#45344)
    • http retries on audio file downloads (#45126)
    • chore: added circleci python script to ruff and ty checkers (#45339)
    • tweak checkers output on errors (#45163)
    • fix: leak in tokenizer registry for test_processors (#45318)
    • chore: remove test_hub for now (#45337)
    • fix: hf-doc-builder insallation was failing (#45225)
  • @marvinzh
    • add Qianfan-OCR model definition (#45280)
  • @remi-or
    • [CB] Fix capture of max_seqlen (#45323)
    • [CB] Add per-request logits processors (#45026)
    • [CB] Tweaks to update and minor fixes (#45179)
  • @ydshieh
    • Minor update (#45484)
    • Close file handler (#45187)
    • Add hasattr(torch.backends.cudnn, "conv") to conftest.py (#45263)
    • Fix SmolVLM video processor resize using wrong interpolation after backend refactor (#45258)
    • Fix Qwen2IntegrationTest (#45268)
    • empty (#45261)
    • Fix unexpected TF32 being enabled in testing (#45252)
    • Fix tf32 issue: set torch.backends.cudnn.conv.fp32_precision explicitly. (#45248)
    • Nvidia CI with torch 2.11 (#45243)
    • Update tiny model creation script (#45241)
    • Update get_test_info.py (related to tiny model creation) (#45238)
    • More fix for tiny model creation (#45228)
    • remove unnecessary entries in some auto model mappings (#45224)
  • @NielsRogge
    • Add SAM3-LiteText (#44320)
  • @ArthurZucker
    • Fix IndexError with DeepSpeed ZeRO-3 when kernels rotary is active (#45414)
    • Fix Kimi-K2.5 tokenizer regression and _patch_mistral_regex AttributeError (#45359)
    • Fix vllm cis (#45139)
    • Fix pypi release (#45210)
    • update to dev version 5.6.0-dev0
  • @JJJYmmm
    • [inference_fusion] convert conv3d patch embed to linear (#45041)
  • @balvisio
    • Add THD support in ESM (#44145)
  • @onwp
    • Add Turkish (tr) translation for Get Started section (#45158)
Transformers · v5.5.4

This is mostly some fixes that are good to have asap, mostly for tokenizers; ** Fix Kimi-K2.5 tokenizer regression and _patch_mistral_regex Attribute… (#45305) by ArthurZucker

For training: ** Fix #45305 + add regression test GAS (#45349) by florian6973, SunMarc ** Fix IndexError with DeepSpeed ZeRO-3 when kernels rotary is active (#…) by ArthurZucker

And for Qwen2.5-VL : ** Fix Qwen2.5-VL temporal RoPE scaling applied to still images (#45330) by Kash6, zucchini-nlp

Transformers · v5.5.0
<img width="2786" height="1504" alt="image" src="https://github.com/user-attachments/assets/6c8c878f-042b-4858-9f64-73fd9ccd7e4b" />

New Model additions

Gemma4

Gemma 4 is a multimodal model with pretrained and instruction-tuned variants, available in 1B, 13B, and 27B parameters. The architecture is mostly the same as the previous Gemma versions. The key differences are a vision processor that can output images of fixed token budget and a spatial 2D RoPE to encode vision-specific information across height and width axis.

<img width="1478" height="1374" alt="image" src="https://github.com/user-attachments/assets/9d88bd1b-02ea-4829-b7d0-fac0e347d436" />

You can find all the original Gemma 4 checkpoints under the Gemma 4 release.

The key difference from previous Gemma releases is the new design to process images of different sizes using a fixed-budget number of tokens. Unlike many models that squash every image into a fixed square (like 224×224), Gemma 4 keeps the image's natural aspect ratio while making it the right size. There a a couple constraints to follow:

  • The total number of pixels must fit within a patch budget
  • Both height and width must be divisible by 48 (= patch size 16 × pooling kernel 3)

Important

Gemma 4 does not apply the standard ImageNet mean/std normalization that many other vision models use. The model's own patch embedding layer handles the final scaling internally (shifting values to the [-1, 1] range).

The number of "soft tokens" (aka vision tokens) an image processor can produce is configurable. The supported options are outlined below and the default is 280 soft tokens per image.

Soft TokensPatches (before pooling)Approx. Image Area
70630~161K pixels
1401,260~323K pixels
2802,520~645K pixels
5605,040~1.3M pixels
1,12010,080~2.6M pixels

To encode positional information for each patch in the image, Gemma 4 uses a learned 2D position embedding table. The position table stores up to 10,240 positions per axis, which allows the model to handle very large images. Each position is a learned vector of the same dimensions as the patch embedding. The 2D RoPE which Gemma 4 uses independently rotate half the attention head dimensions for the x-axis and the other half for the y-axis. This allows the model to understand spatial relationships like "above," "below," "left of," and "right of."

NomicBERT

NomicBERT is a BERT-inspired encoder model that applies Rotary Position Embeddings (RoPE) to create reproducible long context text embeddings. It is the first fully reproducible, open-source text embedding model with 8192 context length that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on short-context MTEB and long context LoCo benchmarks. The model generates dense vector embeddings for various tasks including search, clustering, and classification using specific instruction prefixes.

Links: Documentation | Paper

MusicFlamingo

Music Flamingo is a fully open large audio–language model designed for robust understanding and reasoning over music. It builds upon the Audio Flamingo 3 architecture by including Rotary Time Embeddings (RoTE), which injects temporal position information to enable the model to handle audio sequences up to 20 minutes. The model features a unified audio encoder across speech, sound, and music with special sound boundary tokens for improved audio sequence modeling.

Links: Documentation | Paper

Breaking changes

Mamba and hybrid model caches are now first-class native citizens in the library, so users working with Mamba-based or hybrid (Mamba + attention) models should update their code to use the new native cache classes instead of any previous workarounds.

  • 🚨 [Cache] Native mamba & hybrid cache (#44950) by @Cyrilvallez

Remote code execution support has been removed from the native LightGlue integration, so users who were loading LightGlue with trust_remote_code=True must remove that argument and use the model directly through the standard native API.

  • 🚨 [LightGlue] Remove remote code execution (#45122) by @vasqu

Vision

Several vision-related bugs were fixed in this release, including correcting the Gemma vision mask to support video inputs, resolving a dependency issue that incorrectly required torchvision for PIL-based image processors, and patching bugs in the Janus image generation model and image loading. Local code resolution for tokenizers and image processors was also corrected.

  • Generalize gemma vision mask to videos (#45185) by @zucchini-nlp in [#45185]
  • Fix explicit local code resolution for tokenizers and image processors (#45169) by @hmellor in [#45169]
  • fix bug for janus model image generation (#45044) by @kaixuanliu in [#45044]
  • [Bugfix] Remove incorrect torchvision requirement from PIL backend image processors (#45045) by @Lidang-Jiang in [#45045]
  • Avoid Image.open failure (#44645) by @sywangyi in [#44645]

Cache

Improved the performance of repository checks (check-repo) by introducing file-level and AST-level disk caching, achieving up to a 27x speedup (from ~46s to ~1.6s with a warm cache), and fixed the mlinter cache location in .gitignore.

  • refactoring: speedup static checks with disk cache (#44992) by @tarekziade in [#44992]
  • refactor: added cache in check_repo (#45012) by @tarekziade in [#45012]
  • chore: Fix mlinter cache location (#45052) by @tarekziade in [#45052]

Bugfixes and improvements

  • Fix resized LM head weights being overwritten by post_init (#45079) by @javierdejesusda in [#45079]
  • [Qwen3.5 MoE] Add _tp_plan to ForConditionalGeneration (#45124) by @danielquintas8 in [#45124]
  • fix(models): Fix dtype mismatch in SwitchTransformers and TimmWrapperModel (#45074) by @harshaljanjani in [#45074]
  • [misc] fix qwen35 tests: correct the text model type and skip reverse_mapping (#45173) by @JJJYmmm in [#45173]
  • 🔒 Pin GitHub Actions to commit SHAs (#45180) by @paulinebm in [#45180]
  • Use doc-builder runnable example for GLM-ASR (#44277) by @tarekziade in [#44277]
  • CI] Small T5 expectations updated (#45138) by @Abdennacer-Badaoui in [#45138]
  • fix: correct type annotations across config classes for @strict validation (#45007) by @Krishnachaitanyakc in [#45007]
  • Fix T5Attention shape mismatch under Tensor Parallelism (#45109) by @aws-zhanxun in [#45109]
  • [refactor] Serving into proper modules (#44796) by @SunMarc in [#44796]
  • Re-add regex substitutions to the response parsing spec (#45166) by @Rocketknight1 in [#45166]
  • Fix incorrect TrainingArguments example in training.md (#45150) by @maanas1234 in [#45150]
  • Add parse_response to Processor, make it a bit more official (#45143) by @Rocketknight1 in [#45143]
  • DeepGEMM (#44832) by @IlyasMoutawwakil in [#44832]
  • fix: prefer registered config over remote code in AutoConfig.from_pretrained (#45094) by @HanFa in [#45094]
  • [serving] Fix continuous batching JSON response serialization (#45057) by @NathanHB in [#45057]
  • Fix stupid test fetcher (#45140) by @ydshieh in [#45140]
  • [CB] Add warmup feature (#45112) by @remi-or in [#45112]
  • feature: added import complexity checker (#45013) by @tarekziade in [#45013]
  • Fix tests for janus model (#44739) by @kaixuanliu in [#44739]
  • CB improvements for serving (#45063) by @SunMarc in [#45063]
  • [docs] continuous batching (#44896) by @stevhliu in [#44896]
  • Fix few issues in Qwen_3_Omni_Moe (#44848) by @Sai-Suraj-27 in [#44848]
  • Fix TypeError in rope validation when ignore_keys is a list (#45069) by @Fr0do in [#45069]
  • Remove unused TensorFlow env var (#45065) by @Sai-Suraj-27 in [#45065]
  • fix: add identity reverse_op to dequantize ops for save_pretrained (#44983) by @Hyungkeun-Park-Nota in [#44983]
  • Fix when RoPE params are in kwargs (#45049) by @zucchini-nlp in [#45049]
  • chore: update update_metdata.yml (#45054) by @hf-security-analysis[bot] in [#45054]
  • [FA] Fix BC support for a few versions + add deprecation cycle (#45061) by @vasqu in [#45061]
  • fix(testing): Fix Parakeet, Evolla, Pi0, and Phi-3 test failures on main CI (#45004) by @harshaljanjani in [#45004]
  • Allow advanced users to override model_type in AutoConfig.from_pretrained (#45058) by @hmellor in [#45058]
  • Fix failing SmolLM3IntegrationTest (#45048) by @Sai-Suraj-27 in [#45048]
  • chore: remove old extras (#45024) by @tarekziade in [#45024]
  • Embedding VLMs don't need a head (#45000) by @zucchini-nlp in [#45000]
  • Fix GraniteConfig type hints to accept int for multiplier fields (#45019) by @javierdejesusda in [#45019]
  • fix: preserve rotary_pct across save/load cycle in GPTNeoX configs (#44985) by @Krishnachaitanyakc in [#44985]

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @ed22699
    • Internalise the NomicBERT model (#43067)
  • @tarekziade
    • Use doc-builder runnable example for GLM-ASR (#44277)
    • refactoring: speedup static checks with disk cache (#44992)
    • feature: added import complexity checker (#45013)
    • refactor: added cache in check_repo (#45012)
    • chore: remove old extras (#45024)
    • chore: Fix mlinter cache location (#45052)
    • refactor: speed up docstring checker (#45009)
  • @Krishnachaitanyakc
    • fix: correct type annotations across config classes for @strict validation (#45007)
    • fix: preserve rotary_pct across save/load cycle in GPTNeoX configs (#44985)
  • @lashahub
    • Add Music Flamingo (#43538)
  • @Lidang-Jiang
    • [Bugfix] Remove incorrect torchvision requirement from PIL backend image processors (#45045)

🚀 Transformers.js v4

We're excited to announce that Transformers.js v4 is now available on NPM! After a year of development (we started in March 2025 🤯), we're finally ready for you to use it.

npm i @huggingface/transformers

Links: YouTube Video, Blog Post, Demo Collection

New WebGPU backend

The biggest change is undoubtedly the adoption of a new WebGPU Runtime, completely rewritten in C++. We've worked closely with the ONNX Runtime team to thoroughly test this runtime across our ~200 supported model architectures, as well as many new v4-exclusive architectures.

In addition to better operator support (for performance, accuracy, and coverage), this new WebGPU runtime allows the same transformers.js code to be used across a wide variety of JavaScript environments, including browsers, server-side runtimes, and desktop applications. That's right, you can now run WebGPU-accelerated models directly in Node, Bun, and Deno!

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformersjs-v4/webgpu.png" alt="WebGPU Overview" width="100%">

We've proven that it's possible to run state-of-the-art AI models 100% locally in the browser, and now we're focused on performance: making these models run as fast as possible, even in resource-constrained environments. This required completely rethinking our export strategy, especially for large language models. We achieve this by re-implementing new models operation by operation, leveraging specialized ONNX Runtime Contrib Operators like com.microsoft.GroupQueryAttention, com.microsoft.MatMulNBits, or com.microsoft.QMoE to maximize performance.

For example, adopting the com.microsoft.MultiHeadAttention operator, we were able to achieve a ~4x speedup for BERT-based embedding models.

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformersjs-v4/speedups.png" alt="Optimized ONNX Exports" width="100%">

New models

Thanks to our new export strategy and ONNX Runtime's expanding support for custom operators, we've been able to add many new models and architectures to Transformers.js v4. These include popular models like GPT-OSS, Chatterbox, GraniteMoeHybrid, LFM2-MoE, HunYuanDenseV1, Apertus, Olmo3, FalconH1, and Youtu-LLM. Many of these required us to implement support for advanced architectural patterns, including Mamba (state-space models), Multi-head Latent Attention (MLA), and Mixture of Experts (MoE). Perhaps most importantly, these models are all compatible with WebGPU, allowing users to run them directly in the browser or server-side JavaScript environments with hardware acceleration. We've released several Transformers.js v4 demos so far... and we'll continue to release more!

Additionally, we've added support for larger models exceeding 8B parameters. In our tests, we've been able to run GPT-OSS 20B (q4f16) at ~60 tokens per second on an M4 Pro Max.

New features

ModelRegistry

The new ModelRegistry API is designed for production workflows. It provides explicit visibility into pipeline assets before loading anything: list required files with get_pipeline_files, inspect per-file metadata with get_file_metadata (quite useful to calculate total download size), check cache status with is_pipeline_cached, and clear cached artifacts with clear_pipeline_cache. You can also query available precision types for a model with get_available_dtypes. Based on this new API, progress_callback now includes a progress_total event, making it easy to render end-to-end loading progress without manually aggregating per-file updates.

<details> <summary>See `ModelRegistry` examples</summary>
import { ModelRegistry, pipeline } from "@huggingface/transformers";

const modelId = "onnx-community/all-MiniLM-L6-v2-ONNX";
const modelOptions = { dtype: "fp32" };

const files = await ModelRegistry.get_pipeline_files(
  "feature-extraction",
  modelId,
  modelOptions
);
// ['config.json', 'onnx/model.onnx', ..., 'tokenizer_config.json']

const metadata = await Promise.all(
  files.map(file => ModelRegistry.get_file_metadata(modelId, file))
);

const downloadSize = metadata.reduce((total, item) => total + item.size, 0);

const cached = await ModelRegistry.is_pipeline_cached(
  "feature-extraction",
  modelId,
  modelOptions
);

const dtypes = await ModelRegistry.get_available_dtypes(modelId);
// ['fp32', 'fp16', 'q4', 'q4f16']

if (cached) {
  await ModelRegistry.clear_pipeline_cache(
    "feature-extraction",
    modelId,
    modelOptions
  );
}

const pipe = await pipeline(
  "feature-extraction",
  modelId,
  {
    progress_callback: e => {
      if (e.status === "progress_total") {
        console.log(`${Math.round(e.progress)}%`);
      }
    },
  }
);
</details>
New Environment Settings

We also added new environment controls for model loading. env.useWasmCache enables caching of WASM runtime files (when cache storage is available), allowing applications to work fully offline after the initial load.

env.fetch lets you provide a custom fetch implementation for use cases such as authenticated model access, custom headers, and abortable requests.

<details> <summary>See env examples</summary>
import { env } from "@huggingface/transformers";

env.useWasmCache = true;

env.fetch = (url, options) =>
  fetch(url, {
    ...options,
    headers: {
      ...options?.headers,
      Authorization: `Bearer ${MY_TOKEN}`,
    },
  });
</details>
Improved Logging Controls

Finally, logging is easier to manage in real-world deployments. ONNX Runtime WebGPU warnings are now hidden by default, and you can set explicit verbosity levels for both Transformers.js and ONNX Runtime. This update, also driven by community feedback, keeps console output focused on actionable signals rather than low-value noise.

<details> <summary>See `logLevel` example</summary>
import { env, LogLevel } from "@huggingface/transformers";

// LogLevel.DEBUG
// LogLevel.INFO
// LogLevel.WARNING
// LogLevel.ERROR
// LogLevel.NONE

env.logLevel = LogLevel.WARNING;
</details>

Repository Restructuring

Developing a new major version gave us the opportunity to invest in the codebase and tackle long-overdue refactoring efforts.

PNPM Workspaces

Until now, the GitHub repository served as our npm package. This worked well as long as the repository only exposed a single library. However, looking to the future, we saw the need for various sub-packages that depend heavily on the Transformers.js core while addressing different use cases, like library-specific implementations, or smaller utilities that most users don't need but are essential for some.

That's why we converted the repository to a monorepo using pnpm workspaces. This allows us to ship smaller packages that depend on @huggingface/transformers without the overhead of maintaining separate repositories.

Modular Class Structure

Another major refactoring effort targeted the ever-growing models.js file. In v3, all available models were defined in a single file spanning over 8,000 lines, becoming increasingly difficult to maintain. For v4, we split this into smaller, focused modules with a clear distinction between utility functions, core logic, and model-specific implementations. This new structure improves readability and makes it much easier to add new models. Developers can now focus on model-specific logic without navigating through thousands of lines of unrelated code.

Examples Repository

In v3, many Transformers.js example projects lived directly in the main repository. For v4, we've moved them to a dedicated repository, allowing us to maintain a cleaner codebase focused on the core library. This also makes it easier for users to find and contribute to examples without sifting through the main repository.

Prettier

We updated the Prettier configuration and reformatted all files in the repository. This ensures consistent formatting throughout the codebase, with all future PRs automatically following the same style. No more debates about formatting... Prettier handles it all, keeping the code clean and readable for everyone.

Standalone Tokenizers.js Library

A frequent request from users was to extract the tokenization logic into a separate library, and with v4, that's exactly what we've done. @huggingface/tokenizers is a complete refactor of the tokenization logic, designed to work seamlessly across browsers and server-side runtimes. At just 8.8kB (gzipped) with zero dependencies, it's incredibly lightweight while remaining fully type-safe.

<details> <summary>See example code</summary>
import { Tokenizer } from "@huggingface/tokenizers";

// Load from Hugging Face Hub
const modelId = "HuggingFaceTB/SmolLM3-3B";
const tokenizerJson = await fetch(
  `https://huggingface.co/${modelId}/resolve/main/tokenizer.json`
).then(res => res.json());

const tokenizerConfig = await fetch(
  `https://huggingface.co/${modelId}/resolve/main/tokenizer_config.json`
).then(res => res.json());

// Create tokenizer
const tokenizer = new Tokenizer(tokenizerJson, tokenizerConfig);

// Tokenize text
const tokens = tokenizer.tokenize("Hello World");
// ['Hello', 'ĠWorld']

const encoded = tokenizer.encode("Hello World");
// { ids: [9906, 4435], tokens: ['Hello', 'ĠWorld'], ... }
</details>

This separation keeps the core of Transformers.js focused and lean while offering a versatile, standalone tool that any WebML project can use independently.

New build system

We've migrated our build system from Webpack to esbuild, and the results have been incredible. Build times dropped from 2 seconds to just 200 milliseconds, a 10x improvement that makes development iteration significantly faster. Speed isn't the only benefit, though: bundle sizes also decreased by an average of 10% across all builds. The most notable improvement is in transformers.web.js, our default export, which is now 53% smaller, meaning faster downloads and quicker startup times for users.

Improved types

We've made several quality-of-life improvements across the library. The type system has been enhanced with dynamic pipeline types that adapt based on inputs, providing better developer experience and type safety.

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformersjs-v4/types.png" alt="Type Improvements" width="100%">

Bug fixes

Documentation improvements

Miscellaneous improvements

New Contributors

Full Changelog: https://github.com/huggingface/transformers.js/compare/3.8.1...4.0.0

Transformers · v5.4.0

New Model additions

VidEoMT
<img width="1480" height="460" alt="image" src="https://github.com/user-attachments/assets/bec6fc25-b0ab-4227-8c2b-a838554f37f3" />

Video Encoder-only Mask Transformer (VidEoMT) is a lightweight encoder-only model for online video segmentation built on a plain Vision Transformer (ViT). It eliminates the need for dedicated tracking modules by introducing a lightweight query propagation mechanism that carries information across frames and employs a query fusion strategy that combines propagated queries with temporally-agnostic learned queries. VidEoMT achieves competitive accuracy while being 5x-10x faster than existing approaches, running at up to 160 FPS with a ViT-L backbone.

Links: Documentation | Paper

UVDoc
<img width="1765" height="875" alt="image" src="https://github.com/user-attachments/assets/365e510e-8fb8-46cb-8f4b-e8b7082f0ae2" />

UVDoc is a machine learning model designed for document image rectification and correction. The main purpose of this model is to carry out geometric transformation on images to correct document distortion, inclination, perspective deformation and other problems in document images. It provides both single input and batched inference capabilities for processing distorted document images.

Links: Documentation

Jina Embeddings v3
<img width="595" height="513" alt="image" src="https://github.com/user-attachments/assets/2aee0692-8286-4c6b-98db-847b95ab2d40" />

The Jina-Embeddings-v3 is a multilingual, multi-task text embedding model designed for a variety of NLP applications. Based on the XLM-RoBERTa architecture, this model supports Rotary Position Embeddings (RoPE) replacing absolute position embeddings to support long input sequences up to 8192 tokens. Additionally, it features 5 built-in Task-Specific LoRA Adapters that allow the model to generate task-specific embeddings (e.g., for retrieval vs. classification) without increasing inference latency significantly.

Links: Documentation | Paper

Mistral4
<img width="2429" height="1787" alt="image" src="https://github.com/user-attachments/assets/a6feb0da-8504-4eab-be65-22d6c676336f" />

Mistral 4 is a powerful hybrid model with the capability of acting as both a general instruction model and a reasoning model. It unifies the capabilities of three different model families - Instruct, Reasoning (previously called Magistral), and Devstral - into a single, unified model. The model features a MoE architecture with 128 experts and 4 active, 119B parameters with 6.5B activated per token, 256k context length, and supports multimodal input with both text and image processing capabilities.

Links: Documentation

PI0

PI0 is a vision-language-action model for robotics manipulation that jointly processes visual observations and language instructions to generate robot actions. It uses a novel flow matching architecture built on top of a pre-trained vision-language model to inherit Internet-scale semantic knowledge. The model can perform complex dexterous tasks like laundry folding, table cleaning, and assembling boxes across multiple robot platforms including single-arm robots, dual-arm robots, and mobile manipulators.

Links: Documentation | Paper

  • Add model lerobot PI0 to transformers (#44160) by @molbap in #44160
SLANeXt

SLANeXt is a series of dedicated lightweight models for table structure recognition, focusing on accurately recognizing table structures in documents and natural scenes. The SLANeXt series is a new generation of table structure recognition models independently developed by the Baidu PaddlePaddle Vision Team, with dedicated weights trained separately for wired and wireless tables. The recognition ability for all types of tables has been significantly improved, especially for wired tables.

Links: Documentation

PP-OCRv5_mobile_rec

PP-OCRv5_mobile_rec is a dedicated lightweight model for text recognition, focusing specifically on efficient recognition and understanding of text elements in multi-language documents and natural scenes. It is designed to efficiently and accurately support the recognition of Simplified Chinese, Traditional Chinese, English, Japanese, as well as complex text scenarios such as handwriting, vertical text, pinyin, and rare characters with a single model. While maintaining recognition performance, it also balances inference speed and model robustness, providing efficient and accurate technical support for document understanding in various scenarios.

Links: Documentation

  • [Model] Add PP-OCRv5_server_rec and PP-OCRv5_mobile_rec models Support (#44808) by @zhang-prog in #44808
PP-OCRv5_server_rec

PP-OCRv5_server_rec is a dedicated lightweight model for text recognition, focusing specifically on efficient recognition and understanding of text elements in multi-language documents and natural scenes. It is designed to efficiently and accurately support the recognition of Simplified Chinese, Traditional Chinese, English, Japanese, as well as complex text scenarios such as handwriting, vertical text, pinyin, and rare characters with a single model. While maintaining recognition performance, it also balances inference speed and model robustness, providing efficient and accurate technical support for document understanding in various scenarios.

Links: Documentation

  • [Model] Add PP-OCRv5_server_rec and PP-OCRv5_mobile_rec models Support (#44808) by @zhang-prog in #44808
PP-OCRv5_mobile_det

PP-OCRv5_mobile_det is a dedicated lightweight model for text detection, focusing specifically on efficient detection and understanding of text elements in multi-language documents and natural scenes. It is part of the latest generation of text detection models developed by the PaddleOCR team that efficiently and accurately supports the detection of text in diverse scenarios—including handwriting, vertical, rotated, and curved text—across multiple languages such as Simplified Chinese, Traditional Chinese, English, and Japanese. The model features robust handling of complex layouts, varying text sizes, and challenging backgrounds, making it suitable for practical applications like document analysis, license plate recognition, and scene text detection.

Links: Documentation

PPLCNet

PP-LCNet is a family of efficient, lightweight convolutional neural networks designed for real-world document understanding and OCR tasks. It balances accuracy, speed, and model size, making it ideal for both server-side and edge deployment. The model has three main variants optimized for specific tasks: document image orientation classification, table classification, and text line orientation classification.

Links: Documentation

PPLCNetV3

PPLCNetV3 is a lightweight CPU-optimized convolutional backbone designed for efficient image classification and downstream vision tasks. It builds on the PP-LCNet architecture with improved training strategies and structural refinements for better accuracy-latency tradeoffs on CPU hardware.

Links: Documentation | Paper

PP-OCRv5_server_det

PP-OCRv5_server_det is a high-performance text detection model optimized for server-side applications, focusing on accurate detection of multi-language text in documents and natural scenes. It supports the detection of text in diverse scenarios—including handwriting, vertical, rotated, and curved text—across multiple languages such as Simplified Chinese, Traditional Chinese, English, and Japanese. The model features robust handling of complex layouts, varying text sizes, and challenging backgrounds, making it suitable for practical applications like document analysis, license plate recognition, and scene text detection.

Links: Documentation

CHMv2

CHMv2 is a global, meter-resolution canopy height mapping model that uses DINOv3 to estimate forest canopy heights from high-resolution optical satellite imagery. Building on the original canopy height maps released in 2024, CHMv2 delivers substantial improvements in accuracy, detail, and global consistency by leveraging Meta's self-supervised vision model. The model is trained against airborne laser scanning data and provides essential information for quantifying forest carbon, monitoring restoration and degradation, and assessing habitat structure.

Links: Documentation | Paper | Blog Post

Breaking changes

The dual BaseImageProcessor/BaseImageProcessorFast design has been replaced with a unified backend architecture, and the image_processing_utils_fast module has been removed — users should migrate to the new unified image_processing_utils module.

  • 🚨🚨 Refactor Image Processors to support different backends (#43514) by @yonigozlan

PreTrainedConfig and model config classes have been refactored to use @dataclass and no longer accept positional arguments — users must update any config instantiation calls to use keyword arguments only.

Flash Attention 2 (FA2) support now requires version 2.3.3 or newer, and initial Flash Attention 4 (FA4) support has been added — users on older FA2 versions must upgrade to at least 2.3.3.

  • 🚨 [FA4] Initial support (#42435) by @vasqu

Weight tying behavior has changed so that weights are now tied even when both keys are already present in a checkpoint — users relying on the previous behavior (e.g., with .bin checkpoints containing duplicate keys) should verify their models load as expected.

  • [tie weights] 🚨 If both weights are present with same weights, still tie them (#44497) by @Cyrilvallez

The cache_position argument has been removed from the forward signatures of most major models — users passing cache_position directly to these models should remove it, as it is now handled internally by generate.

  • [core] 🚨 Completely remove cache positions (#44181) by @Cyrilvallez

Parallelization

Several bug fixes and improvements were made to pipeline parallel (PP) and tensor parallel (TP) support, including fixing supports_tp/pp_plan detection, resolving attribute errors in PP for Qwen2VL-based models, correcting FSDP loading with meta devices, and ensuring TP weight sharding properly updates parent module attributes (e.g., in_features/out_features) to improve compatibility with libraries like PEFT.

  • Fix several based models' pipeline parallel support (#44699) by @hmellor in [#44699]
  • [Model] Add PP-Chart2Table Model Support (#43767) by @XingweiDeng in [#43767]
  • enable tp for benchmark (#43750) by @sywangyi in [#43750]
  • Fix supports_{tp/pp}_plan (#44696) by @hmellor in [#44696]
  • Allow to disable stdout hiding for TP (#44608) by @michaelbenayoun in [#44608]
  • fix FSDP loading with meta devices (#44473) by @winglian in [#44473]
  • Fix: Conditionally import torch.distributed.fsdp in trainer_seq2seq.py (#44507) by @0xDELUXA in [#44507]
  • Supplement skip logic for XPU in the CPU-only tp tests (#44536) by @YangKai0616 in [#44536]
  • Update parent module attributes when sharding with TP (#44421) by @michaelbenayoun in [#44421]
  • trigger tensor parallel utils test in the CI (#44460) by @3outeille in [#44460]

Quantization

Quantization support was improved with up to 30x faster FP8 grouped and batched matmuls, static FP8 expert support for multi-GPU setups, and a torchao minimum version bump to 0.15.0. Additionally, MXFP4 dependency error messages were made more actionable, and AWQ tests were updated to align with the GPTQModel migration.

  • fix: split MXFP4 dependency checks for specific error messages (#44930) by @javierdejesusda in [#44930]
  • Add static FP8 expert support (#44895) by @SunMarc in [#44895]
  • Bump torchao >=0.15 and fix quantization CI (#44604) by @SunMarc in [#44604]
  • Fix AWQ tests for GPTQModel migration (#44654) by @jiqing-feng in [#44654]
  • [Performance] FP8 Grouped and Batched Matmuls (#44231) by @IlyasMoutawwakil in [#44231]
  • Fix PR comment CI for quantization job (#44579) by @ydshieh in [#44579]

Tokenization

Several performance improvements were made to tokenizer loading and saving, including eliminating redundant file parsing and unnecessary deep copies of large vocabularies that caused significant overhead. Additionally, bug fixes were applied for incorrect tokenizer class names on the Hub (DeepSeek V2/V3, ModernBERT), a clean_up_tokenization_spaces misconfiguration in Llama 3 tokenizer conversion, and a string replacement issue in AutoTokenizer class name resolution.

  • fix: improve processor loading performance by avoiding redundant tokenizer parsing (#44927) by @ydshieh in [#44927]
  • fix processing_utils.py: avoid deepcopying tokenizer in ProcessorMixin to improve performance (#44894) by @ydshieh in [#44894]
  • fix: set clean_up_tokenization_spaces=False in Llama 3 tokenizer conversion (#44914) by @maxsloef-goodfire in [#44914]
  • deepseek_v2, deepseek_v3, and modernbert fix for having incorrect tokenizer class on the hub (#44801) by @itazap in [#44801]
  • Add XPU Expectations for vibe voice acoustic tokenizer tests (#44428) by @kaixuanliu in [#44428]
  • fix(tokenizer): Only strip Fast from class names in AutoTokenizer if used as a suffix (#44443) by @harshaljanjani in [#44443]

Kernels

Kernel support has been expanded with Flash Attention 4 fallback integration, a paged_attention kernel for continuous batching, and Neuron device support for custom kernels. Several stability fixes were also made, including bumping the kernels version dependency to prevent crashes and correcting the LFM2 kernel path.

  • [FA4] Add kernels fallback (#44797) by @vasqu in [#44797]
  • Bump kernels version dependency to avoid crashes (#44887) by @Cyrilvallez in [#44887]
  • Fix lfm2 kernel path (#44634) by @Cyrilvallez in [#44634]
  • [CB] Add paged_attention kernel (#44379) by @remi-or in [#44379]
  • Neuron kernels integration (#44417) by @michaelbenayoun in [#44417]

Cache

Several cache-related fixes and improvements were made, including aligning LFM2's cache implementation with other Mamba caches, fixing a tensor indexing crash in KV cache continuation for the transformers serve streaming endpoint, and resolving a generation bug in Idefics3 when using use_cache=False. A caching layer was also added to the model linter to skip unchanged valid files and improve build performance.

  • Align lfm2 cache to other mamba caches (#44866) by @Cyrilvallez in [#44866]
  • feat: added cache to the model linter (#44790) by @tarekziade in [#44790]
  • Fix tensor indexing crash in serve generate_response KV cache continuation (#44735) by @mango766 in [#44735]
  • Idefics3 without cache fix (#44607) by @gabe-l-hart in [#44607]

Vision

Fixed backward compatibility for full-path imports of Fast Image Processors and resolved a Llama4 vision rotary embedding initialization error where freqs_ci was not registered as a buffer, causing failures when loading models with device_map="auto".

  • Fix backward compatibility for full path imports of Fast Image Processors (#44926) by @yonigozlan in [#44926]
  • fix(models, testing): Fix Llama4 vision rotary meta tensor initialization and MyT5 get_tokenizer signature (#44581) by @harshaljanjani in [#44581]
  • Fix AMD Docker image build timeout by pinning Flash Attention commit (#44546) by @Abdennacer-Badaoui in [#44546]

Generation

The cache_position argument has been fully removed from the generation pipeline, as all models have been updated to no longer use it (with a backward-compatibility path retained for remote code models). Additionally, integration tests for LASR with chunked decoding were added, and outdated references to deprecated pipeline tasks were cleaned up.

  • [generate] Never use cache_position anymore in generation (#44816) by @Cyrilvallez in [#44816]
  • Add an integration test for LASR using pipe and chunked decoding (#42823) by @kho in [#42823]
  • Fix: Remove references to text2text-generation, summarization and translation pipeline tasks (#44510) by @math-hiyoko in [#44510]

Bugfixes and improvements

  • Dynamic weight conversion is recursive (#44300) by @zucchini-nlp in [#44300]
  • Don't run tests_hub if no tests found (#45014) by @ydshieh in [#45014]
  • Fix type hint for attention_chunk_size in Llama4TextConfig (#45002) by @hmellor in [#45002]
  • Fix AutoProcessor.from_pretrained silently dropping hub kwargs (#44710) by @he-yufeng in [#44710]
  • Fix maybe_autocast crashing on meta device tensors (#44984) by @Butanium in [#44984]
  • fix: remove Copied from comments between @torch.jit.script and def for Python 3.13 compat (#44986) by @Krishnachaitanyakc in [#44986]
  • More small vllm fixes (#44990) by @ArthurZucker in [#44990]
  • fix(models): Fix Perceiver interpolate_pos_encoding interpolating to the source size (#44899) by @harshaljanjani in [#44899]
  • Allow mm_token_type be non-padded lists (#44563) by @zucchini-nlp in [#44563]
  • Fix CPU 16 bytes alignment issue using equivalent fallback (#44970) by @IlyasMoutawwakil in [#44970]
  • refactor: unify QA calls (#44879) by @tarekziade in [#44879]
  • Fix tie_word_embedding issues with Qwen2VL (#44976) by @hmellor in [#44976]
  • Support Modular (!!) + Configs in check_auto_docstrings (#44803) by @yonigozlan in [#44803]
  • [ vllm x v5] nit (#44971) by @ArthurZucker in [#44971]
  • LwDetrImageLoss: Fix dtype casting to prevent crash when using amp on cuda device (#44886) by @m-matthias in [#44886]
  • [AMD CI] Gemma3/Gemma3n Expectations (#44972) by @Abdennacer-Badaoui in [#44972]
  • Officially launch parse_response (#44674) by @Rocketknight1 in [#44674]
  • fix load_best_model_checkpoint_at_end do not load the best model chec… (#44583) by @wilnn in [#44583]
  • Fix failing T5ModelIntegrationTest (#44934) by @Sai-Suraj-27 in [#44934]
  • Config kwargs (#44953) by @zucchini-nlp in [#44953]
  • [CB] [Minor] Simplify test suite (#44858) by @remi-or in [#44858]
  • Allow arbitrary template kwargs in processors (#44881) by @zucchini-nlp in [#44881]
  • Fix missing post_processor in DebertaV2Tokenizer causing no special t… (#44570) by @umbilnm in [#44570]
  • incorrect model list update (#44880) by @itazap in [#44880]
  • refactor: mlinter as its own package (#44939) by @tarekziade in [#44939]
  • [CB] Add an option to return logprobs (#44835) by @remi-or in [#44835]
  • [docs] peft (#44804) by @stevhliu in [#44804]
  • Continuous batching thread safety (#44924) by @Qubitium in [#44924]
  • Fix variable shadowing in pipeline example and typo in BART docs (BERT → BART) (#44935) by @VanshikaSohal in [#44935]
  • Fix failing job Update Transformers metadata after #43514 (#44941) by @ydshieh in [#44941]
  • Clearer type hints and fix rope validation in configs (#44943) by @zucchini-nlp in [#44943]
  • Correct docstrings for from_pretrained (url input deprecated) (#44946) by @BSchilperoort in [#44946]
  • fix(i18n): replace broken relative links to awesome-transformers.md with absolute URLs (#44905) by @NicoleRobin in [#44905]
  • chore(typing): added rule 11 (#44865) by @tarekziade in [#44865]
  • fix(camembert): add tie_word_embeddings=True to CamembertConfig (#44931) by @r266-tech in [#44931]
  • Support SizeDict import in get_size_dict (#44903) by @yonigozlan in [#44903]
  • Add big angry code agent warnings! (#44890) by @Rocketknight1 in [#44890]
  • [docs] model cards (#44837) by @stevhliu in [#44837]
  • Add backward compatibility for direct imports from legacy image_processing_utils_fast (#44897) by @yonigozlan in [#44897]
  • Fix core dumped when NemotronH is torch compiled (#44854) by @ydshieh in [#44854]
  • fix(testing): Fix PaliGemma 2 and PaddleOCR-VL test failures on main (#44765) by @harshaljanjani in [#44765]
  • Fix dtype guessing from state dict (#44883) by @Cyrilvallez in [#44883]
  • Add missing dunder methods to SizeDict (#44884) by @hmellor in [#44884]
  • Fix VL model rope_deltas batch size mismatch in online RL training (#44873) by @sergiopaniego in [#44873]
  • Fix layer_types type hint for AFMoE and Llama4 (#44874) by @hmellor in [#44874]
  • Fix nemotron config docstrings (#44878) by @Cyrilvallez in [#44878]
  • Fix nemotron_h modular (#44876) by @Cyrilvallez in [#44876]
  • [Mistral] Fix query scaling for Mistral4 and Ministral3 (#44860) by @Cyrilvallez in [#44860]
  • Update some type hints (#44851) by @zucchini-nlp in [#44851]
  • Fix glm dsa (#44564) by @ArthurZucker in [#44564]
  • Update AFMoE architecture to use v5-style MoE impl (#44063) by @AutumnAurelium in [#44063]
  • Fix KeyError in convert_to_native_format for dict vocab (#44452) by @<NOT FOUND> in [#44452]
  • fix: XLNet: relative_positional_encoding computes on CPU every forward (#44782) by @JiwaniZakir in [#44782]
  • Fix annotations reader for python 3.14 in PreTrainedModel (#44672) by @neo in [#44672]
  • [CB] Better parametrization for compile (#44578) by @remi-or in [#44578]
  • Fix KeyError when patching mistral regex (#43376) by @LeonardoEmili in [#43376]
  • Correct code block formatting in weightconverter.md (#44839) by @zhulinchng in [#44839]
  • feat(ci): added a network debug report (#44636) by @tarekziade in [#44636]
  • Add GreedyLR adaptive learning rate scheduler (#44271) by @balak4 in [#44271]
  • Fix unexpected position_ids keys when loading OwlViT models (#44508) by @KartikPawade in [#44508]
  • Update more modular examples (#44834) by @Cyrilvallez in [#44834]
  • Fix and re-run modular converter on examples (#44833) by @Cyrilvallez in [#44833]
  • Remove cache_position in more models (4 and last one) (#44828) by @Cyrilvallez in [#44828]
  • Fix loading issue in Sam3 (#44831) by @zucchini-nlp in [#44831]
  • feat(integration): Add KubeflowCallback to enable automatic progress … (#44487) by @abhijeet-dhumal in [#44487]
  • Add GGUF support for MiniMax-M2.1 model (#44526) by @JoursBleu in [#44526]
  • Centralize AI agent templates in .ai (#44489) by @tarekziade in [#44489]
  • support xxxFast alias in v5 tokenizers (#44766) by @itazap in [#44766]
  • Remove cache_position in more models (3) (#44759) by @Cyrilvallez in [#44759]
  • [CI] Temporarily skip Mistral4 tests as they almost all fail (#44825) by @Cyrilvallez in [#44825]
  • [Gemma] Update conversion scripts for Transformers v5 Comaptibility (#44631) by @RyanMullins in [#44631]
  • fix bug embedding_size mismatch with hidden_size in electra model test (#44657) by @kaixuanliu in [#44657]
  • Fix pegasus conversion (#44571) by @ArthurZucker in [#44571]
  • Fix repo-check bot (#44812) by @ydshieh in [#44812]
  • [docs] is_causal feature (#44777) by @stevhliu in [#44777]
  • docs(tasks): remove references to removed question-answering pipeline (#44787) by @<NOT FOUND> in [#44787]
  • Fix configs with @strict (#44770) by @zucchini-nlp in [#44770]
  • [AMD CI] Fix test failures across important models (#44632) by @Abdennacer-Badaoui in [#44632]
  • Move VLM conversions to the main mapping (#44627) by @zucchini-nlp in [#44627]
  • Fix config loading issues (type issues) (#44789) by @ydshieh in [#44789]
  • Remove is_causal from EuroBertConfig (#44774) by @ydshieh in [#44774]
  • model-linter: Added rule 10 (#44761) by @tarekziade in [#44761]
  • [fix] mistral 4 docs (#44776) by @stevhliu in [#44776]
  • Fix: Eurobert model was missing @strict decorator and invalid test kwargs (#44767) by @tarekziade in [#44767]
  • fix: sig lip import (#44764) by @tarekziade in [#44764]
  • Disable async loading when quantizing on the fly (#44576) by @SunMarc in [#44576]
  • [MistralCommonBackend] Upgrade mistral-common to v1.10.0 (#44656) by @juliendenize in [#44656]
  • Fix mlcd auto config/model/mapping issues (#44730) by @ydshieh in [#44730]
  • Fix bug and add XPU Expectations for qwen2 and jamba tests (#44733) by @kaixuanliu in [#44733]
  • [medasr] doc update (#44633) by @eustlb in [#44633]
  • Fix missing / incorrect config class in some model class definitions (#44715) by @ydshieh in [#44715]
  • Update Nvidia CI docker file to use torch 2.10 (#44712) by @ydshieh in [#44712]
  • [FA] Fix fa detection (#44703) by @vasqu in [#44703]
  • Fix set_encoder (#44698) by @hmellor in [#44698]
  • [docs] cb config (#44675) by @stevhliu in [#44675]
  • Fix more model tester missing parent issue (#44685) by @ydshieh in [#44685]
  • Add register method for ParallelInterface (#44640) by @michaelbenayoun in [#44640]
  • [CB] [Bug] Fix crashes when running without cuda (#44673) by @remi-or in [#44673]
  • Another (small) set of fixes required for tiny model creation (#44666) by @ydshieh in [#44666]
  • Fix CookieCutter (#44334) by @NielsRogge in [#44334]
  • pipelines do not have modelcard (#44621) by @KoichiYasuoka in [#44621]
  • [Chmv2] Fix conversion after capture refactor (#44665) by @vasqu in [#44665]
  • [CB] Add dedicated config (#44434) by @remi-or in [#44434]
  • fix(models): Forward timm model kwargs to timm.create_model for OmDet-Turbo (#44611) by @harshaljanjani in [#44611]
  • Ensure same dtype for subconfig when _from_config (#44629) by @zucchini-nlp in [#44629]
  • Remove cache_position in more models (2) (#44602) by @Cyrilvallez in [#44602]
  • fix: cast to proper dtype in EmbeddingParallel (#44612) by @michaelbenayoun in [#44612]
  • Remove many output_attentions and other traced outputs on 100+ models (#43590) by @molbap in [#43590]
  • fix: raise error if mm_token_type_ids not supplied (#44433) by @leopold-tzafon in [#44433]
  • Fix output capturing for Backbones (#44638) by @Cyrilvallez in [#44638]
  • Fix for VibeVoiceAcousticTokenizer (#44628) by @ydshieh in [#44628]
  • Fix off-by-one in decode_spans boundary check (#44584) by @mvanhorn in [#44584]
  • Fix more wrong HF hub checkpoint names (#44624) by @ydshieh in [#44624]
  • Update agentic contributions guidelines in AGENTS.md to force yielding. (#44411) by @burtenshaw in [#44411]
  • Expand model-structure lint rules with a fast AST-based, ruff-like framework (#44174) by @tarekziade in [#44174]
  • feat: add neuron in tensor parallelism initialization (#44498) by @michaelbenayoun in [#44498]
  • [WIP] FIX Make Mixtral LoRA loading work (#44478) by @BenjaminBossan in [#44478]
  • Fix Llava tests for torch too! (#44476) by @Rocketknight1 in [#44476]
  • Fix training ci and clean some tests (#44491) by @SunMarc in [#44491]
  • Remove useless identity assignment (#44600) by @Cyrilvallez in [#44600]
  • Add Yoni to run-slow workflow (#44598) by @vasqu in [#44598]
  • Add shared VLM tests (#42964) by @Rocketknight1 in [#42964]
  • Fix wrong (non-existing) checkpoints (#44549) by @ydshieh in [#44549]
  • Remove cache_position in more models (#44330) by @Cyrilvallez in [#44330]
  • Fix CircleCI summary report not showing due to missing dependency (#44597) by @ydshieh in [#44597]
  • Fix typos in add_new_model_like docstrings (#43544) by @Olexandr88 in [#43544]
  • Fix UnboundLocalError for tp_plan_alt when tp_plan is empty (#44540) by @YangKai0616 in [#44540]
  • FIX Multiple PEFT errors after v5 transition (#44592) by @BenjaminBossan in [#44592]
  • Fix missing BPE token conversion step in Chameleon (#44582) by @yonigozlan in [#44582]
  • Make paligemma embed tokens standard (#44432) by @zucchini-nlp in [#44432]
  • chore(typing): Add type checking to src/transformers/quantizers (#44412) by @tarekziade in [#44412]
  • Fix: AQLM quantizer to match updated replace_with_aqlm_linear signature (#44577) by @tarekziade in [#44577]
  • [device_map] Fix device_map computation by correctly adjusting memory available (#44565) by @Cyrilvallez in [#44565]
  • Fix error message label and docstring default in load_sharded_checkpoint (#44523) by @jnMetaCode in [#44523]
  • Correct Tapas initialization (#44575) by @Rocketknight1 in [#44575]
  • [fix] Prevent crash with Apertus without xielu installed (#44567) by @tomaarsen in [#44567]
  • Fix failing MusicgenStereo integration tests (#44527) by @Sai-Suraj-27 in [#44527]
  • Fix zamba2 rotary embedding call when use_mem_rope is False (#44551) by @echarlaix in [#44551]
  • [Bugfix] fix video inference of qwen3vl and qwen3.5 series (#44474) by @JJJYmmm in [#44474]
  • add XPU Expectations for higgs_audio_v2 tests (#44482) by @kaixuanliu in [#44482]
  • chameleon added to MODELS_WITH_INCORRECT_HUB_TOKENIZER_CLASS (#44475) by @itazap in [#44475]
  • Revert "test merge queue 1" (#44552) by @ydshieh in [#44552]
  • test merge queue 1 (#44529) by @ydshieh2 in [#44529]
  • fix(testing): Fix MoonshineEncoder UnboundLocalError and Florence2VisionBackbone dtype mismatch (#44503) by @harshaljanjani in [#44503]
  • Fix: Remove references to transformers run command (#44513) by @math-hiyoko in [#44513]
  • [LW-DETR] Fix training (#44441) by @NielsRogge in [#44441]
  • Make _prepare_input_fn and _prepare_output_fn instance methods (#44499) by @michaelbenayoun in [#44499]
  • Fix ShieldGemma2 non-reproducible outputs by adding _tied_weights_keys (#44358) by @hardikmeisheri in [#44358]
  • Tensor Parallelism and mps device (#44506) by @michaelbenayoun in [#44506]
  • Fix failing GPTNeoModelLanguageGenerationTest (#44515) by @Sai-Suraj-27 in [#44515]
  • Fix failing MarianIntegrationTests (#44519) by @Sai-Suraj-27 in [#44519]
  • fix pin_memory for contiguous batching (#44455) by @jiqing-feng in [#44455]
  • Fix continuous batching for multimodal models (#44436) by @jw9603 in [#44436]
  • Fix KeyError in _parse_type_hint when Union contains Any (#44525) by @jnMetaCode in [#44525]
  • Fix AssistantTracker.is_active() returning False after activation with empty lists (#44524) by @jnMetaCode in [#44524]
  • Fix and re-enable extra_state tests (#43510) by @pstjohn in [#43510]
  • Fix ansi codes in loading reports when not connected to terminal (#44544) by @Cyrilvallez in [#44544]
  • Follow-up typing checking fixes (#44500) by @tarekziade in [#44500]
  • Fix backend dependency (#44542) by @Cyrilvallez in [#44542]
  • Add a new job in build_pr_documentation.yml (will be the new required job) (#44538) by @ydshieh in [#44538]
  • Update build_pr_documentation workflow for merge_group event (#44532) by @ydshieh in [#44532]
  • Fixed typo in docs/source/en/kv_cache.md (#44501) by @frogNotToad in [#44501]
  • Docs: fix SigLIP2 usage examples (#43641) by @KOKOSde in [#43641]
  • Fix type checker (#44502) by @Cyrilvallez in [#44502]
  • Add MLU bf16 support to is_torch_bf16_gpu_available (#44381) by @carcel-yu in [#44381]
  • fix model parallelism bug for eurobert model (#44490) by @kaixuanliu in [#44490]
  • Update ty to 0.0.20 (#44494) by @tarekziade in [#44494]
  • Add auto-docstring on configs (#44296) by @zucchini-nlp in [#44296]
  • Fix failed unit tests for moonshine_streaming model (#43936) by @kaixuanliu in [#43936]
  • Update distributed tests (#44338) by @SunMarc in [#44338]
  • Add diffusers to CI docker file (#44480) by @ydshieh in [#44480]
  • Replace placeholder tokens as specified in added_tokens_decoder (#44468) by @itazap in [#44468]
  • [vLLM] Fix backward compatibility with hardcoded subprocessors classes in processors (#44447) by @yonigozlan in [#44447]
  • [remote code/vllm] Fix incorrect tied weights (#44469) by @Cyrilvallez in [#44469]
  • Integrate the Neuron device to TrainingArguments (#44302) by @michaelbenayoun in [#44302]
  • Fix failing DepthProModelIntegrationTest (#44456) by @Sai-Suraj-27 in [#44456]
  • [timesfm2_5] fix loss scaling (#44465) by @kashif in [#44465]
  • Fix failing ProphetNetModelIntegrationTest (#44439) by @Sai-Suraj-27 in [#44439]
  • [Trainer] fix SP loss (#44461) by @kashif in [#44461]
  • skip 1 invalid test case for higgs_audio_v2 (#44350) by @kaixuanliu in [#44350]
  • Fix position_ids typo in Qwen3_5TextModel forward pass (#44399) by @<NOT FOUND> in [#44399]

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @ydshieh
    • Don't run tests_hub if no tests found (#45014)
    • Fix failing job Update Transformers metadata after #43514 (#44941)
    • fix: improve processor loading performance by avoiding redundant tokenizer parsing (#44927)
    • fix processing_utils.py: avoid deepcopying tokenizer in ProcessorMixin to improve performance (#44894)
    • Fix core dumped when NemotronH is torch compiled (#44854)
    • Fix repo-check bot (#44812)
    • Fix config loading issues (type issues) (#44789)
    • Remove is_causal from EuroBertConfig (#44774)
    • Fix mlcd auto config/model/mapping issues (#44730)
    • Fix missing / incorrect config class in some model class definitions (#44715)
    • Update Nvidia CI docker file to use torch 2.10 (#44712)
    • Fix more model tester missing parent issue (#44685)
    • Another (small) set of fixes required for tiny model creation (#44666)
    • Fix for VibeVoiceAcousticTokenizer (#44628)
    • Fix more wrong HF hub checkpoint names (#44624)
    • Fix wrong (non-existing) checkpoints (#44549)
    • Fix CircleCI summary report not showing due to missing dependency (#44597)
    • Fix PR comment CI for quantization job (#44579)
    • Revert "test merge queue 1" (#44552)
    • Add a new job in build_pr_documentation.yml (will be the new required job) (#44538)
    • Update build_pr_documentation workflow for merge_group event (#44532)
    • Add diffusers to CI docker file (#44480)
  • @NielsRogge
    • Add VidEoMT (#44285)
    • Fix CookieCutter (#44334)
    • [LW-DETR] Fix training (#44441)
  • @tarekziade
    • refactor: unify QA calls (#44879)
    • refactor: mlinter as its own package (#44939)
    • chore(typing): added rule 11 (#44865)
    • feat: added cache to the model linter (#44790)
    • feat(ci): added a network debug report (#44636)
    • Centralize AI agent templates in .ai (#44489)
    • model-linter: Added rule 10 (#44761)
    • Fix: Eurobert model was missing @strict decorator and invalid test kwargs (#44767)
    • fix: sig lip import (#44764)
    • Expand model-structure lint rules with a fast AST-based, ruff-like framework (#44174)
    • chore(typing): Add type checking to src/transformers/quantizers (#44412)
    • Fix: AQLM quantizer to match updated replace_with_aqlm_linear signature (#44577)
    • Follow-up typing checking fixes (#44500)
    • Update ty to 0.0.20 (#44494)
  • @Sai-Suraj-27
    • Fix failing T5ModelIntegrationTest (#44934)
    • Add Jina-Embeddings-V3 Model (#44251)
    • Fix failing MusicgenStereo integration tests (#44527)
    • Fix failing GPTNeoModelLanguageGenerationTest (#44515)
    • Fix failing MarianIntegrationTests (#44519)
    • Fix failing DepthProModelIntegrationTest (#44456)
    • Fix failing ProphetNetModelIntegrationTest (#44439)
  • @remi-or
    • [CB] [Minor] Simplify test suite (#44858)
    • [CB] Add an option to return logprobs (#44835)
    • [CB] Better parametrization for compile (#44578)
    • [CB] [Bug] Fix crashes when running without cuda (#44673)
    • [CB] Add dedicated config (#44434)
    • [CB] Add paged_attention kernel (#44379)
  • @XingweiDeng
    • [Model] Add UVDoc Model Support (#43385)
    • [Model] Add PP-Chart2Table Model Support (#43767)
    • [Model] Add PP-OCRV5_mobile_det Model Support (#43247)
    • [Model] Add PP-OCRV5_server_det Model Support (#43274)
  • @vasqu
    • [FA4] Add kernels fallback (#44797)
    • [FA] Fix fa detection (#44703)
    • 🚨 [FA4] Initial support (#42435)
    • [Chmv2] Fix conversion after capture refactor (#44665)
    • Add Yoni to run-slow workflow (#44598)
  • @liu-jiaxuan
    • [Model] Add SLANeXt Model Support (#43707)
  • @zhang-prog
    • [Model] Add PP-OCRv5_server_rec and PP-OCRv5_mobile_rec models Support (#44808)
  • @balak4
    • Add GreedyLR adaptive learning rate scheduler (#44271)
  • @kaixuanliu
    • fix bug embedding_size mismatch with hidden_size in electra model test (#44657)
    • Fix bug and add XPU Expectations for qwen2 and jamba tests (#44733)
    • Add XPU Expectations for vibe voice acoustic tokenizer tests (#44428)
    • add XPU Expectations for higgs_audio_v2 tests (#44482)
    • fix model parallelism bug for eurobert model (#44490)
    • Fix failed unit tests for moonshine_streaming model (#43936)
    • skip 1 invalid test case for higgs_audio_v2 (#44350)
  • @juliendenize
    • Add Mistral 4 (#44760)
    • [MistralCommonBackend] Upgrade mistral-common to v1.10.0 (#44656)
  • @molbap
    • Add model lerobot PI0 to transformers (#44160)
    • Remove many output_attentions and other traced outputs on 100+ models (#43590)
  • @JJJYmmm
    • [Bugfix] fix video inference of qwen3vl and qwen3.5 series (#44474)
  • @math-hiyoko
    • Fix: Remove references to text2text-generation, summarization and translation pipeline tasks (#44510)
    • Fix: Remove references to transformers run command (#44513)
Latest
May 20, 2026