This is mostly some fixes that are good to have asap, mostly for tokenizers; ** Fix Kimi-K2.5 tokenizer regression and _patch_mistral_regex Attribute… (#45305) by ArthurZucker
For training: ** Fix #45305 + add regression test GAS (#45349) by florian6973, SunMarc ** Fix IndexError with DeepSpeed ZeRO-3 when kernels rotary is active (#…) by ArthurZucker
And for Qwen2.5-VL : ** Fix Qwen2.5-VL temporal RoPE scaling applied to still images (#45330) by Kash6, zucchini-nlp
Small patch release to fix device_map support for Gemma4! It contains the following commit:
Small patch dedicated to optimizing gemma4, fixing inference with use_cache=False due to k/v states sharing between layers, as well as conversion mappings for some models that would inconsistently serialize their weight names. It contains the following PRs:
This patch is very small and focuses on vLLM and Gemma4!
** Fix export for gemma4 and add Integration tests (#45285) by @Cyrilvallez ** Fix vllm cis (#45139) by @ArthurZucker
Gemma 4 is a multimodal model with pretrained and instruction-tuned variants, available in 1B, 13B, and 27B parameters. The architecture is mostly the same as the previous Gemma versions. The key differences are a vision processor that can output images of fixed token budget and a spatial 2D RoPE to encode vision-specific information across height and width axis.
<img width="1478" height="1374" alt="image" src="https://github.com/user-attachments/assets/9d88bd1b-02ea-4829-b7d0-fac0e347d436" />You can find all the original Gemma 4 checkpoints under the Gemma 4 release.
The key difference from previous Gemma releases is the new design to process images of different sizes using a fixed-budget number of tokens. Unlike many models that squash every image into a fixed square (like 224×224), Gemma 4 keeps the image's natural aspect ratio while making it the right size. There a a couple constraints to follow:
[!IMPORTANT] Gemma 4 does not apply the standard ImageNet mean/std normalization that many other vision models use. The model's own patch embedding layer handles the final scaling internally (shifting values to the [-1, 1] range).
The number of "soft tokens" (aka vision tokens) an image processor can produce is configurable. The supported options are outlined below and the default is 280 soft tokens per image.
| Soft Tokens | Patches (before pooling) | Approx. Image Area |
|---|---|---|
| 70 | 630 | ~161K pixels |
| 140 | 1,260 | ~323K pixels |
| 280 | 2,520 | ~645K pixels |
| 560 | 5,040 | ~1.3M pixels |
| 1,120 | 10,080 | ~2.6M pixels |
To encode positional information for each patch in the image, Gemma 4 uses a learned 2D position embedding table. The position table stores up to 10,240 positions per axis, which allows the model to handle very large images. Each position is a learned vector of the same dimensions as the patch embedding. The 2D RoPE which Gemma 4 uses independently rotate half the attention head dimensions for the x-axis and the other half for the y-axis. This allows the model to understand spatial relationships like "above," "below," "left of," and "right of."
NomicBERT is a BERT-inspired encoder model that applies Rotary Position Embeddings (RoPE) to create reproducible long context text embeddings. It is the first fully reproducible, open-source text embedding model with 8192 context length that outperforms both OpenAI Ada-002 and OpenAI text-embedding-3-small on short-context MTEB and long context LoCo benchmarks. The model generates dense vector embeddings for various tasks including search, clustering, and classification using specific instruction prefixes.
Links: Documentation | Paper
Music Flamingo is a fully open large audio–language model designed for robust understanding and reasoning over music. It builds upon the Audio Flamingo 3 architecture by including Rotary Time Embeddings (RoTE), which injects temporal position information to enable the model to handle audio sequences up to 20 minutes. The model features a unified audio encoder across speech, sound, and music with special sound boundary tokens for improved audio sequence modeling.
Links: Documentation | Paper
Mamba and hybrid model caches are now first-class native citizens in the library, so users working with Mamba-based or hybrid (Mamba + attention) models should update their code to use the new native cache classes instead of any previous workarounds.
Remote code execution support has been removed from the native LightGlue integration, so users who were loading LightGlue with trust_remote_code=True must remove that argument and use the model directly through the standard native API.
LightGlue] Remove remote code execution (#45122) by @vasquSeveral vision-related bugs were fixed in this release, including correcting the Gemma vision mask to support video inputs, resolving a dependency issue that incorrectly required torchvision for PIL-based image processors, and patching bugs in the Janus image generation model and image loading. Local code resolution for tokenizers and image processors was also corrected.
Image.open failure (#44645) by @sywangyi in [#44645]Improved the performance of repository checks (check-repo) by introducing file-level and AST-level disk caching, achieving up to a 27x speedup (from ~46s to ~1.6s with a warm cache), and fixed the mlinter cache location in .gitignore.
janus model (#44739) by @kaixuanliu in [#44739]FA] Fix BC support for a few versions + add deprecation cycle (#45061) by @vasqu in [#45061]model_type in AutoConfig.from_pretrained (#45058) by @hmellor in [#45058]SmolLM3IntegrationTest (#45048) by @Sai-Suraj-27 in [#45048]The following contributors have made significant changes to the library over the last release:
Video Encoder-only Mask Transformer (VidEoMT) is a lightweight encoder-only model for online video segmentation built on a plain Vision Transformer (ViT). It eliminates the need for dedicated tracking modules by introducing a lightweight query propagation mechanism that carries information across frames and employs a query fusion strategy that combines propagated queries with temporally-agnostic learned queries. VidEoMT achieves competitive accuracy while being 5x-10x faster than existing approaches, running at up to 160 FPS with a ViT-L backbone.
Links: Documentation | Paper
UVDoc is a machine learning model designed for document image rectification and correction. The main purpose of this model is to carry out geometric transformation on images to correct document distortion, inclination, perspective deformation and other problems in document images. It provides both single input and batched inference capabilities for processing distorted document images.
Links: Documentation
The Jina-Embeddings-v3 is a multilingual, multi-task text embedding model designed for a variety of NLP applications. Based on the XLM-RoBERTa architecture, this model supports Rotary Position Embeddings (RoPE) replacing absolute position embeddings to support long input sequences up to 8192 tokens. Additionally, it features 5 built-in Task-Specific LoRA Adapters that allow the model to generate task-specific embeddings (e.g., for retrieval vs. classification) without increasing inference latency significantly.
Links: Documentation | Paper
Jina-Embeddings-V3 Model (#44251) by @Sai-Suraj-27 in #44251Mistral 4 is a powerful hybrid model with the capability of acting as both a general instruction model and a reasoning model. It unifies the capabilities of three different model families - Instruct, Reasoning (previously called Magistral), and Devstral - into a single, unified model. The model features a MoE architecture with 128 experts and 4 active, 119B parameters with 6.5B activated per token, 256k context length, and supports multimodal input with both text and image processing capabilities.
Links: Documentation
PI0 is a vision-language-action model for robotics manipulation that jointly processes visual observations and language instructions to generate robot actions. It uses a novel flow matching architecture built on top of a pre-trained vision-language model to inherit Internet-scale semantic knowledge. The model can perform complex dexterous tasks like laundry folding, table cleaning, and assembling boxes across multiple robot platforms including single-arm robots, dual-arm robots, and mobile manipulators.
Links: Documentation | Paper
SLANeXt is a series of dedicated lightweight models for table structure recognition, focusing on accurately recognizing table structures in documents and natural scenes. The SLANeXt series is a new generation of table structure recognition models independently developed by the Baidu PaddlePaddle Vision Team, with dedicated weights trained separately for wired and wireless tables. The recognition ability for all types of tables has been significantly improved, especially for wired tables.
Links: Documentation
PP-OCRv5_mobile_rec is a dedicated lightweight model for text recognition, focusing specifically on efficient recognition and understanding of text elements in multi-language documents and natural scenes. It is designed to efficiently and accurately support the recognition of Simplified Chinese, Traditional Chinese, English, Japanese, as well as complex text scenarios such as handwriting, vertical text, pinyin, and rare characters with a single model. While maintaining recognition performance, it also balances inference speed and model robustness, providing efficient and accurate technical support for document understanding in various scenarios.
Links: Documentation
PP-OCRv5_server_rec is a dedicated lightweight model for text recognition, focusing specifically on efficient recognition and understanding of text elements in multi-language documents and natural scenes. It is designed to efficiently and accurately support the recognition of Simplified Chinese, Traditional Chinese, English, Japanese, as well as complex text scenarios such as handwriting, vertical text, pinyin, and rare characters with a single model. While maintaining recognition performance, it also balances inference speed and model robustness, providing efficient and accurate technical support for document understanding in various scenarios.
Links: Documentation
PP-OCRv5_mobile_det is a dedicated lightweight model for text detection, focusing specifically on efficient detection and understanding of text elements in multi-language documents and natural scenes. It is part of the latest generation of text detection models developed by the PaddleOCR team that efficiently and accurately supports the detection of text in diverse scenarios—including handwriting, vertical, rotated, and curved text—across multiple languages such as Simplified Chinese, Traditional Chinese, English, and Japanese. The model features robust handling of complex layouts, varying text sizes, and challenging backgrounds, making it suitable for practical applications like document analysis, license plate recognition, and scene text detection.
Links: Documentation
PP-LCNet is a family of efficient, lightweight convolutional neural networks designed for real-world document understanding and OCR tasks. It balances accuracy, speed, and model size, making it ideal for both server-side and edge deployment. The model has three main variants optimized for specific tasks: document image orientation classification, table classification, and text line orientation classification.
Links: Documentation
PPLCNetV3 is a lightweight CPU-optimized convolutional backbone designed for efficient image classification and downstream vision tasks. It builds on the PP-LCNet architecture with improved training strategies and structural refinements for better accuracy-latency tradeoffs on CPU hardware.
Links: Documentation | Paper
PP-OCRv5_server_det is a high-performance text detection model optimized for server-side applications, focusing on accurate detection of multi-language text in documents and natural scenes. It supports the detection of text in diverse scenarios—including handwriting, vertical, rotated, and curved text—across multiple languages such as Simplified Chinese, Traditional Chinese, English, and Japanese. The model features robust handling of complex layouts, varying text sizes, and challenging backgrounds, making it suitable for practical applications like document analysis, license plate recognition, and scene text detection.
Links: Documentation
CHMv2 is a global, meter-resolution canopy height mapping model that uses DINOv3 to estimate forest canopy heights from high-resolution optical satellite imagery. Building on the original canopy height maps released in 2024, CHMv2 delivers substantial improvements in accuracy, detail, and global consistency by leveraging Meta's self-supervised vision model. The model is trained against airborne laser scanning data and provides essential information for quantifying forest carbon, monitoring restoration and degradation, and assessing habitat structure.
Links: Documentation | Paper | Blog Post
The dual BaseImageProcessor/BaseImageProcessorFast design has been replaced with a unified backend architecture, and the image_processing_utils_fast module has been removed — users should migrate to the new unified image_processing_utils module.
PreTrainedConfig and model config classes have been refactored to use @dataclass and no longer accept positional arguments — users must update any config instantiation calls to use keyword arguments only.
Flash Attention 2 (FA2) support now requires version 2.3.3 or newer, and initial Flash Attention 4 (FA4) support has been added — users on older FA2 versions must upgrade to at least 2.3.3.
FA4] Initial support (#42435) by @vasquWeight tying behavior has changed so that weights are now tied even when both keys are already present in a checkpoint — users relying on the previous behavior (e.g., with .bin checkpoints containing duplicate keys) should verify their models load as expected.
The cache_position argument has been removed from the forward signatures of most major models — users passing cache_position directly to these models should remove it, as it is now handled internally by generate.
Several bug fixes and improvements were made to pipeline parallel (PP) and tensor parallel (TP) support, including fixing supports_tp/pp_plan detection, resolving attribute errors in PP for Qwen2VL-based models, correcting FSDP loading with meta devices, and ensuring TP weight sharding properly updates parent module attributes (e.g., in_features/out_features) to improve compatibility with libraries like PEFT.
supports_{tp/pp}_plan (#44696) by @hmellor in [#44696]torch.distributed.fsdp in trainer_seq2seq.py (#44507) by @0xDELUXA in [#44507]Quantization support was improved with up to 30x faster FP8 grouped and batched matmuls, static FP8 expert support for multi-GPU setups, and a torchao minimum version bump to 0.15.0. Additionally, MXFP4 dependency error messages were made more actionable, and AWQ tests were updated to align with the GPTQModel migration.
Several performance improvements were made to tokenizer loading and saving, including eliminating redundant file parsing and unnecessary deep copies of large vocabularies that caused significant overhead. Additionally, bug fixes were applied for incorrect tokenizer class names on the Hub (DeepSeek V2/V3, ModernBERT), a clean_up_tokenization_spaces misconfiguration in Llama 3 tokenizer conversion, and a string replacement issue in AutoTokenizer class name resolution.
processing_utils.py: avoid deepcopying tokenizer in ProcessorMixin to improve performance (#44894) by @ydshieh in [#44894]clean_up_tokenization_spaces=False in Llama 3 tokenizer conversion (#44914) by @maxsloef-goodfire in [#44914]Kernel support has been expanded with Flash Attention 4 fallback integration, a paged_attention kernel for continuous batching, and Neuron device support for custom kernels. Several stability fixes were also made, including bumping the kernels version dependency to prevent crashes and correcting the LFM2 kernel path.
FA4] Add kernels fallback (#44797) by @vasqu in [#44797]Several cache-related fixes and improvements were made, including aligning LFM2's cache implementation with other Mamba caches, fixing a tensor indexing crash in KV cache continuation for the transformers serve streaming endpoint, and resolving a generation bug in Idefics3 when using use_cache=False. A caching layer was also added to the model linter to skip unchanged valid files and improve build performance.
Fixed backward compatibility for full-path imports of Fast Image Processors and resolved a Llama4 vision rotary embedding initialization error where freqs_ci was not registered as a buffer, causing failures when loading models with device_map="auto".
The cache_position argument has been fully removed from the generation pipeline, as all models have been updated to no longer use it (with a backward-compatibility path retained for remote code models). Additionally, integration tests for LASR with chunked decoding were added, and outdated references to deprecated pipeline tasks were cleaned up.
cache_position anymore in generation (#44816) by @Cyrilvallez in [#44816]text2text-generation, summarization and translation pipeline tasks (#44510) by @math-hiyoko in [#44510]tests_hub if no tests found (#45014) by @ydshieh in [#45014]attention_chunk_size in Llama4TextConfig (#45002) by @hmellor in [#45002]maybe_autocast crashing on meta device tensors (#44984) by @Butanium in [#44984]mm_token_type be non-padded lists (#44563) by @zucchini-nlp in [#44563]Qwen2VL (#44976) by @hmellor in [#44976]check_auto_docstrings (#44803) by @yonigozlan in [#44803]vllm x v5] nit (#44971) by @ArthurZucker in [#44971]T5ModelIntegrationTest (#44934) by @Sai-Suraj-27 in [#44934]Update Transformers metadata after #43514 (#44941) by @ydshieh in [#44941]from_pretrained (url input deprecated) (#44946) by @BSchilperoort in [#44946]image_processing_utils_fast (#44897) by @yonigozlan in [#44897]NemotronH is torch compiled (#44854) by @ydshieh in [#44854]SizeDict (#44884) by @hmellor in [#44884]layer_types type hint for AFMoE and Llama4 (#44874) by @hmellor in [#44874]PreTrainedModel (#44672) by @neo in [#44672]KeyError when patching mistral regex (#43376) by @LeonardoEmili in [#43376]position_ids keys when loading OwlViT models (#44508) by @KartikPawade in [#44508].ai (#44489) by @tarekziade in [#44489]@strict (#44770) by @zucchini-nlp in [#44770]is_causal from EuroBertConfig (#44774) by @ydshieh in [#44774]mlcd auto config/model/mapping issues (#44730) by @ydshieh in [#44730]config class in some model class definitions (#44715) by @ydshieh in [#44715]FA] Fix fa detection (#44703) by @vasqu in [#44703]set_encoder (#44698) by @hmellor in [#44698]parent issue (#44685) by @ydshieh in [#44685]ParallelInterface (#44640) by @michaelbenayoun in [#44640]Chmv2] Fix conversion after capture refactor (#44665) by @vasqu in [#44665]dtype for subconfig when _from_config (#44629) by @zucchini-nlp in [#44629]cache_position in more models (2) (#44602) by @Cyrilvallez in [#44602]VibeVoiceAcousticTokenizer (#44628) by @ydshieh in [#44628]cache_position in more models (#44330) by @Cyrilvallez in [#44330]src/transformers/quantizers (#44412) by @tarekziade in [#44412]fix] Prevent crash with Apertus without xielu installed (#44567) by @tomaarsen in [#44567]MusicgenStereo integration tests (#44527) by @Sai-Suraj-27 in [#44527]higgs_audio_v2 tests (#44482) by @kaixuanliu in [#44482]_prepare_input_fn and _prepare_output_fn instance methods (#44499) by @michaelbenayoun in [#44499]mps device (#44506) by @michaelbenayoun in [#44506]GPTNeoModelLanguageGenerationTest (#44515) by @Sai-Suraj-27 in [#44515]MarianIntegrationTests (#44519) by @Sai-Suraj-27 in [#44519]build_pr_documentation.yml (will be the new required job) (#44538) by @ydshieh in [#44538]build_pr_documentation workflow for merge_group event (#44532) by @ydshieh in [#44532]ty to 0.0.20 (#44494) by @tarekziade in [#44494]diffusers to CI docker file (#44480) by @ydshieh in [#44480]DepthProModelIntegrationTest (#44456) by @Sai-Suraj-27 in [#44456]ProphetNetModelIntegrationTest (#44439) by @Sai-Suraj-27 in [#44439]The following contributors have made significant changes to the library over the last release:
tests_hub if no tests found (#45014)Update Transformers metadata after #43514 (#44941)processing_utils.py: avoid deepcopying tokenizer in ProcessorMixin to improve performance (#44894)NemotronH is torch compiled (#44854)is_causal from EuroBertConfig (#44774)mlcd auto config/model/mapping issues (#44730)config class in some model class definitions (#44715)parent issue (#44685)VibeVoiceAcousticTokenizer (#44628)build_pr_documentation.yml (will be the new required job) (#44538)build_pr_documentation workflow for merge_group event (#44532)diffusers to CI docker file (#44480).ai (#44489)src/transformers/quantizers (#44412)ty to 0.0.20 (#44494)T5ModelIntegrationTest (#44934)Jina-Embeddings-V3 Model (#44251)MusicgenStereo integration tests (#44527)GPTNeoModelLanguageGenerationTest (#44515)MarianIntegrationTests (#44519)DepthProModelIntegrationTest (#44456)ProphetNetModelIntegrationTest (#44439)FA4] Add kernels fallback (#44797)FA] Fix fa detection (#44703)FA4] Initial support (#42435)Chmv2] Fix conversion after capture refactor (#44665)higgs_audio_v2 tests (#44482)text2text-generation, summarization and translation pipeline tasks (#44510)EuroBERT is a multilingual encoder model based on a refreshed transformer architecture, akin to Llama but with bidirectional attention. It supports a mixture of European and widely spoken languages, with sequences of up to 8192 tokens.
Links: Documentation | Paper | Blog Post
VibeVoice ASR is an automatic speech recognition model from Microsoft that combines acoustic and semantic audio tokenizers with a causal language model for robust speech-to-text transcription. The model uses VibeVoice's acoustic and semantic tokenizers that process audio at 24kHz, paired with a Qwen2-based language decoder for generating transcriptions. It can process up to 60 minutes of continuous audio input, supports customized hotwords, performs joint ASR/diarization/timestamping, and handles over 50 languages with code-switching support.
Links: Documentation | Paper
TimesFM 2.5 is a pretrained time-series foundation model that uses a decoder-only attention architecture with input patching for forecasting. The model is designed to provide accurate zero-shot forecasts across different domains, forecasting horizons and temporal granularities without requiring dataset-specific training. It builds on the original TimesFM architecture with enhancements including rotary attention, QK normalization, per-dimension attention scaling, and continuous quantile prediction.
Links: Documentation | Paper
PP-DocLayoutV2 is a dedicated lightweight model for layout analysis, focusing specifically on element detection, classification, and reading order prediction. The model is composed of two sequentially connected networks: an RT-DETR-based detection model that performs layout element detection and classification, followed by a pointer network that orders these layout elements. It is designed to analyze document layouts by identifying and organizing various layout components in their proper reading sequence.
Links: Documentation
OLMo Hybrid is a hybrid architecture model from Ai2 that combines standard transformer attention layers with linear attention layers using the Gated Deltanet. This hybrid approach aims to improve efficiency while maintaining model quality by interleaving full attention layers with linear attention layers. The model uses a custom cache system that handles both KV cache for attention layers and recurrent state for linear attention layers.
Links: Documentation
ModernVBert is a Vision-Language encoder that combines ModernBert with a SigLIP vision encoder. It is optimized for visual document understanding and retrieval tasks, making it suitable for processing documents that contain both text and visual elements.
Links: Documentation | Paper
ColModernVBert is a model for efficient visual document retrieval that leverages ModernVBert to construct multi-vector embeddings directly from document images, following the ColPali approach. The model enables retrieval and scoring of visual documents by processing both text queries and document images to generate embeddings that can be compared for relevance scoring.
Links: Documentation | Paper
Higgs Audio V2 is a powerful audio foundation model developed by Boson AI that was pretrained on over 10 million hours of audio data and diverse text data. Despite having no post-training or fine-tuning, the model excels in expressive audio generation thanks to its deep language and acoustic understanding. The model supports various audio generation tasks including single-speaker and multi-speaker smart voice, zero-shot voice cloning, and multi-speaker voice cloning.
Links: Documentation
The Higgs Audio V2 Tokenizer is an audio tokenization model that operates at a low frame rate of 25 fps while maintaining high audio quality, effectively halving the frame rate of many baseline models. It uses unified 24 kHz training that mixes speech, music, and sound-event clips in one model to capture both semantic and acoustic details, facilitating the training of audio language models. The model enables fast inference by avoiding diffusion steps, with an encoder/decoder architecture that processes batches quickly for real-time or large-scale tasks.
Links: Documentation
Tensor parallelism (TP) support for dense and MoE decoder-only models has been fixed and stabilized, requiring users to update their TP configurations and conversion mappings accordingly.
The Ernie4.5 VL MoE model class and configuration names have been renamed to align with vLLM/SGLang conventions, requiring users to update any references to the old model names in their code.
Ernie 4.5 VL Moe] Fix up namings to vllm/sglang convention (#44299) by @vasquSeveral pipeline tasks have been removed or updated in the V5 cleanup (including question-answering, visual-question-answering, and image-to-image), requiring users to migrate to the replacement pipelines or updated task names.
3D position IDs for vision-language models have been unified under a common interface (sourced from qwen2-vl), requiring users of affected VLMs (e.g., Ernie, GLM4V) to update their processors and any code that manually constructs position IDs.
Unigram tokenizers were missing the spm precompiled charsmap support. We ran an overall v4 vs v5 regression test and fixed what we had missed.
This was done in:
Generation input preparation was significantly refactored to stop relying on cache_position and instead pass pre-sliced input_ids/inputs_embeds directly to prepare_inputs_for_generation, simplifying the generation loop and laying groundwork for broader cache_position removal. Several bug fixes were also applied, including correct sampling for HiggsAudioV2, flaky cache-equality test stabilization for Idefics, and restored generation integration tests.
prepare_inputs_for_generation (#44226) by @Cyrilvallez in [#44226]cache_position to prepare inputs (#44130) by @Cyrilvallez in [#44130]Several tokenization bugs were fixed in this release, including resolving an AttributeError in MLukeTokenizer caused by the v5 rename of additional_special_tokens, correcting the Fuyu tokenizer class mapping, fixing LayoutXLM tokenization test failures from the slow tokenizer removal refactor, and adding olmo_hybrid to the auto-tokenizer mapping. The tokenizer documentation was also updated to reflect the new unified v5 backend architecture and reorganized for clarity.
Fixed several kernel-related issues including a security vulnerability, corrected Mamba kernel loading to handle incompatible import structures, ensured Liger Kernel is properly enabled during hyperparameter search, and expanded Flash Attention to support multiple compatible implementations.
Mamba] Fix kernel loading (#44176) by @vasqu in [#44176]Flash Attn] Enable compatible implementations (#44177) by @vasqu in [#44177]This release adds several new quantization backends and fixes, including MLX quantization support for MPS devices, Four Over Six (4/6) NVFP4 quantization integration for NVIDIA Blackwell GPUs, and CPU support for MXFP4 models, alongside a bug fix for MXFP4 model saving using reverse_op.
Fixed backward compatibility for image processors loaded from older remote code that lack valid_kwargs definitions, and resolved test failures in AMD ROCm CI by adding the missing timm dependency to the Docker image.
from_dict backward compatibility with old remote code (#44245) by @yonigozlan in [#44245]speaking_rate as an optionl forward argument (#43283) by @gau-nernst in [#43283]ProcessingKwargs ImagesKwargs etc. to docs (#44269) by @yonigozlan in [#44269]has_similar_generate_outputs assertions (#44166) by @tarekziade in [#44166]TokenizersBackend for Olmo3 to preserve custom pre_tokenizer (#44294) by @mario-sanz in [#44294]Modular] Fix file type regression (#44283) by @vasqu in [#44283]Trainer class docs (compute_loss & hyperparameter_search) (#44268) by @ethanknights in [#44268]fix] Set input_modalities on various architectures that aren't just text (#44078) by @tomaarsen in [#44078]VersionComparison.from_string return type mismatch (#43709) by @tarekziade in [#43709]AnyToAnyPipeline.__call__ docstring (#44229) by @alvarobartt in [#44229]test_generate_with_and_without_position_ids in GLM ORC (#44173) by @tarekziade in [#44173]Seq2SeqTrainingArguments documentation (#35258) by @qgallouedec in [#35258]__setitem__ on ModelOutput even if the parameter was previously None (#44080) by @tomaarsen in [#44080]simple] Fix up __repr__ whitespace/brackets (#44048) by @tomaarsen in [#44048]chore] Fix incorrect forward type hint for Gemma3n (#44051) by @tomaarsen in [#44051]get_audio_features (#44040) by @zucchini-nlp in [#44040]Kosmos2ModelTest test (#44061) by @tarekziade in [#44061]grouped_mm fallback (#44043) by @IlyasMoutawwakil in [#44043]The following contributors have made significant changes to the library over the last release:
has_similar_generate_outputs assertions (#44166)VersionComparison.from_string return type mismatch (#43709)test_generate_with_and_without_position_ids in GLM ORC (#44173)Kosmos2ModelTest test (#44061)Ernie 4.5 VL Moe] Fix up namings to vllm/sglang convention (#44299)Modular] Fix file type regression (#44283)Mamba] Fix kernel loading (#44176)Flash Attn] Enable compatible implementations (#44177)VoxtralRealtime is a streaming speech-to-text model from Mistral AI, designed for real-time automatic speech recognition (ASR). Unlike the offline Voxtral model which processes complete audio files, VoxtralRealtime is architected for low-latency, incremental transcription by processing audio in chunks as they arrive.
The model combines an audio encoder with a Mistral-based language model decoder, using time conditioning embeddings and causal convolutions with padding caches to enable efficient streaming inference.
The zAI team launches GLM-5, and introduces it as such:
GLM-5, targeting complex systems engineering and long-horizon agentic tasks. Scaling is still one of the most important ways to improve the intelligence efficiency of Artificial General Intelligence (AGI). Compared to GLM-4.5, GLM-5 scales from 355B parameters (32B active) to 744B parameters (40B active), and increases pre-training data from 23T to 28.5T tokens. GLM-5 also integrates DeepSeek Sparse Attention (DSA), largely reducing deployment cost while preserving long-context capacity.
Reinforcement learning aims to bridge the gap between competence and excellence in pre-trained models. However, deploying it at scale for LLMs is a challenge due to the RL training inefficiency. To this end, we developed slime, a novel asynchronous RL infrastructure that substantially improves training throughput and efficiency, enabling more fine-grained post-training iterations. With advances in both pre-training and post-training, GLM-5 delivers significant improvement compared to GLM-4.7 across a wide range of academic benchmarks and achieves best-in-class performance among all open-source models in the world on reasoning, coding, and agentic tasks, closing the gap with frontier models.
The Qwen team launches Qwen 3.5, and introduces it as such:
We are delighted to announce the official release of Qwen3.5, introducing the open-weight of the first model in the Qwen3.5 series, namely Qwen3.5-397B-A17B. As a native vision-language model, Qwen3.5-397B-A17B demonstrates outstanding results across a full range of benchmark evaluations, including reasoning, coding, agent capabilities, and multimodal understanding, empowering developers and enterprises to achieve significantly greater productivity. Built on an innovative hybrid architecture that fuses linear attention (via Gated Delta Networks) with a sparse mixture-of-experts, the model attains remarkable inference efficiency: although it comprises 397 billion total parameters, just 17 billion are activated per forward pass, optimizing both speed and cost without sacrificing capability. We have also expanded our language and dialect support from 119 to 201, providing broader accessibility and enhanced support to users around the world.
VibeVoice is a novel framework for synthesizing high-fidelity, long-form speech with multiple speakers by employing a next-token diffusion approach within a Large Language Model (LLM) structure. It's designed to capture the authentic conversational "vibe" and is particularly suited for generating audio content like podcasts and multi-participant audiobooks.
One key feature of VibeVoice is the use of two continuous audio tokenizers, one for extracting acoustic features and another for semantic features.
Attn] New attn mask interface everywhere (#42848):rotating_light: This one is quite breaking for super super super old modles: :rotating_light: :rotating_light:
convert_rope_params_to_dict so it uses rope_theta from the config (#43766) by @hmellorAGENTS.md (#43763) by @tarekziadeModular Dependencies] Fixup qwen rms norms (#43772) by @vasquRepo Consistency] Fix rms norm (#43803) by @vasqucheck_model_inputs implementation (#43765) by @Cyrilvallezdo_sample=False to qwen2_5_vl model tests to stablize the output (#43728) by @kaixuanliuJamba] Fallback to slow path and warn instead of error out (#43889) by @vasqufix] Use last_hidden_state key from get_image_features for llama4 (#43882) by @tomaarsencheck_model_inputs into capture_outputs and merge_with_config_defaults + ensure correctness (#43862) by @Cyrilvallez_keys_to_ignore_on_load_missing for now (#43893) by @ArthurZuckerinput_embeds to inputs_embeds everywhere (#43916) by @Cyrilvallezimage_url content support in apply_chat_template (#43786) by @kaixuanliugenerate (#43734) by @zucchini-nlprun_*_no‑trainer.py examples (#42769) by @casincarun_*_no‑trainer.py examples (#43947) by @casincaout_features (#43886) by @zucchini-nlpget_number_of_image_tokens (#43948) by @zucchini-nlpother_workflow_run_ids for issue_comment in utils/notification_service.py (#44036) by @ydshiehThe following contributors have made significant changes to the library over the last release:
Jamba] Fallback to slow path and warn instead of error out (#43889)Attn] New attn mask interface everywhere (#42848)Repo Consistency] Fix rms norm (#43803)Modular Dependencies] Fixup qwen rms norms (#43772)K-EXAONE is a large-scale multilingual language model developed by LG AI Research. Built using a Mixture-of-Experts architecture, K-EXAONE features 236 billion total parameters, with 23 billion active during inference. Performance evaluations across various benchmarks demonstrate that K-EXAONE excels in reasoning, agentic capabilities, general knowledge, multilingual understanding, and long-context processing.
PP-DocLayoutV3 is a unified and high-efficiency model designed for comprehensive layout analysis. It addresses the challenges of complex physical distortions—such as skewing, curving, and adverse lighting—by integrating instance segmentation and reading order prediction into a single, end-to-end framework.
Youtu-LLM is a new, small, yet powerful LLM, contains only 1.96B parameters, supports 128k long context, and has native agentic talents. On general evaluations, Youtu-LLM significantly outperforms SOTA LLMs of similar size in terms of Commonsense, STEM, Coding and Long Context capabilities; in agent-related testing, Youtu-LLM surpasses larger-sized leaders and is truly capable of completing multiple end2end agent tasks.
GLM-OCR is a multimodal OCR model for complex document understanding, built on the GLM-V encoder–decoder architecture. It introduces Multi-Token Prediction (MTP) loss and stable full-task reinforcement learning to improve training efficiency, recognition accuracy, and generalization. The model integrates the CogViT visual encoder pre-trained on large-scale image–text data, a lightweight cross-modal connector with efficient token downsampling, and a GLM-0.5B language decoder. Combined with a two-stage pipeline of layout analysis and parallel recognition based on PP-DocLayout-V3, GLM-OCR delivers robust and high-quality OCR performance across diverse document layouts.
🚨 T5Gemma2 model structure (#43633) - Makes sure that the attn implementation is set to all sub-configs. The config.encoder.text_config was not getting its attn set because we aren't passing it to PreTrainedModel.init. We can't change the model structure without breaking so I manually re-added a call to self.adjust_attn_implemetation in modeling code
🚨 Generation cache preparation (#43679) - Refactors cache initialization in generation to ensure sliding window configurations are now properly respected. Previously, some models (like Afmoe) created caches without passing the model config, causing sliding window limits to be ignored. This is breaking because models with sliding window attention will now enforce their window size limits during generation, which may change generation behavior or require adjusting sequence lengths in existing code.
🚨 Delete duplicate code in backbone utils (#43323) - This PR cleans up backbone utilities. Specifically, we have currently 5 different config attr to decide which backbone to load, most of which can be merged into one and seem redundant After this PR, we'll have only one config.backbone_config as a single source of truth. The models will load the backbone from_config and load pretrained weights only if the checkpoint has any weights saved. The overall idea is same as in other composite models. A few config arguments are removed as a result.
🚨 Refactor DETR to updated standards (#41549) - standardizes the DETR model to be closer to other vision models in the library.
🚨Fix floating-point precision in JanusImageProcessor resize (#43187) - replaces an int() with round(), expect light numerical differences
🚨 Remove deprecated AnnotionFormat (#42983) - removes a missnamed class in favour of AnnotationFormat.
feat] Allow loading T5Gemma2Encoder with AutoModel (#43559) by @tomaarsenimage_sizes input param (#43678) by @kaixuanliuAttn] Fixup interface usage after refactor (#43706) by @vasqunum_frames in ASR pipeline (#43546) by @jiqing-fengPreTrainedTokenizerBase (#43675) by @tarekziadeFP8Expert for DeepSeek R1 (#43616) by @yiliu30HunYuan] Fix RoPE init (#43411) by @vasquSam] Fixup training flags (#43567) by @vasquprocess_bad_commit_report.py: avoid items to appear in null author in the report (#43662) by @ydshiehKeyError in check_bad_commit.py (#43655) by @ydshiehtied_weight_keys in-place (#43619) by @zucchini-nlpRope] Revert #43410 and make inheritance implicit again (#43620) by @vasqumake_batched_video with 5D arrays (#43486) by @zucchini-nlputils/fetch_hub_objects_for_ci.py: avoid too many requests and/or timeout (#43584) by @ydshiehMistralConverter.extract_vocab_merges_from_model (#43557) by @tarekziadetemplates folder (#43536) by @CyrilvallezModular] Allow to add new bases that are not present in the inherited class (#43556) by @vasqupad_token_id (#43453) by @Sai-Suraj-27RoPE] Make explicit inheritance (#43410) by @vasquShieldGemma2IntegrationTest::test_model (#43343) by @sywangyiSamHQModelIntegrationTest::test_inference_mask_generation_batched_points_batched_images for XPU (#43511) by @sywangyisuper() (#43280) by @zucchini-nlppytest-random-order for reproducible test randomization (#43483) by @tarekziademarkuplm & perception_lm integration tests (#43464) by @Sai-Suraj-27The following contributors have made significant changes to the library over the last release:
PreTrainedTokenizerBase (#43675)MistralConverter.extract_vocab_merges_from_model (#43557)pytest-random-order for reproducible test randomization (#43483)Attn] Fixup interface usage after refactor (#43706)HunYuan] Fix RoPE init (#43411)Sam] Fixup training flags (#43567)Rope] Revert #43410 and make inheritance implicit again (#43620)Modular] Allow to add new bases that are not present in the inherited class (#43556)RoPE] Make explicit inheritance (#43410)process_bad_commit_report.py: avoid items to appear in null author in the report (#43662)KeyError in check_bad_commit.py (#43655)utils/fetch_hub_objects_for_ci.py: avoid too many requests and/or timeout (#43584)We have a migration guide that will be continuously updated available on the main branch, please check it out in case you're facing issues: migration guide.
We are excited to announce the initial release of Transformers v5. This is the first major release in five years, and the release is significant: 1200 commits have been pushed to main since the latest minor release. This release removes a lot of long-due deprecations, introduces several refactors that significantly simplify our APIs and internals, and comes with a large number of bug fixes.
We give an overview of our focus for this release in the following blogpost. In these release notes, we'll focus directly on the refactors and new APIs coming with v5.
This release is the full V5 release. It sets in motion something bigger: going forward, starting with v5, we'll now release minor releases every week, rather than every 5 weeks. Expect v5.1 to follow next week, then v5.2 the week that follows, etc.
We're moving forward with this change to ensure you have access to models as soon as they're supported in the library, rather than a few weeks after.
In order to install this release, please do so with the following:
pip install transformers
For us to deliver the best package possible, it is imperative that we have feedback on how the toolkit is currently working for you. Please try it out, and open an issue in case you're facing something inconsistent/a bug.
Transformers version 5 is a community endeavor, and we couldn't have shipped such a massive release without the help of the entire community.
We introduce a new weight loading API in transformers, which significantly improves on the previous API. This
weight loading API is designed to apply operations to the checkpoints loaded by transformers.
Instead of loading the checkpoint exactly as it is serialized within the model, these operations can reshape, merge, and split the layers according to how they're defined in this new API. These operations are often a necessity when working with quantization or parallelism algorithms.
This new API is centered around the new WeightConverter class:
class WeightConverter(WeightTransform):
operations: list[ConversionOps]
source_keys: Union[str, list[str]]
target_keys: Union[str, list[str]]
The weight converter is designed to apply a list of operations on the source keys, resulting in target keys. A common operation done on the attention layers is to fuse the query, key, values layers. Doing so with this API would amount to defining the following conversion:
conversion = WeightConverter(
["self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj"], # The input layers
"self_attn.qkv_proj", # The single layer as output
operations=[Concatenate(dim=0)],
)
In this situation, we apply the Concatenate operation, which accepts a list of layers as input and returns a single
layer.
This allows us to define a mapping from architecture to a list of weight conversions. Applying those weight conversions
can apply arbitrary transformations to the layers themselves. This significantly simplified the from_pretrained method
and helped us remove a lot of technical debt that we accumulated over the past few years.
This results in several improvements:
Linked PR: https://github.com/huggingface/transformers/pull/41580
Just as we moved towards a single backend library for model definition, we want our tokenizers, and the Tokenizer object to be a lot more intuitive. With v5, tokenizer definition is much simpler; one can now initialize an empty LlamaTokenizer and train it directly on your corpus.
Defining a new tokenizer object should be as simple as this:
from transformers import TokenizersBackend, generate_merges
from tokenizers import pre_tokenizers, Tokenizer
from tokenizers.model import BPE
class Llama5Tokenizer(TokenizersBackend):
def __init__(self, unk_token="<unk>",bos_token="<s>", eos_token="</s>", vocab=None, merges=None ):
if vocab is None:
self._vocab = {
str(unk_token): 0,
str(bos_token): 1,
str(eos_token): 2,
}
else:
self._vocab = vocab
self._merges = merges
self._tokenizer = Tokenizer(
BPE(vocab=self._vocab, merges=self._merges, fuse_unk=True)
)
self._tokenizer.pre_tokenizer = pre_tokenizers.Metaspace(
replacement="▁", prepend_scheme=_get_prepend_scheme(self.add_prefix_space, self), split=False
)
super().__init__(
tokenizer_object=self._tokenizer,
unk_token=unk_token,
bos_token=bos_token,
eos_token=eos_token,
)
Once the tokenizer is defined as above, you can load it with the following: Llama5Tokenizer(). Doing this returns you an empty, trainable tokenizer that follows the definition of the authors of Llama5 (it does not exist yet :wink:).
The above is the main motivation towards refactoring tokenization: we want tokenizers to behave similarly to models: trained or empty, and with exactly what is defined in their class definition.
Up to now, transformers maintained two parallel implementations for many tokenizers:
tokenization_<model>.py) - Python-based implementations, often using SentencePiece as the backend.tokenization_<model>_fast.py) - Rust-based implementations using the 🤗 tokenizers library.In v5, we consolidate to a single tokenizer file per model: tokenization_<model>.py. This file will use the most appropriate backend available:
sentencepiece library. It inherits from PythonBackend.tokenizers. Basically allows adding tokens.MistralCommon's tokenization library. (Previously known as the MistralCommonTokenizer)The AutoTokenizer automatically selects the appropriate backend based on available files and dependencies. This is transparent, you continue to use AutoTokenizer.from_pretrained() as before. This allows transformers to be future-proof and modular to easily support future backends.
We enable users and tokenizer builders to define their own tokenizers from top to bottom. Tokenizers are usually defined using a backend such as tokenizers, sentencepiece or mistral-common, but we offer the possibility to design the tokenizer at a higher-level, without relying on those backends.
To do so, you can import the PythonBackend (which was previously known as PreTrainedTokenizer). This class encapsulates all the logic related to added tokens, encoding, and decoding.
If you want something even higher up the stack, then PreTrainedTokenizerBase is what PythonBackend inherits from. It contains the very basic tokenizer API features:
encodedecodevocab_sizeget_vocabconvert_tokens_to_idsconvert_ids_to_tokensfrom_pretrainedsave_pretrainedStarting with v5, we now enable initializing blank, untrained tokenizers-backed tokenizers:
from transformers import LlamaTokenizer
tokenizer = LlamaTokenizer()
This tokenizer will therefore follow the definition of the LlamaTokenizer as defined in its class definition. It can then be trained on a corpus as can be seen in the tokenizers documentation.
These tokenizers can also be initialized from vocab and merges (if necessary), like the previous "slow" tokenizers:
from transformers import LlamaTokenizer
vocab = {"<unk>": 0, "<s>": 1, "</s>": 2, "hello": 3, "world": 4}
merges = [("h", "e"), ("l", "l"), ("o", " ")]
tokenizer = LlamaTokenizer(vocab=vocab, merges=merges)
This tokenizer will behave as a Llama-like tokenizer, with an updated vocabulary. This allows comparing different tokenizer classes with the same vocab; therefore enabling the comparison of different pre-tokenizers, normalizers, etc.
⚠️ The vocab_file (as in, a path towards a file containing the vocabulary) cannot be used to initialize the LlamaTokenizer as loading from files is reserved to the from_pretrained method.
The batch_decode and decode methods have been unified to reflect behavior of the encode method. Both single and batch decoding now use the same decode method. See an example of the new behavior below:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("t5-small")
inputs = ["hey how are you?", "fine"]
tokenizer.decode(tokenizer.encode(inputs))
Gives:
- 'hey how are you?</s> fine</s>'
+ ['hey how are you?</s>', 'fine</s>']
We expect encode and decode to behave, as two sides of the same coin: encode, process, decode, should work.
[!NOTE] A common use-case would be:
encode,model.generate,decode. However, usinggeneratewould returnlist[list[int]], which would then be incompatible withdecode.
The encode_plus method is deprecated in favor of the single __call__ method.
apply_chat_template returns BatchEncodingPreviously, apply_chat_template returned input_ids for backward compatibility. Starting with v5, it now consistently returns a BatchEncoding dict like other tokenizer methods.
# v5
messages = [
{"role": "user", "content": "Hello!"},
{"role": "assistant", "content": "Hi there!"}
]
# Now returns BatchEncoding with input_ids, attention_mask, etc.
outputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
print(outputs.keys()) # dict_keys(['input_ids', 'attention_mask'])
We simplify the serialization of tokenization attributes:
special_tokens_map.json - special tokens are now stored in tokenizer_config.json.added_tokens.json - added tokens are now stored in tokenizer.json.added_tokens_decoder is only stored when there is no tokenizer.json.When loading older tokenizers, these files are still read for backward compatibility, but new saves use the consolidated format. We're gradually moving towards consolidating attributes to fewer files so that other libraries and implementations may depend on them more reliably.
Several models that had identical tokenizers now import from their base implementation:
These modules will eventually be removed altogether.
Removed T5-specific workarounds
The internal _eventually_correct_t5_max_length method has been removed. T5 tokenizers now handle max length consistently with other models.
A few testing changes specific to tokenizers have been applied:
add_tokens, encode, decode) are now centralized and automatically applied across all tokenizers. This reduces test duplication and ensures consistent behaviorFor legacy implementations, the original BERT Python tokenizer code (including WhitespaceTokenizer, BasicTokenizer, etc.) is preserved in bert_legacy.py for reference purposes.
Special Tokens Structure:
SpecialTokensMixin: Merged into PreTrainedTokenizerBase to simplify the tokenizer architecture.special_tokens_map: Now only stores named special token attributes (e.g., bos_token, eos_token). Use extra_special_tokens for additional special tokens (formerly additional_special_tokens). all_special_tokens includes both named and extra tokens.# v4
tokenizer.special_tokens_map # Included 'additional_special_tokens'
# v5
tokenizer.special_tokens_map # Only named tokens
tokenizer.extra_special_tokens # Additional tokens
special_tokens_map_extended and all_special_tokens_extended: Removed. Access AddedToken objects directly from _special_tokens_map or _extra_special_tokens if needed.additional_special_tokens: Still accepted for backward compatibility but is automatically converted to extra_special_tokens.Deprecated Methods:
sanitize_special_tokens(): Already deprecated in v4, removed in v5.prepare_seq2seq_batch(): Deprecated; use __call__() with text_target parameter instead.# v4
model_inputs = tokenizer.prepare_seq2seq_batch(src_texts, tgt_texts, max_length=128)
# v5
model_inputs = tokenizer(src_texts, text_target=tgt_texts, max_length=128, return_tensors="pt")
model_inputs["labels"] = model_inputs.pop("input_ids_target")
BatchEncoding.words(): Deprecated; use word_ids() instead.Removed Methods:
create_token_type_ids_from_sequences(): Removed from base class. Subclasses that need custom token type ID creation should implement this method directly.prepare_for_model(), build_inputs_with_special_tokens(), truncate_sequences(): Moved from tokenization_utils_base.py to tokenization_python.py for PythonBackend tokenizers. TokenizersBackend provides model-ready input via tokenize() and encode(), so these methods are no longer needed in the base class._switch_to_input_mode(), _switch_to_target_mode(), as_target_tokenizer(): Removed from base class. Use __call__() with text_target parameter instead.# v4
with tokenizer.as_target_tokenizer():
labels = tokenizer(tgt_texts, ...)
# v5
labels = tokenizer(text_target=tgt_texts, ...)
parse_response(): Removed from base class.The v5 release significantly improves the performance of the MoE models, as can be seen in the graphs below. We improve and optimize MoE performance through batched and grouped experts implementations, and we optimize them for decoding using batched_mm.
We focus on improving the performance of loading weights on device (which gives speedups up to 6x in tensor parallel situations); this is preliminary work that we'll continue to work on in the coming weeks. Some notable improvements:
dtype updateWe have updated the default dtype for all models loaded with from_pretrained to be auto. This will lead to model instantiations respecting the dtype in which the model was saved, rather than forcing it to load in float 32.
You can, of course, still specify the dtype in which you want to load your model by specifying it as an argument to the from_pretrained method.
The Hugging Face Hub infrastructure has gradually moved to a XET backend. This will significantly simplify uploads and downloads, with higher download and upload speeds, partial uploads, and, most notably, a higher threshold for accepted file sizes on the Hugging Face Hub.
To reflect this, we're increasing the default shard size of models serialized on the Hub to 50GB (up from 5GB).
use_auth_tokenThe use_auth_token argument/parameter is deprecated in favor of token everywhere.
You should be able to search and replace use_auth_token with token and get the same logic.
Linked PR: https://github.com/huggingface/transformers/pull/41666
We decided to remove some features for the upcoming v5 as they are currently only supported in a few old models and no longer integrated in current model additions. It's recommended to stick to v4.x in case you need them. Following features are affected:
We dropped support for two torch APIs:
torchscript in https://github.com/huggingface/transformers/pull/41688torch.fx in https://github.com/huggingface/transformers/pull/41683Those APIs were deprecated by the PyTorch team, and we're instead focusing on the supported APIs dynamo and export.
We clean up the quantization API in transformers, and significantly refactor the weight loading as highlighted above.
We drop support for two quantization arguments that have been deprecated for some time:
load_in_4bitload_in_8bitWe remove them in favor of the quantization_config argument which is much more complete. As an example, here is how
you would load a 4-bit bitsandbytes model using this argument:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
model_4bit = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-3B",
device_map="auto",
quantization_config=quantization_config
)
from_xxx_config are deleted. Configs can be init from the __init__ method in the same way. See #41314.mode.rope_parameters, including the rope_theta and rope_type. Model's config.rope_parameters is a simple dictionaty in most cases, and can also be a nested dict in special cases (i.e. Gemma3 and ModernBert) with different rope parameterization for each layer type. Trying to get config.rope_theta will throw an attribute error from now on. See #39847 and #42255config.vocab_size). Users are expected to access keys from their respective sub-configs (config.text_config.vocab_size).model.generate()) will no longer have a generation_config and model.config.generation_config will throw an attribute error.tokenization_<model>.py ) will be removed in favor of using fast tokenizer files tokenization_<model>_fast.py --> will be renamed to tokenization_<model>.py. As fast tokenizers are :hugs:tokenizers - backend, they include a wider range of features that are maintainable and reliable.encode_plus --> __call__batch_decode --> decodeapply_chat_template by default returns naked input_ids rather than a BatchEncoding dict.
This was inconvenient - it should return a BatchEncoding dict like tokenizer.__call__(), but we were stuck with
it for backward compatibility. The method now returns a BatchEncoding.
Linked PRs:
processor_config.json as a nested dict, instead of serializing attributes in their own config files. Loading will be supported for all old format processors (https://github.com/huggingface/transformers/pull/41474)XXXFeatureExtractors classes are completely removed in favor of XXXImageProcessor class for all vision models (https://github.com/huggingface/transformers/pull/41174)XXXFastImageProcessorKwargs is removed in favor of XXXImageProcessorKwargs which will be shared between fast and slow processors (https://github.com/huggingface/transformers/pull/40931)RotaryEmbeddings layers will start returning a dict of tuples, in case the model uses several RoPE configurations (Gemma2, ModernBert). Each value will be a tuple of "cos, sin" per RoPE type.RotaryEmbeddings layer will be unified and accessed via config.rope_parameters. Config attr for rope_theta might not be accessible anymore for some models, and instead will be in config.rope_parameters['rope_theta']. BC will be supported for a while as much as possible, and in the near future we'll gradually move to the new RoPE format (https://github.com/huggingface/transformers/pull/39847)model.language_model. It is recommended to either access the module with model.model.language_model or model.get_decoder(). See #42156kwargs in their forward methodsGreedySearchEncoderDecoderOutput). We now only have 4 output classes built from the following matrix: decoder-only vs encoder-decoder, uses beams vs doesn't use beams (https://github.com/huggingface/transformers/pull/40998)generate doesn't receive any KV Cache argument, the default cache class used is now defined by the model (as opposed to always being DynamicCache) (https://github.com/huggingface/transformers/pull/41505)config.json for any old model, it will be loaded back into model's generation config. Users are expected to access or modify generation parameters only with model.generation_config.do_sample = True.compute_loss_func Handling
compute_loss_func now always takes priority over the model's built-in loss computation, giving users consistent control over custom loss functions.num_items_in_batch in Prediction Step
num_items_in_batch argument is now passed to compute_loss during prediction_step, enabling proper loss scaling during evaluation.report_to now defaults to "none"
TrainingArguments due to low usagemp_parameters -> legacy param that was later on added to the Sagemaker trainer_n_gpu -> not intended for users to set, we will initialize it correctly instead of putting it in the TrainingArgumentsoverwrite_output_dir - > replaced by resume_from_checkpoint, and it was only used in the examples script, no impact on Trainer.logging_dir -> only used for tensorboard, set TENSORBOARD_LOGGING_DIR env var insteadjit_mode_eval -> use use_torch_compile instead, as torchscript is not recommended anymoretpu_num_cores-> It is actually better to remove it, as it is not recommended to set the number of cores. By default, all TPU cores are used . Set TPU_NUM_CORES env var insteadpast_index -> it was only used for a very small number of models that have special architecture like transformersxl + it was not documented at all how to train those modelsray_scope -> only for a minor arg for ray integration. Set RAY_SCOPE var env insteadwarmup_ratio -> use warmup_step instead. We combined both args together by allowing passing float values in warmup_step.TrainingArgumentsfsdp_min_num_params and fsdp_transformer_layer_cls_to_wrap -> use fsdp_configtpu_metrics_debug -> debugpush_to_hub_token -> hub_tokenpush_to_hub_model_id and push_to_hub_organization -> hub_model_idinclude_inputs_for_metrics -> include_for_metricsper_gpu_train_batch_size -> per_device_train_batch_sizeper_gpu_eval_batch_size -> per_device_eval_batch_sizeuse_mps_device -> mps will be used by default if detectedfp16_backend and half_precision_backend -> we will only rely on torch.amp as everything has been upstreamed to torchno_cuda -> use_cpu include_tokens_per_second -> include_num_input_tokens_seenuse_legacy_prediction_loop -> we only use evaluation_loop function from now onTrainertokenizer in initialization -> processing_classmodel_path in train() -> resume_from_checkpointTrainerTraineruse_cache in the model config will be set to False. You can still change the cache value through TrainingArguments usel_cache argument if needed.organization and repo_url from PushToHubMixin. You must pass a repo_id instead.ignore_metadata_errors from PushToMixin. In practice if we ignore errors while loading the model card, we won't be able to push the card back to the Hub so it's better to fail early and not provide the option to fail later.push_to_hub do not accept **kwargs anymore. All accepted parameters are explicitly documented.push_to_hub are now keyword-only to avoid confusion. Only repo_id can be positional since it's the main arg.use_temp_dir argument from push_to_hub. We now use a tmp dir in all cases.Linked PR: https://github.com/huggingface/transformers/pull/42391.
The deprecated transformers-cli ... command was deprecated, transformers ... is now the only CLI entry point.
transformers CLI has been migrated to Typer, making it easier to maintain + adding some nice features out of
the box (improved --help section, autocompletion).
Biggest breaking change is in transformers chat. This command starts a terminal UI to interact with a chat model.
It used to also be able to start a Chat Completion server powered by transformers and chat with it. In this revamped
version, this feature has been removed in favor of transformers serve. The goal of splitting transformers chat
and transformers serve is to define clear boundaries between client and server code. It helps with maintenance
but also makes the commands less bloated. The new signature of transformers chat is:
Usage: transformers chat [OPTIONS] BASE_URL MODEL_ID [GENERATE_FLAGS]...
Chat with a model from the command line.
It works hand in hand with transformers serve, which means that if transformers serve is running on its default endpoint, transformers chat can be launched as follows:
transformers chat HuggingFaceTB/SmolLM3-3B
It can however use any OpenAI API compatible HTTP endpoint:
transformers chat HuggingFaceTB/SmolLM3-3B https://router.huggingface.co/v1
Linked PRs:
run methodThe transformers run (previously transformers-cli run) is an artefact of the past, was not documented nor tested,
and isn't part of any public documentation. We're removing it for now and ask you to please let us know in case
this is a method you are using; in which case we should bring it back with better support.
Linked PR: https://github.com/huggingface/transformers/pull/42447
TRANSFORMERS_CACHE, PYTORCH_TRANSFORMERS_CACHE, and PYTORCH_PRETRAINED_BERT_CACHE have been removed. Please use HF_HOME instead.HUGGINGFACE_CO_EXAMPLES_TELEMETRY, HUGGINGFACE_CO_EXAMPLES_TELEMETRY, HUGGINGFACE_CO_PREFIX, and HUGGINGFACE_CO_RESOLVE_ENDPOINT have been removed. Please use huggingface_hub.constants.ENDPOINT instead.Linked PR: https://github.com/huggingface/transformers/pull/42391.
transformers v5 pins the huggingface_hub version to >=1.0.0. See this migration guide to learn more about this major release. Here are to main aspects to know about:
requests to httpx. This change was made to improve performance and to support both synchronous and asynchronous requests the same way. If you are currently catching requests.HTTPError errors in your codebase, you'll need to switch to httpx.HTTPError.HTTP_PROXY / HTTPS_PROXY environment variableshf_transfer and therefore HF_HUB_ENABLE_HF_TRANSFER have been completed dropped in favor of hf_xet. This should be transparent for most users. Please let us know if you notice any downside!typer-slim has been added as required dependency, used to implement both hf and transformers CLIs.
The Code World Model (CWM) model was proposed in CWM: An Open-Weights LLM for Research on Code Generation with World Models by Meta FAIR CodeGen Team. CWM is an LLM for code generation and reasoning about code that has, in particular, been trained to better represent and reason about how code and commands affect the state of a program or system. Specifically, we mid-trained CWM on a large number of observation-action trajectories from Python execution traces and agentic interactions in containerized environments. We post-trained with extensive multi-task RL in verifiable coding, math, and multi-turn software engineering environments.
SAM3 (Segment Anything Model 3) was introduced in SAM 3: Segment Anything with Concepts.
The SAM3 addition adds four new architectures:
SAM3 performs Promptable Concept Segmentation (PCS) on images. PCS takes text and/or image exemplars as input (e.g., "yellow school bus"), and predicts instance and semantic masks for every single object matching the concept.
Sam3Tracker and Sam3TrackerVideo perform Promptable Visual Segmentation (PVS) on images. PVS takes interactive visual prompts (points, boxes, masks) or text inputs to segment a specific object instance per prompt. This is the task that SAM 1 and SAM 2 focused on, and SAM 3 improves upon it. Sam3Tracker and Sam3TrackerVideo are updated versions of SAM2 Video that maintain the same API while providing improved performance and capabilities.
SAM3 Video performs Promptable Concept Segmentation (PCS) on videos. PCS takes text as input (e.g., "yellow school bus"), and predicts instance and semantic masks for every single object matching the concept, while preserving object identities across video frames. The model combines a detection module (SAM3) with a tracking module (SAM2-style tracker) to enable robust object tracking across video frames using text prompts.
LFM2-MoE is a Mixture-of-Experts (MoE) variant of LFM2. The LFM2 family is optimized for on-device inference by combining short‑range, input‑aware gated convolutions with grouped‑query attention (GQA) in a layout tuned to maximize quality under strict speed and memory constraints.
LFM2‑MoE keeps this fast backbone and introduces sparse MoE feed‑forward networks to add representational capacity without significantly increasing the active compute path. The first LFM2-MoE release is LFM2-8B-A1B, with 8.3B total parameters and 1.5B active parameters. The model excels in quality (comparable to 3-4B dense models) and speed (faster than other 1.5B class models).
The VideoLLaMA3 model is a major update to VideoLLaMA2 from Alibaba DAMO Academy.
Audio Flamingo 3 (AF3) is a fully open large audio–language model designed for robust understanding and reasoning over speech, environmental sounds, and music. AF3 pairs a Whisper-style audio encoder with a causal language model and performs replace-in-place audio–text fusion: the processor aligns post-pool audio frames to a dedicated placeholder token and the model replaces those token slots with projected audio embeddings during the forward pass.
The model checkpoint is available at: nvidia/audio-flamingo-3-hf
Highlights:
NanoChat is a compact decoder-only transformer model designed for educational purposes and efficient training. The model features several fundamental architectural innovations which are common in modern transformer models. Therefore, it is a good model to use as a starting point to understand the principles of modern transformer models. NanoChat is a variant of the Llama architecture, with simplified attention mechanism and normalization layers.
FastVLM is an open-source vision-language model featuring a novel hybrid vision encoder, FastViTHD. Leveraging reparameterizable convolutional layers, scaled input resolution, and a reduced number of visual tokens, FastVLM delivers high accuracy with exceptional efficiency. Its optimized architecture enables deployment even on edge devices, achieving ultra-low TTFT (time to first token) without sacrificing performance.
PaddleOCR-VL is a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios.
PE Audio (Perception Encoder Audio) is a state-of-the-art multimodal model that embeds audio and text into a shared (joint) embedding space. The model enables cross-modal retrieval and understanding between audio and text.
Text input
Audio input
The resulting embeddings can be used for:
Jais2 a next-generation Arabic open-weight LLM trained on the richest Arabic-first dataset to date. Built from the ground up with 8B and 70B parameters, Jais 2 understands Arabic the way it's truly spoken across dialects, cuulutre, and modern expression. It is developed by MBZUAI, Inception and Cerebras Systems and based on the transformer architecture with modifications including:
Pixio is a vision foundation model that uses ViT as a feature extractor for multiple downstream tasks like depth estimation, semantic segmentation, feed-forward 3D reconstruction, robotics, and image classification. It is built on the Masked Autoencoder (MAE) pre-training framework, with four minimal yet critical updates: 1) deeper decoder, 2) larger masking granularity, 3) more class tokens, and 4) web-scale curated training data.
The Ernie 4.5 VL MoE model was released in the Ernie 4.5 Model Family release by baidu. This family of models contains multiple different architectures and model sizes. The Vision-Language series in specific is composed of a novel multimodal heterogeneous structure, sharing paremeters across modalities and dedicating parameters to specific modalities. This becomes especially apparent in the Mixture of Expert (MoE) which is composed of
This architecture has the advantage to enhance multimodal understanding without compromising, and even improving, performance on text-related tasks. An more detailed breakdown is given in the Technical Report.
Ernie 4.5] Ernie VL models by @vasqu in https://github.com/huggingface/transformers/pull/39585GLM-ASR-Nano-2512 is a robust, open-source speech recognition model with 1.5B parameters. Designed for real-world complexity, it outperforms OpenAI Whisper V3 on multiple benchmarks while maintaining a compact size.
Key capabilities include:
Exceptional Dialect Support Beyond standard Mandarin and English, the model is highly optimized for Cantonese (粤语) and other dialects, effectively bridging the gap in dialectal speech recognition.
Low-Volume Speech Robustness Specifically trained for "Whisper/Quiet Speech" scenarios. It captures and accurately transcribes extremely low-volume audio that traditional models often miss.
SOTA Performance Achieves the lowest average error rate (4.10) among comparable open-source models, showing significant advantages in Chinese benchmarks (Wenet Meeting, Aishell-1, etc..).
This model was contributed by Eustache Le Bihan and Yuxuan Zhang. you can check the model card for more details and our github repo.
GLM-4.7-Flash offers a new option for lightweight deployment that balances performance and efficiency.
We present GLM-4.1V-Thinking, GLM-4.5V, and GLM-4.6V, a family of vision-language models (VLMs) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document interpretation. In a comprehensive evaluation across 42 public benchmarks, GLM-4.5V achieves state-of-the-art performance on nearly all tasks among open-source models of similar size, and demonstrates competitive or even superior results compared to closed-source models such as Gemini-2.5-Flash on challenging tasks including Coding and GUI Agents. Meanwhile, the smaller GLM-4.1V-9B-Thinking remains highly competitive-achieving superior results to the much larger Qwen2.5-VL-72B on 29 benchmarks. We open-source both GLM-4.1V-9B-Thinking and GLM-4.5V. We further introduce the GLM-4.6V series, open-source multimodal models with native tool use and a 128K context window. A brief overview is available at this https URL. Code, models and more information are released at https://github.com/zai-org/GLM-V
LW-DETR proposes a light-weight Detection Transformer (DETR) architecture designed to compete with and surpass the dominant YOLO series for real-time object detection. It achieves a new state-of-the-art balance between speed (latency) and accuracy (mAP) by combining recent transformer advances with efficient design choices.
The LW-DETR architecture is characterized by its simple and efficient structure: a plain ViT Encoder, a Projector, and a shallow DETR Decoder. It enhances the DETR architecture for efficiency and speed using the following core modifications:
Efficient ViT Encoder: Uses a plain ViT with interleaved window/global attention and a window-major organization to drastically reduce attention complexity and latency.
Richer Input: Aggregates multi-level features from the encoder and uses a C2f Projector (YOLOv8) to pass two-scale features ( 1 / 8 and 1 / 32 ).
Faster Decoder: Employs a shallow 3-layer DETR decoder with deformable cross-attention for lower latency and faster convergence.
Optimized Queries: Uses a mixed-query scheme combining learnable content queries and generated spatial queries.
LightOnOcr combines a Vision Transformer encoder (Pixtral-based) with a lightweight text decoder (Qwen3-based) distilled from high-quality open VLMs. It is optimized for document parsing tasks, producing accurate, layout-aware text extraction from high-resolution pages.
JetMoe Fix jetmoe after #40132 by @ArthurZucker in #41324gemma3 by @Sai-Suraj-27 in #41354PretrainedConfig to PreTrainedConfig by @Cyrilvallez in #41300ModularChecker] QOL for the modular checker by @ArthurZucker in #41361v5] Remove relative position embeddings (for bert like models) by @vasqu in #41170apply_chat_template by @Samoed in #41355test_longcat_generation_cpu by @ydshieh in #41368CB] Refactors the way we access paged by @ArthurZucker in #41370v5] Sync Bert and Bart eager attention by @vasqu in #41248TypeError exception for invalid type by @Sai-Suraj-27 in #41346update_device_map for GPTQ quantizer by @Sai-Suraj-27 in #41328prune_heads by @gante in #41417JetMoe] Fix KV head repetition and padding free by @vasqu in #41423JetMoeIntegrationTest by @ydshieh in #41377past_key_value in BERT-like models by @zucchini-nlp in #41448utils/tf_ops/ by @gante in #41402Attention Masks] Bidirectional masks for encoder and encoder-decoder models by @vasqu in #41265past_index by @SunMarc in #41384report_to default changed to "none" + cleaning deprecated env var by @SunMarc in #41375overwrite_output_dir by @SunMarc in #41323CI] Fix copies on main by @vasqu in #41486jit_mode_eval by @SunMarc in #41376local_rank arg from TrainingArguments by @SunMarc in #41382pickle - BloomTokenizerFast by @ydshieh in #41466glm4v by @Sai-Suraj-27 in #41483truncation to False in Qwen3Omni to avoid default truncation by @BakerBunker in #41473local_rank deletion and some cleaning by @SunMarc in #41504tpu_num_cores by @SunMarc in #41383HunYuanMoEV1IntegrationTest:test_model_generation by @ydshieh in #41373generate delegates default cache initialization to the model by @gante in #41505from_pretrained] Small refactor from_pretrained: move around unrelated stuff by @ArthurZucker in #41445transformers serve by @LysandreJik in #41446logits_to_keep to many older CausalLM models by @philiproeleveld in #41335torch.compile recompiled part of th… by @sywangyi in #41558Docs] Fix changed references by @vasqu in #41614expand_device_map instead of redefining it by @Cyrilvallez in #41608tp_plan in from_pretrained directly by @Cyrilvallez in #41435Executorch] Simplify for encoder models by @vasqu in #41627Ernie 4.5 Moe] Fix Moe and offloading by @vasqu in #41385Masks] Fix mask handling in eager for vision models by @vasqu in #41625utils/check_bad_commit.py by @ydshieh in #41658use_cache default to False by @SunMarc in #41585chat_extras.md to Korean by @Judy-Choi in #39863big_bird.md to Korean by @ssum21 in #40445code_llama.md to Korean by @Judy-Choi in #40558ko-LFM2.md to Korean by @ssum21 in #41502use_auth_token parameter by @Wauplin in #41666Attn] Allow dynamic causality in SDPA via Kwargs by @vasqu in #41692run_name docs in TrainingArguments by @tobiasofsn in #41705utils/check_bad_commit.py by @ydshieh in #41658)videos from image processing classes by @zucchini-nlp in #41607@staticmethod from module-level get_device_and_memory_breakdown by @albertvillanova in #41747Onnx docs] Remove some traces by @vasqu in #41791utils/check_bad_commit.py by @ydshieh in #41658)Clip] Fix masking and enable flash attention on all model types by @vasqu in #41750test_tensor_parallel.py by @3outeille in #41918detectron2 installation in docker files by @ydshieh in #41975autoawq[kernels] installation in quantization docker file by @ydshieh in #41978torchcodec version in quantization docker file by @ydshieh in #41988run slow v2: empty report when there is only one model by @ydshieh in #42002torch+deepspeed docker file by @ydshieh in #41985logging_dir by @SunMarc in #42013deeepspeed in AMD docker file by @ydshieh in #42025huggingface_hub dependency version by @hanouticelina in #42033pr_slow_ci_suggestion.yml after #42023 by @ydshieh in #42049Argument list too long in pr_slow_ci_suggestion.yml by @ydshieh in #42061setattr as well by @zucchini-nlp in #41808Attn Masks] Non-vmap default for attention masks by @vasqu in #41852image_transforms.py by @yaswanth19 in #42044prepare_inputs_for_generation cache slicing condition by @albertvillanova in #41764T5Gemma] Fix cross attention cache by @vasqu in #41890streaming by @McPatate in #42102pytest<9 for now by @ydshieh in #42162Pop2Piano] Fix cache usage by @vasqu in #42170PEFT] Fix prefix tuning by @vasqu in #41696FqnToConfig by @jcaip in #41894PEFT] Fix the general test for prefix tuning by @vasqu in #42185Pop2Piano] Fix tied weights by @vasqu in #42193BLT] Fix cache usage by @vasqu in #42188test_dynamic_cache_exportability_multiple_run (failing on torch 2.10 nightly) by @ydshieh in #42212AttentionMaskConverter._unmask_unattended for xpu device before by @kaixuanliu in #42230base_model by @zucchini-nlp in #41589batch_size by @ydshieh in #42213batch_size" by @ydshieh in #42258get_decoder() for multimodal and delete redundant code 🔪 by @zucchini-nlp in #42156cwm by @ydshieh in #42261torch.get_autocast_dtype instead of torch.get_autocast_gpu_dtype by @qgallouedec in #42055WhisperFeatureExtractor by @TopCoder2K in #42286CI] Skip EfficientLoFTR test by @vasqu in #42327Attn Masks] Lift bidirectional mask restriction on eager by @vasqu in #42325torch.distributed imports by @Cyrilvallez in #42361Attn Masks] Add skip option for non-packed sequences by @vasqu in #42367Mistral Tokenizers] Fix tokenizer detection by @vasqu in #42389get_encoder() by @zucchini-nlp in #42295FA] Cleanup loading logic by @vasqu in #41427huggingface_hub constants + cleanup in PushToHubMixin by @Wauplin in #42391CI] Add to run slow by @vasqu in #42459transformers chat launched without base_url has a direct tie to localhost:8000 by @LysandreJik in #42463rotary_partial_emb to RopeParams and delete unnecessary code 🔪 by @zucchini-nlp in #42255add_prefix_space default value by @SunMarc in #42481The following contributors have made significant changes to the library over the last release:
JetMoe Fix jetmoe after #40132 (#41324)ModularChecker] QOL for the modular checker (#41361)CB] Refactors the way we access paged (#41370)from_pretrained] Small refactor from_pretrained: move around unrelated stuff (#41445)v5] Remove relative position embeddings (for bert like models) (#41170)v5] Sync Bert and Bart eager attention (#41248)JetMoe] Fix KV head repetition and padding free (#41423)Attention Masks] Bidirectional masks for encoder and encoder-decoder models (#41265)CI] Fix copies on main (#41486)Docs] Fix changed references (#41614)Executorch] Simplify for encoder models (#41627)Ernie 4.5 Moe] Fix Moe and offloading (#41385)Masks] Fix mask handling in eager for vision models (#41625)Attn] Allow dynamic causality in SDPA via Kwargs (#41692)Onnx docs] Remove some traces (#41791)Clip] Fix masking and enable flash attention on all model types (#41750)Attn Masks] Non-vmap default for attention masks (#41852)T5Gemma] Fix cross attention cache (#41890)Pop2Piano] Fix cache usage (#42170)PEFT] Fix prefix tuning (#41696)PEFT] Fix the general test for prefix tuning (#42185)Pop2Piano] Fix tied weights (#42193)BLT] Fix cache usage (#42188)CI] Skip EfficientLoFTR test (#42327)Attn Masks] Lift bidirectional mask restriction on eager (#42325)Attn Masks] Add skip option for non-packed sequences (#42367)Mistral Tokenizers] Fix tokenizer detection (#42389)FA] Cleanup loading logic (#41427)CI] Add to run slow (#42459)test_longcat_generation_cpu (#41368)JetMoeIntegrationTest (#41377)pickle - BloomTokenizerFast (#41466)HunYuanMoEV1IntegrationTest:test_model_generation (#41373)utils/check_bad_commit.py (#41658)utils/check_bad_commit.py (#41658) (#41690)utils/check_bad_commit.py (#41658) (#41815)detectron2 installation in docker files (#41975)autoawq[kernels] installation in quantization docker file (#41978)torchcodec version in quantization docker file (#41988)run slow v2: empty report when there is only one model (#42002)torch+deepspeed docker file (#41985)deeepspeed in AMD docker file (#42025)pr_slow_ci_suggestion.yml after #42023 (#42049)Argument list too long in pr_slow_ci_suggestion.yml (#42061)pytest<9 for now (#42162)test_dynamic_cache_exportability_multiple_run (failing on torch 2.10 nightly) (#42212)batch_size (#42213)batch_size" (#42258)cwm (#42261)prune_heads (#41417)utils/tf_ops/ (#41402)generate delegates default cache initialization to the model (#41505)use_auth_token parameter (#41666)huggingface_hub constants + cleanup in PushToHubMixin (#42391)logits_to_keep to many older CausalLM models (#41335)We are getting closer and closer to the official release! This RC is focused on removing more of the deprecated stuff, fixing some minors issues, doc updates.
_get_num_multimodal_tokens by @Abhinavexists in https://github.com/huggingface/transformers/pull/43137BartModelIntegrationTest by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/43160auto_doctring in Processors by @yonigozlan in https://github.com/huggingface/transformers/pull/42101BitModelIntegrationTest by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/43164Fp8] Fix experts by @vasqu in https://github.com/huggingface/transformers/pull/43154salesforce-ctrl, xlm & gpt-neo model generation tests by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/43180Generate] Allow custom config values in generate config by @vasqu in https://github.com/huggingface/transformers/pull/43181Pix2StructIntegrationTest by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/43229PhiIntegrationTests by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/43214HF_TOKEN directly and remove require_read_token by @ydshieh in https://github.com/huggingface/transformers/pull/43233Owlv2ModelIntegrationTest & OwlViTModelIntegrationTest by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/43182add_dates by @yonigozlan in https://github.com/huggingface/transformers/pull/43199Vip-llava model integration test by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/43252position_ids in all apply_rotary_pos_emb by @Cyrilvallez in https://github.com/huggingface/transformers/pull/43255_get_test_info in testing_utils.py by @ydshieh in https://github.com/huggingface/transformers/pull/43259Hiera, SwiftFormer & LED Model integration tests by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/43225_toctree.yml by @Cyrilvallez in https://github.com/huggingface/transformers/pull/43264PegasusX, Mvp & LED model integration tests by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/43245Full Changelog: https://github.com/huggingface/transformers/compare/v5.0.0rc2...v5.0.0rc3
Another fix for qwen vl models that prevented correctly loading the associated model type - this works together with https://github.com/huggingface/transformers/pull/41808 of the previous patch release.
Full Changelog: https://github.com/huggingface/transformers/compare/v4.57.5...v4.57.6
Should not have said last patch :wink: These should be the last remaining fixes that got lost in between patches and the transition to v5.
Full Changelog: https://github.com/huggingface/transformers/compare/v4.57.4...v4.57.5
Last patch release for v4: We have a few small fixes for remote generation methods (e.g. group beam search), vLLM, and an offline tokenizer fix (if it's already been cached).
Full Changelog: https://github.com/huggingface/transformers/compare/v4.57.3...v4.57.4
This release candidate is focused on fixing AutoTokenizer, expanding the dynamic weight loading support, and improving performances with MoEs!
The main issue with the tokenization refactor is that tokenizer_class are now "enforced" when in most cases they are wrong. This took a while to properly isolate and now we try to use TokenizersBackend whenever we can. #42894 has a much more detailed description of the big changes!
TokenizersBackend by @ArthurZucker in https://github.com/huggingface/transformers/pull/42894Tokenizers] Change treatment of special tokens by @vasqu in https://github.com/huggingface/transformers/pull/42903Here we focused on boosting the performances of loading weights on device!
post_init and fix all of them by @Cyrilvallez in https://github.com/huggingface/transformers/pull/42873_init_weights for ALL models by @Cyrilvallez in https://github.com/huggingface/transformers/pull/42309Ernie 4.5] Ernie VL models by @vasqu in https://github.com/huggingface/transformers/pull/39585Mostly around processors!
convert_segmentation_map_to_binary_masks to EoMT by @simonreise in https://github.com/huggingface/transformers/pull/43073Thanks again to everyone !
Full Changelog: https://github.com/huggingface/transformers/compare/v5.0.0rc1...v5.0.0rc2
This release candidate was focused mostly on quantization support with the new dynamic weight loader, and a few notable 🚨 breaking changes🚨:
from_pretrained is now auto!Mostly QOL and fixed + support back CPU offloading.
Mostly added support for fbgemme , quanto,
The dynamic weight loader broke small things, this adds glue for all models but MoEs.
Tokenization needed more refactoring, this time its a lot cleaner!
rope_parameters to empty dict if there is something to put in it by @hmellor in https://github.com/huggingface/transformers/pull/42651We omitted a lot of other commits for clarity, but thanks to everyone and the new contributors!
Full Changelog: https://github.com/huggingface/transformers/compare/v5.0.0rc0...v5.0.0rc1
We are excited to announce the initial release of Transformers v5. This is the first major release in five years, and the release is significant: 800 commits have been pushed to main since the latest minor release. This release removes a lot of long-due deprecations, introduces several refactors that significantly simplify our APIs and internals, and comes with a large number of bug fixes.
We give an overview of our focus for this release in the following blogpost. In these release notes, we'll focus directly on the refactors and new APIs coming with v5.
This release is a release candidate (RC). It is not the final v5 release, and we will push on pypi as a pre-release. This means that the current release is purely opt-in, as installing transformers without specifying this exact release will install the latest version instead (v4.57.3 as of writing).
In order to install this release, please do so with the following:
pip install transformers --pre
For us to deliver the best package possible, it is imperative that we have feedback on how the toolkit is currently working for you. Please try it out, and open an issue in case you're facing something inconsistent/a bug.
Transformers version 5 is a community endeavor, and this is the last mile. Let's ship this together!
[!NOTE] 👀 Nothing is final and things are still actively in movement. We have a section dedicated to what is planned for future release candidates, yet is known not to work in the RC0. Look for "Disclaimers for the RC0".
We'll be eagerly awaiting your feedback in our GitHub issues!
We introduce a new weight loading API in transformers, which significantly improves on the previous API. This
weight loading API is designed to apply operations to the checkpoints loaded by transformers.
Instead of loading the checkpoint exactly as it is serialized within the model, these operations can reshape, merge, and split the layers according to how they're defined in this new API. These operations are often a necessity when working with quantization or parallelism algorithms.
This new API is centered around the new WeightConverter class:
class WeightConverter(WeightTransform):
operations: list[ConversionOps]
source_keys: Union[str, list[str]]
target_keys: Union[str, list[str]]
The weight converter is designed to apply a list of operations on the source keys, resulting in target keys. A common operation done on the attention layers is to fuse the query, key, values layers. Doing so with this API would amount to defining the following conversion:
conversion = WeightConverter(
["self_attn.q_proj", "self_attn.k_proj", "self_attn.v_proj"], # The input layers
"self_attn.qkv_proj", # The single layer as output
operations=[Concatenate(dim=0)],
)
In this situation, we apply the Concatenate operation, which accepts a list of layers as input and returns a single
layer.
This allows us to define a mapping from architecture to a list of weight conversions. Applying those weight conversions
can apply arbitrary transformations to the layers themselves. This significantly simplified the from_pretrained method
and helped us remove a lot of technical debt that we accumulated over the past few years.
This results in several improvements:
While this is being implemented, expect varying levels of support across different release candidates.
Linked PR: https://github.com/huggingface/transformers/pull/41580
Just as we moved towards a single backend library for model definition, we want our tokenizers, and the Tokenizer object to be a lot more intuitive. With v5, tokenizer definition is much simpler; one can now initialize an empty LlamaTokenizer and train it directly on your corpus.
Defining a new tokenizer object should be as simple as this:
from transformers import TokenizersBackend, generate_merges
from tokenizers import pre_tokenizers, Tokenizer
from tokenizers.model import BPE
class Llama5Tokenizer(TokenizersBackend):
def __init__(self, unk_token="<unk>",bos_token="<s>", eos_token="</s>", vocab=None, merges=None ):
if vocab is None:
self._vocab = {
str(unk_token): 0,
str(bos_token): 1,
str(eos_token): 2,
}
else:
self._vocab = vocab
if merges is not None:
self._merges = merges
else:
self._merges = generate_merges(filtered_vocab)
self._tokenizer = Tokenizer(
BPE(vocab=self._vocab, merges=self._merges, fuse_unk=True)
)
self._tokenizer.pre_tokenizer = pre_tokenizers.Metaspace(
replacement="▁", prepend_scheme=_get_prepend_scheme(self.add_prefix_space, self), split=False
)
super().__init__(
tokenizer_object=self._tokenizer,
unk_token=unk_token,
bos_token=bos_token,
eos_token=eos_token,
)
Once the tokenizer is defined as above, you can load it with the following: Llama5Tokenizer(). Doing this returns you an empty, trainable tokenizer that follows the definition of the authors of Llama5 (it does not exist yet :wink:).
The above is the main motivation towards refactoring tokenization: we want tokenizers to behave similarly to models: trained or empty, and with exactly what is defined in their class definition.
Up to now, transformers maintained two parallel implementations for many tokenizers:
tokenization_<model>.py) - Python-based implementations, often using SentencePiece as the backend.tokenization_<model>_fast.py) - Rust-based implementations using the 🤗 tokenizers library.In v5, we consolidate to a single tokenizer file per model: tokenization_<model>.py. This file will use the most appropriate backend available:
sentencepiece library. It inherits from PythonBackend.tokenizers. Basically allows adding tokens.MistralCommon's tokenization library. (Previously known as the MistralCommonTokenizer)The AutoTokenizer automatically selects the appropriate backend based on available files and dependencies. This is transparent, you continue to use AutoTokenizer.from_pretrained() as before. This allows transformers to be future-proof and modular to easily support future backends.
We enable users and tokenizer builders to define their own tokenizers from top to bottom. Tokenizers are usually defined using a backend such as tokenizers, sentencepiece or mistral-common, but we offer the possibility to design the tokenizer at a higher-level, without relying on those backends.
To do so, you can import the PythonBackend (which was previously known as PreTrainedTokenizer). This class encapsulates all the logic related to added tokens, encoding, and decoding.
If you want something even higher up the stack, then PreTrainedTokenizerBase is what PythonBackend inherits from. It contains the very basic tokenizer API features:
encodedecodevocab_sizeget_vocabconvert_tokens_to_idsconvert_ids_to_tokensfrom_pretrainedsave_pretrainedStarting with v5, we now enable initializing blank, untrained tokenizers-backed tokenizers:
from transformers import LlamaTokenizer
tokenizer = LlamaTokenizer()
This tokenizer will therefore follow the definition of the LlamaTokenizer as defined in its class definition. It can then be trained on a corpus as can be seen in the tokenizers documentation.
These tokenizers can also be initialized from vocab and merges (if necessary), like the previous "slow" tokenizers:
from transformers import LlamaTokenizer
vocab = {"<unk>": 0, "<s>": 1, "</s>": 2, "hello": 3, "world": 4}
merges = [("h", "e"), ("l", "l"), ("o", " ")]
tokenizer = LlamaTokenizer(vocab=vocab, merges=merges)
This tokenizer will behave as a Llama-like tokenizer, with an updated vocabulary. This allows comparing different tokenizer classes with the same vocab; therefore enabling the comparison of different pre-tokenizers, normalizers, etc.
⚠️ The vocab_file (as in, a path towards a file containing the vocabulary) cannot be used to initialize the LlamaTokenizer as loading from files is reserved to the from_pretrained method.
The batch_decode and decode methods have been unified to reflect behavior of the encode method. Both single and batch decoding now use the same decode method. See an example of the new behavior below:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("t5-small")
inputs = ["hey how are you?", "fine"]
tokenizer.decode(tokenizer.encode(inputs))
Gives:
- 'hey how are you?</s> fine</s>'
+ ['hey how are you?</s>', 'fine</s>']
We expect encode and decode to behave, as two sides of the same coin: encode, process, decode, should work.
[!NOTE] A common use-case would be:
encode,model.generate,decode. However, usinggeneratewould returnlist[list[int]], which would then be incompatible withdecode.
The encode_plus method is deprecated in favor of the single __call__ method.
apply_chat_template returns BatchEncodingPreviously, apply_chat_template returned input_ids for backward compatibility. Starting with v5, it now consistently returns a BatchEncoding dict like other tokenizer methods.
# v5
messages = [
{"role": "user", "content": "Hello!"},
{"role": "assistant", "content": "Hi there!"}
]
# Now returns BatchEncoding with input_ids, attention_mask, etc.
outputs = tokenizer.apply_chat_template(messages, return_tensors="pt")
print(outputs.keys()) # dict_keys(['input_ids', 'attention_mask'])
We simplify the serialization of tokenization attributes:
special_tokens_map.json - special tokens are now stored in tokenizer_config.json.added_tokens.json - added tokens are now stored in tokenizer.json.added_tokens_decoder is only stored when there is no tokenizer.json.When loading older tokenizers, these files are still read for backward compatibility, but new saves use the consolidated format. We're gradually moving towards consolidating attributes to fewer files so that other libraries and implementations may depend on them more reliably.
Several models that had identical tokenizers now import from their base implementation:
These modules will eventually be removed altogether.
Removed T5-specific workarounds
The internal _eventually_correct_t5_max_length method has been removed. T5 tokenizers now handle max length consistently with other models.
A few testing changes specific to tokenizers have been applied:
add_tokens, encode, decode) are now centralized and automatically applied across all tokenizers. This reduces test duplication and ensures consistent behaviorFor legacy implementations, the original BERT Python tokenizer code (including WhitespaceTokenizer, BasicTokenizer, etc.) is preserved in bert_legacy.py for reference purposes.
Special Tokens Structure:
SpecialTokensMixin: Merged into PreTrainedTokenizerBase to simplify the tokenizer architecture.special_tokens_map: Now only stores named special token attributes (e.g., bos_token, eos_token). Use extra_special_tokens for additional special tokens (formerly additional_special_tokens). all_special_tokens includes both named and extra tokens.# v4
tokenizer.special_tokens_map # Included 'additional_special_tokens'
# v5
tokenizer.special_tokens_map # Only named tokens
tokenizer.extra_special_tokens # Additional tokens
special_tokens_map_extended and all_special_tokens_extended: Removed. Access AddedToken objects directly from _special_tokens_map or _extra_special_tokens if needed.additional_special_tokens: Still accepted for backward compatibility but is automatically converted to extra_special_tokens.Deprecated Methods:
sanitize_special_tokens(): Already deprecated in v4, removed in v5.prepare_seq2seq_batch(): Deprecated; use __call__() with text_target parameter instead.# v4
model_inputs = tokenizer.prepare_seq2seq_batch(src_texts, tgt_texts, max_length=128)
# v5
model_inputs = tokenizer(src_texts, text_target=tgt_texts, max_length=128, return_tensors="pt")
model_inputs["labels"] = model_inputs.pop("input_ids_target")
BatchEncoding.words(): Deprecated; use word_ids() instead.Removed Methods:
create_token_type_ids_from_sequences(): Removed from base class. Subclasses that need custom token type ID creation should implement this method directly.clean_up_tokenization(): Removed from base class. Now defined at model class level for models that need it (e.g., PLBart, CLVP, Wav2Vec2).prepare_for_model(), build_inputs_with_special_tokens(), truncate_sequences(): Moved from tokenization_utils_base.py to tokenization_python.py for PythonBackend tokenizers. TokenizersBackend provides model-ready input via tokenize() and encode(), so these methods are no longer needed in the base class._switch_to_input_mode(), _switch_to_target_mode(), as_target_tokenizer(): Removed from base class. Use __call__() with text_target parameter instead.# v4
with tokenizer.as_target_tokenizer():
labels = tokenizer(tgt_texts, ...)
# v5
labels = tokenizer(text_target=tgt_texts, ...)
parse_response(): Removed from base class.Because we are switching from the naive MOE (nn.ModuleList for experts) we currently have an issue with MoEs that have adapters. For more details see https://github.com/huggingface/transformers/issues/42491#issuecomment-3591485649.
We aim for this to be fixed and released in a following release candidate in the week that follows RC0.
We are streamlining the MoE support with vLLM; while this is being implemented, tensor parallelism and expert parallelism aren't working as expected. This is known and actively being worked on.
We aim for this to be fixed and released in a following release candidate in the week that follows RC0.
For anyone inheriting from a transformers PreTrainedModel, the weights are automatically initialized with the common scheme:
@torch.no_grad()
def _init_weights(self, module):
"""
Initialize the weights. This is quite general on purpose, in the spirit of what we usually do. For more complex
initialization scheme, it should be overridden by the derived `PreTrainedModel` class. In case a model adds an explicit
`nn.Parameter`, this method should also be overridden in order to initialize it correctly.
"""
if hasattr(self.config, "initializer_range"):
std = self.config.initializer_range or 0.02
elif hasattr(self.config, "init_std"):
std = self.config.init_std
elif hasattr(self.config, "initializer_factor"):
std = self.config.initializer_factor
else:
# 0.02 is the standard default value across the library
std = getattr(self.config.get_text_config(), "initializer_range", 0.02)
if isinstance(module, (nn.Linear, nn.Conv1d, nn.Conv2d, nn.Conv3d, nn.ConvTranspose1d, nn.ConvTranspose2d)):
if getattr(module, "weight", None) is not None:
init.normal_(module.weight, mean=0.0, std=std)
if getattr(module, "bias", None) is not None:
init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
if getattr(module, "weight", None) is not None:
init.normal_(module.weight, mean=0.0, std=std)
# Here we need the check explicitly, as we slice the weight in the `zeros_` call, so it looses the flag
if module.padding_idx is not None and not getattr(module.weight, "_is_hf_initialized", False):
init.zeros_(module.weight[module.padding_idx])
elif isinstance(module, nn.MultiheadAttention):
# This uses torch's original init
module._reset_parameters()
# We cannot use `isinstance` on the RMSNorms or LayerNorms, as they usually are custom modules which change names
# between modelings (because they are prefixed with the model name)
elif (
isinstance(module, (nn.GroupNorm, nn.BatchNorm1d, nn.BatchNorm2d, nn.BatchNorm3d))
or "LayerNorm" in module.__class__.__name__
or "RMSNorm" in module.__class__.__name__
):
# Norms can exist without weights (in which case they are None from torch primitives)
if hasattr(module, "weight") and module.weight is not None:
init.ones_(module.weight)
if hasattr(module, "bias") and module.bias is not None:
init.zeros_(module.bias)
If you want to avoid that, for now you should just do:
class CustomModel(Qwen3VLForConditionalGeneration):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self.action_head = nn.Linear(1024, 7)
self.positional_embedding = nn.Parameter(torch.randn(16, 1152))
self.post_init()
def _init_weights(self, module):
pass
There is a tracker for that here: https://github.com/huggingface/transformers/issues/42418.
use_auth_tokenThe use_auth_token argument/parameter is deprecated in favor of token everywhere.
You should be able to search and replace use_auth_token with token and get the same logic.
Linked PR: https://github.com/huggingface/transformers/pull/41666
We decided to remove some features for the upcoming v5 as they are currently only supported in a few old models and no longer integrated in current model additions. It's recommended to stick to v4.x in case you need them. Following features are affected:
We dropped support for two torch APIs:
torchscript in https://github.com/huggingface/transformers/pull/41688torch.fx in https://github.com/huggingface/transformers/pull/41683Those APIs were deprecated by the PyTorch team, and we're instead focusing on the supported APIs dynamo and export.
We clean up the quantization API in transformers, and significantly refactor the weight loading as highlighted above.
We drop support for two quantization arguments that have been deprecated for some time:
load_in_4bitload_in_8bitWe remove them in favor of the quantization_config argument which is much more complete. As an example, here is how
you would load a 4-bit bitsandbytes model using this argument:
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
quantization_config = BitsAndBytesConfig(load_in_4bit=True)
model_4bit = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-3B",
device_map="auto",
quantization_config=quantization_config
)
from_xxx_config are deleted. Configs can be init from the __init__ method in the same way. See #41314.mode.rope_parameters, including the rope_theta and rope_type. Model's config.rope_parameters is a simple dictionaty in most cases, and can also be a nested dict in special cases (i.e. Gemma3 and ModernBert) with different rope parameterization for each layer type. Trying to get config.rope_theta will throw an attribute error from now on. See #39847 and #42255config.vocab_size). Users are expected to access keys from their respective sub-configs (config.text_config.vocab_size).model.generate()) will no longer have a generation_config and model.config.generation_config will throw an attribute error.tokenization_<model>.py ) will be removed in favor of using fast tokenizer files tokenization_<model>_fast.py --> will be renamed to tokenization_<model>.py. As fast tokenizers are :hugs:tokenizers - backend, they include a wider range of features that are maintainable and reliable.encode_plus --> __call__batch_decode --> decodeapply_chat_template by default returns naked input_ids rather than a BatchEncoding dict.
This was inconvenient - it should return a BatchEncoding dict like tokenizer.__call__(), but we were stuck with
it for backward compatibility. The method now returns a BatchEncoding.
Linked PRs:
processor_config.json as a nested dict, instead of serializing attributes in their own config files. Loading will be supported for all old format processors (https://github.com/huggingface/transformers/pull/41474)XXXFeatureExtractors classes are completely removed in favor of XXXImageProcessor class for all vision models (https://github.com/huggingface/transformers/pull/41174)XXXFastImageProcessorKwargs is removed in favor of XXXImageProcessorKwargs which will be shared between fast and slow processors (https://github.com/huggingface/transformers/pull/40931)RotaryEmbeddings layers will start returning a dict of tuples, in case the model uses several RoPE configurations (Gemma2, ModernBert). Each value will be a tuple of "cos, sin" per RoPE type.RotaryEmbeddings layer will be unified and accessed via config.rope_parameters. Config attr for rope_theta might not be accessible anymore for some models, and instead will be in config.rope_parameters['rope_theta']. BC will be supported for a while as much as possible, and in the near future we'll gradually move to the new RoPE format (https://github.com/huggingface/transformers/pull/39847)model.language_model. It is recommended to either access the module with model.model.language_model or model.get_decoder(). See #42156GreedySearchEncoderDecoderOutput). We now only have 4 output classes built from the following matrix: decoder-only vs encoder-decoder, uses beams vs doesn't use beams (https://github.com/huggingface/transformers/pull/40998)generate doesn't receive any KV Cache argument, the default cache class used is now defined by the model (as opposed to always being DynamicCache) (https://github.com/huggingface/transformers/pull/41505)config.json for any old model, it will be loaded back into model's generation config. Users are expected to access or modify generation parameters only with model.generation_config.do_sample = True.compute_loss_func Handling
compute_loss_func now always takes priority over the model's built-in loss computation, giving users consistent control over custom loss functions.num_items_in_batch in Prediction Step
num_items_in_batch argument is now passed to compute_loss during prediction_step, enabling proper loss scaling during evaluation.report_to now defaults to "none"
TrainingArguments due to low usagemp_parameters -> legacy param that was later on added to the Sagemaker trainer_n_gpu -> not intended for users to set, we will initialize it correctly instead of putting it in the TrainingArgumentsoverwrite_output_dir - > replaced by resume_from_checkpoint, and it was only used in the examples script, no impact on Trainer.logging_dir -> only used for tensorboard, set TENSORBOARD_LOGGING_DIR env var insteadjit_mode_eval -> use use_torch_compile instead, as torchscript is not recommended anymoretpu_num_cores-> It is actually better to remove it, as it is not recommended to set the number of cores. By default, all TPU cores are used . Set TPU_NUM_CORES env var insteadpast_index -> it was only used for a very small number of models that have special architecture like transformersxl + it was not documented at all how to train those modelsray_scope -> only for a minor arg for ray integration. Set RAY_SCOPE var env insteadwarmup_ratio -> use warmup_step instead. We combined both args together by allowing passing float values in warmup_step.TrainingArgumentsfsdp_min_num_params and fsdp_transformer_layer_cls_to_wrap -> use fsdp_configtpu_metrics_debug -> debugpush_to_hub_token -> hub_tokenpush_to_hub_model_id and push_to_hub_organization -> hub_model_idinclude_inputs_for_metrics -> include_for_metricsper_gpu_train_batch_size -> per_device_train_batch_sizeper_gpu_eval_batch_size -> per_device_eval_batch_sizeuse_mps_device -> mps will be used by default if detectedfp16_backend and half_precision_backend -> we will only rely on torch.amp as everything has been upstreamed to torchno_cuda -> use_cpu include_tokens_per_second -> include_num_input_tokens_seenuse_legacy_prediction_loop -> we only use evaluation_loop function from now onTrainertokenizer in initialization -> processing_classmodel_path in train() -> resume_from_checkpointTrainerTraineruse_cache in the model config will be set to False. You can still change the cache value through TrainingArguments usel_cache argument if needed.organization and repo_url from PushToHubMixin. You must pass a repo_id instead.ignore_metadata_errors from PushToMixin. In practice if we ignore errors while loading the model card, we won't be able to push the card back to the Hub so it's better to fail early and not provide the option to fail later.push_to_hub do not accept **kwargs anymore. All accepted parameters are explicitly documented.push_to_hub are now keyword-only to avoid confusion. Only repo_id can be positional since it's the main arg.use_temp_dir argument from push_to_hub. We now use a tmp dir in all cases.Linked PR: https://github.com/huggingface/transformers/pull/42391.
The deprecated transformers-cli ... command was deprecated, transformers ... is now the only CLI entry point.
transformers CLI has been migrated to Typer, making it easier to maintain + adding some nice features out of
the box (improved --help section, autocompletion).
Biggest breaking change is in transformers chat. This command starts a terminal UI to interact with a chat model.
It used to also be able to start a Chat Completion server powered by transformers and chat with it. In this revamped
version, this feature has been removed in favor of transformers serve. The goal of splitting transformers chat
and transformers serve is to define clear boundaries between client and server code. It helps with maintenance
but also makes the commands less bloated. The new signature of transformers chat is:
Usage: transformers chat [OPTIONS] BASE_URL MODEL_ID [GENERATE_FLAGS]...
Chat with a model from the command line.
It works hand in hand with transformers serve, which means that if transformers serve is running on its default endpoint, transformers chat can be launched as follows:
transformers chat HuggingFaceTB/SmolLM3-3B
It can however use any OpenAI API compatible HTTP endpoint:
transformers chat HuggingFaceTB/SmolLM3-3B https://router.huggingface.co/v1
Linked PRs:
run methodThe transformers run (previously transformers-cli run) is an artefact of the past, was not documented nor tested,
and isn't part of any public documentation. We're removing it for now and ask you to please let us know in case
this is a method you are using; in which case we should bring it back with better support.
Linked PR: https://github.com/huggingface/transformers/pull/42447
TRANSFORMERS_CACHE, PYTORCH_TRANSFORMERS_CACHE, and PYTORCH_PRETRAINED_BERT_CACHE have been removed. Please use HF_HOME instead.HUGGINGFACE_CO_EXAMPLES_TELEMETRY, HUGGINGFACE_CO_EXAMPLES_TELEMETRY, HUGGINGFACE_CO_PREFIX, and HUGGINGFACE_CO_RESOLVE_ENDPOINT have been removed. Please use huggingface_hub.constants.ENDPOINT instead.Linked PR: https://github.com/huggingface/transformers/pull/42391.
transformers v5 pins the huggingface_hub version to >=1.0.0. See this migration guide to learn more about this major release. Here are to main aspects to know about:
requests to httpx. This change was made to improve performance and to support both synchronous and asynchronous requests the same way. If you are currently catching requests.HTTPError errors in your codebase, you'll need to switch to httpx.HTTPError.HTTP_PROXY / HTTPS_PROXY environment variableshf_transfer and therefore HF_HUB_ENABLE_HF_TRANSFER have been completed dropped in favor of hf_xet. This should be transparent for most users. Please let us know if you notice any downside!typer-slim has been added as required dependency, used to implement both hf and transformers CLIs.
The Code World Model (CWM) model was proposed in CWM: An Open-Weights LLM for Research on Code Generation with World Models by Meta FAIR CodeGen Team. CWM is an LLM for code generation and reasoning about code that has, in particular, been trained to better represent and reason about how code and commands affect the state of a program or system. Specifically, we mid-trained CWM on a large number of observation-action trajectories from Python execution traces and agentic interactions in containerized environments. We post-trained with extensive multi-task RL in verifiable coding, math, and multi-turn software engineering environments.
SAM3 (Segment Anything Model 3) was introduced in SAM 3: Segment Anything with Concepts.
The SAM3 addition adds four new architectures:
SAM3 performs Promptable Concept Segmentation (PCS) on images. PCS takes text and/or image exemplars as input (e.g., "yellow school bus"), and predicts instance and semantic masks for every single object matching the concept.
Sam3Tracker and Sam3TrackerVideo perform Promptable Visual Segmentation (PVS) on images. PVS takes interactive visual prompts (points, boxes, masks) or text inputs to segment a specific object instance per prompt. This is the task that SAM 1 and SAM 2 focused on, and SAM 3 improves upon it. Sam3Tracker and Sam3TrackerVideo are updated versions of SAM2 Video that maintain the same API while providing improved performance and capabilities.
SAM3 Video performs Promptable Concept Segmentation (PCS) on videos. PCS takes text as input (e.g., "yellow school bus"), and predicts instance and semantic masks for every single object matching the concept, while preserving object identities across video frames. The model combines a detection module (SAM3) with a tracking module (SAM2-style tracker) to enable robust object tracking across video frames using text prompts.
LFM2-MoE is a Mixture-of-Experts (MoE) variant of LFM2. The LFM2 family is optimized for on-device inference by combining short‑range, input‑aware gated convolutions with grouped‑query attention (GQA) in a layout tuned to maximize quality under strict speed and memory constraints.
LFM2‑MoE keeps this fast backbone and introduces sparse MoE feed‑forward networks to add representational capacity without significantly increasing the active compute path. The first LFM2-MoE release is LFM2-8B-A1B, with 8.3B total parameters and 1.5B active parameters. The model excels in quality (comparable to 3-4B dense models) and speed (faster than other 1.5B class models).
The VideoLLaMA3 model is a major update to VideoLLaMA2 from Alibaba DAMO Academy.
Audio Flamingo 3 (AF3) is a fully open large audio–language model designed for robust understanding and reasoning over speech, environmental sounds, and music. AF3 pairs a Whisper-style audio encoder with a causal language model and performs replace-in-place audio–text fusion: the processor aligns post-pool audio frames to a dedicated placeholder token and the model replaces those token slots with projected audio embeddings during the forward pass.
The model checkpoint is available at: nvidia/audio-flamingo-3-hf
Highlights:
NanoChat is a compact decoder-only transformer model designed for educational purposes and efficient training. The model features several fundamental architectural innovations which are common in modern transformer models. Therefore, it is a good model to use as a starting point to understand the principles of modern transformer models. NanoChat is a variant of the Llama architecture, with simplified attention mechanism and normalization layers.
JetMoe Fix jetmoe after #40132 by @ArthurZucker in #41324gemma3 by @Sai-Suraj-27 in #41354PretrainedConfig to PreTrainedConfig by @Cyrilvallez in #41300ModularChecker] QOL for the modular checker by @ArthurZucker in #41361v5] Remove relative position embeddings (for bert like models) by @vasqu in #41170apply_chat_template by @Samoed in #41355test_longcat_generation_cpu by @ydshieh in #41368CB] Refactors the way we access paged by @ArthurZucker in #41370v5] Sync Bert and Bart eager attention by @vasqu in #41248TypeError exception for invalid type by @Sai-Suraj-27 in #41346update_device_map for GPTQ quantizer by @Sai-Suraj-27 in #41328prune_heads by @gante in #41417JetMoe] Fix KV head repetition and padding free by @vasqu in #41423JetMoeIntegrationTest by @ydshieh in #41377past_key_value in BERT-like models by @zucchini-nlp in #41448utils/tf_ops/ by @gante in #41402Attention Masks] Bidirectional masks for encoder and encoder-decoder models by @vasqu in #41265past_index by @SunMarc in #41384report_to default changed to "none" + cleaning deprecated env var by @SunMarc in #41375overwrite_output_dir by @SunMarc in #41323CI] Fix copies on main by @vasqu in #41486jit_mode_eval by @SunMarc in #41376local_rank arg from TrainingArguments by @SunMarc in #41382pickle - BloomTokenizerFast by @ydshieh in #41466glm4v by @Sai-Suraj-27 in #41483truncation to False in Qwen3Omni to avoid default truncation by @BakerBunker in #41473local_rank deletion and some cleaning by @SunMarc in #41504tpu_num_cores by @SunMarc in #41383HunYuanMoEV1IntegrationTest:test_model_generation by @ydshieh in #41373generate delegates default cache initialization to the model by @gante in #41505from_pretrained] Small refactor from_pretrained: move around unrelated stuff by @ArthurZucker in #41445transformers serve by @LysandreJik in #41446logits_to_keep to many older CausalLM models by @philiproeleveld in #41335torch.compile recompiled part of th… by @sywangyi in #41558Docs] Fix changed references by @vasqu in #41614expand_device_map instead of redefining it by @Cyrilvallez in #41608tp_plan in from_pretrained directly by @Cyrilvallez in #41435Executorch] Simplify for encoder models by @vasqu in #41627Ernie 4.5 Moe] Fix Moe and offloading by @vasqu in #41385Masks] Fix mask handling in eager for vision models by @vasqu in #41625utils/check_bad_commit.py by @ydshieh in #41658use_cache default to False by @SunMarc in #41585chat_extras.md to Korean by @Judy-Choi in #39863big_bird.md to Korean by @ssum21 in #40445code_llama.md to Korean by @Judy-Choi in #40558ko-LFM2.md to Korean by @ssum21 in #41502use_auth_token parameter by @Wauplin in #41666Attn] Allow dynamic causality in SDPA via Kwargs by @vasqu in #41692run_name docs in TrainingArguments by @tobiasofsn in #41705utils/check_bad_commit.py by @ydshieh in #41658)videos from image processing classes by @zucchini-nlp in #41607@staticmethod from module-level get_device_and_memory_breakdown by @albertvillanova in #41747Onnx docs] Remove some traces by @vasqu in #41791utils/check_bad_commit.py by @ydshieh in #41658)Clip] Fix masking and enable flash attention on all model types by @vasqu in #41750test_tensor_parallel.py by @3outeille in #41918detectron2 installation in docker files by @ydshieh in #41975autoawq[kernels] installation in quantization docker file by @ydshieh in #41978torchcodec version in quantization docker file by @ydshieh in #41988run slow v2: empty report when there is only one model by @ydshieh in #42002torch+deepspeed docker file by @ydshieh in #41985logging_dir by @SunMarc in #42013deeepspeed in AMD docker file by @ydshieh in #42025huggingface_hub dependency version by @hanouticelina in #42033pr_slow_ci_suggestion.yml after #42023 by @ydshieh in #42049Argument list too long in pr_slow_ci_suggestion.yml by @ydshieh in #42061setattr as well by @zucchini-nlp in #41808Attn Masks] Non-vmap default for attention masks by @vasqu in #41852image_transforms.py by @yaswanth19 in #42044prepare_inputs_for_generation cache slicing condition by @albertvillanova in #41764T5Gemma] Fix cross attention cache by @vasqu in #41890streaming by @McPatate in #42102pytest<9 for now by @ydshieh in #42162Pop2Piano] Fix cache usage by @vasqu in #42170PEFT] Fix prefix tuning by @vasqu in #41696FqnToConfig by @jcaip in #41894PEFT] Fix the general test for prefix tuning by @vasqu in #42185Pop2Piano] Fix tied weights by @vasqu in #42193BLT] Fix cache usage by @vasqu in #42188test_dynamic_cache_exportability_multiple_run (failing on torch 2.10 nightly) by @ydshieh in #42212AttentionMaskConverter._unmask_unattended for xpu device before by @kaixuanliu in #42230base_model by @zucchini-nlp in #41589batch_size by @ydshieh in #42213batch_size" by @ydshieh in #42258get_decoder() for multimodal and delete redundant code 🔪 by @zucchini-nlp in #42156cwm by @ydshieh in #42261torch.get_autocast_dtype instead of torch.get_autocast_gpu_dtype by @qgallouedec in #42055WhisperFeatureExtractor by @TopCoder2K in #42286CI] Skip EfficientLoFTR test by @vasqu in #42327Attn Masks] Lift bidirectional mask restriction on eager by @vasqu in #42325torch.distributed imports by @Cyrilvallez in #42361Attn Masks] Add skip option for non-packed sequences by @vasqu in #42367Mistral Tokenizers] Fix tokenizer detection by @vasqu in #42389get_encoder() by @zucchini-nlp in #42295FA] Cleanup loading logic by @vasqu in #41427huggingface_hub constants + cleanup in PushToHubMixin by @Wauplin in #42391CI] Add to run slow by @vasqu in #42459transformers chat launched without base_url has a direct tie to localhost:8000 by @LysandreJik in #42463rotary_partial_emb to RopeParams and delete unnecessary code 🔪 by @zucchini-nlp in #42255add_prefix_space default value by @SunMarc in #42481The following contributors have made significant changes to the library over the last release:
JetMoe Fix jetmoe after #40132 (#41324)ModularChecker] QOL for the modular checker (#41361)CB] Refactors the way we access paged (#41370)from_pretrained] Small refactor from_pretrained: move around unrelated stuff (#41445)v5] Remove relative position embeddings (for bert like models) (#41170)v5] Sync Bert and Bart eager attention (#41248)JetMoe] Fix KV head repetition and padding free (#41423)Attention Masks] Bidirectional masks for encoder and encoder-decoder models (#41265)CI] Fix copies on main (#41486)Docs] Fix changed references (#41614)Executorch] Simplify for encoder models (#41627)Ernie 4.5 Moe] Fix Moe and offloading (#41385)Masks] Fix mask handling in eager for vision models (#41625)Attn] Allow dynamic causality in SDPA via Kwargs (#41692)Onnx docs] Remove some traces (#41791)Clip] Fix masking and enable flash attention on all model types (#41750)Attn Masks] Non-vmap default for attention masks (#41852)T5Gemma] Fix cross attention cache (#41890)Pop2Piano] Fix cache usage (#42170)PEFT] Fix prefix tuning (#41696)PEFT] Fix the general test for prefix tuning (#42185)Pop2Piano] Fix tied weights (#42193)BLT] Fix cache usage (#42188)CI] Skip EfficientLoFTR test (#42327)Attn Masks] Lift bidirectional mask restriction on eager (#42325)Attn Masks] Add skip option for non-packed sequences (#42367)Mistral Tokenizers] Fix tokenizer detection (#42389)FA] Cleanup loading logic (#41427)CI] Add to run slow (#42459)test_longcat_generation_cpu (#41368)JetMoeIntegrationTest (#41377)pickle - BloomTokenizerFast (#41466)HunYuanMoEV1IntegrationTest:test_model_generation (#41373)utils/check_bad_commit.py (#41658)utils/check_bad_commit.py (#41658) (#41690)utils/check_bad_commit.py (#41658) (#41815)detectron2 installation in docker files (#41975)autoawq[kernels] installation in quantization docker file (#41978)torchcodec version in quantization docker file (#41988)run slow v2: empty report when there is only one model (#42002)torch+deepspeed docker file (#41985)deeepspeed in AMD docker file (#42025)pr_slow_ci_suggestion.yml after #42023 (#42049)Argument list too long in pr_slow_ci_suggestion.yml (#42061)pytest<9 for now (#42162)test_dynamic_cache_exportability_multiple_run (failing on torch 2.10 nightly) (#42212)batch_size (#42213)batch_size" (#42258)cwm (#42261)prune_heads (#41417)utils/tf_ops/ (#41402)generate delegates default cache initialization to the model (#41505)use_auth_token parameter (#41666)huggingface_hub constants + cleanup in PushToHubMixin (#42391)logits_to_keep to many older CausalLM models (#41335)There was a hidden bug when loading models with local_files_only=True and a typo related to the recent patch.
The main fix is: https://github.com/huggingface/transformers/commit/b6055550a15a8fab367cf983b743ff68cc58d81a.
We are really sorry that this slipped through, our CIs just did not catch it.
As it affects a lot of users we are gonna yank the previous release
This patch most notably fixes an issue on some Mistral tokenizers. It contains the following commits:
@staticmethod from module-level get_device_and_memory_breakdown (#41747)This patch most notably fixes an issue with an optional dependency (optax), which resulted in parsing errors with poetry. It contains the following fixes: