The Qwen3-Next series represents the Qwen team's next-generation foundation models, optimized for extreme context length and large-scale parameter efficiency. The series introduces a suite of architectural innovations designed to maximize performance while minimizing computational cost:
Built on this architecture, they trained and open-sourced Qwen3-Next-80B-A3B — 80B total parameters, only 3B active — achieving extreme sparsity and efficiency.
Despite its ultra-efficiency, it outperforms Qwen3-32B on downstream tasks — while requiring less than 1/10 of the training cost. Moreover, it delivers over 10x higher inference throughput than Qwen3-32B when handling contexts longer than 32K tokens.
For more details, please visit their blog Qwen3-Next (blog post).
VaultGemma is a text-only decoder model derived from Gemma 2, notably it drops the norms after the Attention and MLP blocks, and uses full attention for all layers instead of alternating between full attention and local sliding attention. VaultGemma is available as a pretrained model with 1B parameters that uses a 1024 token sequence length.
VaultGemma was trained from scratch with sequence-level differential privacy (DP). Its training data includes the same mixture as the Gemma 2 models, consisting of a number of documents of varying lengths. Additionally, it is trained using DP stochastic gradient descent (DP-SGD) and provides a (ε ≤ 2.0, δ ≤ 1.1e-10)-sequence-level DP guarantee, where a sequence consists of 1024 consecutive tokens extracted from heterogeneous data sources. Specifically, the privacy unit of the guarantee is for the sequences after sampling and packing of the mixture.
Qwen3-VL is a multimodal vision-language model series, encompassing both dense and MoE variants, as well as Instruct and Thinking versions.
Building upon its predecessors, Qwen3-VL delivers significant improvements in visual understanding while maintaining strong pure text capabilities. Key architectural advancements include: enhanced MRope with interleaved layout for better spatial-temporal modeling, DeepStack integration to effectively leverage multi-level features from the Vision Transformer (ViT), and improved video understanding through text-based time alignment—evolving from T-RoPE to text timestamp alignment for more precise temporal grounding.
These innovations collectively enable Qwen3-VL to achieve superior performance in complex multimodal tasks.
The LongCatFlash model was proposed in LongCat-Flash Technical Report by the Meituan LongCat Team. LongCat-Flash is a 560B parameter Mixture-of-Experts (MoE) model that activates 18.6B-31.3B parameters dynamically (average ~27B). The model features a shortcut-connected architecture enabling high inference speed (>100 tokens/second) and advanced reasoning capabilities.
The abstract from the paper is the following:
We present LongCat-Flash, a 560 billion parameter Mixture-of-Experts (MoE) language model featuring a dynamic computation mechanism that activates 18.6B-31.3B parameters based on context (average ~27B). The model incorporates a shortcut-connected architecture enabling high inference speed (>100 tokens/second) and demonstrates strong performance across multiple benchmarks including 89.71% accuracy on MMLU and exceptional agentic tool use capabilities.
Tips:
FlexOlmo is a new class of language models (LMs) that supports (1) distributed training without data sharing, where different model parameters are independently trained on closed datasets, and (2) data-flexible inference, where these parameters along with their associated data can be flexibly included or excluded from model inferences with no further training. FlexOlmo employs a mixture-of-experts (MoE) architecture where each expert is trained independently on closed datasets and later integrated through a new domain-informed routing without any joint training. FlexOlmo is trained on FlexMix, a corpus we curate comprising publicly available datasets alongside seven domain-specific sets, representing realistic approximations of closed sets.
You can find all the original FlexOlmo checkpoints under the FlexOlmo collection.
LFM2-VL first series of vision-language foundation models developed by Liquid AI. These multimodal models are designed for low-latency and device-aware deployment. LFM2-VL extends the LFM2 family of open-weight Liquid Foundation Models (LFMs) into the vision-language space, supporting both text and image inputs with variable resolutions.
LFM2-VL consists of three main components: a language model backbone, a vision encoder, and a multimodal projector. LFM2-VL builds upon the LFM2 backbone, inheriting from either LFM2-1.2B (for LFM2-VL-1.6B) or LFM2-350M (for LFM2-VL-450M). For the vision tower, LFM2-VL uses SigLIP2 NaFlex encoders to convert input images into token sequences. Two variants are implemented:
The encoder processes images at their native resolution up to 512×512 pixels, efficiently handling smaller images without upscaling and supporting non-standard aspect ratios without distortion. Larger images are split into non-overlapping square patches of 512×512 each, preserving detail. In LFM2-VL-1.6B, the model also receives a thumbnail (a small, downscaled version of the original image capturing the overall scene) to enhance global context understanding and alignment. Special tokens mark each patch’s position and indicate the thumbnail’s start. The multimodal connector is a 2-layer MLP connector with pixel unshuffle to reduce image token count.
The BLT model was proposed in Byte Latent Transformer: Patches Scale Better Than Tokens by Artidoro Pagnoni, Ram Pasunuru, Pedro Rodriguez, John Nguyen, Benjamin Muller, Margaret Li1, Chunting Zhou, Lili Yu, Jason Weston, Luke Zettlemoyer, Gargi Ghosh, Mike Lewis, Ari Holtzman†, Srinivasan Iyer. BLT is a byte-level LLM that achieves tokenization-level performance through entropy-based dynamic patching.
The abstract from the paper is the following:
We introduce the Byte Latent Transformer (BLT), a new byte-level LLM architecture that, for the first time, matches tokenization-based LLM performance at scale with significant improvements in inference efficiency and robustness. BLT encodes bytes into dynamically sized patches, which serve as the primary units of computation. Patches are segmented based on the entropy of the next byte, allocating more compute and model capacity where increased data complexity demands it. We present the first flop controlled scaling study of byte-level models up to 8B parameters and 4T training bytes. Our results demonstrate the feasibility of scaling models trained on raw bytes without a fixed vocabulary. Both training and inference efficiency improve due to dynamically selecting long patches when data is predictable, along with qualitative improvements on reasoning and long tail generalization. Overall, for fixed inference costs, BLT shows significantly better scaling than tokenization-based models, by simultaneously growing both patch and model size.
Dual Model Architecture: BLT consists of two separate trained models:
Dynamic Patching: The model uses entropy-based dynamic patching where:
Local Encoder: Processes byte sequences with cross-attention to patch embeddings
Global Transformer: Processes patch-level representations with full attention across patches
Local Decoder: Generates output with cross-attention back to the original byte sequence
Byte-Level Tokenizer: Unlike traditional tokenizers that use learned vocabularies, BLT's tokenizer simply converts text to UTF-8 bytes and maps each byte to a token ID. There is no need for a vocabulary.
The Qwen2.5-Omni model is a unified multiple modalities model proposed in Qwen2.5-Omni Technical Report from Qwen team, Alibaba Group.
Qwen2_5OmniForConditionalGeneration] to generate audio and text output. To generate only one output type, use [Qwen2_5OmniThinkerForConditionalGeneration] for text-only and [Qwen2_5OmniTalkersForConditionalGeneration] for audio-only outputs.Qwen2_5OmniForConditionalGeneration] supports only single batch size at the moment.processor.max_pixels. By default the maximum is set to a very arge value and high resolution visuals will not be resized, unless resolution exceeds processor.max_pixels.~ProcessorMixin.apply_chat_template] method to convert chat messages to model inputs.Parakeet models, introduced by NVIDIA NeMo, are models that combine a Fast Conformer encoder with connectionist temporal classification (CTC), recurrent neural network transducer (RNNT) or token and duration transducer (TDT) decoder for automatic speech recognition.
Model Architecture
ParakeetEncoder] for the encoder implementation and details).The EdgeTAM model was proposed in EdgeTAM: On-Device Track Anything Model Chong Zhou, Chenchen Zhu, Yunyang Xiong, Saksham Suri, Fanyi Xiao, Lemeng Wu, Raghuraman Krishnamoorthi, Bo Dai, Chen Change Loy, Vikas Chandra, Bilge Soran.
EdgeTAM is an efficient adaptation of SAM 2 that introduces a 2D Spatial Perceiver architecture to optimize memory attention mechanisms for real-time video segmentation on mobile devices.
More details to come soon :eyes:
We are introducing Continuous Batching (CB) in this release, we consider it a stable feature. The main use case for CB is batched generation, which makes it very efficient in the context of GRPO training or evaluation. Thanks to CB, researchers or model developers are now free to use transformers in these contexts without having to spin up an additional inference engine.
CB currently supports both full attention and sliding window attention: this means that the vast majority of models are supported, like llama, gemma3, gpt-oss.
CB is also integrated with transformers serve, which means that you can deploy transformers as an OpenAI-compatible HTTP server. Here is a small snippet on how to use it:
import datasets
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.generation import GenerationConfig
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-4B-Instruct-2507", dtype=torch.bfloat16, _attn_implementation="sdpa_paged", device_map="auto"
)
model.generation_config.max_new_tokens = 32
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-4B-Instruct-2507", padding_side="left")
dataset = datasets.load_dataset("openai/gsm8k", "socratic", split="test")
tokenized_datasets = dataset.map(lambda x: tokenizer(x["question"]), batched=True)
simple_batch_inputs = [item["input_ids"] for item in tokenized_datasets]
batch_outputs = model.generate_batch(inputs=simple_batch_inputs)
for request in batch_outputs:
print(tokenizer.decode(batch_outputs[request].generated_tokens))
"""
Let's break down the problem step by step:
1. **Total eggs laid per day**:
Janet’s ducks lay **16 eggs per day**
Let's break down the problem step by step:
1. **Blue fiber**: The robe takes **2 bolts** of blue fiber.
2. **White fiber
To determine Josh's profit from flipping the house, let's go step by step.
---
### Step 1: Initial cost of the house
Josh buys the
To find the total distance James runs in a week, we can break down the problem step by step:
1. **Sprints per session**: James runs
To determine how many cups of feed Wendi needs to give her chickens in the final meal of the day, let's go step by step.
"""
check_model_inputs in core VLMs by @zucchini-nlp in #40342input_feature length and attention_mask length in WhisperFeatureExtractor by @BakerBunker in #39221_prepare_generation_config by @manueldeprada in #40715center_crop fast equivalent to slow by @yonigozlan in #40856pytest-rerunfailures<16.0 by @ydshieh in #40561test_all_params_have_gradient=False for DeepseekV2ModelTest by @ydshieh in #40566test_eager_matches_sdpa_inference not run for CLIP by @ydshieh in #40581remi-or to run-slow by @ydshieh in #40590get_*_features methods + update doc snippets by @qubvel in #40555TvpImageProcessingTest::test_slow_fast_equivalence by @ydshieh in #40593siglip flaky test_eager_matches_sdpa_inference by @ydshieh in #40584Tests] Fixup duplicated mrope logic by @vasqu in #40592TokenizerTesterMixin temporarily by @ydshieh in #40611transformers serve by @McPatate in #40479too many request caused by AutoModelTest::test_dynamic_saving_from_local_repo by @ydshieh in #40614JambaModelTest.test_load_balancing_loss by @ydshieh in #40617deepseek_v3.md to Korean by @ssum21 in #39649too many requests in TestMistralCommonTokenizer by @ydshieh in #40623test_prompt_lookup_decoding_matches_greedy_search for voxtral by @ydshieh in #40643LongformerModelTest::test_attention_outputs as flaky by @ydshieh in #40655custom_generate Callables and unify generation args structure by @manueldeprada in #40586check_determinism inside test_determinism by @ydshieh in #40661test_fast_is_faster_than_slow for Owlv2ImageProcessingTest by @ydshieh in #40663test_prompt_lookup_decoding_matches_greedy_search for qwen2_audio by @ydshieh in #40664GitModelTest::test_beam_search_generate by @ydshieh in #40666tolist instead of list comprehension calling .item() by @McPatate in #40646Aimv2ModelTest::test_eager_matches_sdpa_inference_04_fp16_pad_right_sdpa_kernels as flaky by @ydshieh in #40683T5GemmaModelTest::test_eager_matches_sdpa_inference being flaky by @ydshieh in #40702hf_hub_download by @ydshieh in #40710self in post-process methods by @framonmar7 in #40711or for grounding dino mask by @lmarshall12 in #40625Gemma Embedding] Fix SWA by @vasqu in #40700VitMatteImageProcessingTest::test_fast_is_faster_than_slow by @ydshieh in #40713request_id to headers by @McPatate in #40722and/or_mask_function by @Cyrilvallez in #40753--continuous_batching by @McPatate in #40618continue_final_message in apply_chat_template to prevent substring matching issues by @abdokaseb in #40732public.cloud.experiment_url api error by @Zeyi-Lin in #40763PromptLookupCandidateGenerator won't generate forbidden tokens by @gante in #40726test_past_key_values_format and delete overwrites by @gante in #40701generate by @gante in #40375Jetmoe] Fix RoPE by @vasqu in #40819self.loss_function by @qubvel in #40764test_modeling_common.py by @gante in #40854past_key_values by @gante in #40803rsqrt by @thalahors in #40848VaultGemma] Update expectations in integration tests by @vasqu in #40855imageprocessor.md to Korean by @HyunZ118 in #39557Gemma3nAudioFeatureExtractionTest::test_dither by @ydshieh in #40902get_mask_sizes by @Cyrilvallez in #40907Glm4vIntegrationTest by @ydshieh in #40905runner_map by @ydshieh in #40880test_fast_is_faster_than_slow by @ydshieh in #40909Gemma3ForConditionalGeneration compatible with assisted generation by @gante in #40791image_sizes arg and deprecate vision_feature_layer by @yaswanth19 in #40832import torch.utils.checkpoint by @gante in #40934Glm4vMoeIntegrationTest by @ydshieh in #40930Glm4vModelTest::test_eager_matches_fa2_generate by @ydshieh in #40947test_speculative_generation by @ydshieh in #40949The following contributors have made significant changes to the library over the last release:
imageprocessor.md to Korean (#39557)A new model is added to transformers: Vault-Gemma
It is added on top of the v4.56.1 release, and can be installed from the following tag: v4.56.1-Vault-Gemma-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.56.1-Vault-Gemma-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the Vault-Gemma model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.57.0.
VaultGemma is a text-only decoder model derived from Gemma 2, notably it drops the norms after the Attention and MLP blocks, and uses full attention for all layers instead of alternating between full attention and local sliding attention. VaultGemma is available as a pretrained model with 1B parameters that uses a 1024 token sequence length.
VaultGemma was trained from scratch with sequence-level differential privacy (DP). Its training data includes the same mixture as the Gemma 2 models, consisting of a number of documents of varying lengths. Additionally, it is trained using DP stochastic gradient descent (DP-SGD) and provides a (ε ≤ 2.0, δ ≤ 1.1e-10)-sequence-level DP guarantee, where a sequence consists of 1024 consecutive tokens extracted from heterogeneous data sources. Specifically, the privacy unit of the guarantee is for the sequences after sampling and packing of the mixture.
The example below demonstrates how to chat with the model with pipeline:
from transformers import pipeline
pipe = pipeline(
task="text-generation",
model="google/vaultgemma-1b",
dtype="auto",
device_map="auto",
)
text = "Tell me an unknown interesting biology fact about the brain."
outputs = pipe(text, max_new_tokens=32)
response = outputs[0]["generated_text"]
print(response)
with the AutoModelForCausalLM class:
# pip install accelerate
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "google/vaultgemma-1b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, device_map="auto", dtype="auto")
text = "Tell me an unknown interesting biology fact about the brain."
input_ids = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**input_ids, max_new_tokens=32)
print(tokenizer.decode(outputs[0]))
or with transformers chat:
transformers chat google/vaultgemma-1b
This patch most notably fixes an issue with the new dtype argument (replacing torch_dtype) in pipelines!
A new model is added to transformers: Embedding Gemma
It is added on top of the v4.56.0 release, and can be installed from the following tag: v4.56.0-Embedding-Gemma-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.56.0-Embedding-Gemma-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the EmbeddingGemma model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.57.0.
Today, Google releases EmbeddingGemma, a state-of-the-art multilingual embedding model perfect for on-device use cases. Designed for speed and efficiency, the model features a compact size of 308M parameters and a 2K context window, unlocking new possibilities for mobile RAG pipelines, agents, and more. EmbeddingGemma is trained to support over 100 languages and is the highest-ranking text-only multilingual embedding model under 500M on the Massive Text Embedding Benchmark (MTEB) at the time of writing.
EmbeddingGemma can be found on the Huggingface Hub. It is integrated in sentence-transformers which depends on transformers.
See below for sentence-transformers examples using the model:
from sentence_transformers import SentenceTransformer
# Download from the 🤗 Hub
model = SentenceTransformer("google/embeddinggemma-300m")
# Run inference with queries and documents
query = "Which planet is known as the Red Planet?"
documents = [
"Venus is often called Earth's twin because of its similar size and proximity.",
"Mars, known for its reddish appearance, is often referred to as the Red Planet.",
"Jupiter, the largest planet in our solar system, has a prominent red spot.",
"Saturn, famous for its rings, is sometimes mistaken for the Red Planet."
]
query_embeddings = model.encode_query(query)
document_embeddings = model.encode_document(documents)
print(query_embeddings.shape, document_embeddings.shape)
# (768,) (4, 768)
# Compute similarities to determine a ranking
similarities = model.similarity(query_embeddings, document_embeddings)
print(similarities)
# tensor([[0.3011, 0.6359, 0.4930, 0.4889]])
# Convert similarities to a ranking
ranking = similarities.argsort(descending=True)[0]
print(ranking)
# tensor([1, 2, 3, 0])
DINOv3 is a family of versatile vision foundation models that outperforms the specialized state of the art across a broad range of settings, without fine-tuning. DINOv3 produces high-quality dense features that achieve outstanding performance on various vision tasks, significantly surpassing previous self- and weakly-supervised foundation models.
You can find all the original DINOv3 checkpoints under the DINOv3 collection.
<img width="814" height="658" alt="image" src="https://github.com/user-attachments/assets/740a5c3d-a5a1-45d9-9e4c-d9117837205d" />he X-Codec model was proposed in Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model by Zhen Ye, Peiwen Sun, Jiahe Lei, Hongzhan Lin, Xu Tan, Zheqi Dai, Qiuqiang Kong, Jianyi Chen, Jiahao Pan, Qifeng Liu, Yike Guo, Wei Xue
The X-Codec model is a neural audio codec that integrates semantic information from self-supervised models (e.g., HuBERT) alongside traditional acoustic information. This enables :
The Ovis2 is an updated version of the Ovis model developed by the AIDC-AI team at Alibaba International Digital Commerce Group.
Ovis2 is the latest advancement in multi-modal large language models (MLLMs), succeeding Ovis1.6. It retains the architectural design of the Ovis series, which focuses on aligning visual and textual embeddings, and introduces major improvements in data curation and training methods.
<img src="https://cdn-uploads.huggingface.co/production/uploads/637aebed7ce76c3b834cea37/XB-vgzDL6FshrSNGyZvzc.png" width="600">MetaCLIP 2 is a replication of the original CLIP model trained on 300+ languages. It achieves state-of-the-art (SOTA) results on multilingual benchmarks (e.g., XM3600, CVQA, Babel‑ImageNet), surpassing previous SOTA such as mSigLIP and SigLIP‑2. The authors show that English and non-English worlds can mutually benefit and elevate each other.
<img width="805" height="408" alt="image" src="https://github.com/user-attachments/assets/72eaa441-9362-4a6a-a834-f505d6727a2a" />Florence-2 is an advanced vision foundation model that uses a prompt-based approach to handle a wide range of vision and vision-language tasks. Florence-2 can interpret simple text prompts to perform tasks like captioning, object detection, and segmentation. It leverages the FLD-5B dataset, containing 5.4 billion annotations across 126 million images, to master multi-task learning. The model's sequence-to-sequence architecture enables it to excel in both zero-shot and fine-tuned settings, proving to be a competitive vision foundation model.
<img width="864" height="565" alt="image" src="https://github.com/user-attachments/assets/d09dfe3a-6dda-45a3-8dd3-0254d8503b4e" />SAM2 (Segment Anything Model 2) was proposed in Segment Anything in Images and Videos by Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, Eric Mintun, Junting Pan, Kalyan Vasudev Alwala, Nicolas Carion, Chao-Yuan Wu, Ross Girshick, Piotr Dollár, Christoph Feichtenhofer.
The model can be used to predict segmentation masks of any object of interest given an input image or video, and input points or bounding boxes.
<img width="960" height="540" alt="image" src="https://github.com/user-attachments/assets/0ab42e5c-6951-4cbc-9d5d-ff8bf0c2dbf1" />The Kosmos-2.5 model was proposed in KOSMOS-2.5: A Multimodal Literate Model by Microsoft.
The abstract from the paper is the following:
We present Kosmos-2.5, a multimodal literate model for machine reading of text-intensive images. Pre-trained on large-scale text-intensive images, Kosmos-2.5 excels in two distinct yet cooperative transcription tasks: (1) generating spatially-aware text blocks, where each block of text is assigned its spatial coordinates within the image, and (2) producing structured text output that captures styles and structures into the markdown format. This unified multimodal literate capability is achieved through a shared Transformer architecture, task-specific prompts, and flexible text representations. We evaluate Kosmos-2.5 on end-to-end document-level text recognition and image-to-markdown text generation. Furthermore, the model can be readily adapted for any text-intensive image understanding task with different prompts through supervised fine-tuning, making it a general-purpose tool for real-world applications involving text-rich images. This work also paves the way for the future scaling of multimodal large language models.
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/kosmos2_5_ocr.png" alt="drawing" width="600"/>
More information at release 🤗
More information at release 🤗
More information at release 🤗
Beyond a large refactor of the caching system in Transformers, making it much more practical and general, models using sliding window attention/chunk attention do not waste memory anymore when caching past states. It was allowed most notable by:
See the following improvements on memory usage for Mistral (using only sliding layers) and GPT-OSS (1 out of 2 layers is sliding) respectively: <img width="569" height="431" alt="image" src="https://github.com/user-attachments/assets/7f1688f4-b077-4840-a62c-bfa6131fe806" /> <img width="574" height="431" alt="image" src="https://github.com/user-attachments/assets/bb4a284f-961e-413d-b7e1-783bb5d8fb39" />
Beyond memory usage, it will also improve generation/forward speed by a large margin for large contexts, as only necessary states are passed to the attention computation, which is very sensitive to the sequence length.
Since the GPT-OSS release which introduced the MXPF4 quantization type, several improvements have been made to the support, which should now stabilize.
swiglu_limit not passed in for MXFP4 by @danielhanchen in #40197Mxfp4] Add a way to save with a quantization method by @ArthurZucker in #40176Now that we deprecated tensorflow and jax, we felt that torch_dtype was not only misaligned with torch, but was redundant and hard to remember. For this reason, we switched to a much more standard dtype argument!
torch_dtype will still be a valid usage for as long as needed to ensure a smooth transition, but new code should use dtype, and we encourage you to update older code as well!
The following commits are breaking changes in workflows that were either buggy or not working as expected.
On models where the hub checkpoint specifies cache_implementation="hybrid" (static sliding window hybrid cache), UNSETS this value. This will make the model use the dynamic sliding window layers by default.
This default meant that there were widespread super slow 1st generate calls on models with hybrid caches, which should nol onger be the case.
cache_implementation="hybrid" hub defaults by @gante in #40135Cache the computation of sine positional embeddings for MaskFormer; results in a 6% performance improvement.
Adds explicit cache initialization to prepare for the deprecation of the from_legacy_cache utility.
fullgraph=FalseHaving fullgraph set to True during compilation ended up being very restrictive, especially with the arrival of widely-used MoEs.
The DoLa decoding strategy has been moved to the following remote-code repository a few versions ago: https://huggingface.co/transformers-community/dola
The Contrastive Search decoding strategy has been moved to the following remote-code repository a few versions ago: https://huggingface.co/transformers-community/contrastive-search
Both have now been removed from the library as a result.
Flash attention has used sliding window sizes which were off by one. This affected generations that had initially bigger contexts than the sliding window size.
Flash Attention] Fix sliding window size by @vasqu in #40163Torch 2.1 support has been unreliable for some time, so we've now made it official and bumped our minimum version to 2.2.
GptOss fixes for green CI by @gante in #39929utils/check_bad_commit.py failing due to rate limit (requesting api.github.com) by @ydshieh in #39918torch.device('cpu').index being None by @manueldeprada in #39933torchcodec is updated by @ydshieh in #39951triton_kernels dep with kernels instead by @SunMarc in #39926fix_and_overwrite mode of utils/check_docstring.py by @manueldeprada in #39369find_file_type by @yonigozlan in #39897past_key_value to past_key_valueS everywhere by @Cyrilvallez in #39956notification_service.py about time_spent by @ydshieh in #40037notification_service.py about time_spent" by @ydshieh in #40044torchcodec==0.5.0 and use torch 2.8 on daily CI by @ydshieh in #40072time_spent in notification_service.py. by @ydshieh in #40081GPT Big Code] Fix attention scaling by @vasqu in #40041ForConditionalGeneration by @qgallouedec in #39973is_fast to ImageProcessor by @MilkClouds in #39603logger.warning with logger.warning_once in GradientCheckpointingLayer by @qgallouedec in #40091Flash Attention] Fix flash attention integration by @vasqu in #40002custom_generate collections by @gante in #39894tiny_agents.md to Korean by @AhnJoonSung in #39913content inputs for LLMs by @gante in #39829decoding_method argument in generate by @manueldeprada in #40085generation_config by @gante in #40127main_classes/processors.md to Korean by @TaskerJang in #39519jamba.md to Korean by @skwh54 in #39890main_classes/optimizer_schedules.md to Korean by @luckyvickyricky in #39713gpt2.md to Korean by @taemincode in #39808optimizers.md to Korean by @chelsseeey in #40011pipelines.md to Korean by @xhaktm00 in #39577gemma3.md to Korean by @seopp in #39865torch_compile_test and torch_export_test by @ydshieh in #39950self.tokenizer by self.processing_class by @qgallouedec in #40119too long with no output by @ydshieh in #40201model_input_names for PixtralImageProcessor by @rohitrango in #40226chat_template (jinja2) as an extra dependency by @tboerstad in #40128CI] Fix repo consistency by @vasqu in #40249k_proj weight and bias slicing in D-FINE by @notkisk in #40257id=usage to <hfoptions> tag in LayoutLM model card by @Jin-HoMLee in #40273torch.compile tests with fullgraph=True by @ydshieh in #40164FA] Fix dtype in varlen with position ids by @vasqu in #40295fix] Pass adamw optimizer parameters to StableAdamW by @emapco in #40184find_executable_batch_size to match new 0.9 ratio by @MilkClouds in #40206Flash Attention] Fix sliding window size by @vasqu in #40163_tp_plan attribute by @rishub-tamirisa in #39944natten by @ydshieh in #40287GPT OSS] Refactor the tests as it was not properly checking the outputs by @ArthurZucker in #40288get_placeholder_mask in Ovis2 by @thisisiron in #40280/en/model_doc by @gante in #40311/en/model_doc by @gante in #40344test_spm_converter_bytefallback_warning by @ydshieh in #40284FA] Fix some model tests by @vasqu in #40350label_names as an argument to TrainingArguments by @huzaifa-jawad367 in #40353skip_special_tokens in the main text generation pipelines by @gante in #40356dtype instead of torch_dtype everywhere! by @Cyrilvallez in #39782tokenizer_kwargs argument to the text generation pipeline by @Joshua-Chin in #40364transformers TF classes/methods by @gante in #40429models.md to Korean by @Judy-Choi in #39518main by @ydshieh in #40451qwen2_moe tests by @ydshieh in #40494merge to main by @ydshieh in #40503The following contributors have made significant changes to the library over the last release:
get_placeholder_mask in Ovis2 (#40280)There was a mick mack on our side when cherry-picking the commit #40197 which led to a wrong commit in the patch! Sorry everyone 😭
This patch is just the official fix for #40197!
Focused on stabilizing FlashAttention-2 on Ascend NPU, improving FSDP behavior for generic-task models, fixing MXFP4 integration for GPT-OSS
FA2 generations!😢 Well sorry everyone, sometimes shit can happen...
4.55.1 was broken because of 🥁 git merge conflict.
I cherry-picked https://github.com/huggingface/transformers/pull/40002 without having https://github.com/huggingface/transformers/pull/40029 , thus from ..modeling_flash_attention_utils import prepare_fa_kwargs_from_position_ids is missing, and since this is a slow test, nothing caught it.
Will work to remediate and write the post-mortem when yanking the release.
Mostly focused around stabalizing the Mxfp4 for GPTOSS model!
New model added by the Z.ai team to transformers!
GLM-4.5V is a new multimodal reasoning model based on GLM-4.5-Air, which has 106B total and 12B active parameters.
It's performant across 42 benchmarks across various categories:
To use, install transformers release branch.
pip install transformers-v4.55.0-GLM-4.5V-preview
Then you can run:
from transformers import AutoProcessor, Glm4vMoeForConditionalGeneration
import torch
MODEL_PATH = "zai-org/GLM-4.5V"
messages = [
{
"role": "user",
"content": [
{
"type": "image",
"url": "https://upload.wikimedia.org/wikipedia/commons/f/fa/Grayscale_8bits_palette_sample_image.png"
},
{
"type": "text",
"text": "describe this image"
}
],
}
]
processor = AutoProcessor.from_pretrained(MODEL_PATH)
model = Glm4vMoeForConditionalGeneration.from_pretrained(
pretrained_model_name_or_path=MODEL_PATH,
torch_dtype="auto",
device_map="auto",
)
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt"
).to(model.device)
inputs.pop("token_type_ids", None)
generated_ids = model.generate(**inputs, max_new_tokens=8192)
output_text = processor.decode(generated_ids[0][inputs["input_ids"].shape[1]:], skip_special_tokens=False)
print(output_text)
For more detailed information about this model, we recommend reading the following blogpost: https://huggingface.co/blog/welcome-openai-gpt-oss
GPT OSS is a hugely anticipated open-weights release by OpenAI, designed for powerful reasoning, agentic tasks, and versatile developer use cases. It comprises two models: a big one with 117B parameters (gpt-oss-120b), and a smaller one with 21B parameters (gpt-oss-20b). Both are mixture-of-experts (MoEs) and use a 4-bit quantization scheme (MXFP4), enabling fast inference (thanks to fewer active parameters, see details below) while keeping resource usage low. The large model fits on a single H100 GPU, while the small one runs within 16GB of memory and is perfect for consumer hardware and on-device applications.
The following snippet shows simple inference with the 20B model. It runs on 16 GB GPUs when using mxfp4, or ~48 GB in bfloat16.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "openai/gpt-oss-20b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype="auto",
)
messages = [
{"role": "user", "content": "How many rs are in the word 'strawberry'?"},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
).to(model.device)
generated = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))
The models use attention sinks, a technique the vLLM team made compatible with Flash Attention 3. We have packaged and integrated their optimized kernel in kernels-community/vllm-flash-attn3. At the time of writing, this super-fast kernel has been tested on Hopper cards with PyTorch 2.7 and 2.8. We expect increased coverage in the coming days. If you run the models on Hopper cards (for example, H100 or H200), you need to pip install –upgrade kernels and add the following line to your snippet:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "openai/gpt-oss-20b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype="auto",
+ # Flash Attention with Sinks
+ attn_implementation="kernels-community/vllm-flash-attn3",
)
messages = [
{"role": "user", "content": "How many rs are in the word 'strawberry'?"},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
).to(model.device)
generated = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))
Even though the 120B model fits on a single H100 GPU (using mxfp4), you can also run it easily on multiple GPUs using accelerate or torchrun. Transformers provides a default parallelization plan, and you can leverage optimized attention kernels as well. The following snippet can be run with torchrun --nproc_per_node=4 generate.py on a system with 4 GPUs:
from transformers import AutoModelForCausalLM, AutoTokenizer
from transformers.distributed import DistributedConfig
import torch
model_path = "openai/gpt-oss-120b"
tokenizer = AutoTokenizer.from_pretrained(model_path, padding_side="left")
device_map = {
"tp_plan": "auto", # Enable Tensor Parallelism
}
model = AutoModelForCausalLM.from_pretrained(
model_path,
torch_dtype="auto",
attn_implementation="kernels-community/vllm-flash-attn3",
**device_map,
)
messages = [
{"role": "user", "content": "Explain how expert parallelism works in large language models."}
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1000)
# Decode and print
response = tokenizer.decode(outputs[0])
print("Model response:", response.split("<|channel|>final<|message|>")[-1].strip())
If you have a Hopper GPU or better, we recommend you use mxfp4 for the reasons explained above. If you can additionally use Flash Attention 3, then by all means do enable it!
[!TIP] If your GPU is not compatible with mxfp4, then we recommend you use MegaBlocks MoE kernels for a nice speed bump. To do so, you just need to adjust your inference code like this:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_id = "openai/gpt-oss-20b"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
torch_dtype="auto",
+ # Optimize MoE layers with downloadable MegaBlocksMoeMLP
+ use_kernels=True,
)
messages = [
{"role": "user", "content": "How many rs are in the word 'strawberry'?"},
]
inputs = tokenizer.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_tensors="pt",
return_dict=True,
).to(model.device)
generated = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.decode(generated[0][inputs["input_ids"].shape[-1]:]))
[!TIP] MegaBlocks optimized MoE kernels require the model to run on bfloat16, so memory consumption will be higher than running on mxfp4. We recommend you use mxfp4 if you can, otherwise opt in to MegaBlocks via use_kernels=True.
You can use transformers serve to experiment locally with the models, without any other dependencies. You can launch the server with just: transformers serve
To which you can send requests using the Responses API.
# responses API
curl -X POST http://localhost:8000/v1/responses \
-H "Content-Type: application/json" \
-d '{"input": [{"role": "system", "content": "hello"}], "temperature": 1.0, "stream": true, "model": "openai/gpt-oss-120b"}'
You can also send requests using the standard Completions API:
# completions API
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"messages": [{"role": "system", "content": "hello"}], "temperature": 1.0, "max_tokens": 1000, "stream": true, "model": "openai/gpt-oss-120b"}'
Command A Vision is a state-of-the-art multimodal model designed to seamlessly integrate visual and textual information for a wide range of applications. By combining advanced computer vision techniques with natural language processing capabilities, Command A Vision enables users to analyze, understand, and generate insights from both visual and textual data.
The model excels at tasks including image captioning, visual question answering, document understanding, and chart understanding. This makes it a versatile tool for AI practitioners. Its ability to process complex visual and textual inputs makes it useful in settings where text-only representations are imprecise or unavailable, like real-world image understanding and graphics-heavy document processing.
Command A Vision is built upon a robust architecture that leverages the latest advancements in VLMs. It's highly performant and efficient, even when dealing with large-scale datasets. The model's flexibility makes it suitable for a wide range of use cases, from content moderation and image search to medical imaging analysis and robotics.
MM Grounding DINO model was proposed in An Open and Comprehensive Pipeline for Unified Object Grounding and Detection by Xiangyu Zhao, Yicheng Chen, Shilin Xu, Xiangtai Li, Xinjiang Wang, Yining Li, Haian Huang>.
MM Grounding DINO improves upon the Grounding DINO by improving the contrastive class head and removing the parameter sharing in the decoder, improving zero-shot detection performance on both COCO (50.6(+2.2) AP) and LVIS (31.9(+11.8) val AP and 41.4(+12.6) minival AP).
You can find all the original MM Grounding DINO checkpoints under the MM Grounding DINO collection. This model also supports LLMDet inference. You can find LLMDet checkpoints under the LLMDet collection.
FastSpeech2Conformer by @bvantuan in #39689classmethod by @zucchini-nlp in #38812CI] Add Eric to comment slow ci by @vasqu in #39601QAPipelineTests::test_large_model_course after #39193 by @ydshieh in #39666Glm4MoeModelTest::test_torch_compile_for_training by @ydshieh in #39670Qwen2AudioForConditionalGeneration.forward() and test_flash_attn_kernels_inference_equivalence by @ebezzam in #39503models/__init__.py for typo checking by @hebangwen in #39745GemmaIntegrationTest::test_model_2b_bf16_dola again by @ydshieh in #39731--gpus all in workflow files by @ydshieh in #39752libcst to extras["testing"] in setup.py by @ydshieh in #39761main_classes/peft.md by @luckyvickyricky in #39515tvp.md to Korean by @Kim-Ju-won in #39578tokenizer.md to Korean by @seopp in #39532pipeline_gradio.md to Korean by @AhnJoonSung in #39520perf_train_gpu_one.md to Korean by @D15M4S in #39552how_to_hack_models.md to Korean by @skwh54 in #39536run_name when none by @qgallouedec in #39695model_results.json by @ydshieh in #39783attn_implementation] remove recursive, allows custom kernels with wrappers by @ArthurZucker in #39823plot_keypoint_matching, make visualize_keypoint_matching as a standard by @sbucaille in #39830TrackioCallback to work when pynvml is not installed by @qgallouedec in #39851is_wandb_available function to verify WandB installation by @qgallouedec in #39875sub_configs by @qubvel in #39855Tokenizer with PreTrainedTokenizerFast in ContinuousBatchProcessor by @qgallouedec in #39858torch.backends.cudnn.allow_tf32 = False for CI by @ydshieh in #39885AutoModelForCausalLM and AutoModelForImageTextToText by @qubvel in #39881ModernBertForMultipleChoice by @netique in #39232Exaone4] Fixes the attn implementation! by @ArthurZucker in #39906The following contributors have made significant changes to the library over the last release:
We had quite a lot of bugs that got through! Release was a bit rushed, sorry everyone! 🤗 Mostly cache fixes, as we now have layered cache, and fixed to distributed.
In order to become the source of truth, we recognize that we need to address two common and long-heard critiques about transformers:
transformers is bloatedtransformers is slowOur team has focused on improving both aspects, and we are now ready to announce this.
The modeling files for the standard Llama models are down to 500 LOC and should be much more readable, keeping just the core of the modeling and hiding the "powerful transformers features."
<img width="1583" height="974" alt="image" src="https://github.com/user-attachments/assets/f1075598-d63e-4184-b3af-c0d4b31cdde5" />
The MoEs are getting some kernel magic, enabling the use of the efficient megablocks kernels, setting a good precedent to allow the community to leverage any of the most powerful kernels developed for quantization as well!
It should also be much more convenient to use with any attention implementation you want. This opens the door to some optimizations such as leveraging flash-attention on Metal (MPS Torch backend).
<img width="2050" height="752" alt="image" src="https://github.com/user-attachments/assets/23ebfb20-7626-46a5-b264-76ffb8b8c811" />
This is but the tip of the iceberg: with the work on kernels that we're heavily pushing forward, expect speed-ups on several backends in the coming months!!
This release also includes the first steps to enabling efficient distributed training natively in transformers. Loading a 100B model takes ~3 seconds on our cluster — we hope this will be the norm for everyone! We are working on distributed checkpointing as well, and want to make sure our API can be easily used for any type of parallelism.
We want the community to benefit from all of the advances, and as always, include all hardware and platforms! We believe the kernels library will give the tools to optimize everything, making a big difference for the industry!
The Ernie 4.5 model was released in the Ernie 4.5 Model Family release by baidu. This family of models contains multiple different architectures and model sizes. This model in specific targets the base text model without mixture of experts (moe) with 0.3B parameters in total. It uses the standard Llama at its core.
Other models from the family can be found at Ernie 4.5 MoE.
<div class="flex justify-center"> <img src="https://ernie.baidu.com/blog/posts/ernie4.5/overview.png"/> </div>Ernie 4.5] Add ernie text models by @vasqu in #39228Voxtral is an upgrade of Ministral 3B and Mistral Small 3B, extending its language capabilities with audio input support. It is designed to handle tasks such as speech transcription, translation, and audio understanding.
You can read more in Mistral's realease blog post.
The model is available in two checkpoints:
Voxtral builds on Ministral-3B by adding audio processing capabilities:
LFM2 represents a new generation of Liquid Foundation Models developed by Liquid AI, specifically designed for edge AI and on-device deployment.
The models are available in three sizes (350M, 700M, and 1.2B parameters) and are engineered to run efficiently on CPU, GPU, and NPU hardware, making them particularly well-suited for applications requiring low latency, offline operation, and privacy.
The DeepSeek-V2 model was proposed in DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model by DeepSeek-AI Team.
The model uses Multi-head Latent Attention (MLA) and DeepSeekMoE architectures for efficient inference and cost-effective training. It employs an auxiliary-loss-free strategy for load balancing and multi-token prediction training objective. The model can be used for various language tasks after being pre-trained on 14.8 trillion tokens and going through Supervised Fine-Tuning and Reinforcement Learning stages.
ModernBERT Decoder is the same architecture as ModernBERT but trained from scratch with a causal language modeling (CLM) objective. This allows for using the same architecture for comparing encoders and decoders. This is the decoder architecture implementation of ModernBERT, designed for autoregressive text generation tasks.
Like the encoder version, ModernBERT Decoder incorporates modern architectural improvements such as rotary positional embeddings to support sequences of up to 8192 tokens, unpadding to avoid wasting compute on padding tokens, GeGLU layers, and alternating attention patterns. However, it uses causal (unidirectional) attention to enable autoregressive generation.
The Encoder-only Mask Transformer (EoMT) model was introduced in the CVPR 2025 Highlight Paper Your ViT is Secretly an Image Segmentation Model by Tommie Kerssies, Niccolò Cavagnero, Alexander Hermans, Narges Norouzi, Giuseppe Averta, Bastian Leibe, Gijs Dubbelman, and Daan de Geus. EoMT reveals Vision Transformers can perform image segmentation efficiently without task-specific components.
<div style="text-align: center;"> <img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/eomt_architecture.png" alt="drawing" width="500"/> </div>Doge is a series of small language models based on the Doge architecture, aiming to combine the advantages of state-space and self-attention algorithms, calculate dynamic masks from cached value states using the zero-order hold method, and solve the problem of existing mainstream language models getting lost in context. It uses the wsd_scheduler scheduler to pre-train on the smollm-corpus, and can continue training on new datasets or add sparse activation feedforward networks from stable stage checkpoints.
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/refs%2Fpr%2F426/transformers/model_doc/doge_architecture.png" alt="drawing" width="600"
The AIMv2 model was proposed in Multimodal Autoregressive Pre-training of Large Vision Encoders by Enrico Fini, Mustafa Shukor, Xiujun Li, Philipp Dufter, Michal Klein, David Haldimann, Sai Aitharaju, Victor Guilherme Turrisi da Costa, Louis Béthune, Zhe Gan, Alexander T Toshev, Marcin Eichner, Moin Nabi, Yinfei Yang, Joshua M. Susskind, Alaaeldin El-Nouby.
The abstract from the paper is the following:
We introduce a novel method for pre-training of large-scale vision encoders. Building on recent advancements in autoregressive pre-training of vision models, we extend this framework to a multimodal setting, i.e., images and text. In this paper, we present AIMV2, a family of generalist vision encoders characterized by a straightforward pre-training process, scalability, and remarkable performance across a range of downstream tasks. This is achieved by pairing the vision encoder with a multimodal decoder that autoregressively generates raw image patches and text tokens. Our encoders excel not only in multimodal evaluations but also in vision benchmarks such as localization, grounding, and classification. Notably, our AIMV2-3B encoder achieves 89.5% accuracy on ImageNet-1k with a frozen trunk. Furthermore, AIMV2 consistently outperforms state-of-the-art contrastive models (e.g., CLIP, SigLIP) in multimodal image understanding across diverse settings.
The PerceptionLM model was proposed in PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding by Jang Hyun Cho et al. It's a fully open, reproducible model for transparent research in image and video understanding. PLM consists of a vision encoder with a small scale (<8B parameters) LLM decoder.
The EfficientLoFTR model was proposed in Efficient LoFTR: Semi-Dense Local Feature Matching with Sparse-Like Speed by Yifan Wang, Xingyi He, Sida Peng, Dongli Tan and Xiaowei Zhou.
This model consists of matching two images together by finding pixel correspondences. It can be used to estimate the pose between them. This model is useful for tasks such as image matching, homography estimation, etc.
Evolla is an advanced 80-billion-parameter protein-language generative model designed to decode the molecular language of proteins. It integrates information from protein sequences, structures, and user queries to generate precise and contextually nuanced insights into protein function. Trained on an unprecedented AI-generated dataset of 546 million protein question-answer pairs and 150 billion word tokens, Evolla significantly advances research in proteomics and functional genomics, providing expert-level insights and shedding light on the molecular logic encoded in proteins.
Deepseek-VL was introduced by the DeepSeek AI team. It is a vision-language model (VLM) designed to process both text and images for generating contextually relevant responses. The model leverages LLaMA as its text encoder, while SigLip is used for encoding images.
The xLSTM model was proposed in xLSTM: Extended Long Short-Term Memory by Maximilian Beck*, Korbinian Pöppel*, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter and Sepp Hochreiter. xLSTM updates the original LSTM architecture to be competitive with Transformer models by introducing exponential gating, matrix memory expansion, and parallelizable training and ingestion.
The 7B model variant was trained by the xLSTM team Maximilian Beck, Korbinian Pöppel, Phillip Lippe, Richard Kurle, Patrick Blies, Sebastian Böck and Sepp Hochreiter at NXAI.
EXAONE 4.0 model is the language model, which integrates a Non-reasoning mode and Reasoning mode to achieve both the excellent usability of EXAONE 3.5 and the advanced reasoning abilities of EXAONE Deep. To pave the way for the agentic AI era, EXAONE 4.0 incorporates essential features such as agentic tool use, and its multilingual capabilities are extended to support Spanish in addition to English and Korean.
The EXAONE 4.0 model series consists of two sizes: a mid-size 32B model optimized for high performance, and a small-size 1.2B model designed for on-device applications.
We've added Expert Parallel support for Llama4, next release will include it for all model! You can just set a distributed_config with enable_expert_parallel=True. This is enabling efficient training of sparse Mixture-of-Experts (MoE) models across multiple devices. This allows each expert in the MoE layer to run in parallel (instead of previous TP which requires more communication), significantly improving scalability and memory efficiency.
FP-Quant is a quantization method optimized for Blackwell-generation Nvidia GPUs, supporting efficient post-training quantization (PTQ) and quantization-aware training (QAT) of LLMs using MXFP4 and NVFP4 formats.
Currently, only PTQ with MXFP4 is available. You can quantize models on-the-fly using transformers:
from transformers import AutoModelForCausalLM, FPQuantConfig
model = AutoModelForCausalLM.from_pretrained(
"qwen/Qwen3-8B",
quantization_config=FPQuantConfig(),
device_map="cuda",
torch_dtype=torch.bfloat16,
)
FP-Quant requires a Blackwell GPU and runs via the QuTLASS library. No Blackwell GPU? Use FPQuantConfig(pseudoquant=True) to emulate quantization (no QuTLASS needed).
The following results show the inference speedup of QuTLASS MXFP4 over PyTorch BF16 in Transformers. MXFP4 gives consistent speedups across all batch sizes, reaching up to 4× faster at larger scales.
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/qwen3-8b-end-to-end-prefill-speedup-mxfp4-vs-bf16-on-rtx5090.svg" alt="drawing" width="600">The kernels project aims to become the single trusted source for high-performance kernels in the Transformers ecosystem. We're working toward centralizing all kernels on the Hub, so updates, bug fixes, and improvements can happen in one place—no more scattered repos and no compilation headaches!
You can already try it out today by setting use_kernels=True in from_pretrained. Any contributor can build their kernel, register it and use it right away—no extra setup, more on this here
Even better: want to use Flash Attention 3? No need to deal with tricky compilation and missing symbols issues! Just drop in:
model.set_attn_implementation("kernels-community/flash-attn3")
This automatically fetches the right build for your setup (e.g. CUDA and PyTorch versions).
We’re also teaming up with amazing kernel devs from Unsloth, Liger, vLLM, and more to bring their work directly to the Hub—making it easier than ever to access amazing performance with a single line of code.
https://github.com/user-attachments/assets/9928f62b-543c-4b8a-b81b-4a6e262c229e
Over the past few months, we have been putting more and more functionality in the transformers chat utility, which offers a CLI-based app to chat with chat models. We've chosen to push this further by splitting the backend of transformers chat in a new, separate utility called transformers serve.
This is ideal for experimentation purposes, or to run models locally for personal and private use. It does not aim to compete with dedicated inference engines such as vLLM or SGLang.
Models of diverse modalities supported by transformers may be served with the transformers serve CLI. It spawns a local server that offers compatibility with the OpenAI SDK, which is the de-facto standard for LLM conversations and other related tasks. This way, you can use the server from many third party applications, or test it using the transformers chat CLI (docs).
The server supports the following REST APIs:
/v1/chat/completions/v1/responses/v1/audio/transcriptions/v1/modelsRelevant commits:
transformers chat and transformers serve by @LysandreJik in #38443transformers serve by @LysandreJik in #39149generation_config by @gante in #39230transformers serve by @LysandreJik in #39155/v1/audio/transcriptions) by @gante in #39434Significant refactors have been underway in transformers, aiming to reduce the complexity of the code. A metric we follow to see how the refactors impact our code is to follow the number of lines in a given model; we try to reduce it as much as possible, while keeping everything related to the forward pass and model definition in that file.
See the evolution here:
<img width="1200" height="600" alt="image" src="https://github.com/user-attachments/assets/c232bc8d-7d7c-4192-baa8-a60efe5eb2ff" />Some notable refactors:
KV caches are now defined per layer, enabling new hybrid caches that mix different attention types. CacheProcessors also encapsulate cache quantization and offloading, making them easy to customize.
output_attentions or output_hidden_statesSuch attributes require very specific handling within the forward call, while they're not important to understand how the model works. We remove that code but keep the functionality by providing a better utility to handle it.
We refactor the way to explicitly set the attention implementation so that it has a method dedicated to it.
average_tokens_across_devices by default in TrainingArguments by @Krish0909 in #39395Flex Attn] Fix torch 2.5.1 incompatibilities by @vasqu in #37406test_compare_unprocessed_logit_scores by @ydshieh in #39053t5gemma tests by @ydshieh in #39052layoutlmv3 tests by @ydshieh in #39050Gemma3nProcessorTest by @ydshieh in #39068mistral3 tests by @ydshieh in #38989dots1 tests by @ydshieh in #39088test_is_split_into_words in test_pipelines_token_classification.py by @st81 in #39079test_sdpa_can_dispatch_on_flash by @ydshieh in #39092@lru_cache() to @lru_cache to match styles from #38883. by @rasmi in #39093run-slow by @ydshieh in #39100llama tests by @ydshieh in #39161Dia] Change ckpt path in docs by @vasqu in #39181from_pretrained by @qubvel in #39184fastspeech2_conformer tests by @ydshieh in #39229is not None -> isinstance(..., dict) by @qubvel in #39145segmentation_maps support to MobileNetV2ImageProcessor by @simonreise in #37312tests/generation/test_utils.py by @ydshieh in #39254test_eager_matches sdpa generate and update an integration test for blip-like models by @ydshieh in #39248smollm3 by @gante in #39271PretrainedConfig.__init__ method to make it more explicit by @qubvel in #39158test_generate_compile_model_forward by @ydshieh in #39276datasets 4.0 by @lhoestq in #39156aria tests by @ydshieh in #39277test_torchscript_* for now until the majority of the community ask for it by @ydshieh in #39307stevhliu to the list in self-comment-ci.yml by @ydshieh in #39315src/ for doctest (for now) by @ydshieh in #39316max_length_q and max_length_k types to flash_attn_varlen_func by @HollowMan6 in #37206phi3 tests by @ydshieh in #39312position_ids in masking_utils by @Cyrilvallez in #39310test_sdpa_can_dispatch_on_flash by @ydshieh in #39259timm (for perception_lm) by @ydshieh in #39380/v1/models output payload by @alvarobartt in #39414set_tracer_provider and set_meter_provider calls by @McPatate in #39422JetMoeForCausalLM by @Phoenix-Shen in #37830ContinuousBatchProcessor by @qgallouedec in #39372CI] Fix partially red CI by @vasqu in #39448GemmaIntegrationTest::test_model_2b_bf16_dola by @ydshieh in #39362datasets pin by @gante in #39500args_doc.py to auto_docstring.py by @yonigozlan in #39439_supports_flash_attn_2 in examples and tests by @zucchini-nlp in #39471TypeError instead of ValueError for invalid types by @Sai-Suraj-27 in #38660MambaCache to modeling_mamba.py by @manueldeprada in #38086perf_infer_gpu_multi.md to Korean by @luckyvickyricky in #39441CI] Fix post merge ernie 4.5 by @vasqu in #39561docs/source/ko/_toctree.yml by @jungnerd in #39516supports_static_cache to can_compile_fullgraph by @zucchini-nlp in #39505device_mesh have multiple dim by @S1ro1 in #38949test_export_static_cache by @gante in #39662Ernie 4.5] Post merge adaptations by @vasqu in #39664kyutai tests by @ydshieh in #39416typing.Literal as type of tool parameters or return value by @grf53 in #39633The following contributors have made significant changes to the library over the last release:
segmentation_maps support to MobileNetV2ImageProcessor (#37312)docs/source/ko/_toctree.yml (#39516)Two new models are added to transformers: Ernie 4.5, and its MoE variant, Ernie 4.5 MoE.
They are added on top of the v4.53.2 release, and can be installed from the following tag: v4.53.2-Ernie-4.5-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.53.2-Ernie-4.5-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the Ernie-4.5 models. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.54.0.
The Ernie 4.5 model was released in the Ernie 4.5 Model Family release by baidu. This family of models contains multiple different architectures and model sizes.
This model in specific targets the base text model without mixture of experts (moe) with 0.3B parameters in total. It uses the standard Llama at its core.
This model in specific targets the base text model with mixture of experts (moe) - one with 21B total, 3B active parameters and another one with 300B total, 47B active parameters. It uses the standard Llama at its core combined with a specialized MoE based on Mixtral with additional shared experts.
Ernie-4.5 can be found on the Huggingface Hub.
Generating text with Ernie:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "baidu/ERNIE-4.5-0.3B-PT"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype=torch.bfloat16,
)
# prepare the model input
inputs = tokenizer("Hey, are you conscious? Can you talk to me?", return_tensors="pt")
prompt = "Hey, are you conscious? Can you talk to me?"
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], add_special_tokens=False, return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=32,
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
# decode the generated ids
generate_text = tokenizer.decode(output_ids, skip_special_tokens=True)
See below for an example leveraging the MoE variant:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "baidu/ERNIE-4.5-21B-A3B-PT"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype=torch.bfloat16,
)
# prepare the model input
inputs = tokenizer("Hey, are you conscious? Can you talk to me?", return_tensors="pt")
prompt = "Hey, are you conscious? Can you talk to me?"
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], add_special_tokens=False, return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(
**model_inputs,
max_new_tokens=32,
)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
# decode the generated ids
generate_text = tokenizer.decode(output_ids, skip_special_tokens=True)
A small patch for open telemetry fixes! Sorry for the delay!
** refactor: remove set_tracer_provider and set_meter_provider calls (https://github.com/huggingface/transformers/pull/39422) from @McPatate
A new model is added to transformers: ModernBERT Decoder
It is added on top of the v4.53.2 release, and can be installed from the following tag: v4.53.2-modernbert-decoder-preview.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.53.2-modernbert-decoder-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the ModernBERT Decoder model. This tag is a tagged version of the main branch and does not follow semantic versioning. This model will be included in the next minor release: v4.54.0.
ModernBERT Decoder is the same architecture as ModernBERT but trained from scratch with a causal language modeling (CLM) objective. This allows for using the same architecture for comparing encoders and decoders. This is the decoder architecture implementation of ModernBERT, designed for autoregressive text generation tasks.
Like the encoder version, ModernBERT Decoder incorporates modern architectural improvements such as rotary positional embeddings to support sequences of up to 8192 tokens, unpadding to avoid wasting compute on padding tokens, GeGLU layers, and alternating attention patterns. However, it uses causal (unidirectional) attention to enable autoregressive generation.
ModernBERT Decoder can be found on the Huggingface Hub.
Using pipeline:
import torch
from transformers import pipeline
generator = pipeline(
task="text-generation",
model="blab-jhu/test-32m-dec",
torch_dtype=torch.float16,
device=0
)
generator("The future of artificial intelligence is", max_length=50, num_return_sequences=1)
# For sequence classification
classifier = pipeline(
task="text-classification",
model="blab-jhu/test-32m-dec",
torch_dtype=torch.float16,
device=0
)
classifier("This movie is really great!")
Using AutoModel:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("blab-jhu/test-32m-dec")
model = AutoModelForCausalLM.from_pretrained(
"blab-jhu/test-32m-dec",
torch_dtype=torch.float16,
device_map="auto",
)
prompt = "The future of artificial intelligence is"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_length=50,
num_return_sequences=1,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Generated text: {generated_text}")
# For sequence classification
from transformers import AutoModelForSequenceClassification
classifier_model = AutoModelForSequenceClassification.from_pretrained(
"blab-jhu/test-32m-dec",
torch_dtype=torch.float16,
device_map="auto",
num_labels=2
)
text = "This movie is really great!"
inputs = tokenizer(text, return_tensors="pt").to("cuda")
with torch.no_grad():
outputs = classifier_model(**inputs)
predictions = torch.nn.functional.softmax(outputs.logits, dim=-1)
predicted_class = torch.argmax(predictions, dim=-1)
print(f"Predicted class: {predicted_class.item()}")
print(f"Prediction probabilities: {predictions}")
Using the transformers CLI:
echo "The future of artificial intelligence is" | transformers run --task text-generation --model your-username/modernbert-decoder-base --device 0This patch contains the following bug fixes:
smollm3 (#39271)position_ids in masking_utils (#39310)This patch contains several bug fixes. The following commits are included:
Gemma 3n models are designed for efficient execution on low-resource devices. They are capable of multimodal input, handling text, image, video, and audio input, and generating text outputs, with open weights for pre-trained and instruction-tuned variants. These models were trained with data in over 140 spoken languages.
Gemma 3n models use selective parameter activation technology to reduce resource requirements. This technique allows the models to operate at an effective size of 2B and 4B parameters, which is lower than the total number of parameters they contain. For more information on Gemma 3n's efficient parameter management technology, see the Gemma 3n page.
from transformers import pipeline
import torch
pipe = pipeline(
"image-text-to-text",
torch_dtype=torch.bfloat16,
model="google/gemma-3n-e4b",
device="cuda",
)
output = pipe(
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
text="<image_soft_token> in this image, there is"
)
print(output)
Dia is an opensource text-to-speech (TTS) model (1.6B parameters) developed by Nari Labs. It can generate highly realistic dialogue from transcript including nonverbal communications such as laughter and coughing. Furthermore, emotion and tone control is also possible via audio conditioning (voice cloning).
Model Architecture: Dia is an encoder-decoder transformer based on the original transformer architecture. However, some more modern features such as rotational positional embeddings (RoPE) are also included. For its text portion (encoder), a byte tokenizer is utilized while for the audio portion (decoder), a pretrained codec model DAC is used - DAC encodes speech into discrete codebook tokens and decodes them back into audio.
Kyutai STT is a speech-to-text model architecture based on the Mimi codec, which encodes audio into discrete tokens in a streaming fashion, and a Moshi-like autoregressive decoder. Kyutai’s lab has released two model checkpoints:
Read more about the model in the documentation
V-JEPA 2 is a self-supervised approach to training video encoders developed by FAIR, Meta. Using internet-scale video data, V-JEPA 2 attains state-of-the-art performance on motion understanding and human action anticipation tasks. V-JEPA 2-AC is a latent action-conditioned world model post-trained from V-JEPA 2 (using a small amount of robot trajectory interaction data) that solves robot manipulation tasks without environment-specific data collection or task-specific training or calibration.
Read more about the model in the documentation.
Arcee is a decoder-only transformer model based on the Llama architecture with a key modification: it uses ReLU² (ReLU-squared) activation in the MLP blocks instead of SiLU, following recent research showing improved training efficiency with squared activations. This architecture is designed for efficient training and inference while maintaining the proven stability of the Llama design.
The Arcee model is architecturally similar to Llama but uses x * relu(x) in MLP layers for improved gradient flow and is optimized for efficiency in both training and inference scenarios.
Read more about the model in the documentation.
ColQwen2 is a variant of the ColPali model designed to retrieve documents by analyzing their visual features. Unlike traditional systems that rely heavily on text extraction and OCR, ColQwen2 treats each page as an image. It uses the Qwen2-VL backbone to capture not only text, but also the layout, tables, charts, and other visual elements to create detailed multi-vector embeddings that can be used for retrieval by computing pairwise late interaction similarity scores. This offers a more comprehensive understanding of documents and enables more efficient and accurate retrieval.
Read more about the model in the documentation.
MiniMax is a powerful language model with 456 billion total parameters, of which 45.9 billion are activated per token. To better unlock the long context capabilities of the model, MiniMax adopts a hybrid architecture that combines Lightning Attention, Softmax Attention and Mixture-of-Experts (MoE). Leveraging advanced parallel strategies and innovative compute-communication overlap methods—such as Linear Attention Sequence Parallelism Plus (LASP+), varlen ring attention, Expert Tensor Parallel (ETP), etc., MiniMax's training context length is extended to 1 million tokens, and it can handle a context of up to 4 million tokens during the inference. On various academic benchmarks, MiniMax also demonstrates the performance of a top-tier model.
The architecture of MiniMax is briefly described as follows:
For more details refer to the release blog post.
Read more about the model in the documentation.
T5Gemma (aka encoder-decoder Gemma) was proposed in a research paper by Google. It is a family of encoder-decoder large langauge models, developed by adapting pretrained decoder-only models into encoder-decoder. T5Gemma includes pretrained and instruction-tuned variants. The architecture is based on transformer encoder-decoder design following T5, with improvements from Gemma 2: GQA, RoPE, GeGLU activation, RMSNorm, and interleaved local/global attention.
T5Gemma has two groups of model sizes: 1) Gemma 2 sizes (2B-2B, 9B-2B, and 9B-9B), which are based on the offical Gemma 2 models (2B and 9B); and 2) T5 sizes (Small, Base, Large, and XL), where are pretrained under the Gemma 2 framework following T5 configuration. In addition, we also provide a model at ML size (medium large, ~2B in total), which is in-between T5 Large and T5 XL.
The pretrained varaints are trained with two objectives: prefix language modeling with knowledge distillation (PrefixLM) and UL2, separately. We release both variants for each model size. The instruction-turned varaints was post-trained with supervised fine-tuning and reinforcement learning.
Read more about the model in the documentation.
The GLM-4.1V model architecture is added to transformers; no models have yet been released with that architecture. Stay tuned for the GLM team upcoming releases!
Read more about the model in the documentation.
The FalconH1 model was developed by the TII Pretraining team. A comprehensive research paper covering the architecture, pretraining dynamics, experimental results, and conclusions is forthcoming. You can read more about this series in this website.
Read more about the model in the documentation.
The LightGlue model was proposed in LightGlue: Local Feature Matching at Light Speed by Philipp Lindenberger, Paul-Edouard Sarlin and Marc Pollefeys.
Similar to SuperGlue, this model consists of matching two sets of local features extracted from two images, its goal is to be faster than SuperGlue. Paired with the SuperPoint model, it can be used to match two images and estimate the pose between them. This model is useful for tasks such as image matching, homography estimation, etc.
The abstract from the paper is the following:
We introduce LightGlue, a deep neural network that learns to match local features across images. We revisit multiple design decisions of SuperGlue, the state of the art in sparse matching, and derive simple but effective improvements. Cumulatively, they make LightGlue more efficient - in terms of both memory and computation, more accurate, and much easier to train. One key property is that LightGlue is adaptive to the difficulty of the problem: the inference is much faster on image pairs that are intuitively easy to match, for example because of a larger visual overlap or limited appearance change. This opens up exciting prospects for deploying deep matchers in latency-sensitive applications like 3D reconstruction. The code and trained models are publicly available at this https URL
Read more about the model in the documentation.
The abstract from the report is the following:
Mixture of Experts (MoE) models have emerged as a promising paradigm for scaling language models efficiently by activating only a subset of parameters for each input token. In this report, we present dots.llm1, a large-scale MoE model that activates 14B parameters out of a total of 142B parameters, delivering performance on par with state-of-the-art models while reducing training and inference costs. Leveraging our meticulously crafted and efficient data processing pipeline, dots.llm1 achieves performance comparable to Qwen2.5-72B after pretraining on high-quality corpus and post-training to fully unlock its capabilities. Notably, no synthetic data is used during pretraining. To foster further research, we open-source intermediate training checkpoints spanning the entire training process, providing valuable insights into the learning dynamics of large language models.
Read more about the model in the documentation.
SmolLM3 is a fully open, compact language model designed for efficient deployment while maintaining strong performance. It uses a Transformer decoder architecture with Grouped Query Attention (GQA) to reduce the kv cache, and no RoPE, enabling improved performance on long-context tasks. It is trained using a multi-stage training approach on high-quality public datasets across web, code, and math domains. The model is multilingual and supports very large context lengths. The instruct variant is optimized for reasoning and tool use.
Read more about the model in the documentation.
In previous versions, installing the kernels library would automatically activate the custom kernels added to transformers, because the @use_kernel_forward_from_the_hub decorator directly swapped out the model’s forward method. This implicit behavior caused several issues for users — including problems with torch.compile, non-determinism, and inconsistent outputs.
To address this, we've introduced a new opt-in mechanism called kernelize. You can now enable kernel usage explicitly by passing use_kernels=True to from_pretrained. The use_kernel_forward_from_the_hub decorator now simply stores the kernel name that the user wants to use — and kernelize handles the rest under the hood.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-3.2-1B-Instruct",
torch_dtype=torch.bfloat16,
device_map="cuda",
use_kernels=True
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B-Instruct")
input = "Hello"
input_ids = tokenizer(input, return_tensors="pt").to(model.device).input_ids
output = model.generate(input_ids, max_new_tokens=100)
print(tokenizer.decode(output[0], skip_special_tokens=True))
More kernels will be added over time — this will be a collaborative, community-driven effort to make transformers lighter and faster 🤗
Support for Flash Attention 3 is added across the most popular models.
Several efforts refactoring the repository are happening in parallel. The direction is to greatly simplify the library, removing unnecessary codepaths. Whilst the efforts are spread across the library, they're particularly visible in each individual models; where non-modeling-specific code will be simplified and eventually removed.
We take the assumption that model-agnostic utilities shouldn't be in the modeling code. Things like the output of attentions, hidden states, router logits, are important for end-users but don't need to be explicitely displayed in the modeling code.
Several minimal breaking changes aiming to bring clearer defaults while greatly simplifying the library have been merged.
dtype for pipelines to auto by @Vaibhavs10 in #38882output_attentions=True and the attn implementation is wrong by @ArthurZucker in #38288Attention] Refactor Attention Interface for Bart-based Models by @vasqu in #38108Attention] Attention refactor for Whisper-based models by @vasqu in #38235compile] re-enable for Qwen-VL models by @zucchini-nlp in #38127forced_decoder_ids by @gante in #38232liger-kernel to docker file by @ydshieh in #38292transformers env output by @yao-matrix in #38274forced_decoder_ids deletion by @gante in #38316beam_indices by @gante in #38259custom_generate and trust_remote_code by @gante in #38304vasqu to self-comment-ci.yml by @ydshieh in #38324FlexAttention] Reenable flex for encoder-decoder and make the test more robust by @vasqu in #38321kernels for AMD docker images by @ydshieh in #38354OPT] Fix attention scaling by @vasqu in #38290get_default_device for torch<2.3 by @Cyrilvallez in #38376utils/notification_service.py by @ydshieh in #38379initialize_weights by @Cyrilvallez in #38382tokenizer -> tokenize by @foldl in #38357generation_config.json as base parameterization by @gante in #38330test_offloaded_cache_implementation) by @gante in #37896pixel_values with inputs_embeds by @dxoigmn in #38334CsmForConditionalGenerationIntegrationTest by @ydshieh in #38424huggingface/transformers by @ydshieh in #38413from_pretrained by @pstjohn in #38155from_args_and_dict ProcessorMixin by @yonigozlan in #38296microsoft/python-type-stubs (post dropping support for Python 3.8) by @Avasam in #38335BatchFeature and BatchEncoding by @lgeiger in #38459Gemma3IntegrationTest by @ydshieh in #38471SinkCache to a custom_generate repo by @gante in #38399Gemma2IntegrationTest by @ydshieh in #38492av by @ydshieh in #38548python3 by @S1ro1 in #38555utils/notification_service.py by @ydshieh in #38556chameleon tests by @ydshieh in #38565utils/notification_service.py for AMD vs Nvidia by @ydshieh in #38563deepseekv3 by @ydshieh in #38562FlexAttn] Fix models with unique characteristics by @vasqu in #38433repository field to benchmarks table by @McPatate in #38582mlm_probability to be set to None when mlm=False in DataCollatorForLanguageModeling by @KameniAlexNea in #38522)isort from dependencies by @Sai-Suraj-27 in #38616return_dict=False giving errors in a few VLM models by @ydshieh in #38519MiniMax (docs and integration tests checkpoint) by @geetu040 in #38575test_initialization by @ydshieh in #38607ColQwen2ModelIntegrationTest by @ydshieh in #38583test_initialization for SwiftFormer by @ydshieh in #38636AriaForConditionalGenerationModelTest on CircleCI by @ydshieh in #38615InternVL integration test by @ydshieh in #38612aya_vision test by @ydshieh in #38674is_bitsandbytes_available() by @ved1beta in #38528llava tests by @ydshieh in #38722None instead of try/except by @zucchini-nlp in #38561average_tokens_across_devices=True and world size = 1 by @qgallouedec in #38785qwen_2_5 omni by @ydshieh in #38658llava_onevision tests by @ydshieh in #38791mllama by @ydshieh in #38704low_cpu_mem_usage by @Cyrilvallez in #38792llava_next tests by @ydshieh in #38813wandb.run.url instead of wandb.run.get_url() (deprecated) by @qgallouedec in #38817align_to_words=True in QuestionAnsweringPipeline can lead to duplicate answers by @yushi2006 in #38761qwen2_5_vl tests by @ydshieh in #38845auxiliary_in_channels default behavior in UperNet by @simonreise in #37540qwen3 tests by @ydshieh in #38862phi4_multimodal tests by @ydshieh in #38816qwen3_moe tests by @ydshieh in #38865raise from e in hub.py utility by @Wauplin in #37241fsmt tests by @ydshieh in #38904FalconMambaIntegrationTests by @ydshieh in #38566ALL_LAYERNORM_LAYERS by @Cyrilvallez in #38922test_initialization by @ydshieh in #38932mistral and mistral3 tests by @ydshieh in #38978is_split_into_words in the TokenClassificationPipeline. by @yushi2006 in #38818rag by @ydshieh in #38585Attention] Small fix on output attentions by @vasqu in #38948require_tf) by @gante in #38944The following contributors have made significant changes to the library over the last release:
liger-kernel to docker file (#38292)vasqu to self-comment-ci.yml (#38324)kernels for AMD docker images (#38354)utils/notification_service.py (#38379)CsmForConditionalGenerationIntegrationTest (#38424)huggingface/transformers (#38413)Gemma3IntegrationTest (#38471)Gemma2IntegrationTest (#38492)av (#38548)utils/notification_service.py (#38556)chameleon tests (#38565)utils/notification_service.py for AMD vs Nvidia (#38563)deepseekv3 (#38562)return_dict=False giving errors in a few VLM models (#38519)test_initialization (#38607)ColQwen2ModelIntegrationTest (#38583)test_initialization for SwiftFormer (#38636)AriaForConditionalGenerationModelTest on CircleCI (#38615)InternVL integration test (#38612)aya_vision test (#38674)llava tests (#38722)qwen_2_5 omni (#38658)llava_onevision tests (#38791)mllama (#38704)llava_next tests (#38813)qwen2_5_vl tests (#38845)qwen3 tests (#38862)phi4_multimodal tests (#38816)qwen3_moe tests (#38865)fsmt tests (#38904)FalconMambaIntegrationTests (#38566)test_initialization (#38932)mistral and mistral3 tests (#38978)rag (#38585)output_attentions=True and the attn implementation is wrong (#38288)transformers env output (#38274)Attention] Refactor Attention Interface for Bart-based Models (#38108)FlexAttention] Reenable flex for encoder-decoder and make the test more robust (#38321)OPT] Fix attention scaling (#38290)Attention] Attention refactor for Whisper-based models (#38235)FlexAttn] Fix models with unique characteristics (#38433)Attention] Small fix on output attentions (#38948)microsoft/python-type-stubs (post dropping support for Python 3.8) (#38335)MiniMax (docs and integration tests checkpoint) (#38575)