Hugging Face/Transformers

Transformers

$npx -y @buildinternet/releases show transformers

Sun

Mon

Tue

Wed

Thu

Fri

Sat

AprMayJunJulAugSepOctNovDecJanFebMarApr

Less

Releases13Avg4/moVersionsv4.57.4 → v5.5.3

Apr 8, 2025

Patch release v4.51.1

Since the release of Llama 4, we have fixed a few issues that we are now releasing in patch v4.51.1

Fixing flex attention for torch=2.6.0 (#37285)
more fixes for post-training llama4 (#37329)
Remove HQQ from caching allocator warmup (#37347)
fix derived berts _init_weights (#37341)
Fix init empty weights without accelerate (#37337)
Fix deepspeed with quantization (#37324)
fix llama4 training (#37319)
fix flex attn when optional args aren't passed (#37327)
Multiple llama4 fixe (#37353)

Thanks all for your patience

Apr 5, 2025

v4.51.0: Llama 4, Phi4-Multimodal, DeepSeek-v3, Qwen3

New Model Additions

Llama 4

Llama 4, developed by Meta, introduces a new auto-regressive Mixture-of-Experts (MoE) architecture.This generation includes two models:

The highly capable Llama 4 Maverick with 17B active parameters out of ~400B total, with 128 experts.
The efficient Llama 4 Scout also has 17B active parameters out of ~109B total, using just 16 experts.

Both models leverage early fusion for native multimodality, enabling them to process text and image inputs. Maverick and Scout are both trained on up to 40 trillion tokens on data encompassing 200 languages (with specific fine-tuning support for 12 languages including Arabic, Spanish, German, and Hindi).

For deployment, Llama 4 Scout is designed for accessibility, fitting on a single server-grade GPU via on-the-fly 4-bit or 8-bit quantization, while Maverick is available in BF16 and FP8 formats. These models are released under the custom Llama 4 Community License Agreement, available on the model repositories

Getting started with Llama 4 using transformers is straightforward. Make sure you have transformers v4.51.0 or later installed:

pip install -U transformers[hf_xet]

Here's a quick example using the instruction-tuned Maverick model responding about two images, using tensor parallel for maximum speed. You need to run this script on an instance with 8 GPUs, using a command like:

torchrun –nproc-per-instance=8 script.py

from transformers import AutoProcessor, Llama4ForConditionalGeneration
import torch

model_id = "meta-llama/Llama-4-Maverick-17B-128E-Instruct"

processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    attn_implementation="flex_attention",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

url1 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
url2 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/cat_style_layout.png"
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": url1},
            {"type": "image", "url": url2},
            {"type": "text", "text": "Can you describe how these two images are similar, and how they differ?"},
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
)

response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])[0]
print(response)
print(outputs[0])

Make sure to check the model cards on the repos (Llama 4 Maverick (~400B) and Llama 4 Scout (~109B)) for detailed usage instructions, including multimodal examples, specific prompt formats (like system prompts), quantization details, and advanced configuration options!

Phi4-Multimodal

Phi-4-multimodal-instruct is a lightweight open multimodal foundation model that leverages the language, vision, and speech research and datasets used for Phi-3.5 and 4.0 models. The model processes text, image, and audio inputs, generating text outputs, and comes with 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning, direct preference optimization and RLHF (Reinforcement Learning from Human Feedback) to support precise instruction adherence and safety measures. The languages that each modal supports are the following:

Text: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian
Vision: English
Audio: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese

Add Phi4 multimodal by @Cyrilvallez in #36939

DeepSeek-v3

DeepSeek-v3 is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.

The model is detailed in the following paper.

Overview

The DeepSeek-V3 model was proposed in DeepSeek-V3 Technical Report by DeepSeek-AI Team.

The abstract from the paper is the following:

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.

[WIP] add deepseek-v3 by @bzantium in #35926

Qwen3

The Qwen3 architecture has been contributed to transformers and is available in v4.51.0. At time of release, the models themselves have not yet been released - stay tuned for a release from the Qwen team!

Adding Qwen3 and Qwen3MoE by @bozheng-hit in #36878

Documentation

Model docs are getting a significant overhaul by providing much needed, ready-to-use examples one can copy-paste in their modules/consoles. We will adapt these examples to each model, with the goal of providing relevant examples on a per-model basis.

[docs] Model docs by @stevhliu in #36469

Significant model improvements

A very large PR was provided by @nikosanto13 that helped add modular files to all speech models in the library; seeing the difference between each of them is now much simpler, as well as maintenance and eventual refactors.

Introduce modular files for speech models by @nikosanto13 in #35902

Bugfixes and improvements

fix: loss computation after embeddings resize - mllama by @Ssukriti in #36840
Simplify keep_in_fp32_modules logic by @Cyrilvallez in #36722
Fix Pan and Scan on batched images Gemma3 by @yonigozlan in #36864
Update installation.md by @ariG23498 in #36826
fix Gemma3 Config by @eljandoubi in #36893
Fix torch version guard at import by @zucchini-nlp in #36907
[Fix] Add original_max_position_embeddings to YARN rope_scaling optional keys by @JustinTong0323 in #36877
tests: fix asyncio.wait() usage for python>=3.11 by @dvrogozh in #36898
[chameleon] fix num image token check by @zucchini-nlp in #36918
Fix Compressed tensors to_dict_diff by @MekkCyber in #36922
Use another repo. for Mistral3 processor testing by @ydshieh in #36925
Fix typos by @omahs in #36910
Update trainer_pt_utils.py docstrings for consistency by @ethanknights in #36912
[2/N] Use pyupgrade --py39-plus to improve code by @cyyever in #36857
Fix pytorch defomr attn path by @qubvel in #36923
More precise comment by @ydshieh in #36935
Added support for seed in DataCollatorForWholeWordMask by @capemox in #36903
Fix processor kwargs qwen2 vl by @yonigozlan in #36890
Disallow Offload to disk for gguf files by @MekkCyber in #36933
Deprecate #36741 and map Causal to Conditional by @zucchini-nlp in #36917
Fixing _pre_quantization_dtype when torch_dtype is None by @MekkCyber in #36930
Export for Phi4-mini by @guangy10 in #36780
fix typos in the tests directory by @threewebcode in #36932
Fix cuda index issue in cache allocator by @SunMarc in #36937
[Utils] torch version checks optionally accept dev versions by @gante in #36847
Update after #36962 by @ydshieh in #36965
Change GPUS to GPUs by @zhanluxianshen in #36945
typo fixed in README_fr.md by @NargiT in #36951
Updated docker files to use uv for installing packages by @Sai-Suraj-27 in #36957
update examples after ruff being updated by @ydshieh in #36972
Remove extra tensor clone in PyTorch code by @cyyever in #36748
[docs] Fix image link by @stevhliu in #36869
Add ruff target-version by @cyyever in #36971
update bot comment again by @ydshieh in #36974
🚨Deprecate legacy argument for image-text-to-text models and adopt new behavior by default by @yonigozlan in #36307
Fix tensor dtype mismatch by @cyyever in #36985
byebye CircleCI TF jobs by @ydshieh in #36998
Use torch.expm1 by @cyyever in #36995
Install networkx==3.2.1 manually in some CircleCI jobs after #36957 by @ydshieh in #37000
Fix Optional type annotation by @cyyever in #36841
Fix get_device_properties by @ivarflakstad in #36997
Allow easy registration of custom attention functions by @Cyrilvallez in #36889
Fix removing "cpu" from frozenset in bitsandbytes.py to allow better ROCm support. by @anadon in #36975
Fix device_map check for ggml files by @MekkCyber in #37003
Log the correct learning rate by @SunMarc in #36973
fix typos in the code comments and error messages by @threewebcode in #36993
Remove deprecated training arguments by @cyyever in #36946
[docs] Attention mask image by @stevhliu in #36970
fix transformers_cli import relative path issue by @yao-matrix in #36989
Support QuestionAnswering Module for ModernBert based models. by @bakrianoo in #35566
Fix PixtralProcessor patch_size when spatial_merge_size is used by @mgoin in #37019
[Modeling] Load FP8 safetensors such as DeepSeek by @kylesayrs in #36828
Mark 2 tests as flaky for now by @ydshieh in #37038
remove redundant code in trainer by @hiyouga in #36994
Skip FP8 linear tests For device capability 9.0 by @MekkCyber in #37008
Add Distill Any Depth by @keetrap in #36614
fix pegasus init weights and other copied models by @jiqing-feng in #36844
Optimize to_py_obj for python-native numeric lists and scalars by @n0gu-furiosa in #36885
Fixup for distill_any_depth conversion script by @qubvel in #37043
[chat templates} support loading audio from video by @zucchini-nlp in #36955
[audio utils] fix fft_bin_width computation by @eustlb in #36603
[generate, cache] handle more complex device maps by @gante in #37014
clean pipeline question_answering. by @zhanluxianshen in #36986
Avoid unnecessary device operations in loss computing by @cyyever in #36950
Set weights_only in torch.load by @cyyever in #36991
Replace default split function with jnp.split() in flax models by @premmurugan229 in #37001
Remove deprecated batch_size parameter by @cyyever in #37007
fixed typo by @finnoh in #37036
fix: Fully remove legacy cache from Llama by @Wheest in #36958
Fix SDPA implementation in Qwen2-VL (issues with torch==2.6.0) by @ManuelFay in #36891
fix: AttributeError: 'LlavaProcessor' object has no attribute 'image_token_id' by @jp1924 in #37026
Fix some typos about benchmark scripts. by @zhanluxianshen in #37027
Change deprecated PT functions by @cyyever in #37041
[blip-2] Fix dtype mismatch when keep in fp32 by @zucchini-nlp in #37068
fix tied weigths issue by @ydshieh in #37031
Update w/ new account by @muellerzr in #37084
Fix state_dict map location when quantized by @Cyrilvallez in #37086
Fix AttentionInterface following feedback by @Cyrilvallez in #37010
fixed typo. by @zhanluxianshen in #37057
[generate] beam search -- fix output cropping by @gante in #37080
[Cache] rename dtype attribute 🚨 🚨 by @gante in #37044
Kenlm by @ydshieh in #37091
🌐 [i18n-KO] Translated qwen2_vl.md to Korean by @MinJu-Ha in #36750
Gaudi: Fix the pipeline failed issue with hpu device by @yuanwu2017 in #36990
Support passing flash_attn_kwargs when gradient_checkpointing is enabled by @efsotr in #37037
Fix 4090/ada not detected as having FP8 support by @Qubitium in #37067
enable tp on CPU by @jiqing-feng in #36299
fix whisper re-compile by @jiqing-feng in #36712
[MLU] Fix FA2 check error, remove deepspeed-mlu deps. by @huismiling in #36159
Fix Gemma3 embedding scaling by @gau-nernst in #37109
RWKV: fix mask warning typo by @RobinKa in #37114
Remove deprecated code by @cyyever in #37059
[tests] remove cuda-only test marker in AwqConfigTest by @faaany in #37032
Export T5 (encoder-decoder) to ExecuTorch by @guangy10 in #36486
skip by @ydshieh in #37141
[qwen3] fix generation tests by @zucchini-nlp in #37142
Fix more inefficient PT operations by @cyyever in #37060
Fix std initialization in Idefics variants by @yaswanth19 in #37100
add gpt2 test on XPU by @jiqing-feng in #37028
Fix llava xpu tests. by @jiqing-feng in #37130
enable test_assisted_decoding_in_different_gpu test on XPU by @yao-matrix in #37120
Use public export API on torch 2.5 and future by @guangy10 in #36781
Convert _VALID_DICT_FIELDS to class attribute for shared dict parsing in subclasses by @Tavish9 in #36736
Only count num items in batch when needed by @IlyasMoutawwakil in #36867
Make canine model exportable by removing unncessary complicated logic by @tugsbayasgalan in #37124
[ModernBERT] Never save 'reference_compile' config; should be set based on end user by @tomaarsen in #36305
fix XPU UT error case brough by RNG difference btw XPU and CUDA by @yao-matrix in #37121
Fixes the inconsistency of the optionality of attention_mask by @Zephyr271828 in #37153
Avoid pipeline test failing related to Hub call by @ydshieh in #37170
Fix meta state dict loading with quantizers by @Cyrilvallez in #37136
Revert #37031 by @Cyrilvallez in #37178
[doc] Fix link for Quark quantization page by @BowenBao in #37179
[chat-template] fix video loading by @zucchini-nlp in #37146
Skip code 307 in RequestCounter by @ydshieh in #36953
Add device workaround for int4 weight only quantization after API update by @jerryzh168 in #36980
Fixes DynamicCache export issues due to control flow and inplace modifications by @xadupre in #36652
Try to avoid/reduce some remaining CI job failures by @ydshieh in #37202
fix: Add 'image-text-to-text' to TASK_MAPPING by @saattrupdan in #37107
Fix some code annotation typos. by @zhanluxianshen in #37102
Merge tensor operations with device transfer operations by @cyyever in #37097
[3/N] Use pyupgrade --py39-plus to improve code by @cyyever in #36936
Add py.typed by @cyyever in #37022
No more dtype_byte_size() by @Rocketknight1 in #37144
[Tests] add min_new_tokens to prevent flaky length checks by @gante in #37175
Stop DOSing the Hub in the CI by @Rocketknight1 in #37209
More ReDOS fixes! by @Rocketknight1 in #36964
Updated the model card for CLIP by @purusharthmalik in #37040
Update falcon model card by @ricalanis in #37184
Updated model card for Qwen2 by @Aravind-11 in #37192
Fix static cache export by @guangy10 in #37229
[Phi4] add multimodal chat template by @zucchini-nlp in #36996
Add new dim to num_items_in_batch if necessary by @regisss in #36967
Fix test by @Cyrilvallez in #37213
[tests] fix mamba integration simple inference precision issue by @faaany in #37193
[CI] lazy loading external datasets by @gante in #37218
enable 2 types of case on XPU by @yao-matrix in #37198
Fix AST parsing when looking for remote code imports by @Rocketknight1 in #37245
Add support for fast image processing in image-pretraining example by @jafraustro in #37021
Allow flexible generation params arg when checking pipeline specs by @Rocketknight1 in #37211
[CI] green llama tests by @gante in #37244
Adding links to ShieldGemma 2 technical report by @RyanMullins in #37247
feat: updated model card for qwen_2.5_vl by @arkhamHack in #37099
Update model card for Cohere by @bimal-gajera in #37056
chore: Update model doc for code_llama by @AbhishekRP2002 in #37115
Update Model Card for ModernBERT by @ParagEkbote in #37052
Update model card for electra by @Wu-n0 in #37063
[qwen-vl] fix image processor by @zucchini-nlp in #37258
update error msg by @itazap in #37207
Fix utils/check_bad_commit.py by @ydshieh in #37272
Support return_tensors in audio chat templates by @zucchini-nlp in #34601
Update ruff to 0.11.2 by @ydshieh in #36962
Fix typing for None valued variables by @cyyever in #37004
Use lru_cache for tokenization tests by @ydshieh in #36818
Create and Expose SamVisionModel as public for better accessibility by @geetu040 in #36493
[Feature] Support using FlashAttention2 on Ascend NPU by @FightingZhen in #36696
Remove low_cpu_mem_usage and _fast_init by @Cyrilvallez in #36963
Refactor return_dict logic to remove complicated if/else paths by @qubvel in #36794
Refactor attention for SigLIP based models by @qubvel in #36981
Add Optional to types by @cyyever in #37163
Purge unused ModelTester code by @Rocketknight1 in #37085

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@cyyever
- [2/N] Use pyupgrade --py39-plus to improve code (#36857)
- Remove extra tensor clone in PyTorch code (#36748)
- Add ruff target-version (#36971)
- Fix tensor dtype mismatch (#36985)
- Use torch.expm1 (#36995)
- Fix Optional type annotation (#36841)
- Remove deprecated training arguments (#36946)
- Avoid unnecessary device operations in loss computing (#36950)
- Fix typing for None valued variables (#37004)
- Set weights_only in torch.load (#36991)
- Remove deprecated batch_size parameter (#37007)
- Change deprecated PT functions (#37041)
- Remove deprecated code (#37059)
- Fix more inefficient PT operations (#37060)
- Merge tensor operations with device transfer operations (#37097)
- [3/N] Use pyupgrade --py39-plus to improve code (#36936)
- Add py.typed (#37022)
- Add Optional to types (#37163)
@bzantium
- [WIP] add deepseek-v3 (#35926)
@bozheng-hit
- Adding Qwen3 and Qwen3MoE (#36878)
@geetu040
- Create and Expose SamVisionModel as public for better accessibility (#36493)
@FightingZhen
- [Feature] Support using FlashAttention2 on Ascend NPU (#36696)
@nikosanto13
- Introduce modular files for speech models (#35902)

Mar 28, 2025

Deepseek v3 (based on 4.50.3)

A new model is added to transformers: DeepSeek 3 (Also known as DeepSeek R1). It is added on top of the v4.50.3 release, and can be installed from the following tag: v4.50.3-DeepSeek-3.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.50.3-DeepSeek-3

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

DeepSeek 3 (Also known as DeepSeek R1)

The model is detailed in the following paper.

Overview

The DeepSeek-V3 model was proposed in DeepSeek-V3 Technical Report by DeepSeek-AI Team.

The abstract from the paper is the following:

Limitations and call for contribution!

We are super happy to make this code community-powered, and would love to see how you can help optimize the following:

current implementation uses the "naive" attention compution (so not really MLA)
current implementation loops through the experts. This should be replaced. Pointers to use get_packed_weights from intetrations/tensor_parallel.
current implementation uses the eleuther formula for ROPE, using the orginal one would be more efficient! (should still follow our API)
static cache is not supported (this should be just a generation config issue / config shape issues)

Usage tips

The model uses Multi-head Latent Attention (MLA) and DeepSeekMoE architectures for efficient inference and cost-effective training. It employs an auxiliary-loss-free strategy for load balancing and multi-token prediction training objective. The model can be used for various language tasks after being pre-trained on 14.8 trillion tokens and going through Supervised Fine-Tuning and Reinforcement Learning stages.

You can run the model in FP8 automatically, using 2 nodes of 8 H100 should be more than enough!

# `run_deepseek_v1.py`
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
torch.manual_seed(30)

tokenizer = AutoTokenizer.from_pretrained("deepseek-r1")

chat = [
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
  {"role": "user", "content": "I'd like to show off how chat templating works!"},
]


model = AutoModelForCausalLM.from_pretrained("deepseek-r1", device_map="auto", torch_dtype=torch.bfloat16)
inputs = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)

outputs = model.generate(inputs, max_new_tokens=50)
print(tokenizer.batch_decode(outputs))

This generated:

<｜Assistant｜><think>
Okay, the user wants to demonstrate how chat templating works. Let me break down what that means. Chat templating is about structuring the conversation data, especially for models that need specific input formats. Maybe they're referring to something like how messages are formatted with roles (user, assistant, system) in APIs like OpenAI.

First, I should explain what chat templating is. It's the process of formatting conversation data into a structured format that the model can understand. This usually includes roles and content. For example, user messages, assistant responses, and system messages each have their own role tags.

They might want an example. Let me think of a simple conversation. The user says "Hello, how are you?" and the assistant responds "I'm doing great. How can I help you today?" Then the user follows up with wanting to show off chat templating. So the example should include the history and the new message.

In some frameworks, like Hugging Face's Transformers, chat templates are applied using Jinja2 templates. The template might look something like combining system messages, then looping through user and assistant messages with appropriate tags. For instance, using {% for message in messages %} and assigning roles like <|user|>, <|assistant|>, etc.

I should structure the example with the messages array, showing each role and content. Then apply a hypothetical template to convert that into a formatted string the model uses. Also, mention that different models have different templating requirements, like using special tokens or varying role labels.

Wait, the user mentioned "chat templating" in the context of showing off. Maybe they want a practical example they can present. So providing a code snippet or a structured data example would be helpful. Let me outline a typical messages array and then the templated output.

Also, it's important to note that proper templating ensures the model knows the conversation flow, which is crucial for generating coherent responses. Maybe include a note about why it's important, like maintaining context and role-specific processing.

Let me check if there are any common mistakes or things to avoid. For example, not closing tags properly, or mismatching roles. But maybe that's too detailed unless the user asks. Focus on the positive example first.

Putting it all together, the response should have an example messages array, the applied template, and the final formatted string. Maybe use angle brackets or special tokens as placeholders. Also, mention that this helps in training or fine-tuning models with structured data.

I think that's a solid approach. Let me structure it step by step to make it clear.
</think>

Chat templating is a way to structure conversation data (e.g., user/assistant interactions) into a format that language models understand. This is especially important for models trained to handle multi-turn dialogues, where the input must explicitly separate roles (user, assistant, system, etc.) and messages. Let’s break this down with an example!

---

### **Step 1: Raw Conversation History**
Suppose we have this conversation:
- **User**: "Hello, how are you?"
- **Assistant**: "I'm doing great. How can I help you today?"
- **User**: "I'd like to show off how chat templating works!"

---

### **Step 2: Structured Messages**
In frameworks like Hugging Face Transformers or OpenAI, conversations are often formatted as a list of dictionaries with `role` and `content`:
```python
messages = [
    {"role": "user", "content": "Hello, how are you?"},
    {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
    {"role": "user", "content": "I'd like to show off how chat templating works!"},
]
```

---

### **Step 3: Apply a Chat Template**
A **chat template** converts this structured data into a single string formatted for the model. For example, using a Jinja-style template (common in Hugging Face):

```jinja
{% for message in messages %}
    {% if message['role'] == 'user' %}
        <|user|>{{ message['content'] }}<|end|>
    {% elif message['role'] == 'assistant' %}
        <|assistant|>{{ message['content'] }}<|end|>
    {% endif %}
{% endfor %}
<|assistant|>
```

---

### **Step 4: Final Templated Output**
Applying the template to our `messages` list would produce:
```text
<|user|>Hello, how are you?<|end|>
<|assistant|>I'm doing great. How can I help you today?<|end|>
<|user|>I'd like to show off how chat templating works!<|end|>
<|assistant|>
```

This tells the model:  
1. The conversation history (user/assistant turns).  
2. The model’s turn to generate a response (`<|assistant|>` at the end).  

---

### **Key Notes**:
- **Role Separation**: Tags like `<|user|>` and `<|assistant|>` help the model distinguish speakers.
- **Special Tokens**: Models often use unique tokens (e.g., `<|end|>`) to mark message boundaries.
- **Flexibility**: Templates vary by model (e.g., OpenAI uses `{"role": "user", "content": "..."}` instead of tags).

---

### **Why This Matters**:
- **Consistency**: Ensures the model understands dialogue structure.
- **Context Preservation**: Maintains the flow of multi-turn conversations.
- **Alignment**: Matches the format the model was trained on for better performance.

Want to dive deeper or see a specific framework’s implementation (e.g., OpenAI, Llama, Mistral)? Let me know! 😊<｜end▁of▁sentence｜>

Use the following to run it

torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0|1 --rdzv-id an_id --rdzv-backend c10d --rdzv-endpoint master_addr:master_port run_deepseek_r1.py

If you have:

[rank0]: ncclInternalError: Internal check failed.
[rank0]: Last error:
[rank0]: Bootstrap : no socket interface found

error, it means NCCL was probably not loaded.

Patch release v4.50.3

Thanks to the vllm team we have a few more bugs that slipped in!

[generate] beam search -- fix output cropping (#37080) by @gante
[blip-2] Fix dtype mismatch when keep in fp32 (#37068) by @zucchini-nlp
Fix PixtralProcessor patch_size when spatial_merge_size is used (#37019)

Mar 27, 2025

Patch release v4.50.2

I completely forgot to put these in the previous patch sorry! Should put the transformers backend in a good spot!

[Utils] torch version checks optionally accept dev versions (#36847) by @gante
Fix processor kwargs qwen2 vl (#36890) by @yonigozlan
Fix Pan and Scan on batched images Gemma3 (#36864) by @yonigozlan

Mar 25, 2025

Patch release v4.50.1

There were some very minor bugs with the new hub kernels, and with remote code that we had to fix

Deprecate #36741 and map Causal to Conditional (#36917) by @zucchini-nlp
Fix pytorch deform attn path (#36923) by @qubvel
[chameleon] fix num image token check (#36918) by @zucchini-nlp
Fix torch version guard at import (#36907) by @zucchini-nlp

Mar 21, 2025

Release v4.50.0

New Model Additions

Model-based releases

Starting with version v4.49.0, we have been doing model-based releases, additionally to our traditional, software-based monthly releases. These model-based releases provide a tag from which models may be installed.

Contrarily to our software-releases; these are not pushed to pypi and are kept on our GitHub. Each release has a tag attributed to it, such as:

v4.49.0-Gemma-3
v4.49.0-AyaVision

⚠️ As bugs are identified and fixed on each model, the release tags are updated so that installing from that tag always gives the best experience possible with that model.

Each new model release will always be based on the current state of the main branch at the time of its creation. This ensures that new models start with the latest features and fixes available.

For example, if two models—Gemma-3 and AyaVision—are released from main, and then a fix for gemma3 is merged, it will look something like this:

              o---- v4.49.0-Gemma-3 (includes AyaVision, plus main fixes)
            /                  \  
---o--o--o--o--o-- (fix for gemma3) --o--o--o main
       \          
        o---- v4.49.0-AyaVision

We strive to merge model specific fixes on their respective branches as fast as possible!

Gemma 3

Gemma 3 is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.

The Gemma 3 model was proposed by Google. It is a vision-language model composed by a SigLIP vision encoder and a Gemma 2 language decoder linked by a multimodal linear projection.

It cuts an image into a fixed number of tokens same way as Siglip if the image does not exceed certain aspect ratio. For images that exceed the given aspect ratio, it crops the image into multiple smaller pacthes and concatenates them with the base image embedding.

One particularity is that the model uses bidirectional attention on all the image tokens. Also, the model interleaves sliding window local attention with full causal attention in the language backbone, where each sixth layer is a full causal attention layer.

Gemma3 by @RyanMullins in #36658

Shield Gemma2

ShieldGemma 2 is built on Gemma 3, is a 4 billion (4B) parameter model that checks the safety of both synthetic and natural images against key categories to help you build robust datasets and models. With this addition to the Gemma family of models, researchers and developers can now easily minimize the risk of harmful content in their models across key areas of harm as defined below:

No Sexually Explicit content: The image shall not contain content that depicts explicit or graphic sexual acts (e.g., pornography, erotic nudity, depictions of rape or sexual assault).
No Dangerous Content: The image shall not contain content that facilitates or encourages activities that could cause real-world harm (e.g., building firearms and explosive devices, promotion of terrorism, instructions for suicide).
No Violence/Gore content: The image shall not contain content that depicts shocking, sensational, or gratuitous violence (e.g., excessive blood and gore, gratuitous violence against animals, extreme injury or moment of death).

We recommend using ShieldGemma 2 as an input filter to vision language models, or as an output filter of image generation systems. To train a robust image safety model, we curated training datasets of natural and synthetic images and instruction-tuned Gemma 3 to demonstrate strong performance.

Shieldgemma2 #36678 by @RyanMullins

Aya Vision

AyaVision is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.

The Aya Vision 8B and 32B models is a state-of-the-art multilingual multimodal models developed by Cohere For AI. They build on the Aya Expanse recipe to handle both visual and textual information without compromising on the strong multilingual textual performance of the original model.

Aya Vision 8B combines the Siglip2-so400-384-14 vision encoder with the Cohere CommandR-7B language model further post-trained with the Aya Expanse recipe, creating a powerful vision-language model capable of understanding images and generating text across 23 languages. Whereas, Aya Vision 32B uses Aya Expanse 32B as the language model.

Key features of Aya Vision include:

Multimodal capabilities in 23 languages
Strong text-only multilingual capabilities inherited from CommandR-7B post-trained with the Aya Expanse recipe and Aya Expanse 32B
High-quality visual understanding using the Siglip2-so400-384-14 vision encoder
Seamless integration of visual and textual information in 23 languages.

Add aya by @ArthurZucker in #36521

Mistral 3.1

Mistral 3.1 is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.

Building upon Mistral Small 3 (2501), Mistral Small 3.1 (2503) adds state-of-the-art vision understanding and enhances long context capabilities up to 128k tokens without compromising text performance. With 24 billion parameters, this model achieves top-tier capabilities in both text and vision tasks.

It is ideal for:

Fast-response conversational agents.
Low-latency function calling.
Subject matter experts via fine-tuning.
Local inference for hobbyists and organizations handling sensitive data.
Programming and math reasoning.
Long document understanding.
Visual understanding.

Add Mistral3 by @Cyrilvallez in #36790

Smol VLM 2

SmolVLM-2 is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.

SmolVLM2 is an adaptation of the Idefics3 model with two main differences:

It uses SmolLM2 for the text model.
It supports multi-image and video inputs

SmolVLM2 by @orrzohar in #36126

SigLIP-2

SigLIP-2 is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.

The SigLIP2 model was proposed in SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features by Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner and Xiaohua Zhai.

The model comes in two variants

FixRes - model works with fixed resolution images (backward compatible with SigLIP v1)
NaFlex - model works with variable image aspect ratios and resolutions (SigLIP2 in transformers)

Add SigLIP 2 by @qubvel in #36323

Prompt Depth Anything

PromptDepthAnything is a high-resolution, accurate metric depth estimation model that leverages prompting, inspired by its success in vision-language (VLMs) and large language models (LLMs). Using iPhone LiDAR as a prompt, the model generates precise depth maps at up to 4K resolution, unlocking the potential of depth foundation models.

Add Prompt Depth Anything Model by @haotongl in #35401

New tool: attention visualization

We add a new tool to transformers to visualize the attention layout of a given model. It only requires a model ID as input, and will load the relevant tokenizer/model and display what the attention mask looks like. Some examples:


from transformers.utils.attention_visualizer import AttentionMaskVisualizer
visualizer = AttentionMaskVisualizer("meta-llama/Llama-3.2-3B-Instruct")
visualizer("A normal attention mask")

visualizer = AttentionMaskVisualizer("mistralai/Mistral-Small-24B-Instruct-2501")
visualizer("A normal attention mask with a long text to see how it is displayed, and if it is displayed correctly")

visualizer = AttentionMaskVisualizer("google/paligemma2-3b-mix-224")
visualizer("<img> You are an assistant.", suffix = "What is on the image?")

visualizer = AttentionMaskVisualizer("google/gemma-2b")
visualizer("You are an assistant. Make sure you print me") # we should have slidiing on non sliding side by side

visualizer = AttentionMaskVisualizer("google/gemma-3-27b-it")
visualizer("<img>You are an assistant. Make sure you print me") # we should have slidiing on non sliding side by side

Add attention visualization tool by @ArthurZucker in #36630

Deprecating transformers.agents in favor of smolagents

We are deprecating transformers.agents in favour of the smolagents library. Read more about smolagents here.

Deprecate transformers.agents by @aymeric-roucher in #36415

Quantization

We support adding custom quantization method by using the @register_quantization_config and @register_quantizer decorator:

@register_quantization_config("custom")
class CustomConfig(QuantizationConfigMixin):
   pass

@register_quantizer("custom")
class CustomQuantizer(HfQuantizer):
   pass

quantized_model = AutoModelForCausalLM.from_pretrained(
    "facebook/opt-350m", quantization_config=CustomConfig(), torch_dtype="auto"
)

Added Support for Custom Quantization by @keetrap in #35915
Add Example for Custom quantization by @MekkCyber in #36286

AMD is developing its in-house quantizer named Quark released under MIT license, which supports a broad range of quantization pre-processing, algorithms, dtypes and target hardware. You can now load a model quantized by quark library:

# pip install amd-quark

model_id = "EmbeddedLLM/Llama-3.1-8B-Instruct-w_fp8_per_channel_sym"
model = AutoModelForCausalLM.from_pretrained(model_id)
model = model.to("cuda")

Support loading Quark quantized models in Transformers by @fxmarty-amd and @BowenBao in #36372

Torchao is augmented with autoquant support, CPU-quantization, as well as new AOBaseConfig object instances for more advanced configuration.

Add autoquant support for torchao quantizer by @jerryzh168 in #35503
enable torchao quantization on CPU by @jiqing-feng in #36146
Add option for ao base configs by @drisspg in #36526

Tensor Parallelism implementation changes

At loading time, the parallelization is now applied module-by-module, so that no memory overhead is required compared to what the final weight distribution will be!

TP initialization module-by-module by @Cyrilvallez in #35996

Generation

This release includes two speed upgrades to generate:

Assisted generation now works with ANY model as an assistant, even with do_sample=True;

from transformers import pipeline
import torch

prompt = "Alice and Bob"
checkpoint = "google/gemma-2-9b"
assistant_checkpoint = "double7/vicuna-68m"

pipe = pipeline(
    "text-generation",
    model=checkpoint,
    assistant_model=assistant_checkpoint,
    do_sample=True
)
pipe_output = pipe(prompt, max_new_tokens=50, do_sample=True)
print(pipe_output[0]["generated_text"])

Beam search was vectorized, and should be significantly faster with a large num_beams. The speedup is more visible on smaller models, where model.forward doesn't dominate the total run time.

Universal Speculative Decoding CandidateGenerator by @keyboardAnt, @jmamou, and @gauravjain14 in #35029
[generate] ✨ vectorized beam search ✨ by @gante in #35802

Documentation

A significant redesign of our documentation has wrapped-up. The goal was to greatly simplify the transformers documentation, making it much more easy to navigate. Let us know what you think!

[docs] Redesign by @stevhliu in #31757

Notable repo maintenance

The research examples folder that was hosted in transformers is no more. We have moved it out of transformers and in the following repo: github.com/huggingface/transformers-research-projects/

Remove research projects by @Rocketknight1 in #36645

We have updated our flex attention support so as to have it be on-par with our Flash Attention 2 support.

Proper_flex by @ArthurZucker in #36643

More models support flex attention now thanks to @qubvel

Refactor Attention implementation for ViT-based models by @qubvel in #36545

First integration of hub kernels for deformable detr!

Use deformable_detr kernel from the Hub (#36853) by @danieldk

Bugfixes and improvements

[tests] fix EsmModelIntegrationTest::test_inference_bitsandbytes by @faaany in #36225
Fix LlavaForConditionalGenerationModelTest::test_config after #36077 by @ydshieh in #36230
AMD DeepSpeed image additional HIP dependencies by @ivarflakstad in #36195
[generate] remove cache v4.47 deprecations by @gante in #36212
Add missing atol to torch.testing.assert_close where rtol is specified by @ivarflakstad in #36234
[tests] remove tf/flax tests in /generation by @gante in #36235
[generate] Fix encoder decoder models attention mask by @eustlb in #36018
Add compressed tensor in quant dockerfile by @SunMarc in #36239
[tests] remove test_export_to_onnx by @gante in #36241
Au revoir flaky test_fast_is_faster_than_slow by @ydshieh in #36240
Fix TorchAoConfig not JSON serializable by @andrewor14 in #36206
Remove flakiness in VLMs by @zucchini-nlp in #36242
feat: add support for tensor parallel training workflow with accelerate by @kmehant in #34194
Fix XGLM loss computation (PyTorch and TensorFlow) by @damianoamatruda in #35878
GitModelIntegrationTest - flatten the expected slice tensor by @ivarflakstad in #36260
Added Support for Custom Quantization by @keetrap in #35915
Qwen2VL fix cos,sin dtypes to float when used with deepspeed by @ArdalanM in #36188
Uniformize LlavaNextVideoProcessor kwargs by @yonigozlan in #35613
Add support for post-processing kwargs in image-text-to-text pipeline by @yonigozlan in #35374
Add dithering to the Speech2TextFeatureExtractor API. by @KarelVesely84 in #34638
[tests] remove pt_tf equivalence tests by @gante in #36253
TP initialization module-by-module by @Cyrilvallez in #35996
[tests] deflake dither test by @gante in #36284
[tests] remove flax-pt equivalence and cross tests by @gante in #36283
[tests] make test_from_pretrained_low_cpu_mem_usage_equal less flaky by @gante in #36255
Add Example for Custom quantization by @MekkCyber in #36286
docs: Update README_zh-hans.md by @hyjbrave in #36269
Fix callback handler reference by @SunMarc in #36250
Make cache traceable by @IlyasMoutawwakil in #35873
Fix broken CI on release branch due to missing conversion files by @ydshieh in #36275
Ignore conversion files in test fetcher by @ydshieh in #36251
SmolVLM2 by @orrzohar in #36126
Fix typo in Pixtral example by @12v in #36302
fix: prevent second save in the end of training if last step was saved already by @NosimusAI in #36219
[smolvlm] make CI green by @gante in #36306
Fix default attention mask of generate in MoshiForConditionalGeneration by @cyan-channel-io in #36171
VLMs: even more clean-up by @zucchini-nlp in #36249
Add SigLIP 2 by @qubvel in #36323
[CI] Check test if the GenerationTesterMixin inheritance is correct 🐛 🔫 by @gante in #36180
[tests] make quanto tests device-agnostic by @faaany in #36328
Uses Collection in transformers.image_transforms.normalize by @CalOmnie in #36301
Fix exploitable regexes in Nougat and GPTSan/GPTJNeoXJapanese by @Rocketknight1 in #36121
[tests] enable bnb tests on xpu by @faaany in #36233
Improve model loading for compressed tensor models by @rahul-tuli in #36152
Change slack channel for mi250 CI to amd-hf-ci by @ivarflakstad in #36346
Add autoquant support for torchao quantizer by @jerryzh168 in #35503
Update amd pytorch index to match base image by @ivarflakstad in #36347
fix(type): padding_side type should be Optional[str] by @shenxiangzhuang in #36326
[Modeling] Reduce runtime when loading missing keys by @kylesayrs in #36312
notify new model merged to main by @ydshieh in #36375
Update modeling_llava_onevision.py by @yinsong1986 in #36391
Load models much faster on accelerator devices!! by @Cyrilvallez in #36380
[modular] Do not track imports in functions by @Cyrilvallez in #36279
Fix is_causal fail with compile by @Cyrilvallez in #36374
enable torchao quantization on CPU by @jiqing-feng in #36146
Update _get_eval_sampler to reflect Trainer.tokenizer is deprecation self.tokenizer -> self.processing_class by @yukiman76 in #36315
Fix doc formatting in forward passes & modular by @Cyrilvallez in #36243
Added handling for length <2 of suppress_tokens for whisper by @andreystarenky in #36336
addressing the issue #34611 to make FlaxDinov2 compatible with any batch size by @MHRDYN7 in #35138
tests: revert change of torch_require_multi_gpu to be device agnostic by @dvrogozh in #35721
[tests] enable autoawq tests on XPU by @faaany in #36327
fix audio classification pipeline fp16 test on cuda by @jiqing-feng in #36359
chore: fix function argument descriptions by @threewebcode in #36392
Fix pytorch integration tests for SAM by @qubvel in #36397
[CLI] add import guards by @gante in #36376
Fix convert_to_rgb for SAM ImageProcessor by @MSt-10 in #36369
Security fix for benchmark.yml by @ydshieh in #36402
Fixed VitDet for non-squre Images by @cjfghk5697 in #35969
Add retry hf hub decorator by @muellerzr in #35213
Deprecate transformers.agents by @aymeric-roucher in #36415
Fixing the docs corresponding to the breaking change in torch 2.6. by @Narsil in #36420
add recommendations for NPU using flash_attn by @zheliuyu in #36383
fix: prevent model access error during Optuna hyperparameter tuning by @emapco in #36395
Universal Speculative Decoding CandidateGenerator by @keyboardAnt in #35029
Fix compressed tensors config by @MekkCyber in #36421
Update form pretrained to make TP a first class citizen by @ArthurZucker in #36335
Fix Expected output for compressed-tensors tests by @MekkCyber in #36425
restrict cache allocator to non quantized model by @SunMarc in #36428
Change PR to draft when it is (re)opened by @ydshieh in #36417
Fix permission by @ydshieh in #36443
Fix another permission by @ydshieh in #36444
Add contents: write by @ydshieh in #36445
[save_pretrained ] Skip collecting duplicated weight by @wejoncy in #36409
[generate] torch.distributed-compatible DynamicCache by @gante in #36373
Lazy import libraries in src/transformers/image_utils.py by @hmellor in #36435
Fix hub_retry by @ydshieh in #36449
[GroundingDino] Fix grounding dino loss 🚨 by @EduardoPach in #31828
Fix loading models with mismatched sizes by @qubvel in #36463
[docs] fix bug in deepspeed config by @faaany in #36081
Add Got-OCR 2 Fast image processor and refactor slow one by @yonigozlan in #36185
Fix couples of issues from #36335 by @SunMarc in #36453
Fix _load_state_dict_into_meta_model with device_map=None by @hlky in #36488
Fix loading zero3 weights by @muellerzr in #36455
Check TRUST_REMOTE_CODE for RealmRetriever for security by @ydshieh in #36511
Fix kwargs UserWarning in SamImageProcessor by @MSt-10 in #36479
fix torch_dtype, contiguous, and load_state_dict regression by @SunMarc in #36512
Fix some typos in docs by @co63oc in #36502
chore: fix message descriptions in arguments and comments by @threewebcode in #36504
Fix pipeline+peft interaction by @Rocketknight1 in #36480
Fix edge case for continue_final_message by @Rocketknight1 in #36404
[Style] fix E721 warnings by @kashif in #36474
Remove unused code by @Rocketknight1 in #36459
[docs] Redesign by @stevhliu in #31757
Add aya by @ArthurZucker in #36521
chore: Fix typos in docs and examples by @co63oc in #36524
Fix bamba tests amd by @ivarflakstad in #36535
Fix links in quantization doc by @MekkCyber in #36528
chore: enhance messages in docstrings by @threewebcode in #36525
guard torch version for uint16 by @SunMarc in #36520
Fix typos in tests by @co63oc in #36547
Fix typos . by @zhanluxianshen in #36551
chore: enhance message descriptions in parameters,comments,logs and docstrings by @threewebcode in #36554
Delete redundancy if case in model_utils by @zhanluxianshen in #36559
Modular Conversion --fix_and_overwrite on Windows by @hlky in #36583
Integrate SwanLab for offline/online experiment tracking and local visualization by @ShaohonChen in #36433
[bark] fix loading of generation config by @gante in #36587
[XGLM] tag tests as slow by @gante in #36592
fix: argument by @ariG23498 in #36558
Mention UltraScale Playbook 🌌 in docs by @NouamaneTazi in #36589
avoid errors when the size of input_ids passed to PrefixConstrainedLogitsProcessor is zero by @HiDolen in #36489
Export base streamer. by @AndreasAbdi in #36500
Github action for auto-assigning reviewers by @Rocketknight1 in #35846
Update chat_extras.md with content correction by @krishkkk in #36599
Update "who to tag" / "who can review" by @gante in #36394
Fixed datatype related issues in DataCollatorForLanguageModeling by @capemox in #36457
Fix check for XPU. PyTorch >= 2.6 no longer needs ipex. by @tripzero in #36593
[HybridCache] disable automatic compilation by @gante in #36620
Fix auto-assign reviewers by @Rocketknight1 in #36631
chore: fix typos in language models by @threewebcode in #36586
[docs] Serving LLMs by @stevhliu in #36522
Refactor some core stuff by @ArthurZucker in #36539
Fix bugs in mllama image processing by @tjohnson31415 in #36156
Proper_flex by @ArthurZucker in #36643
Fix AriaForConditionalGeneration flex attn test by @ivarflakstad in #36604
Remove remote code warning by @Rocketknight1 in #36285
Stop warnings from unnecessary torch.tensor() overuse by @Rocketknight1 in #36538
[docs] Update docs dependency by @stevhliu in #36635
Remove research projects by @Rocketknight1 in #36645
Fix gguf docs by @SunMarc in #36601
fix typos in the docs directory by @threewebcode in #36639
Gemma3 by @RyanMullins in #36658
HPU support by @IlyasMoutawwakil in #36424
fix block mask typing by @ArthurZucker in #36661
[CI] gemma 3 make fix-copies by @gante in #36664
Fix bnb regression due to empty state dict by @SunMarc in #36663
[core] Large/full refactor of from_pretrained by @Cyrilvallez in #36033
Don't accidentally mutate the base_model_tp_plan by @Rocketknight1 in #36677
Fix Failing GPTQ tests by @MekkCyber in #36666
Remove hardcoded slow image processor class in processors supporting fast ones by @yonigozlan in #36266
[quants] refactor logic for modules_to_not_convert by @SunMarc in #36672
Remove differences between init and preprocess kwargs for fast image processors by @yonigozlan in #36186
Refactor siglip2 fast image processor by @yonigozlan in #36406
Fix rescale normalize inconsistencies in fast image processors by @yonigozlan in #36388
[Cache] Don't initialize the cache on meta device by @gante in #36543
Update config.torch_dtype correctly by @SunMarc in #36679
Fix slicing for 0-dim param by @SunMarc in #36580
Changing the test model in Quanto kv cache by @MekkCyber in #36670
fix wandb hp search unable to resume from sweep_id by @bd793fcb in #35883
Upgrading torch version and cuda version in quantization docker by @MekkCyber in #36264
Change Qwen2_VL image processors to have init and call accept the same kwargs by @yonigozlan in #36207
fix type annotation for ALL_ATTENTION_FUNCTIONS by @WineChord in #36690
Fix dtype for params without tp_plan by @Cyrilvallez in #36681
chore: fix typos in utils module by @threewebcode in #36668
[CI] Automatic rerun of certain test failures by @gante in #36694
Add loading speed test by @Cyrilvallez in #36671
fix: fsdp sharded state dict wont work for save_only_model knob by @kmehant in #36627
Handling an exception related to HQQ quantization in modeling by @MekkCyber in #36702
Add GGUF support to T5-Encoder by @Isotr0py in #36700
Final CI cleanup by @Rocketknight1 in #36703
Add support for fast image processors in add-new-model-like CLI by @yonigozlan in #36313
Gemma3 processor typo by @Kuangdd01 in #36710
Make the flaky list a little more general by @Rocketknight1 in #36704
Cleanup the regex used for doc preprocessing by @Rocketknight1 in #36648
[model loading] don't gc.collect() if only 1 shard is used by @gante in #36721
Fix/best model checkpoint fix by @seanswyi in #35885
Try working around the processor registration bugs by @Rocketknight1 in #36184
[tests] Parameterized test_eager_matches_sdpa_inference by @gante in #36650
🌐 [i18n-KO] Translated codegen.md to Korean by @maximizemaxwell in #36698
Fix post_init() code duplication by @Cyrilvallez in #36727
Fix grad accum arbitrary value by @IlyasMoutawwakil in #36691
[Generation, Gemma 3] When passing a custom generation_config, overwrite default values with the model's base generation_config by @gante in #36684
🚨🚨🚨 Fix sdpa in SAM and refactor relative position embeddings by @geetu040 in #36422
enable/disable compile for quants methods by @SunMarc in #36519
fix can_generate by @jiqing-feng in #36570
Allow ray datasets to be used with trainer by @FredrikNoren in #36699
fix xpu tests by @jiqing-feng in #36656
Fix test isolation for clear_import_cache utility by @sambhavnoobcoder in #36345
Fix TrainingArguments.torch_empty_cache_steps post_init check by @pkuderov in #36734
[MINOR:TYPO] Update hubert.md by @cakiki in #36733
[CI] remove redundant checks in test_eager_matches_sdpa_inference by @gante in #36740
[docs] Update README by @stevhliu in #36265
doc: Clarify is_decoder usage in PretrainedConfig documentation by @d-kleine in #36724
fix typos in the tests directory by @threewebcode in #36717
chore: fix typos in tests directory by @threewebcode in #36785
Fixing typo in gemma3 image_processor_fast and adding a small test by @Zebz13 in #36776
Fix gemma3_text tokenizer in mapping by @LysandreJik in #36793
Add Mistral3 by @Cyrilvallez in #36790
fix hqq due to recent modeling changes by @SunMarc in #36771
Update SHA for tj-actions/changed-files by @ydshieh in #36795
Loading optimizations by @Cyrilvallez in #36742
Fix Mistral3 tests by @yonigozlan in #36797
Fix casting dtype for qunatization by @SunMarc in #36799
Fix chameleon's TypeError because inputs_embeds may None by @YenFuLin in #36673
Support custom dosctrings in modular by @yonigozlan in #36726
[generate] ✨ vectorized beam search ✨ by @gante in #35802
Expectations test utils by @ivarflakstad in #36569
fix "Cannot copy out of meta tensor; no data!" issue for BartForConditionalGeneration model by @yao-matrix in #36572
Remove dist": "loadfile" for pytest in CircleCI jobs by @ydshieh in #36811
Fix Device map for bitsandbytes tests by @MekkCyber in #36800
[Generation] remove leftover code from end-to-end compilation by @gante in #36685
Add attention visualization tool by @ArthurZucker in #36630
Add option for ao base configs by @drisspg in #36526
enable OffloadedCache on XPU from PyTorch 2.7 by @yao-matrix in #36654
[gemma 3] multimodal checkpoints + AutoModelForCausalLM by @gante in #36741
One more fix for reviewer assignment by @Rocketknight1 in #36829
Support tracable dynamicKVcache by @tugsbayasgalan in #36311
Add Space to Bitsandbytes doc by @MekkCyber in #36834
quick fix fast_image_processor register error by @JJJYmmm in #36716
Update configuration_qwen2.py by @michaelfeil in #36735
Just import torch AdamW instead by @Rocketknight1 in #36177
Move the warning to the documentation for DataCollatorWithFlattening by @qgallouedec in #36707
Fix swanlab global step by @Zeyi-Lin in #36728
Disable inductor config setter by default by @HDCharles in #36608
[ForCausalLMLoss] allow users to pass shifted labels by @stas00 in #36607
fix tiktoken convert to pass AddedToken to Tokenizer by @itazap in #36566
Saving Trainer.collator.tokenizer in when Trainer.processing_class is None by @innerNULL in #36552
Pass num_items_in_batch directly to loss computation by @eljandoubi in #36753
Fix fp16 ONNX export for RT-DETR and RT-DETRv2 by @qubvel in #36460
Update deprecated Jax calls by @rasmi in #35919
[qwen2 audio] remove redundant code and update docs by @gante in #36282
Pass state dict by @phos-phophy in #35234
[modular] Sort modular skips by @gante in #36304
[generate] clarify docstrings: when to inherit GenerationMixin by @gante in #36605
Update min safetensors bis by @SunMarc in #36823
Fix import for torch 2.0, 2.1 - guard typehint for "device_mesh" by @qubvel in #36768
Gemma 3: Adding explicit GenerationConfig and refactoring conversion … by @RyanMullins in #36833
Fix: remove the redundant snippet of _whole_word_mask by @HuangBugWei in #36759
Shieldgemma2 by @RyanMullins in #36678
Fix ONNX export for sequence classification head by @echarlaix in #36332
Fix hqq skipped modules and dynamic quant by @mobicham in #36821
Use pyupgrade --py39-plus to improve code by @cyyever in #36843
Support loading Quark quantized models in Transformers by @fxmarty-amd in #36372
DeepSpeed tensor parallel+ZeRO by @inkcherry in #36825
Refactor Attention implementation for ViT-based models by @qubvel in #36545
Add Prompt Depth Anything Model by @haotongl in #35401
Add model visual debugger by @molbap in #36798
[torchao] revert to get_apply_tensor_subclass by @SunMarc in #36849
Gemma3: fix test by @zucchini-nlp in #36820
[CI] fix update metadata job by @gante in #36850
Add support for seed in DataCollatorForLanguageModeling by @capemox in #36497
Refactor Aya Vision with modular by @yonigozlan in #36688
Mllama: raise better error by @zucchini-nlp in #35934
[CI] doc builder without custom image by @gante in #36862
FIX FSDP plugin update for QLoRA by @BenjaminBossan in #36720
Remove call to .item in get_batch_samples by @regisss in #36861
chore: fix typos in the tests directory by @threewebcode in #36813
Make ViTPooler configurable by @sebbaur in #36517
Revert "Update deprecated Jax calls by @ArthurZucker in #35919)"
[generate] model defaults being inherited only happens for newer models by @gante in #36881
:red_circle: :red_circle: :red_circle: supersede paligemma forward to shift pos id indexing by @molbap in #36859
Gemma 3 tests expect greedy decoding by @molbap in #36882
Use deformable_detr kernel from the Hub by @danieldk in #36853
Minor Gemma 3 fixes by @molbap in #36884
Fix: dtype cannot be str by @zucchini-nlp in #36262

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@IlyasMoutawwakil
- Make cache traceable (#35873)
- HPU support (#36424)
- Fix grad accum arbitrary value (#36691)
@orrzohar
- SmolVLM2 (#36126)
@threewebcode
- chore: fix function argument descriptions (#36392)
- chore: fix message descriptions in arguments and comments (#36504)
- chore: enhance messages in docstrings (#36525)
- chore: enhance message descriptions in parameters,comments,logs and docstrings (#36554)
- chore: fix typos in language models (#36586)
- fix typos in the docs directory (#36639)
- chore: fix typos in utils module (#36668)
- fix typos in the tests directory (#36717)
- chore: fix typos in tests directory (#36785)
- chore: fix typos in the tests directory (#36813)
@aymeric-roucher
- Deprecate transformers.agents (#36415)
@keyboardAnt
- Universal Speculative Decoding CandidateGenerator (#35029)
@EduardoPach
- [GroundingDino] Fix grounding dino loss 🚨 (#31828)
@co63oc
- Fix some typos in docs (#36502)
- chore: Fix typos in docs and examples (#36524)
- Fix typos in tests (#36547)
@RyanMullins
- Gemma3 (#36658)
- Gemma 3: Adding explicit GenerationConfig and refactoring conversion … (#36833)
- Shieldgemma2 (#36678)
@cyyever
- Use pyupgrade --py39-plus to improve code (#36843)
@haotongl
- Add Prompt Depth Anything Model (#35401)
@danieldk
- Use deformable_detr kernel from the Hub (#36853)

Mar 18, 2025

Mistral 3 (Based on v4.49.0)

A new model is added to transformers: Mistral 3. It is added on top of the v4.49.0 release, and can be installed from the following tag: v4.49.0-Mistral-3.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.49.0-Mistral-3

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

Mistral 3

The model is detailed in the following blog post. The models are available on the Hub with the following tag: mistral3

Overview

It is ideal for:

Fast-response conversational agents.
Low-latency function calling.
Subject matter experts via fine-tuning.
Local inference for hobbyists and organizations handling sensitive data.
Programming and math reasoning.
Long document understanding.
Visual understanding.

This model was contributed by cyrilvallez and yonigozlan.

The original code can be found here and here.

Usage example

Inference with Pipeline

Here is how you can use the image-text-to-text pipeline to perform inference with the Mistral3 models in just a few lines of code:

>>> from transformers import pipeline

>>> messages = [
...     {
...         "role": "user",
...         "content": [
...             {
...                 "type": "image",
...                 "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
...             },
...             {"type": "text", "text": "Describe this image."},
...         ],
...     },
... ]

>>> pipe = pipeline("image-text-to-text", model="mistralai/Mistral-Small-3.1-24B-Instruct-2503", torch_dtype=torch.bfloat16)
>>> outputs = pipe(text=messages, max_new_tokens=50, return_full_text=False)
>>> outputs[0]["generated_text"]
'The image depicts a vibrant and lush garden scene featuring a variety of wildflowers and plants. The central focus is on a large, pinkish-purple flower, likely a Greater Celandine (Chelidonium majus), with a'

Inference on a single image

This example demonstrates how to perform inference on a single image with the Mistral3 models using chat templates.

>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
...     {
...         "role": "user",
...         "content": [
...             {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
...             {"type": "text", "text": "Describe this image"},
...         ],
...     }
... ]

>>> inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

>>> generate_ids = model.generate(**inputs, max_new_tokens=20)
>>> decoded_output = processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)

>>> decoded_output
"The image depicts two cats lying on a pink blanket. The larger cat, which appears to be an"...

Text-only generation

This example shows how to generate text using the Mistral3 model without providing any image input.

>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = ".mistralai/Mistral-Small-3.1-24B-Instruct-2503"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> SYSTEM_PROMPT = "You are a conversational agent that always answers straight to the point, always end your accurate response with an ASCII drawing of a cat."
>>> user_prompt = "Give me 5 non-formal ways to say 'See you later' in French."

>>> messages = [
...    {"role": "system", "content": SYSTEM_PROMPT},
...    {"role": "user", "content": user_prompt},
... ]

>>> text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
>>> inputs = processor(text=text, return_tensors="pt").to(0, dtype=torch.float16)
>>> generate_ids = model.generate(**inputs, max_new_tokens=50, do_sample=False)
>>> decoded_output = processor.batch_decode(generate_ids[:, inputs["input_ids"].shape[1] :], skip_special_tokens=True)[0]

>>> print(decoded_output)
"1. À plus tard!
2. Salut, à plus!
3. À toute!
4. À la prochaine!
5. Je me casse, à plus!

```
 /\_/\
( o.o )
 > ^ <
```"

Batched image and text inputs

Mistral3 models also support batched image and text inputs.

>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
...                 {"type": "text", "text": "Write a haiku for this image"},
...             ],
...         },
...     ],
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
...                 {"type": "text", "text": "Describe this image"},
...             ],
...         },
...     ],
... ]


>>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

>>> output = model.generate(**inputs, max_new_tokens=25)

>>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
>>> decoded_outputs
["Write a haiku for this imageCalm waters reflect\nWhispers of the forest's breath\nPeace on wooden path"
, "Describe this imageThe image depicts a vibrant street scene in what appears to be a Chinatown district. The focal point is a traditional Chinese"]

Batched multi-image input and quantization with BitsAndBytes

This implementation of the Mistral3 models supports batched text-images inputs with different number of images for each text. This example also how to use BitsAndBytes to load the model in 4bit quantization.

>>> from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> quantization_config = BitsAndBytesConfig(load_in_4bit=True)
>>> model = AutoModelForImageTextToText.from_pretrained(
...     model_checkpoint, quantization_config=quantization_config
... )

>>> messages = [
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
...                 {"type": "text", "text": "Write a haiku for this image"},
...             ],
...         },
...     ],
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"},
...                 {"type": "image", "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg"},
...                 {"type": "text", "text": "These images depict two different landmarks. Can you identify them?"},
...             ],
...         },
...     ],
>>> ]

>>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

>>> output = model.generate(**inputs, max_new_tokens=25)

>>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
>>> decoded_outputs
["Write a haiku for this imageSure, here is a haiku inspired by the image:\n\nCalm lake's wooden path\nSilent forest stands guard\n", "These images depict two different landmarks. Can you identify them? Certainly! The images depict two iconic landmarks:\n\n1. The first image shows the Statue of Liberty in New York City."]

Gemma 3 (Based on v4.49.0)

A new model is added to transformers: Gemma 3. It is added on top of the v4.49.0 release, and can be installed from the following tag: v4.49.0-Gemma-3.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

Gemma 3

The model is detailed in the following blog post. The models and demos using the model are available in the following collection.

A Space to play around with the 12B-it flavor is available here.

Overview

The Gemma 3 model was proposed by Google. It is a vision-language model composed by a SigLIP vision encoder and a Gemma 2 language decoder linked by a multimodal linear projection.

Usage tips

For image+text and image-only inputs use Gemma3ForConditionalGeneration.
For text-only inputs use Gemma3ForCausalLM for generation to avoid loading the vision tower.
Each sample can contain multiple images, and the number of images can vary between samples. However make sure to pass correctly batched images to the processor, where each batch is a list of one or more images.
The text passed to the processor should have the "<start_of_image_>" token where the images should be inserted.
The processor has its own apply_chat_template method to convert chat messages to text that can then be passed as text to the processor. You can also get a vectorized output from apply_chat_template. See the examples below for more details on how to use it.

Image cropping for high resolution images

The model supports cropping images into smaller patches when the image aspect ratio exceeds a certain value. By default the images are not cropped and only the base image is forwarded to the model. Users can set do_pan_and_scan=True to obtain several crops per image along with the base image to improve the quality in DocVQA or similar tasks requiring higher resolution images.

Pan and scan is an inference time optimization to handle images with skewed aspect ratios. When enabled, it improves performance on tasks related to document understanding, infographics, OCR, etc.

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("google/gemma-3-4b-it", padding_side="left")

url = "https://media.istockphoto.com/id/1192867753/photo/cow-in-berchida-beach-siniscola.jpg?s=612x612&w=0&k=20&c=v0hjjniwsMNfJSuKWZuIn8pssmD5h5bSN1peBd1CmH4="
messages = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are a helpful assistant."}
        ]
    },
    {
        "role": "user", "content": [
            {"type": "image", "url": url},
            {"type": "text", "text": "What is shown in this image?"},
        ]
    },
]
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
    do_pan_and_scan=True,
).to(model.device)

Usage Example

Single-image Inference

from transformers import AutoProcessor, Gemma3ForConditionalGeneration

model_id = "google/gemma-3-4b-it"
model = Gemma3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
processor = AutoProcessor.from_pretrained(model_id, padding_side="left")

url = "https://media.istockphoto.com/id/1192867753/photo/cow-in-berchida-beach-siniscola.jpg?s=612x612&w=0&k=20&c=v0hjjniwsMNfJSuKWZuIn8pssmD5h5bSN1peBd1CmH4="
messages = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are a helpful assistant."}
        ]
    },
    {
        "role": "user", "content": [
            {"type": "image", "url": url},
            {"type": "text", "text": "What is shown in this image?"},
        ]
    },
]
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)

output = model.generate(**inputs, max_new_tokens=50)
print(processor.decode(output[0], skip_special_tokens=True)[inputs.input_ids.shape[1]: ])

Multi-image Inference

from transformers import AutoTokenizer, Gemma3ForCausalLM

model_id = "google/gemma-3-4b-it"
model = Gemma3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
processor = AutoProcessor.from_pretrained(model_id, padding_side="left")

url_cow = "https://media.istockphoto.com/id/1192867753/photo/cow-in-berchida-beach-siniscola.jpg?s=612x612&w=0&k=20&c=v0hjjniwsMNfJSuKWZuIn8pssmD5h5bSN1peBd1CmH4="
url_stop = "https://www.ilankelman.org/stopsigns/australia.jpg"
messages = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are a helpful assistant."}
        ]
    },
    {
        "role": "user", "content": [
            {"type": "image", "url": url_cow},
            {"type": "image", "url": url_stop},
            {"type": "text", "text": "Are these two images identical?"},
        ]
    },
]
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)

output = model.generate(**inputs, max_new_tokens=50)
print(processor.decode(output[0], skip_special_tokens=True)[inputs.input_ids.shape[1]: ])

Text-only inference

from transformers import AutoTokenizer, Gemma3ForCausalLM

model_id = "google/gemma-3-1b-it"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = Gemma3ForCausalLM.from_pretrained(model_id, device_map="auto")

input_ids = tokenizer("Write me a poem about Machine Learning.", return_tensors="pt").to(model.device)

outputs = model.generate(**input_ids, max_new_tokens=100)
text = tokenizer.batch_decode(outputs, skip_special_tokens=True)

print(text)

Mar 4, 2025

Aya Vision (Based on v4.49.0)

A new model is added to transformers: Aya Vision. It is added on top of the v4.49.0 release, and can be installed from the following tag: v4.49.0-AyaVision.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.49.0-AyaVision

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

Aya Vision

The model is detailed in the following blog post.

Overview

Key features of Aya Vision include:

Multimodal capabilities in 23 languages
Strong text-only multilingual capabilities inherited from CommandR-7B post-trained with the Aya Expanse recipe and Aya Expanse 32B
High-quality visual understanding using the Siglip2-so400-384-14 vision encoder
Seamless integration of visual and textual information in 23 languages.

Usage Example

Here's an example usage of the Aya Vision model.

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_id = "CohereForAI/aya-vision-32b"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id, device_map="auto", torch_dtype=torch.float16
)

# Format message with the aya-vision chat template
messages = [
    {"role": "user",
     "content": [
       {"type": "image", "url": "https://pbs.twimg.com/media/Fx7YvfQWYAIp6rZ?format=jpg&name=medium"},
        {"type": "text", "text": "चित्र में लिखा पाठ क्या कहता है?"},
    ]},
    ]

inputs = processor.apply_chat_template(
    messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
).to(model.device)

gen_tokens = model.generate(
    **inputs, 
    max_new_tokens=300, 
    do_sample=True, 
    temperature=0.3,
)

print(processor.tokenizer.decode(gen_tokens[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Feb 21, 2025

SigLIP-2 (Based on v4.49.0)

A new model is added to transformers: SigLIP-2. It is added on top of the v4.49.0 release, and can be installed from the following tag: v4.49.0-SigLIP-2.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.49.0-SigLIP-2

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

SigLIP2

The paper page for the model is available here. It is detailed in the following blog post.

The models and demos using the model are available in the following collection.

Overview

The model comes in two variants

FixRes - model works with fixed resolution images (backward compatible with SigLIP v1)
NaFlex - model works with variable image aspect ratios and resolutions (SigLIP2 in transformers)

The abstract from the paper is the following:

We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe—this includes decoder-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification (best SigLIP 2 ViT-g/16 achieves 85.0% ImageNet zero-shot accuracy), image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements on localization and dense prediction tasks. We also train variants which support multiple resolutions and preserve the input’s native aspect ratio. Finally, we train on a more diverse data-mixture that includes de-biasing techniques, leading to much better multilingual understanding and improved fair- ness. To provide users with the ability to trade-off inference cost with performance, we release model checkpoints at four sizes (ViT-B/86M, L/303M, So400m/400M, and g/1B).

Usage tips

Usage of SigLIP2 is similar to SigLIP and CLIP. The main difference from CLIP is the training loss, which does not require a global view of all the pairwise similarities of images and texts within a batch. One needs to apply the sigmoid activation function to the logits, rather than the softmax.
Training is supported but does not use torch.distributed utilities which may limit the scalability of batch size. However, DDP and FDSP works on single-node multi-gpu setup.
When using the standalone [GemmaTokenizerFast] make sure to pass padding="max_length" and max_length=64 as that's how the model was trained.
Model was trained with lowercased text, make sure you make the same preprocessing for your text labels.
To get the same results as the pipeline, a prompt template of "this is a photo of {label}" should be used.
The NaFlex variant supports processing images at higher resolutions by adjusting the max_num_patches parameter in the Processor. The default value is max_num_patches=256. Increasing max_num_patches to 1024 (4x) will approximately double processed image height and width, while preserving the aspect ratio.

This model was contributed by qubvel. The original code can be found here.

Usage example

There are 2 main ways to use SigLIP2: either using the pipeline API, which abstracts away all the complexity for you, or by using the Siglip2Model class yourself.

FixRes variant

Pipeline API

The pipeline allows to use the model in a few lines of code:

>>> from transformers import pipeline
>>> from PIL import Image
>>> import requests

>>> # load pipe
>>> image_classifier = pipeline(
...     task="zero-shot-image-classification",
...     model="google/siglip2-base-patch16-224",
... )

>>> # load image
>>> url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> # inference
>>> candidate_labels = ["2 cats", "a plane", "a remote"]
>>> outputs = image_classifier(image, candidate_labels=candidate_labels)
>>> outputs = [{"score": round(output["score"], 4), "label": output["label"] } for output in outputs]
>>> print(outputs)
[{'score': 0.1499, 'label': '2 cats'}, {'score': 0.0008, 'label': 'a remote'}, {'score': 0.0, 'label': 'a plane'}]

Using the model yourself

If you want to do the pre- and postprocessing yourself, here's how to do that:

>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, AutoModel
>>> import torch

>>> model = AutoModel.from_pretrained("google/siglip2-base-patch16-224")
>>> processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-224")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> candidate_labels = ["2 cats", "2 dogs"]
# follows the pipeline prompt template to get same results
>>> texts = [f"This is a photo of {label}." for label in candidate_labels]

# IMPORTANT: we pass `padding=max_length` and `max_length=64` since the model was trained with this
>>> inputs = processor(text=texts, images=image, padding="max_length", max_length=64, return_tensors="pt")

>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> logits_per_image = outputs.logits_per_image
>>> probs = torch.sigmoid(logits_per_image) # these are the probabilities
>>> print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")
15.0% that image 0 is '2 cats'

NaFlex variant

NaFlex combines ideas from FlexiViT, i.e. supporting multiple, predefined sequence lengths with a single ViT model, and NaViT, namely processing images at their native aspect ratio. This enables processing different types of images at appropriate resolution, e.g. using a larger resolution to process document images, while at the same time minimizing the impact of aspect ratio distortion on certain inference tasks, e.g. on OCR.

Given a patch size and target sequence length, NaFlex preprocesses the data by first resizing the input image such that the height and width after resizing are multiples of the patch size, while

1. keeping the aspect ratio distortion as small as possible
2. producing a sequence length of at most the desired target sequence length (`max_num_patches`)

The resulting distortion in width and height is at most (patch_size - 1) / width and (patch_size - 1) / height, respectively, which tends to be small for common resolutions and aspect ratios. After resizing, the image is split into a sequence of patches, and a mask with padding information is added.

>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, AutoModel
>>> import torch

>>> model = AutoModel.from_pretrained("google/siglip2-base-patch16-naflex")
>>> processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-naflex")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> candidate_labels = ["2 cats", "2 dogs"]
# follows the pipeline prompt template to get same results
>>> texts = [f"This is a photo of {label}." for label in candidate_labels]

# default value for `max_num_patches` is 256, but you can increase resulted image resolution providing
# higher values e.g. `max_num_patches=512`
>>> inputs = processor(text=texts, images=image, max_num_patches=256, return_tensors="pt")

>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> logits_per_image = outputs.logits_per_image
>>> probs = torch.sigmoid(logits_per_image) # these are the probabilities
>>> print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")
21.1% that image 0 is '2 cats'

Feb 20, 2025

SmolVLM-2 (Based on v4.49.0)

A new model is added to transformers: SmolVLM-2. It is added on top of the v4.49.0 release, and can be installed from the following tag: v4.49.0-SmolVLM-2.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.49.0-SmolVLM-2

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

SmolVLM-2

SmolVLM-2 is detailed in the following blog post.

The models and demos using the model are available in the following collection.

Overview

SmolVLM2 is an adaptation of the Idefics3 model with two main differences:

It uses SmolLM2 for the text model.
It supports multi-image and video inputs

Usage tips

Input images are processed either by upsampling (if resizing is enabled) or at their original resolution. The resizing behavior depends on two parameters: do_resize and size.

Videos should not be upsampled.

If do_resize is set to True, the model resizes images so that the longest edge is 4*512 pixels by default. The default resizing behavior can be customized by passing a dictionary to the size parameter. For example, {"longest_edge": 4 * 512} is the default, but you can change it to a different value if needed.

Here’s how to control resizing and set a custom size:

image_processor = SmolVLMImageProcessor(do_resize=True, size={"longest_edge": 2 * 512}, max_image_size=512)

Additionally, the max_image_size parameter, which controls the size of each square patch the image is decomposed into, is set to 512 by default but can be adjusted as needed. After resizing (if applicable), the image processor decomposes the images into square patches based on the max_image_size parameter.

This model was contributed by orrzohar.

Usage example

Single Media inference

The model can accept both images and videos as input, but you should use only one of the modalities at a time. Here's an example code for that.

import torch
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM2-256M-Video-Instruct")
model = AutoModelForImageTextToText.from_pretrained(
    "HuggingFaceTB/SmolVLM2-256M-Video-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="cuda"
)

conversation = [
    {
        "role": "user",
        "content":[
            {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
            {"type": "text", "text": "Describe this image."}
        ]
    }
]

inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

output_ids = model.generate(**inputs, max_new_tokens=128)
generated_texts = processor.batch_decode(output_ids, skip_special_tokens=True)
print(generated_texts)


# Video
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "video", "path": "/path/to/video.mp4"},
            {"type": "text", "text": "Describe this video in detail"}
        ]
    },
]

inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=100)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts[0])

Feb 17, 2025

v4.49.0: Helium, Qwen2.5-VL, SuperGlue, Granite Vision, Zamba2, GOT-OCR 2.0, DAB-DETR, Depth Pro, RT-DETRv2, GPTQModel

New models

Helium

Helium-1 preview is a lightweight language model with 2B parameters, targeting edge and mobile devices. It supports the following languages: English, French, German, Italian, Portuguese, Spanish.

Add-helium by @ArthurZucker in #35669

Qwen2.5-VL

The Qwen2.5-VL model is an update to Qwen2-VL from Qwen team, Alibaba Group.

The abstract from this update is the following:

Qwen2.5-VL marks a major step forward from Qwen2-VL, built upon the latest Qwen2.5 LLM. We’ve accelerated training and testing through the strategic implementation of window attention within the ViT. The ViT architecture itself has been refined with SwiGLU and RMSNorm, aligning it more closely with the LLM’s structure. A key innovation is the expansion of native dynamic resolution to encompass the temporal dimension, in addition to spatial aspects. Furthermore, we’ve upgraded MRoPE, incorporating absolute time alignment on the time axis to allow the model to effectively capture temporal dynamics, regardless of frame rate, leading to superior video understanding.

add qwen2.5vl by @ShuaiBai623 in #35569

SuperGlue

The SuperGlue model was proposed in SuperGlue: Learning Feature Matching with Graph Neural Networks by Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz and Andrew Rabinovich.

This model consists of matching two sets of interest points detected in an image. Paired with the SuperPoint model, it can be used to match two images and estimate the pose between them. This model is useful for tasks such as image matching, homography estimation, etc.

Add SuperGlue model by @sbucaille in #29886

Granite Vision Support

The Granite Vision model is a variant of LLaVA-NeXT, leveraging a Granite language model alongside a SigLIP visual encoder. It utilizes multiple concatenated vision hidden states as its image features, similar to VipLlava. It also uses a larger set of image grid pinpoints than the original LlaVa-NeXT models to support additional aspect ratios.

Granite Vision Support by @alex-jw-brooks in #35579

Zamba2

Zamba2 is a large language model (LLM) trained by Zyphra, and made available under an Apache 2.0 license.

Zamba2-1.2B, Zamba2-2.7B and Zamba2-7B are hybrid models combining state-space models (Specifically Mamba) and transformer, and were trained using next-token prediction. Zamba2 uses shared transformer layers after every 6 mamba blocks. It uses the Mistral v0.1 tokenizer. We came to this architecture after a series of ablations at small scales. Zamba2-1.2B, Zamba2-2.7B and Zamba2-7B were pre-trained on 2T and 3T tokens, respectively.

Add Zamba2 by @pglorio in #34517

GOT-OCR 2.0

GOT-OCR2 works on a wide range of tasks, including plain document OCR, scene text OCR, formatted document OCR, and even OCR for tables, charts, mathematical formulas, geometric shapes, molecular formulas and sheet music. While this implementation of the model will only output plain text, the outputs can be further processed to render the desired format, with packages like pdftex, mathpix, matplotlib, tikz, verovio or pyecharts. The model can also be used for interactive OCR, where the user can specify the region to be recognized by providing the coordinates or the color of the region’s bounding box.

Add GOT-OCR 2.0 to Transformers by @yonigozlan in #34721

DAB-DETR

DAB-DETR is an enhanced variant of Conditional DETR. It utilizes dynamically updated anchor boxes to provide both a reference query point (x, y) and a reference anchor size (w, h), improving cross-attention computation. This new approach achieves 45.7% AP when trained for 50 epochs with a single ResNet-50 model as the backbone.

Add DAB-DETR for object detection by @conditionedstimulus in #30803

Depth PRO

DepthPro is a foundation model for zero-shot metric monocular depth estimation, designed to generate high-resolution depth maps with remarkable sharpness and fine-grained details. It employs a multi-scale Vision Transformer (ViT)-based architecture, where images are downsampled, divided into patches, and processed using a shared Dinov2 encoder. The extracted patch-level features are merged, upsampled, and refined using a DPT-like fusion stage, enabling precise depth estimation.

Add Apple's Depth-Pro for depth estimation by @geetu040 in #34583

RT-DETRv2

An improved Real-Time DEtection TRansformer (RT-DETR). RT-DETRv2 refines RT-DETR by introducing selective multi-scale feature extraction, a discrete sampling operator for broader deployment compatibility. These improvements yield a 0.3 to 1.4 increase in mAP metrics on the COCO dataset, all while maintaining the same parameter count and frames-per-second (FPS) performance.

Adding RTDETRv2 by @jadechoghari in #34773

Transformers-CLI

Transformers' CLI welcomes a new command: chat. This command starts a conversation with the model of your choosing directly in your terminal.

This feature exists in TRL and has been migrated to transformers for easier usage.

[Chat] Add Chat from TRL 🐈 by @gante in #35714

Processor Standardization

An ongoing work is to standardize the image processors so that their API is equivalent. Additionally, the processors are given a fast variant so that they are never blockers in the image processing pipelines.

In this release, several processors have been standardized and have seen their fast version be contributed.

OwlViT/Owlv2 post processing standardization by @qubvel in #34929
OmDet Turbo processor standardization by @qubvel in #34937
Grounding DINO Processor standardization by @qubvel in #34853
Refactoring of ImageProcessorFast by @yonigozlan in #35069
add Qwen2-VL image processor fast by @yonigozlan in #35733
Remove Multi-threaded image conversion for fast image processors by @yonigozlan in #36105

Breaking changes

DPT segmentation maps

DPT image processors did not support segmentation_maps, instead only requiring images. This has been fixed. This adds an argument to the preprocess method, therefore users using arguments as positional arguments with that method may see changed behavior. We recommend using keyword arguments for such methods so as to not be bothered by the addition of new features.

🔴 🔴 🔴 Added segmentation maps support for DPT image processor by @simonreise in #34345

Image classification pipeline and single vs multi-label

The problem_type in the config.json file was read incorrectly by the pipeline, which mapped single-label to multi-label losses, and vice-versa. This has been fixed.

🚨🚨🚨 image-classification pipeline single-label and multi-label prob type squashing fns (sigmoid vs softmax) are backwards by @rwightman in #35848

Fixing the LayerNorm beta/gamma renames

The description of the pull request is the easiest way to understand the problem, why it exists, and how it is solved; please read the description below:

🚨🚨🚨 An attempt to fix #29554. Include 'LayerNorm.' in gamma/beta rename scope, optimize string search. by @rwightman in #35615

VLM cleanup

The ignore_index property of the llava configuration has been removed as it was not serving a purpose.

🔴 VLM: compile compatibility by @zucchini-nlp in #35724

Quantization

Quantization has received several improvements and fixes, including the contribution of FP8 quantization and the HIGGS quantization interface.

Additionally, we're replacing the AutoGPTQ implementaiton with GPTQModel from ModelCloud (see repository here)).

GPTQModel originated as major refractor of AutoGPTQ but is now a full-stand-in replacement with cleaner api, up-to-date model support, faster inference, higher quality quants.

Enable gptqmodel by @ jiqing-feng in #35012
Split and clean up GGUF quantization tests by @Isotr0py in #35502
Display warning for unknown quants config instead of an error by @SunMarc in #35963
Adding FP8 Quantization to transformers by @MekkCyber in #36026
New HIGGS quantization interfaces, JIT kernel compilation support. by @BlackSamorez in #36148

Generate

[generate] revert change in Aria: the maximum cache length must match max_length by @gante in #36120
🧹 remove generate-related objects and methods scheduled for removal in v4.48 by @gante in #35677
[generate] can instantiate GenerationConfig(cache_implementation="static") by @gante in #35679
[generate] return Cache object even if passed in a legacy format by @gante in #35673
[generate] update docstring of SequenceBiasLogitsProcessor by @gante in #35699
Test: generate with torch.compile(model.forward) as a fast test by @gante in #34544
[generate] move max time tests by @gante in #35962
[generate] shape checks in tests compatible with fixed-length caches (+ some minor fixes) by @gante in #35993

Pipelines

Pipelines have received several bug fixes and improvements which are detailed below.

Stop mutating input dicts in audio classification pipeline by @Rocketknight1 in #35754
fix document qa bf16 pipeline by @jiqing-feng in #35456
fix low-precision audio classification pipeline by @jiqing-feng in #35435
[pipeline] missing import regarding assisted generation by @gante in #35752
Output dicts support in text generation pipeline by @jonasrohw in #35092
Fix Audio Classification Pipeline top_k Documentation Mismatch and Bug #35736 by @sambhavnoobcoder in #35771

Bugfixes and improvements

Fix flaky test_custom_4d_attention_mask by @ydshieh in #35606
Use inherit tempdir makers for tests + fix failing DS tests by @muellerzr in #35600
Added error when sequence length is bigger than max_position_embeddings by @Taha1506 in #32156
Let EarlyStoppingCallback not require load_best_model_at_end by @muellerzr in #35101
Fix flaky test_beam_search_low_memory by @ydshieh in #35611
Skip MobileNetV1ModelTest::test_batching_equivalence for now by @ydshieh in #35614
Update codeowners with individual model owners by @Rocketknight1 in #35595
Fix device in rope module when using dynamic updates by @Cyrilvallez in #35608
Fix whisper compile by @jiqing-feng in #35413
Removed some duplicated code by @Sai-Suraj-27 in #35637
[Phi] bias should be True by @ArthurZucker in #35650
Enable different torch dtype in sub models by @zucchini-nlp in #34873
[Compile] Only test compiling model forward pass by @ArthurZucker in #35658
[tests] make cuda-only tests device-agnostic by @faaany in #35607
[i18n-ar] Translated file : docs/source/ar/tasks/token_classification.md into Arabic by @AhmedAlmaghz in #35193
Fix zero_shot_image_classification documentation guide link in SigLIP by @aretrace in #35671
Fix : adding einops lib in the CI docker for some bitsandbytes tests by @MekkCyber in #35652
Update torchao.md: use auto-compilation by @martin0258 in #35490
Fix : HQQ config when hqq not available by @MekkCyber in #35655
Fix expected output for ggml test by @MekkCyber in #35686
Fix : add require_read_token for gemma2 gated model by @MekkCyber in #35687
Enhanced Installation Section in README.md by @egojoseph in #35094
Enhance DataCollatorForLanguageModeling with Configurable Token Replacement Probabilities by @mahdibaghbanzadeh in #35251
Clean-up composite configs by @zucchini-nlp in #34603
Add future import for Py < 3.10 by @Rocketknight1 in #35666
Enable gptqmodel by @jiqing-feng in #35012
Fix : Nemotron Processor in GGUF conversion by @MekkCyber in #35708
Fix typo in /docs/source/ja/model_doc/decision_transformer.md URL by @hiroaki222 in #35705
Replace deprecated batch_size with max_batch_size when using HybridCache by @mtreinik in #35498
Fix: Falcon tie_word_embeddings in GGUF by @MekkCyber in #35715
Fix condition when GA loss bug fix is not performed by @techkang in #35651
Fix the bug that Trainer cannot correctly call torch_jit_model_eval by @Wanguy in #35722
[generation] fix type hint by @gante in #35725
Add proper jinja2 error by @Rocketknight1 in #35533
Optimize ForCausalLMLoss by removing unnecessary contiguous() call to reduce memory overhead by @efsotr in #35646
Modular: support for importing functions from any file by @Cyrilvallez in #35692
Remove batch size argument warning when unjustified by @quintenroets in #35519
[cache] add a test to confirm we can use cache at train time by @gante in #35709
Remove pt_to_tf by @gante in #35672
Added resource class configuration option for check_circleci_user job by @Sai-Suraj-27 in #32866
Fix some tests by @Cyrilvallez in #35682
Unable to use MimiModel with DeepSpeed ZeRO-3 by @anferico in #34735
check is added for the report_to variable in TrainingArguments by @alpertunga-bile in #35403
Added liger_kernel compatibility with PeftModel by @ambroser53 in #35680
Restore is_torch_greater_or_equal_than for backward compatibility by @tlrmchlsmth in #35734
Revert "Unable to use MimiModel with DeepSpeed ZeRO-3" by @eustlb in #35755
ci: fix xpu skip condition for test_model_parallel_beam_search by @dvrogozh in #35742
Use AMD CI workflow defined in hf-workflows by @ivarflakstad in #35058
Fix CI for VLMs by @zucchini-nlp in #35690
Security fix for self-comment-ci.yml by @ydshieh in #35548
[ViTPose] Convert more checkpoints by @NielsRogge in #35638
fix register_buffer in MimiEuclideanCodebook by @anferico in #35759
remove code owners as it was generating too much noise BUT by @ArthurZucker in #35784
Skip Falcon 7B GGML Test by @MekkCyber in #35783
[fix] cannot import name 'Pop2PianoFeatureExtractor' from 'transformers' by @faaany in #35604
transformers.image_transforms.normalize wrong types by @CalOmnie in #35773
Patch moonshine by @eustlb in #35731
Don't import torch.distributed when it's not available by @booxter in #35777
Fix vits low-precision dtype by @jiqing-feng in #35418
Tool calling: support more types by @aymeric-roucher in #35776
Fixes, improvements to timm import behaviour by @rwightman in #35800
modular_model_converter bugfix on assignments by @nikosanto13 in #35642
Deterministic sorting in modular converter when adding new functions by @Cyrilvallez in #35795
Fix "test_chat_template_dict" in video LLMs by @zucchini-nlp in #35660
Update AMD Docker image by @ivarflakstad in #35804
Add LlavaImageProcessor by @NielsRogge in #33191
Byebye test_batching_equivalence's flakiness by @ydshieh in #35729
[Doc] Adding blog post to model doc for TimmWrapper by @ariG23498 in #35744
add a new flax example for Bert model inference by @louie-tsai in #34794
Support adamw_torch_8bit by @fzyzcjy in #34993
Auto-add timm tag to timm-wrapper models. by @pcuenca in #35794
Fix : BLOOM tie_word_embeddings in GGUF by @MekkCyber in #35812
Fixed typo in autoawq version number in an error message for IPEX backend requirements. by @InfroLab in #35815
Remove deprecated get_cached_models by @Wauplin in #35809
Optimized set_initialized_submodules. by @LagPixelLOL in #35493
[i18n-ar] Translated file: docs/source/ar/tasks/masked_language_modeling.md into Arabic by @AhmedAlmaghz in #35198
move fastspeech to audio models by @eustlb in #35788
Improve modular documentation by @Cyrilvallez in #35737
[Mimi] update test expected values for t4 runners by @eustlb in #35696
Remove old benchmark code by @gante in #35730
Remove pyav pin to allow python 3.11 to be used by @CalOmnie in #35823
Another security patch for self-comment-ci.yml by @ydshieh in #35816
Init cache on meta device by @zucchini-nlp in #35164
Hotfix: missing working-directory in self-comment-ci.yml by @ydshieh in #35833
[gpt2] fix generation tests by @gante in #35822
Fix : Nemotron tokenizer for GGUF format by @MekkCyber in #35836
Fix head_dim in config extracted from Gemma2 GGUF model by @Isotr0py in #35818
[chat] docs fix by @gante in #35840
Fix compatibility issues when using auto_gptq with these older versions by @LRL-ModelCloud in #35830
Add PyTorch version check for FA backend on AMD GPUs by @mht-sharma in #35813
Fix NoneType type as it requires py>=3.10 by @SunMarc in #35843
[ tests] remove some flash attention class tests by @ArthurZucker in #35817
[Backend support] Allow num_logits_to_keep as Tensor + add flag by @Cyrilvallez in #35757
Fix GA loss for Deepspeed by @timjeffrey10 in #35808
Fix uploading processors/tokenizers to WandB on train end by @jack89roberts in #35701
Fix more CI tests by @ArthurZucker in #35661
[DOC] Fix contamination and missing paragraph in translation by @Yosshi999 in #35851
Fix typo by @SilverSoldier in #35854
fix apply_chat_template() padding choice by @baoyf4244 in #35828
Fix test_pipelines_video_classification that was always failing by @CalOmnie in #35842
Fix Llava-NeXT / Llava-NeXT Video / Llava-OneVision's token unpadding mismatch by @sheryc in #35779
use torch.testing.assertclose instead to get more details about error in cis by @ArthurZucker in #35659
add xpu device check in device_placement by @faaany in #35865
Add Rocketknight1 to self-comment-ci.yml by @ydshieh in #35881
[doctest] Fixes by @stevhliu in #35863
Fix fast image processor warnings in object detection examples by @sugendran in #35892
Update deepspeed amd image by @ivarflakstad in #35906
Fix typing in audio_utils.chroma_filter_bank by @CalOmnie in #35888
[docs] uv install by @stevhliu in #35821
Fix the config class comparison for remote code models by @Rocketknight1 in #35592
Close Zamba2Config code block by @Rocketknight1 in #35914
[docs] Fix Zamba2 by @stevhliu in #35916
Remove _supports_static_cache = True for some model classes by @ydshieh in #34975
Use rocm6.2 for AMD images by @ivarflakstad in #35930
Add default TP plan for all models with backend support by @Cyrilvallez in #35870
Fix: loading DBRX back from saved path by @zucchini-nlp in #35728
Fix mask slicing for models with HybridCache by @Cyrilvallez in #35681
Qwen-2-5-VL: fix CI by @zucchini-nlp in #35935
Fix TP initialization by @Cyrilvallez in #35860
fix(FA): QKV not being casted to target_dtype for FA with dpo lora by @NanoCode012 in #35834
Remove INC notebook reference in documentation by @echarlaix in #35936
use torch constraints to check if covariance is positive definite during mean resizing. by @abuelnasr0 in #35693
fix test_generated_length_assisted_generation by @keyboardAnt in #34935
Update unwrap_and_save_reload_schedule to use weights_only=False by @ydshieh in #35952
Update squad_convert_example_to_features to work with numpy v2 by @ydshieh in #35955
Fix flaky test_assisted_decoding_matches_greedy_search by @ydshieh in #35951
Trainer Refactor: Part 1 by @muellerzr in #35567
update docker file transformers-pytorch-deepspeed-latest-gpu by @ydshieh in #35940
[tests] further fix Tester object has no attribute '_testMethodName' by @faaany in #35781
Update README.md by @BlessedTatonka in #35958
fix iterator overflow when gradient accumulation is 1 by @winglian in #35960
Fix is_causal being a tensor by @IlyasMoutawwakil in #35791
[bart] minor test fixes by @gante in #35965
Pixtral: vectorize patch embeddings and enable tests by @zucchini-nlp in #35122
Whisper: fix static cache CI by @zucchini-nlp in #35852
Less flaky for TimmBackboneModelTest::test_batching_equivalence by @ydshieh in #35971
Support batching for UsefulSensors Moonshine by @njeffrie in #35922
not to use A100 for benchmark.yml by @ydshieh in #35974
Handle empty change indices in SAM's mask to rle conversion by @MSt-10 in #35665
Add support for nested images to LLava and VipLLava by @yonigozlan in #35558
[Moonshine] compute head_dim_padding at init by @eustlb in #35984
[Moshi] disable automatic compilation if the model can't compile by @gante in #35992
use torch 2.6 for daily CI by @ydshieh in #35985
Update-tp test by @ArthurZucker in #35844
Add mean_resizing for every VLMs' resizing_token_embeddings() by @YenFuLin in #35717
Update Granite Vision Model Path / Tests by @alex-jw-brooks in #35998
Qwen2-VL: fix rope delta calculation by @zucchini-nlp in #36013
Fix custom kernel for DeformableDetr, RT-Detr, GroindingDINO, OmDet-Turbo in Pytorch 2.6.0 by @qubvel in #35979
apply_chat_template: consistent behaviour for return_assistant_tokens_mask=True return_tensors=True by @mrsndmn in #35582
layernorm_decay_fix by @Ryoo72 in #35927
Update Mistral converter by @Cyrilvallez in #35967
Refactor (and fix) gpt_neox by @Cyrilvallez in #35610
Fix device mismatch error in Whisper model during feature extraction by @thedebugger in #35866
Fix RMSNormGated in Zamba2 by @pglorio in #35943
Commont bot CI for other jobs (generation / quantization) by @ydshieh in #35341
Hotfix for self-comment-ci.yml by @ydshieh in #36030
feat(ci): ignore trufflehog unverified results by @McPatate in #36031
CircleCI with python 3.9 by @ydshieh in #36027
Update tests regarding attention types after #35235 by @ydshieh in #36024
Fix Gemma2 synced multi-GPU generation by @ManukyanD in #35232
Fix synced multi-GPU generation with LLMs and VLMs by @ManukyanD in #35893
Add XPU type for work-around -inf mask causing sdpa NaN issue in modeling files by @Liangliang-Ma in #35647
add support for empty list as input to create_model_card by @ROZBEH in #36042
DeepSpeed github repo move sync by @stas00 in #36021
[docs] no hard coding cuda as bnb has multi-backend support by @faaany in #35867
[docs] fix bugs in the bitsandbytes documentation by @faaany in #35868
[docs] no hard-coding cuda by @faaany in #36043
Fix how we compute the final non-padding token for ForSequenceClassification models by @Rocketknight1 in #35911
Add Qwen2VLImageProcessorFast into Qwen2VLProcessor by @yeliudev in #35987
Iterative generation using Input embeds and past_key_values by @yaswanth19 in #35890
Fix usage of unpad_input function by @pavelgein in #35925
Fix repo consistency by @ydshieh in #36063
Update test_flash_attn_2_can_dispatch_composite_models by @ydshieh in #36050
Paligemma: fix generation with Gemma2 by @zucchini-nlp in #36044
Save checkpoint to temporary directory to handle partial saves during failures by @SilverSoldier in #35580
Nail in edge case of torch dtype being overriden permantly in the case of an error by @muellerzr in #35845
Fix words typos in ggml test. by @zhanluxianshen in #36060
Fix model kwargs by @muellerzr in #35875
Fix StopStringCriteria to handle tokens above len(tokenizer) by @Rocketknight1 in #35797
[docs] fix outdated example code in trainer.md by @faaany in #36066
Adding RT-DETRv2 for object detection by @jadechoghari in #34773
Fix bug in apply_rotary_pos_emb_flashatt: in Qwen2-5-VL by @DeepWaved in #36065
Move audio top_k tests to the right file and add slow decorator by @Rocketknight1 in #36072
Fix OS err by @muellerzr in #36094
[docs] fix model checkpoint name by @faaany in #36075
[docs] fix typo by @faaany in #36080
[docs] fix not-working example code in perf_infer_gpu_one.md by @faaany in #36087
fix MllamaVisionAttention typehint by @kylesayrs in #35975
Processors: allow tuples of images when checking by @zucchini-nlp in #36084
Chat template: update for processor by @zucchini-nlp in #35953
Paligemma: revert #36084 by @zucchini-nlp in #36113
Support constant lr with cooldown by @LoserCheems in #35453
Enable pytest live log and show warning logs on GitHub Actions CI runs by @ydshieh in #35912
Refactor OPT model by @jiqing-feng in #36101
Revert checkpoint tmp dir by @SunMarc in #36112
[Bugfix] fix file name of docstring in utils/check_table.py by @kkscilife in #36108
fix bnb warning by @SunMarc in #36116
AutoformerForPrediction test add atol by @ivarflakstad in #36017
Fix nighlty CIs: missing atols by @ArthurZucker in #35903
Add common test for torch.export and fix some vision models by @qubvel in #35124
fix: typos in documentation files by @maximevtush in #36122
update awesome-transformers.md. by @zhanluxianshen in #36115
Fix max size deprecated warning by @HichTala in #34998
Fix CI issues by @molbap in #35662
update tiktoken integ to use converted by @ArthurZucker in #36135
Make output_dir Optional in TrainingArguments #27866 by @sambhavnoobcoder in #35735
[docs] minor doc fix by @faaany in #36127
[docs] update awq doc by @faaany in #36079
Add pipeline parallel plan to PretrainedConfig and PreTrainedModel by @hmellor in #36091
add RAdamScheduleFree optimizer by @nhamanasu in #35313
added warning to Trainer when label_names is not specified for PeftModel by @MilkClouds in #32085
Whisper: remove redundant assisted generation tests by @gante in #34814
Add utility for Reload Transformers imports cache for development workflow #35508 by @sambhavnoobcoder in #35858
VLM: enable skipped tests by @zucchini-nlp in #35746
[commands] remove deprecated/inoperational commands by @gante in #35718
Fix Gradient Checkpointing for Deberta & Deberta-V2 using PEFT / Adapters by @lenglaender in #35898
🚨 Remove cache migration script by @Wauplin in #35810
multi-gpu: fix tensor device placements for various models by @dvrogozh in #35763
Optim: APOLLO optimizer integration by @zhuhanqing in #36062
Fix multi gpu loss sync condition, add doc and test by @techkang in #35743
adding option to save/reload scaler by @hsilva664 in #34932
Update doc re list of models supporting TP by @kwen2501 in #35864
Add more rigerous non-slow grad accum tests by @muellerzr in #35668
Fix test fetcher by @ydshieh in #36129
skip test_initialization for VitPoseBackboneModelTest for now by @ydshieh in #36154
Add git LFS to AMD docker image by @ivarflakstad in #36016
Mllama fsdp by @blbadger in #36000
Fix PaliGemma Pad Token Masking During Training #35855 by @sambhavnoobcoder in #35859
Add reminder config to issue template and print DS version in env by @Ben-Schneider-code in #35156
Fix Gemma2 dtype issue when storing weights in float16 precision by @Nerogar in #35398
Replace deprecated update_repo_visibility by @Wauplin in #35970
Fix tests for vision models by @qubvel in #35654
qwen2.5vl: fix bugs when using flash2+bf16 or num_return_sequences>1 by @gewenbin0992 in #36083
docs: fix return type annotation of get_default_model_revision by @MarcoGorelli in #35982
Fix PretrainedTokenizerFast check => Fix PretrainedTokenizerFast Save by @CL-ModelCloud in #35835
Move DataCollatorForMultipleChoice from the docs to the package by @bauwenst in #34763
Helium documentation fixes by @LysandreJik in #36170
Remove loading custom kernel for RT-DETRv2 by @qubvel in #36098
[Modular] skip modular checks based on diff by @gante in #36130
Fix red CI by @ArthurZucker in #36174
Fix : fix doc fp8 by @MekkCyber in #36173
Efficient Inference Kernel for SpQR by @elvircrn in #34976
fix training issues by @ArthurZucker in #36158
add disable compile option by @ArthurZucker in #36161
CI: avoid human error, automatically infer generative models by @gante in #33212
Use tqdm auto by @SmartManoj in #35726
Optimize Qwen2VL vision model by precomputing cos/sin embeds before ViT blocks by @li-plus in #35837
Make check_repository_consistency run faster by MP by @ydshieh in #36175
Fix the key name for _load_rng_state under torch.cuda by @wizyoung in #36138
Follow up to SpQR integration by @MekkCyber in #36176
Fix a mistake in #36175 by @ydshieh in #36179
Fix make_batched_videos and add tests by @yonigozlan in #36143
Uniformize OwlViT and Owlv2 processors by @yonigozlan in #35700
Add support for partial rotary embeddings in Phi3 model by @garg-amit in #35947
CI: fix test-save-trainer by @zucchini-nlp in #36191
Chat template docs by @zucchini-nlp in #36163
Add ImageProcessorFast to Qwen2.5-VL processor by @Isotr0py in #36164
Prepare processors for VideoLLMs by @zucchini-nlp in #36149
Add require_read_token to fp8 tests by @MekkCyber in #36189
Revert qwen2 breaking changes related to attention refactor by @ArthurZucker in #36162
Guard against unset resolved_archive_file by @dmlap in #35628
[Bugfix] Fix reloading of pixtral/llava configs by @kylesayrs in #36077

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@jiqing-feng
- Fix whisper compile (#35413)
- Enable gptqmodel (#35012)
- fix document qa bf16 pipeline (#35456)
- Fix vits low-precision dtype (#35418)
- fix low-precision audio classification pipeline (#35435)
- Refactor OPT model (#36101)
@AhmedAlmaghz
- [i18n-ar] Translated file : docs/source/ar/tasks/token_classification.md into Arabic (#35193)
- [i18n-ar] Translated file: docs/source/ar/tasks/masked_language_modeling.md into Arabic (#35198)
@sbucaille
- Add SuperGlue model (#29886)
@Isotr0py
- Fix head_dim in config extracted from Gemma2 GGUF model (#35818)
- Split and clean up GGUF quantization tests (#35502)
- Add ImageProcessorFast to Qwen2.5-VL processor (#36164)
@ShuaiBai623
- add qwen2.5vl (#35569)
@alex-jw-brooks
- Granite Vision Support (#35579)
- Update Granite Vision Model Path / Tests (#35998)
@pglorio
- Add Zamba2 (#34517)
- Fix RMSNormGated in Zamba2 (#35943)
@conditionedstimulus
- Add DAB-DETR for object detection (#30803)
@jadechoghari
- Adding RT-DETRv2 for object detection (#34773)
@geetu040
- Add Apple's Depth-Pro for depth estimation (#34583)
@zhuhanqing
- Optim: APOLLO optimizer integration (#36062)
@bauwenst
- Move DataCollatorForMultipleChoice from the docs to the package (#34763)
@elvircrn
- Efficient Inference Kernel for SpQR (#34976)

Feb 7, 2025

Patch release v4.48.3

This ends the python3.9 issues mostly!

Add future import for Py < 3.10 (#35666) by @Rocketknight1

For some very niche cases, the new rope embedding introduced device failures

Fix device in rope module when using dynamic updates (#35608) by @Cyrilvallez

Num items in batch

Fix model kwargs (#35875) by @muellerzr: this is long due, sorry that it took so long. Some models were not compatible with the num_items_in_batch

Finally the fix to Gemma2 is propagated to paligemma2!

Paligemma: fix generation with Gemma2 (#36044) by @zucchini-nlp

Jan 30, 2025

Patch release v4.48.2

Sorry because the fixes for num_items_in_batches are not done yet 😓 To follow along see this PR, a new patch will be available soon!

Now, we mostly had BC issue with python version 3.9:

Restore is_torch_greater_or_equal_than for backward compatibility (#35734) by @tlrmchlsmth
Fix NoneType type as it requires py>=3.10 (#35843) by @SunMarc

Then we had a small regression for DBRX saving:

Fix: loading DBRX back from saved path (#35728) by @zucchini-nlp

Finally we have a fix for gemma and the hybrid attention architectures:

Fix mask slicing for models with HybridCache #35681 by @Cyrilvallez

Miscellaneous:

Fix is_causal being a tensor (#35791) by @IlyasMoutawwakil

Jan 20, 2025

Patch release v4.48.1

Yet again we are dawned with a gradient accumulation fix! There is also a refactoring of the attention that let a small typo in, we made sure PHI is no longer broken!

Moonshine had a small issue when wrapping generate so we removed that!

[Phi] bias should be True (#35650) @ArthurZucker
Fix condition when GA loss bug fix is not performed (#35651) @techkang
Patch moonshine (#35731) @eustlb

🤗

Jan 10, 2025

v4.48.0: ModernBERT, Aria, TimmWrapper, ColPali, Falcon3, Bamba, VitPose, DinoV2 w/ Registers, Emu3, Cohere v2, TextNet, DiffLlama, PixtralLarge, Moonshine

New models

ModernBERT

The ModernBert model was proposed in Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference by Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Galalgher, Raja Bisas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Grifin Adams, Jeremy Howard and Iacopo Poli.

It is a refresh of the traditional encoder architecture, as used in previous models such as BERT and RoBERTa.

It builds on BERT and implements many modern architectural improvements which have been developed since its original release, such as:

Rotary Positional Embeddings to support sequences of up to 8192 tokens.
Unpadding to ensure no compute is wasted on padding tokens, speeding up processing time for batches with mixed-length sequences.
GeGLU Replacing the original MLP layers with GeGLU layers, shown to improve performance.
Alternating Attention where most attention layers employ a sliding window of 128 tokens, with Global Attention only used every 3 layers.
Flash Attention to speed up processing.
A model designed following recent The Case for Co-Designing Model Architectures with Hardware, ensuring maximum efficiency across inference GPUs.
Modern training data scales (2 trillion tokens) and mixtures (including code ande math data)

Add ModernBERT to Transformers by @warner-benjamin in #35158

Aria

The Aria model was proposed in Aria: An Open Multimodal Native Mixture-of-Experts Model by Li et al. from the Rhymes.AI team.

Aria is an open multimodal-native model with best-in-class performance across a wide range of multimodal, language, and coding tasks. It has a Mixture-of-Experts architecture, with respectively 3.9B and 3.5B activated parameters per visual token and text token.

Add Aria by @aymeric-roucher in #34157

TimmWrapper

We add a TimmWrapper set of classes such that timm models can be loaded in as transformer models into the library.

Here's a general usage example:

import torch
from urllib.request import urlopen
from PIL import Image
from transformers import AutoConfig, AutoModelForImageClassification, AutoImageProcessor

checkpoint = "timm/resnet50.a1_in1k"
img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

image_processor = AutoImageProcessor.from_pretrained(checkpoint)
inputs = image_processor(img, return_tensors="pt")
model = AutoModelForImageClassification.from_pretrained(checkpoint)

with torch.no_grad():
    logits = model(**inputs).logits

top5_probabilities, top5_class_indices = torch.topk(logits.softmax(dim=1) * 100, k=5)

Thanks to this, timm models now have access to pipelines, as well as Trainer, accelerate device maps, quantization, etc:

import torch
from urllib.request import urlopen
from PIL import Image

from transformers import pipeline

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
pipe = pipeline("image-classification", model="timm/resnet18.a1_in1k")
print(pipe(img))

Add TimmWrapper by @qubvel and @amyeroberts in #34564

Pixtral-Large

Pixtral modeling and checkpoint conversion code has been updated to support the new Pixtral-Large model.

Update Pixtral conversion script to support large format! by @arthurzucker in #34801

ColPali

The ColPali model was proposed in ColPali: Efficient Document Retrieval with Vision Language Models by Manuel Faysse*, Hugues Sibille*, Tony Wu*, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo (* denotes equal contribution). Work lead by ILLUIN Technology.

In the proposed ColPali approach, the authors leverage VLMs to construct efficient multi-vector embeddings directly from document images (“screenshots”) for document retrieval. They train the model to maximize the similarity between these document embeddings and the corresponding query embeddings, using the late interaction method introduced in ColBERT.

Add ColPali to 🤗 transformers by @tonywu71 and @yonigozlan in #33736

Falcon3

Falcon3 represents a natural evolution from previous releases, emphasizing expanding the models’ science, math, and code capabilities. This iteration includes five base models: Falcon3-1B-Base, Falcon3-3B-Base, Falcon3-Mamba-7B-Base, Falcon3-7B-Base, and Falcon3-10B-Base. In developing these models, the authors incorporated several key innovations aimed at improving the models’ performances while reducing training costs:

One pre-training: They conducted a single large-scale pretraining run on the 7B model, using 2048 H100 GPU chips, leveraging 14 trillion tokens featuring web, code, STEM, and curated high-quality and multilingual data. Depth up-scaling for improved reasoning: Building on recent studies on the effects of model depth, they upscaled the 7B model to a 10B parameters model by duplicating the redundant layers and continuing pre-training with 2TT of high-quality data. This yielded Falcon3-10B-Base which achieves state-of-the-art zero-shot and few-shot performance for models under 13B parameters. Knowledge distillation for better tiny models: To provide compact and efficient alternatives, we developed Falcon3-1B-Base and Falcon3-3B-Base by leveraging pruning and knowledge distillation techniques, using less than 100GT of curated high-quality data, thereby redefining pre-training efficiency.

Add Falcon3 documentation by @mokeddembillel in #35307

Bamba

Bamba-9B is a decoder-only language model based on the Mamba-2 architecture and is designed to handle a wide range of text generation tasks. It is trained from scratch using a two-stage training approach. In the first stage, the model is trained on 2 trillion tokens from the Dolma v1.7 dataset. In the second stage, it undergoes additional training on 200 billion tokens, leveraging a carefully curated blend of high-quality data to further refine its performance and enhance output quality.

Checkout all Bamba-9B model checkpoints here.

Add the Bamba Model by @fabianlim in #34982

VitPose

ViTPose is a state-of-the-art vision transformer-based model for human pose estimation, introduced by Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao in "ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation”.

The model leverages the capabilities of vision transformers to accurately predict 2D human keypoints. Adopting a top-down approach, ViTPose estimates keypoints locations for each detected person, allowing it to be easily used with any object detection model.

Add VitPose by @SangbumChoi and @NielsRogge in #30530

DINOv2 with registers

The DINOv2 with Registers model was proposed in Vision Transformers Need Registers by Timothée Darcet, Maxime Oquab, Julien Mairal, Piotr Bojanowski.

The Vision Transformer (ViT) is a transformer encoder model (BERT-like) originally introduced to do supervised image classification on ImageNet.

Next, people figured out ways to make ViT work really well on self-supervised image feature extraction (i.e. learning meaningful features, also called embeddings) on images without requiring any labels. Some example papers here include DINOv2 and MAE.

The authors of DINOv2 noticed that ViTs have artifacts in attention maps. It’s due to the model using some image patches as “registers”. The authors propose a fix: just add some new tokens (called “register” tokens), which you only use during pre-training (and throw away afterwards). This results in:

no artifacts
interpretable attention maps
and improved performances.

Add DINOv2 with registers by @NielsRogge in #35348

Emu3

The Emu3 model was proposed in Emu3: Next-Token Prediction is All You Need by Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, Zhongyuan Wang.

Emu3 sets a new standard in multimodal AI by using next-token prediction to handle images, text, and videos. It simplifies multimodal modeling by tokenizing all data into a unified format and training a single transformer. Visual data is tokenized using vector quantization methods based on VQ-VAE model. Discretized visual tokens are later fused with text token ids for image and text generation.

Emu3 outperforms leading models like SDXL and LLaVA-1.6 in both generation and perception tasks, without relying on diffusion or compositional methods..

Add Emu3 by @zucchini-nlp in #33770

Cohere2

A new Cohere update was added through a new "Cohere2" set of classes.

Add Cohere2 model by @alexrs-cohere in #35224

TextNet

TextNet is a lightweight and efficient architecture designed specifically for text detection, offering superior performance compared to traditional models like MobileNetV3. With variants TextNet-T, TextNet-S, and TextNet-B (6.8M, 8.0M, and 8.9M parameters respectively), it achieves an excellent balance between accuracy and inference speed.

Add TextNet by @jadechoghari in #34979

DiffLlama

Differential Transformer combines the Llama architecture with Differential Transformer's Attention.

Add DiffLllama by @weak-kajuma in #34083

PixtralLarge

The conversion script needed a few update, while the modeling code was barely changed!

[PixtralLarge] Update Pixtral conversion script to support large format! (#34801)

Moonshine

Moonshine is an autoregressive speech recognition encoder-decoder model that improves upon Whisper's architecture. Namely, it replaces absolute position embeddings with Rotary Position Embeddings (RoPE). This allows Moonshine to handle audio inputs of any length, unlike Whisper, which is restricted to fixed 30-second windows. It was introduced by Nat Jeffries, Evan King, Manjunath Kudlur, Guy Nicholson, James Wang, and Pete Warden in Moonshine: Speech Recognition for Live Transcription and Voice Commands .

Add Moonshine by @eustlb in #34784

Quantization methods

VPTQ Quantization

From the VPTQ contributors:

VPTQ is a novel Post-Training Quantization method that leverages Vector Quantization to high accuracy on LLMs at an extremely low bit-width (<2-bit). VPTQ can compress 70B, even the 405B model, to 1-2 bits without retraining and maintain high accuracy.. More details here: https://github.com/microsoft/vptq

FEAT : Adding VPTQ quantization method to HFQuantizer by @wejoncy in #34770

HIGGS Quantization

From the contributors:

HIGGS is a new 0-shot quantization algorithm that combines Hadamard preprocessing with MSE-Optimal quantization grids to achieve lower quantization error and SOTA performance. You can find more information in the paper.

Runtime support for HIGGS is implemented through FLUTE, and its library.

This PR adds support for HIGGS+FLUTE into transformers allowing for low-error 0-shot quantization and fast LLM inference.

HIGGS Quantization Support by @BlackSamorez in #34997

Cleanup

We merged a cleanup for vision language models, to make sure it all models are standardized.

VLMs: major clean up 🧼 (#34502)

Breaking changes

Conversion scripts

Many models in Transformers include scripts to convert the original model checkpoints into a Transformers-compatible format. These scripts can be found in the repo using the glob pattern models/**/convert_*.py. They were a recurring source of vulnerability reports and CVEs because many models were originally released using insecure formats like older PyTorch .bin weights or pickle files. The conversion scripts had to open these formats, and this meant that they were vulnerable to maliciously crafted inputs.

In practice, we do not see this as a serious vulnerability. The conversion scripts are never imported or called by the rest of the library; each script is standalone, and so the only way to exploit the vulnerability is to create a malicious checkpoint, induce a user to download it, and then also induce them to manually call a specific conversion script on it.

However, even if there is little practical risk of an exploit, we are aware that open vulnerability reports create a compliance problem for users, and so beginning with this release we will be excluding these conversion scripts from release branches and wheels. They will remain accessible to developers on the main branch.

🚨🚨🚨 Delete conversion scripts when making release wheels by @Rocketknight1 in #35296

Backtracking in Nougat

A regular expression used within the Nougat code has been modified to ensure it does not hang. The method should output the same results but we cannot guarantee it; we recommend upgrading to the latest transformers if you use this model to ensure your code is performance-optimized.

🚨🚨🚨 Limit backtracking in Nougat regexp by @qubvel in #35264

Whisper decoding

This PR finalizes work that aimes to enable short-form (< 30 secs) and long-form generation using temperature fallback. It is a significant improvement to the whisper codebase, but it does result in the following breaking changes:

➡️ Previously:
• Short-form: Returned a ModelOutput or torch.LongTensor, including decoder input IDs and the EOS token ID.
• Long-form: Returned a Dict or torch.LongTensor, excluding decoder input IDs and the EOS token ID.

➡️ From now on:
Short-form and long-form generation are now treated identically, meaning output differentiation based on these modes is no longer applicable.

Decoder input IDs and EOS token IDs are never returned, except in two specific cases: when return_dict_in_generate=True and (return_timestamps=False or force_unique_generate_call=True).

In this case, the output will be a ModelOutput, which is the result of the underlying call to GenerationMixin’s generate. Indeed, return_timestamps=False ensures no seeking occurs; only a single call to generate is made. Therefore, this output includes both decoder input IDs and the EOS token ID.

[Whisper] 🚨 Fix whisper decoding 🚨 by @eustlb in #34135

Attention refactor

In order to have a cleaner, isolated, future-proof code for the attention layers, they have been refactored so as to keep the model attention code within their files; but attention definitions relating to SDPA, Flash Attention, and other types of attention have been moved to a common file.

🚨All attention refactor🚨 by @ArthurZucker in #35235

Bugfixes and improvements

Pipeline: simple API for assisted generation by @gante and @Rocketknight1 #34504
[tokenizers] Ensure that add_prefix_space is propagated to backend_tokenizer.pre_tokenizer (#35593)
Setup loss_type in config at model init time (#34616)
[docs] Update Python version in translations by @jla524 in #35096
[docs] top_p, top_k, temperature docstrings by @stevhliu in #35065
Fix private forked repo. CI by @ydshieh in #35114
Add feature dim attributes to BitLinear for easier PEFT integration by @agostinv in #34946
Update I-JEPA checkpoints path by @qubvel in #35120
Fix GA loss bugs and add unit test by @techkang in #35121
[I-JEPA] Update docs by @NielsRogge in #35148
Corrected typo in agent system prompts by @Uvi-12 in #35143
Option to set 'non_blocking' for to(device) in BatchEncoding and BatchFeature by @daniel-bogdoll in #34883
Fix typo in EETQ Tests by @MekkCyber in #35160
Cleanup: continue the init refactor by @LysandreJik in #35167
Super tiny fix logging message by @fzyzcjy in #35132
Fixed typo of 'avilable' in prompts.py by @Uvi-12 in #35145
[CI] Fix bnb quantization tests with accelerate>=1.2.0 by @matthewdouglas in #35172
Fix num_items_in_batch not being an integer by @xspirus in #35115
Assisted decoding multi-gpu by @zucchini-nlp in #35116
Fix file path for shard_num 1 with mllama converter by @strangiato in #35053
Support BatchNorm in Hubert pos_conv_emb as in fairseq by @gallilmaimon in #34389
Remove unnecessary masked_fill in deberta models by @xadupre in #35182
Fix DBRX LayerNorm init method by @hgt312 in #35177
Fixing GGUF support for StableLm by @MekkCyber in #35060
[i18n-ar] Translated file : docs/source/ar/community.md into Arabic by @AhmedAlmaghz in #33027
Multiple typo fixes in NLP, Audio docs by @henryhmko in #35181
Only import torch.distributed if it is available by @GaetanLepage in #35133
[i18n-<languageCode>] Translating Benchmarks.md to Chinese by @asdkfjsd in #35137
[docs] Fix FlashAttention link by @stevhliu in #35171
Update data collator docstrings to accurately reference Nvidia tensor core compute capability version by @johngrahamreynolds in #35188
[i18n-<languageCode>] Translating agents.md to Chinese by @HMJ0628 in #35139
BLIP: enable device map by @zucchini-nlp in #34850
🧹 Remove deprecated RotaryEmbedding parts in the Attention layers by @Cyrilvallez in #34858
[PEFT] Better Trainer error when prompt learning with loading best model at the end by @BenjaminBossan in #35087
Cleanup: continue the init refactor by @LysandreJik in #35170
Fix CI by @Cyrilvallez in #35208
Fix seamless TTS generate by @ylacombe in #34968
docs: clarify initializer_range parameter description in Idefics3VisionConfig by @h3110Fr13nd in #35215
Fixed typo of 'indentifier' in audio_utils.py by @Uvi-12 in #35226
Fix type hints for apply_chat_template by @Rocketknight1 in #35216
Support Python 3.10+ Union style in chat template type hints parsing by @RezaRahemtola in #35103
Refactoring AssistedCandidateGenerator for Improved Modularity and Reusability by @keyboardAnt and @jmamou in #35009
Change back to Thread for SF conversion by @ydshieh in #35236
[Init refactor] Modular changes by @LysandreJik in #35240
Fix typo in chat template example by @EricWinsorDSIT in #35250
Run model as compressed/uncompressed mode by @horheynm in #34719
skip Fuyu from test_generate by @nhamanasu in #35246
[tests] fix "Tester object has no attribute '_testMethodName'" by @faaany in #34910
Use rsfE with pytest by @ydshieh in #35119
Update AMD docker image (rocm 6.1) by @ivarflakstad in #35259
Fixed typos in Audio Classification Documentation by @Uvi-12 in #35263
Translating agents_advanced.md to Chinese by @HMJ0628 in #35231
Fix FSDP no longer working by @muellerzr in #35212
don't use no_sync when deepspeed doesn't support it for certain zero stages by @winglian in #35157
[i18n-Chinese] Translating perf_train_cpu.md to Chinese by @asdkfjsd in #35242
Fall back to slow image processor in ImageProcessingAuto when no fast processor available by @yonigozlan in #34785
Aggeregate test summary files in CircleCI workflow runs by @ydshieh in #34989
Blip: fix offloading and MP tests by @zucchini-nlp in #35239
Fix : model used to test ggml conversion of Falcon-7b is incorrect by @MekkCyber in #35083
Temporarily disable amd push ci by @ivarflakstad in #35293
Delete redundancy for loop checks. by @zhanluxianshen in #35288
[Whisper] patch float type on mps by @eustlb in #35295
Fix typos in Translated Audio Classification Docs by @jla524 in #35287
Translating "translate perf_infer_gpu_multi.md" to Chinese by @HMJ0628 in #35271
Fix wrongs in quicktour[zh] by @zhanluxianshen in #35272
Improved documentation of Automatic speech recognition by @Uvi-12 in #35268
fix modular order by @ArthurZucker in #35297
Add sdpa for Beit by @OmarManzoor in #34941
Support for SDPA for SAM models by @MagnusS0 in #34110
remove benchmark job in push-important-models.yml by @ydshieh in #35292
Fix typos in translated quicktour docs by @jla524 in #35302
Fix image preview in multi-GPU inference docs by @jla524 in #35303
Fix remove unused parameter in docs by @zzzzzsa in #35306
Add Cohere2 docs details by @alexrs-cohere in #35294
Fixed typo in audio_classification.md by @Uvi-12 in #35305
[docs] Improve register_pipeline by @stevhliu in #35300
Fix loading with only state dict and low_cpu_mem_usage = True by @SunMarc in #35217
[tests] make cuda-only tests device-agnostic by @faaany in #35222
Trigger GitHub CI with a comment on PR by @ydshieh in #35211
change bnb tests by @jiqing-feng in #34713
[Whisper] fix docstrings typo by @eustlb in #35319
feat: add benchmarks_entrypoint.py by @McPatate in #34495
Fix documentation for ColPali by @tonywu71 in #35321
Update comment CI bot by @ydshieh in #35323
PaliGemma: Make sure to add <eos> to suffix if <image> is present in text by @probicheaux in #35201
Fix some fa2 tests by @ArthurZucker in #35340
Modernbert Release Fixes by @warner-benjamin in #35344
[docs] Add link to ModernBERT Text Classification GLUE finetuning script by @tomaarsen in #35347
fix onnx export of speech foundation models by @nikosanto13 in #34224
[Mamba2] Fix caching, slow path, and multi-gpu by @vasqu in #35154
Reduce CircleCI usage by @ydshieh in #35355
Implement AsyncTextIteratorStreamer for asynchronous streaming by @CISC in #34931
Cleaner attention interfaces by @Cyrilvallez in #35342
Add Tensor Parallel support for Qwen2VL by @jla524 in #35050
fix zoedepth initialization error under deepspeed zero3 by @Tavish9 in #35011
Aurevoir PyTorch 1 by @ydshieh in #35358
bugfix: torch.export failure caused by _make_causal_mask by @jiwoong-choi in #35291
update codecarbon by @nhamanasu in #35243
Update test fetcher when we want to test all by @ArthurZucker in #35364
Use weights_only=True with torch.load for transfo_xl by @ydshieh in #35241
Make test_generate_with_static_cache even less flaky by @ydshieh in #34995
Improve modular transformers documentation by @joelpaulkoch in #35322
Improved Documentation Of Audio Classification by @Uvi-12 in #35368
[docs] Follow up register_pipeline by @stevhliu in #35310
owlvit/2 dynamic input resolution by @bastrob in #34764
Fix new FA2 if is_causal is passed explicitly by @Cyrilvallez in #35390
bitsandbytes: simplify 8bit dequantization by @matthewdouglas in #35068
make LlamaModel._update_causal_mask torch compilable by @winglian in #35187
Patch GPTNeoX to use adequate FA2 if position_ids is provided by @taha-yassine in #35318
uniformize kwargs for SAM by @tibor-reiss in #34578
Deprecate _is_quantized_training_enabled by @MekkCyber in #34991
Scale loss before backward by @qgallouedec in #35207
Fix typing in docstring for PaliGemmaProcessor by @alvarobartt in #35278
Fix : VPTQ test by @MekkCyber in #35394
add bnb support for Ascend NPU by @statelesshz in #31512
bugfix Idefics3 processor - handle gracefully cases with text and no images by @mfarre in #35363
Adding logger.info about update_torch_dtype in some quantizers by @MekkCyber in #35046
Add compile test for fast image processor by @yonigozlan in #35184
Disable .github/workflows/self-comment-ci.yml for now by @ydshieh in #35366
enable non-cuda awq model support without modify version by @jiqing-feng in #35334
[GPTQ, CompressedTensors] Fix unsafe imports and metada check by @vasqu in #34815
Drop inplace operation for loss computation with gradient accumulation by @qgallouedec in #35416
Fix: Rename keyword argument in_channels to num_channels by @ningyuv in #35289
CLIP conversion script - Change fairseq to OpenAI by @gau-nernst in #35384
Fix f-string to show ACCELERATE_MIN_VERSION on error by @KSafran in #35189
Fix model_accepts_loss_kwargs for timm model by @qubvel in #35257
Update perf_infer_gpu_one.md: fix a typo by @martin0258 in #35441
Add compute_loss_func to Seq2SeqTrainer by @d223302 in #35136
Update docs for sdpa_kernel by @jla524 in #35410
[i18n-ar] Translated file: docs/source/ar/tasks/question_answering.md into Arabic by @AhmedAlmaghz in #35196
[i18n-ar] Translated file: docs/source/ar/tasks/summarization.md into Arabic by @AhmedAlmaghz in #35195
Update translated docs for sdpa_kernel by @jla524 in #35461
Reintroduce Python 3.9 support for ModernBERT by @tomaarsen in #35458
Fix new BNB test failures by @matthewdouglas in #35345
Fix docs typos. by @zhanluxianshen in #35465
Fix paligemma warning message by @hiyouga in #35486

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@ydshieh
- Fix private forked repo. CI (#35114)
- Change back to Thread for SF conversion (#35236)
- Use rsfE with pytest (#35119)
- Aggeregate test summary files in CircleCI workflow runs (#34989)
- remove benchmark job in push-important-models.yml (#35292)
- Trigger GitHub CI with a comment on PR (#35211)
- Update comment CI bot (#35323)
- Reduce CircleCI usage (#35355)
- Aurevoir PyTorch 1 (#35358)
- Use weights_only=True with torch.load for transfo_xl (#35241)
- Make test_generate_with_static_cache even less flaky (#34995)
- Disable .github/workflows/self-comment-ci.yml for now (#35366)
@aymeric-roucher
- Add Aria (#34157)
@NielsRogge
- [I-JEPA] Update docs (#35148)
- Add DINOv2 with registers (#35348)
@HMJ0628
- [i18n-<languageCode>] Translating agents.md to Chinese (#35139)
- Translating agents_advanced.md to Chinese (#35231)
- Translating "translate perf_infer_gpu_multi.md" to Chinese (#35271)
@alexrs-cohere
- Add Cohere2 model (#35224)
- Add Cohere2 docs details (#35294)
@ArthurZucker
- fix modular order (#35297)
- 🚨All attention refactor🚨 (#35235)
- Fix some fa2 tests (#35340)
- Update test fetcher when we want to test all (#35364)
@tonywu71
- Add ColPali to 🤗 transformers (#33736)
- Fix documentation for ColPali (#35321)
@OmarManzoor
- Add sdpa for Beit (#34941)
@fabianlim
- Add the Bamba Model (#34982)
@warner-benjamin
- Add ModernBERT to Transformers (#35158)
- Modernbert Release Fixes (#35344)
@wejoncy
- FEAT : Adding VPTQ quantization method to HFQuantizer (#34770)
@bastrob
- owlvit/2 dynamic input resolution (#34764)
@BlackSamorez
- HIGGS Quantization Support (#34997)

Dec 17, 2024

Patch release v4.47.1

We waited a little bit to make sure it was stable, thanks @winglian for double checking and everyone for the fixes!

Fix GA loss bugs and add unit test (#35121) Contributed by @techkang and @ArthurZucker.
Fix num_items_in_batch not being an integer (#35115)) Contributed by @xspirus.
Fix FSDP no longer working (#35212) Contributed by @muellerzr.
Don't use no_sync when DeepSpeed doesn't support it for certain ZeRO configurations (#35212) Contributed by @winglian.
Only import torch.distributed if it is available (#35133) Contributed by @GaetanLepage.
[Whisper] Patch float type on MPS (#35295) Contributed by @eustlb. 🔜 we should probably have MPS CIs to avoid repeating this!

Dec 5, 2024

v4.47.0: PaliGemma-2, I-JEPA, OLMo-2, LayerSkip, Tensor Parallel

New models

PaliGemma-2

PaliGemma 2 and PaliGemma are lightweight open vision-language models (VLM) inspired by PaLI-3, and based on open components like the SigLIP vision model and the Gemma language model. PaliGemma takes both images and text as inputs and can answer questions about images with detail and context, meaning that PaliGemma can perform deeper analysis of images and provide useful insights, such as captioning for images and short videos, object detection, and reading text embedded within images.

PaliGemma 2 is available in 3B, 10B, and 28B parameter sizes, which are based on Gemma 2 2B, 9B, and 27B models, respectively. The original PaliGemma models are available in the 3B size. For more information on Gemma model variants, see the Gemma models list. PaliGemma model variants support different pixel resolutions for image inputs, including 224 x 224, 448 x 448, and 896 x 896 pixels.

I-JEPA

The I-JEPA model was proposed in Image-based Joint-Embedding Predictive Architecture by Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas. I-JEPA is a self-supervised learning method that predicts the representations of one part of an image based on other parts of the same image. This approach focuses on learning semantic features without relying on pre-defined invariances from hand-crafted data transformations, which can bias specific tasks, or on filling in pixel-level details, which often leads to less meaningful representations.

Add I-JEPA by @jmtzt in #33125

OLMo 2

The OLMo2 model is the successor of the OLMo model, which was proposed in OLMo: Accelerating the Science of Language Models.

The architectural changes from the original OLMo model to this model are:

RMSNorm is used instead of standard layer norm.
Norm is applied to attention queries and keys.
Norm is applied after attention/feedforward layers rather than before.

Commits:

Add OLMo November 2024 by @2015aroras in #34551
Rename OLMo November to OLMo2 by @2015aroras in #34864

Layer-Skip Llama

We add support for Meta's Layer-Skip Llama 3.2 1B model.

The Llama3.2 1B model was continually pretrained with LayerSkip recipe, early exit loss and layer dropout, as presented in Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding and is capable of performing self-speculative decoding: decode with earlier layers and verify with remaining layers.

Self-speculation (Layer-Skip Llama) by @ArthurZucker in #34240

Tensor Parallel implementation

This PR uses the torch.distributed.tensor.parallel subpackage to implement Tensor Parallel for Llama (as an example).

The motivation is multi-fold:

to make modeling code simple as single-worker case:
all manual TP implementations under if self.config.pretraining_tp > 1 can be removed.
to make tensor parallelism easily accessible by users:
added a model.tensor_parallel(device_mesh) method that allows users to turn a single-proc model into a parallel model. !- Please guide me to a right place to put this function/method if PreTrainedModel is not a preferred place. -!

This is the first PR of many to simplify and enable Tensor Parallel across models.

Simplify Tensor Parallel implementation with PyTorch TP by @kwen2501 in #34184

Farewell, Python 3.8

Python 3.8 reaches end of life, and, as such, we drop it from our CI.

Drop support for Python 3.8 by @ydshieh in #34314

GGUF improvements

Several improvements have been done to the GGUF support in transformers; notably by adding new architectures to the list of supported architectures.

Add T5 GGUF loading support by @junejae in #33389
Add GGUF for Mamba by @VladOS95-cyber in #34200
Add Nemotron GGUF Loading Support by @farrosalferro in #34725
Improve gguf tensor processing by @VladOS95-cyber in #34515
Fix use_parallel_residual and qkv_bias for StableLM GGUF config extraction by @Isotr0py in #34450

Fast processors

We continue the work to improve the speed of fast processors as detailed in this roadmap.

We contribute a fast processor to RT-DETR.

Add Image Processor Fast RT-DETR by @yonigozlan in #34354

New pipelines

A new pipeline has been added to transformers: image-text-to-text!

the pipeline support the following inputs:

unbatched images and text - images=image, text=text
batched images and text - images = [image, image], text= [text, text]
several images per prompt (only for models supporting the use of an image token) - images = [[image, image], [image]] or images=[image, image, image], text = ["... <image>...<image>...", "...<image>..."]
Chat templates (for models supporting them).

Add image text to text pipeline by @yonigozlan in #34170

Notable refactors

Separate chat templates into a single file

We have had several issues with chat templates because they're stored as single lines in the JSON config files:

Impossible to review diffs
Very hard to edit in the web UI (or in general)
Differences between processor templates in chat_template.json and tokenizer templates in tokenizer_config.json causing confusion
Some models use multiple templates, requiring a template dict, but we're trying to discourage that in future and move those models to single templates with conditional behaviour instead

The solution:

Just move chat templates to a single chat_template.jinja file in the repo
If multiple templates are required, then they should still be stored in the JSON file. This is not supported for Processor classes, so processors should always be able to save their template as a raw Jinja file. In general, we'll be gently deprecating multiple templates in future.
If a chat_template.jinja file is present, it overrides the JSON files. If a tokenizer is loaded with both Jinja and JSON chat templates and resaved, it should save only the Jinja file, and not have any chat_template entry in tokenizer_config.json.

For now, we continue saving in the old format by default. I'll probably keep it this way for several versions before making the new format the default, to ensure that most users are able to load the new format before it becomes common. Until then, the new format should mostly be used for testing, to make sure it's ready for deployment when we do the switch.

Separate chat templates into a single file by @Rocketknight1 in #33957

Large modular logic refactor

This PR largely rework the logic we use in the modular converter. It is (hopefully) clearer and maintainable. Instead of going in all directions, adding stuff, then deleting it if not needed, we now do the following:

visit all the modular file (record imports/functions/classes/assignments nodes)
- create function dependency mapping
for each import coming from another model:
- visit the corresponding file
- create function dependency mapping
- update mapping with function/assignment from the modular (updated/new functions)
- create the class dependency graph based on merged dependencies
update dependency graph of the modular with the functions and assignments imported from the other files
for each class recorded in the modular:
- if inherithing from class in another file:
  - replace call to super
  - find the dependencies after the node was replaced
  - follow (updated with modular defs) dependency mapping to add all nodes
- else:
  - only add needed imported functions (and their dependencies)
determine the needed imports and add them

Large modular logic refactoring by @Cyrilvallez in #34487

Community bugfixes and improvements

Remove graph breaks for torch.compile() in flash_attention_forward when Lllama Model is padding free tuned by @Abhishek-TAMU in #33932
Better defaults by @ArthurZucker in #34026
translated gguf.md into chinese by @blueingman in #34163
CI: fix failures by @zucchini-nlp in #34371
Zamba is an LM by @LysandreJik in #34342
add code generation to natural language processing section by @furtnerthomas in #34333
Fix pil_torch_interpolation_mapping import in image_processing_detr_fast by @yonigozlan in #34375
Add code sample docstrings and checkpoint reference for GLM models by @h3110Fr13nd in #34360
refactor: remove redundant if-condition and improve type correctness for convert_tokens_to_ids by @winstxnhdw in #34030
Ignore unsupported kwarg in ProcessorMixin call by @yonigozlan in #34285
[PEFT] Add warning for missing key in LoRA adapter by @BenjaminBossan in #34068
Fix torch.fx issue related to the new loss_kwargs keyword argument by @michaelbenayoun in #34380
Correct the new defaults by @Cyrilvallez in #34377
[auto. ping] Avoid sending empty info + add more team members by @ydshieh in #34383
Fix glm by @Cyrilvallez in #34388
Use non nested images and batched text Idefics2/3 by @yonigozlan in #34222
Fix onnx non-expotable inplace aten op by @IlyasMoutawwakil in #34376
Fix right padding in LLaVA models by @zucchini-nlp in #34305
no filter by @ydshieh in #34391
SynthID: better example by @gante in #34372
Tests: upgrade test_eager_matches_sdpa_generate by @gante in #34386
Fix bnb training test failure by @matthewdouglas in #34414
Avoid check expected exception when it is on CUDA by @ydshieh in #34408
Fix typos in agents_advanced.md by @rudydel in #34405
[docs] Cache implementations by @stevhliu in #34325
Fix pix2struct by @IlyasMoutawwakil in #34374
pin tensorflow_probability<0.22 in docker files by @ydshieh in #34381
Tiny update after #34383 by @ydshieh in #34404
Fix batch size handling in prediction_loop for DataLoaderShard by @zeus2611 in #34343
exclude fsdp from delay_optimizer_creation by @eljandoubi in #34140
New option called "best" for args.save_strategy. by @seanswyi in #31817
[docs] update input documentation for MAMBA2 and MISTRAL models to include cache_position and attention_mask details by @h3110Fr13nd in #34322
🌐 [i18n-KO] Translated model_doc/barthez.md to Korean by @Jwaminju in #33980
Apply linting to the important code blocks to make it readable by @ShubhamJagtap2000 in #34449
Torchao weights only + prequantized compability by @SunMarc in #34355
[i18n-ar] Translated file : docs/source/ar/fast_tokenizers.md into Arabic by @AhmedAlmaghz in #33034
enable average tokens across devices by @techkang in #34373
feat: run benchmarks on A100 by @McPatate in #34287
Add post_process_depth_estimation for GLPN by @alex-bene in #34413
LLaVA: latency issues by @zucchini-nlp in #34460
Generation: fix test by @zucchini-nlp in #34369
Fix CI by @zucchini-nlp in #34458
use a tinymodel to test generation config which aviod timeout by @techkang in #34482
🚨🚨🚨 [SuperPoint] Fix keypoint coordinate output and add post processing by @sbucaille in #33200
Simplify running tests in a subprocess by @ydshieh in #34213
Fix perplexity computation in perplexity.md by @Framartin in #34387
Fixes for Modular Converter on Windows by @hlky in #34266
Fix regression loading dtype by @SunMarc in #34409
Bert is ExecuTorch compatible by @guangy10 in #34424
manual head_dim for mixtral model by @wavy-jung in #34281
fix-qwen2vl-no-position_ids by @simonJJJ in #33487
Bug fix for drop path decay rate in swin transformer by @abhi-glitchhg in #34291
MobileBERT is ExecuTorch compatible by @guangy10 in #34473
Albert is ExecuTorch compatible by @guangy10 in #34476
Adding optimizer_cls_and_kwargs to Trainer.__init__ by @apoorvkh in #34358
Fix performance in get_imports regexp by @AlekseyLobanov in #34298
fix incorrect warning by @yonigozlan in #34416
Un-deprecate timeout arg in pipelines by @Rocketknight1 in #34382
Roberta is ExecuTorch compatible by @guangy10 in #34425
Fix format mistake in string repr of tokenizer objects by @gpetho in #34493
Mllama: update docs by @zucchini-nlp in #34334
VLMs: fix number of image tokens by @zucchini-nlp in #34332
Tests: move generate tests to the right mixin and delete redundant tests by @gante in #34464
fix pixtral processor by @molbap in #34486
Use torch 2.5 in scheduled CI by @ydshieh in #34465
Fix super tiny extra space typo by @fzyzcjy in #34440
UPDATE Documentation for #TRANSLATING.md Documentation into Multiple Languages.(Changes made) by @anshumangahlot in #34226
enable QA bf16 pipeline by @jiqing-feng in #34483
Fix: img size mismatch caused by incorrect unpadding in LLaVA-Next by @jp1924 in #34522
Fix step shifting when accumulate gradient by @kibitzing in #33673
avoid calling gc.collect and cuda.empty_cache by @ydshieh in #34514
Qwen2VL: skip base input_ids-inputs_embeds equivalence check by @gante in #34535
fix(DPT,Depth-Anything) Address expected_slice errors inside inference tests by @philkuz in #34518
feat: add benchmarks pg indexes by @McPatate in #34536
make test_eager_matches_sdpa_inference less flaky by @ydshieh in #34512
Bug Fix for issue #34294 by @fpgaminer in #34295
[CLIPSeg] Make interpolate_pos_encoding default to True by @NielsRogge in #34419
update doc by @jiqing-feng in #34478
[i18n-ar] Translated file : docs/source/ar/multilingual.md into Arabic by @AhmedAlmaghz in #33048
Blip: get/set input embeddings correctly by @zucchini-nlp in #34152
BLIP: enable generation tests by @zucchini-nlp in #34174
:red_circle: :red_circle: fix query_pre_attn_scalar different of num_heads in default gemma2 config by @molbap in #34540
[i18n-HI] Translated accelerate page to Hindi by @karthik-script in #34443
Update trainer for easier handling of accumulate, compile fixes, and proper reporting by @muellerzr in #34511
VLM: special multimodal Tokenizer by @zucchini-nlp in #34461
MPS: isin_mps_friendly can support 0D tensors by @gante in #34538
Add text support to the Trainer's TensorBoard integration by @JacobLinCool in #34418
[i18n-HI] Translated TFLite page to Hindi by @karthik-script in #34572
🌐 [i18n-KO] Translated perf_train_special.md to Korean by @maximizemaxwell in #34590
🌐 [i18n-KO] Update README_ko.md by @J4BEZ in #33098
fix TrainerState doc because num_input_tokens_seen is unused by defau… by @techkang in #34593
Fix Whisper CI by @ydshieh in #34541
Skip DeepSpeed ZeRO Stage 3 model initialization when bnb by @eljandoubi in #34395
FIX: Broken repr of TorchAoConfig by @BenjaminBossan in #34560
Load sub-configs from composite configs by @zucchini-nlp in #34410
DistilBERT is ExecuTorch compatible by @guangy10 in #34475
Remove unused test_dataset by @thisisiron in #34516
Revert "Fix Whisper CI" by @ydshieh in #34605
Fix #34494 assistant tokens when truncated by @yonigottesman in #34531
Remove @slow for test_eager_matches_sdpa_inference by @ydshieh in #34558
Changing repr in torchao to show quantized Linear by @MekkCyber in #34202
Fix torchvision interpolation CI by @yonigozlan in #34539
🌐 [i18n-KO] Translated convbert.md to Korean by @ahnjj in #34599
fix(dvclive): pass fake dataset to avoid exception in trainer init by @shcheklein in #34455
🌐 [i18n-KO] Translated timesformer.md to Korean by @mreraser in #33972
🌐 [i18n-KO] Translated bert.md to Korean by @maximizemaxwell in #34627
[i18n-ar] Translated file : docs/source/ar/trainer.md into Arabic by @AhmedAlmaghz in #33080
Update llm_engine.py by @louisbrulenaudet in #33332
Agents: turn any Space into a Tool with Tool.from_space() by @aymeric-roucher in #34561
[docs] update not-working model revision by @faaany in #34682
[i18n-ar] Translated file : docs/source/ar/torchscript.md into Arabic by @AhmedAlmaghz in #33079
Agents: Small fixes in streaming to gradio + add tests by @aymeric-roucher in #34549
🌐 [i18n-KO] Translated marian.md to Korean by @maximizemaxwell in #34698
[docs] Broken link in generation_strategies by @pcuenca in #34717
Fix example in EsmConfig docstring by @yuanx749 in #34653
[docs] add xpu device check by @faaany in #34684
Retain newlines in chat template when continue_final_message=True by @lewtun in #34253
Update llava.md by @LysandreJik in #34749
fix(wandb): pass fake dataset to avoid exception in trainer (see #34455) by @CezaPasc in #34720
add xpu path for awq by @jiqing-feng in #34712
FSDP grad accum fix by @winglian in #34645
Remove FSDP wrapping from sub-models. by @eljandoubi in #34452
🧼 remove v4.44 deprecations by @gante in #34245
VLMs: patch_size -> num_image_tokens in processing by @zucchini-nlp in #33424
Fix broken link by @ofek in #34618
fix a typo bug where 'id2label' was incorrectly written as 'i2label' when reading config by @ZuoChenFttS in #34637
Fix skip of test_training_gradient_checkpointing by @dvrogozh in #34723
make sure to disable gradients for integer tensor by @winglian in #32943
[docs] make empty_cache device-agnostic by @faaany in #34774
[docs] add XPU besides CUDA, MPS etc. by @faaany in #34777
[tests] add XPU part to testing by @faaany in #34778
fix: Update pixel_values parameter in hf_model input by @thisisiron in #34782
Fix callback key name by @jung-hunsoo in #34762
fix: Wrong task mentioned in docs by @ecyht2 in #34757
Allow handling files as args for a tool created with Tool.from_space by @aymeric-roucher in #34687
Fix Whisper CI by @ydshieh in #34617
protect tensor parallel usage by @ArthurZucker in #34800
Trainer hyperparameter search kwargs docs update by @GuillemGSubies in #34459
feat: allow to use hf-hub models for timm backbone by @cgebbe in #34729
Support gradient checkpointing in Qwen2VL ViT by @li-plus in #34724
Fix: siglip image processor rgb_convert is not being applied correctly. by @jp1924 in #34301
fix cpu bnb path by @jiqing-feng in #34647
Gemma capping by @ArthurZucker in #34282
Fix cache_utils for optimum.quanto kvcache quantization by @SunMarc in #34750
Modular fix by @Cyrilvallez in #34802
MLU devices : Checks if mlu is available via an cndev-based check which won't trigger the drivers and leave mlu by @huismiling in #34326
🚨🚨🚨 fix(Mask2Former): torch export 🚨🚨🚨 by @philkuz in #34393
Feature: print tokens per second during training by @tibor-reiss in #34507
Add do_convert_rgb to vit by @jp1924 in #34523
Fix post process function called in the instance segmentation example of mask2former by @OnTheThirdDay in #34588
fix crash in tiiuae/falcon-11B-vlm image-to-text generation by @sywangyi in #34728
Add support for OpenAI api "image_url" input in chat for image-text-to-text pipeline by @yonigozlan in #34562
Add Image Processor Fast Deformable DETR by @yonigozlan in #34353
Run test_medium_seamless_m4t_pt in subprocess to avoid many failures by @ydshieh in #34812
Fix check_training_gradient_checkpointing by @ydshieh in #34806
Added image-text-to-text pipeline to task guide by @merveenoyan in #34783
Translate attention.md into Chinese by @wwwbai in #34716
LLaVA OV: fix unpadding precision by @zucchini-nlp in #34779
Fix low memory beam search by @zucchini-nlp in #34746
Fix the memory usage issue of logits in generate() by @kjohew in #34813
fix(DPT,Depth-Anything) torch.export by @philkuz in #34103
Fix: take into account meta device by @tibor-reiss in #34134
Fix hyperparameter search when optuna+deepseed by @corentin-ryr in #34642
Fix CI by tweaking torchao tests by @SunMarc in #34832
Fix CI slack reporting issue by @ydshieh in #34833
VLMs: enable generation tests - last batch by @zucchini-nlp in #34484
Change logging level from warning to info for max_steps overriding num_train_epochs by @qgallouedec in #34810
Fix ds nvme by @eljandoubi in #34444
Fix heuristic scheduling for UAG by @jmamou in #34805
Refactor StarCoder2 using modular by @Cyrilvallez in #34015
Watermarking: fix order by @zucchini-nlp in #34849
Update checks for torch.distributed.tensor to require torch >= 2.5 by @loadams in #34816
Remove quantization related config from dequantized model by @konradkalita in #34856
Auto compile when static cache by @ArthurZucker in #34247
Speculative decoding: Test the target distribution (to prevent issues like #32867) by @keyboardAnt in #34553
smol improvements to support more flexible usage by @andimarafioti in #34857
[CI] Skip EETQ tests while package is broken with latest transformers by @BenjaminBossan in #34854
Bitnet test fix to avoid using gated model by @MekkCyber in #34863
Fix support for image processors modifications in modular by @yonigozlan in #34866
Fix: Enable prefill phase key value caching of nemotron/minitron models by @jeongin601 in #34742
Add safe_globals to resume training on PyTorch 2.6 by @dvrogozh in #34632
Cache: init empty cache when use_cache by @zucchini-nlp in #34274
BLIP: fix generation after hub update by @zucchini-nlp in #34876
[Deberta/Deberta-v2] Refactor code base to support compile, export, and fix LLM by @ArthurZucker in #22105
🔴 Mllama: fix base prefix by @zucchini-nlp in #34874
Sum gathered input tokens by @techkang in #34554
allow unused input parameters passthrough when chunking in asr pipelines by @VictorAtIfInsurance in #33889
prepare_fa2_from_position_ids function bugfix by @meliksahturker in #33269
chore: fix some typos by @wanxiangchwng in #34891
Fix convert_tokens_to_string when decoder is None by @dszeto in #34569
[peft] Given that self.active_adapter is deprecated, avoid using it by @tomaarsen in #34804
Fix Qwen2 failing tests by @jla524 in #34819
Fix : BitNet tests by @MekkCyber in #34895
[AWQ, CI] Bump AWQ version used in docker image by @BenjaminBossan in #34922
fix static cache data type miss-match by @jiqing-feng in #34799
Fix test_auto_backbone_timm_model_from_pretrained by @ydshieh in #34877
Upgrade torch version to 2.5 in dockerfile for quantization CI by @MekkCyber in #34924
Fix failling GGML test by @MekkCyber in #34871
Updated documentation and added conversion utility by @ViktorooReps in #34319
making gpt2 fx traceable by @xuzifei-dmatrix in #34633
Fix import structure for Fast Image processors by @yonigozlan in #34859
VideoLLaVA: add default values by @zucchini-nlp in #34916
Skipping aqlm non working inference tests till fix merged by @MekkCyber in #34865
[Whisper] Fix whisper integration tests by @eustlb in #34111
Add Pytorch Tensor Parallel support for Mistral by @VladOS95-cyber in #34927
change apply_rotary_pos_emb of Glmmodel for GLM-Edge Series model by @zRzRzRzRzRzRzR in #34629
Fix torch.onnx.export of Qwen2-VL vision encoder by @xenova in #34852
Update the Python version in the Chinese README to match the English README. by @vansin in #34870
[i18n-ar] Translated file : docs/source/ar/benchmarks.md into Arabic by @AhmedAlmaghz in #33023
[docs] use device-agnostic API instead of cuda by @faaany in #34913
[doc] use full path for run_qa.py by @faaany in #34914
docs: HUGGINGFACE_HUB_CACHE -> HF_HUB_CACHE by @imba-tjd in #34904
[i18n-zh]Translated tiktoken.md into chinese by @blueingman in #34936
[FlexAttention] Update gemma2 by @ArthurZucker in #34942
Fix : Add PEFT from source to CI docker by @MekkCyber in #34969
Avoid calling get_max_length by @ydshieh in #34971
Fix flaky test execution caused by Thread by @ydshieh in #34966
🌐 [i18n-KO] Translated encoder-decoder.md to Korean by @maximizemaxwell in #34880
[docs] add explanation to release_memory() by @faaany in #34911
[i18n-zh]Translated perf_train_special.md into Chinese by @blueingman in #34948
Fix typo in code block in vipllava.md by @yuanx749 in #34957
Fixed typo in VisitWebpageTool by @sergiopaniego in #34978
[PEFT] Set eval mode when loading PEFT adapter by @BenjaminBossan in #34509
Fix save_pretrained for partially offloaded models by @kylesayrs in #34890
🚨🚨🚨 Changed DINOv2Config default patch size to 14 by @OFSkean in #34568
Refine the code of Universal Assisted Generation by @xinpengzz in #34823
Allow compressed-tensors quantized model to be trained by @horheynm in #34520
Offloaded cache: fix generate by @zucchini-nlp in #34921
Fix utils/check_bad_commit.py (for auto ping in CI) by @ydshieh in #34943
Add optimized PixtralImageProcessorFast by @mgoin in #34836
Improve .from_pretrained type annotations by @qubvel in #34973
Fix docker CI : install autogptq from source by @MekkCyber in #35000
Let server decide default repo visibility by @Wauplin in #34999
🚨🚨🚨 Uniformize kwargs for TrOCR Processor by @tibor-reiss in #34587
Update timm version by @qubvel in #35005
fix: double verbs by @SamuelLarkin in #35008
Update FillMaskPipeline.__call__ signature and docstring by @alvarobartt in #35006
Only cast cu_seqlens when tracing by @xenova in #35016
fix variable undefined bug when return_tensors is not specified in llava processing by @chenweize1998 in #34953
Optimize memory usage of mllama encoder by @milesial in #34930
Typo in warning switching to optimum-quanto by @Bojun-Feng in #35028
Add type hints for forward functions in Gemma2 by @jla524 in #35034
Fix test_eager_matches_sdpa_inference for XPU backend by @dvrogozh in #34889
Multiple typo fixes in Tutorials docs by @henryhmko in #35035
add docstring example for compute_loss_func by @secrettoad in #35020
[i18n-ar] Translated file : docs/source/ar/notebooks.md into Arabic by @AhmedAlmaghz in #33049
[docs] add the missing import for Image and bug fix by @faaany in #34776
Translate bertlogy.md into Chinese by @wwwbai in #34908
Automatic compilation in generate: do not rely on inner function by @Cyrilvallez in #34923
Add token cost + runtime monitoring to Agent and HfEngine children by @aymeric-roucher in #34548
Fix BertGeneration by @ydshieh in #35043
fix speecht5 failure issue in test_peft_gradient_checkpointing_enable… by @sywangyi in #34454
[docs] fix example code bug by @faaany in #35054
Translate community.md into Chinese by @wwwbai in #35013
[docs] use device-agnostic instead of cuda by @faaany in #35047
[docs] use device-agnostic API instead of hard-coded cuda by @faaany in #35048
Fix pad_token_tensor is None in warning by @tshu-w in #34005
Add Pytorch Tensor Parallel support for Qwen2, Qwen2Moe, Starcoder2 by @VladOS95-cyber in #35007
[GPTNeoX] Flex Attention + Refactor by @vasqu in #34896
Support for easier multimodal use of modular by @Cyrilvallez in #35056
[docs] add a comment that offloading requires CUDA GPU by @faaany in #35055
[docs] Increase visibility of torch_dtype="auto" by @stevhliu in #35067
Informative by @ydshieh in #35059
[Whisper] Fix whisper tokenizer by @eustlb in #34537
[tokenizers] bump to 0.21 by @ArthurZucker in #34972
Update Mistral conversion script by @Cyrilvallez in #34829
Fix tie_word_embeddings handling for GGUF models by @Isotr0py in #35085
Deprecate quanto and switch to optimum-quanto by @MekkCyber in #35001
BLIP: this is correct now by @zucchini-nlp in #35081
[trainer] fix the GA model_accepts_loss_kwargs by @ArthurZucker in #34915
Fix flaky Hub CI (test_trainer.py) by @ydshieh in #35062
Adaptive dynamic number of speculative tokens by @jmamou in #34156

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@AhmedAlmaghz
- [i18n-ar] Translated file : docs/source/ar/fast_tokenizers.md into Arabic (#33034)
- [i18n-ar] Translated file : docs/source/ar/multilingual.md into Arabic (#33048)
- [i18n-ar] Translated file : docs/source/ar/trainer.md into Arabic (#33080)
- [i18n-ar] Translated file : docs/source/ar/torchscript.md into Arabic (#33079)
- [i18n-ar] Translated file : docs/source/ar/benchmarks.md into Arabic (#33023)
@maximizemaxwell
- 🌐 [i18n-KO] Translated perf_train_special.md to Korean (#34590)
- 🌐 [i18n-KO] Translated bert.md to Korean (#34627)
- 🌐 [i18n-KO] Translated marian.md to Korean (#34698)
- 🌐 [i18n-KO] Translated encoder-decoder.md to Korean (#34880)
@2015aroras
- Add OLMo November 2024 (#34551)
- Rename OLMo November to OLMo2 (#34864)
@mgoin
- Add optimized PixtralImageProcessorFast (#34836)

Nov 18, 2024

Patch release v4.46.3

One small fix for FSDP + gradient accumulation loss issue!

FSDP grad accum fix, #34645 by @winglian

Previous 2 3 4 5 6 Next

Similar releases

Other sources from this team

Similar sources

Latest

v5.5.4

Source

@huggingface/transformers

Tracking Since

Apr 23, 2024

Last checked Apr 19, 2026

.json·.md·.atom