releases.shpreview
Hugging Face/Transformers

Transformers

$npx -y @buildinternet/releases show transformers
Mon
Wed
Fri
AprMayJunJulAugSepOctNovDecJanFebMarApr
Less
More
Releases13Avg4/moVersionsv4.57.4 → v5.5.3
Apr 8, 2025
Patch release v4.51.1

Since the release of Llama 4, we have fixed a few issues that we are now releasing in patch v4.51.1

  • Fixing flex attention for torch=2.6.0 (#37285)
  • more fixes for post-training llama4 (#37329)
  • Remove HQQ from caching allocator warmup (#37347)
  • fix derived berts _init_weights (#37341)
  • Fix init empty weights without accelerate (#37337)
  • Fix deepspeed with quantization (#37324)
  • fix llama4 training (#37319)
  • fix flex attn when optional args aren't passed (#37327)
  • Multiple llama4 fixe (#37353)

Thanks all for your patience

Apr 5, 2025
v4.51.0: Llama 4, Phi4-Multimodal, DeepSeek-v3, Qwen3

New Model Additions

Llama 4

Llama 4, developed by Meta, introduces a new auto-regressive Mixture-of-Experts (MoE) architecture.This generation includes two models:

  • The highly capable Llama 4 Maverick with 17B active parameters out of ~400B total, with 128 experts.
  • The efficient Llama 4 Scout also has 17B active parameters out of ~109B total, using just 16 experts.

Both models leverage early fusion for native multimodality, enabling them to process text and image inputs. Maverick and Scout are both trained on up to 40 trillion tokens on data encompassing 200 languages (with specific fine-tuning support for 12 languages including Arabic, Spanish, German, and Hindi).

For deployment, Llama 4 Scout is designed for accessibility, fitting on a single server-grade GPU via on-the-fly 4-bit or 8-bit quantization, while Maverick is available in BF16 and FP8 formats. These models are released under the custom Llama 4 Community License Agreement, available on the model repositories

Getting started with Llama 4 using transformers is straightforward. Make sure you have transformers v4.51.0 or later installed:

pip install -U transformers[hf_xet]

Here's a quick example using the instruction-tuned Maverick model responding about two images, using tensor parallel for maximum speed. You need to run this script on an instance with 8 GPUs, using a command like:

torchrun –nproc-per-instance=8 script.py
from transformers import AutoProcessor, Llama4ForConditionalGeneration
import torch

model_id = "meta-llama/Llama-4-Maverick-17B-128E-Instruct"

processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
    model_id,
    attn_implementation="flex_attention",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

url1 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
url2 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/cat_style_layout.png"
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": url1},
            {"type": "image", "url": url2},
            {"type": "text", "text": "Can you describe how these two images are similar, and how they differ?"},
        ]
    },
]

inputs = processor.apply_chat_template(
    messages,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(
    **inputs,
    max_new_tokens=256,
)

response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])[0]
print(response)
print(outputs[0])

Make sure to check the model cards on the repos (Llama 4 Maverick (~400B) and Llama 4 Scout (~109B)) for detailed usage instructions, including multimodal examples, specific prompt formats (like system prompts), quantization details, and advanced configuration options!

Phi4-Multimodal

<img width="898" alt="image" src="https://github.com/user-attachments/assets/847a18d8-0d6a-4767-b45c-3cc9d6ff392e" />

Phi-4-multimodal-instruct is a lightweight open multimodal foundation model that leverages the language, vision, and speech research and datasets used for Phi-3.5 and 4.0 models. The model processes text, image, and audio inputs, generating text outputs, and comes with 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning, direct preference optimization and RLHF (Reinforcement Learning from Human Feedback) to support precise instruction adherence and safety measures. The languages that each modal supports are the following:

  • Text: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian
  • Vision: English
  • Audio: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese
  • Add Phi4 multimodal by @Cyrilvallez in #36939

DeepSeek-v3

DeepSeek-v3 is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.

The model is detailed in the following paper.

Overview

The DeepSeek-V3 model was proposed in DeepSeek-V3 Technical Report by DeepSeek-AI Team.

The abstract from the paper is the following:

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.

  • [WIP] add deepseek-v3 by @bzantium in #35926

Qwen3

The Qwen3 architecture has been contributed to transformers and is available in v4.51.0. At time of release, the models themselves have not yet been released - stay tuned for a release from the Qwen team!

  • Adding Qwen3 and Qwen3MoE by @bozheng-hit in #36878

Documentation

Model docs are getting a significant overhaul by providing much needed, ready-to-use examples one can copy-paste in their modules/consoles. We will adapt these examples to each model, with the goal of providing relevant examples on a per-model basis.

  • [docs] Model docs by @stevhliu in #36469

Significant model improvements

A very large PR was provided by @nikosanto13 that helped add modular files to all speech models in the library; seeing the difference between each of them is now much simpler, as well as maintenance and eventual refactors.

  • Introduce modular files for speech models by @nikosanto13 in #35902

Bugfixes and improvements

  • fix: loss computation after embeddings resize - mllama by @Ssukriti in #36840
  • Simplify keep_in_fp32_modules logic by @Cyrilvallez in #36722
  • Fix Pan and Scan on batched images Gemma3 by @yonigozlan in #36864
  • Update installation.md by @ariG23498 in #36826
  • fix Gemma3 Config by @eljandoubi in #36893
  • Fix torch version guard at import by @zucchini-nlp in #36907
  • [Fix] Add original_max_position_embeddings to YARN rope_scaling optional keys by @JustinTong0323 in #36877
  • tests: fix asyncio.wait() usage for python>=3.11 by @dvrogozh in #36898
  • [chameleon] fix num image token check by @zucchini-nlp in #36918
  • Fix Compressed tensors to_dict_diff by @MekkCyber in #36922
  • Use another repo. for Mistral3 processor testing by @ydshieh in #36925
  • Fix typos by @omahs in #36910
  • Update trainer_pt_utils.py docstrings for consistency by @ethanknights in #36912
  • [2/N] Use pyupgrade --py39-plus to improve code by @cyyever in #36857
  • Fix pytorch defomr attn path by @qubvel in #36923
  • More precise comment by @ydshieh in #36935
  • Added support for seed in DataCollatorForWholeWordMask by @capemox in #36903
  • Fix processor kwargs qwen2 vl by @yonigozlan in #36890
  • Disallow Offload to disk for gguf files by @MekkCyber in #36933
  • Deprecate #36741 and map Causal to Conditional by @zucchini-nlp in #36917
  • Fixing _pre_quantization_dtype when torch_dtype is None by @MekkCyber in #36930
  • Export for Phi4-mini by @guangy10 in #36780
  • fix typos in the tests directory by @threewebcode in #36932
  • Fix cuda index issue in cache allocator by @SunMarc in #36937
  • [Utils] torch version checks optionally accept dev versions by @gante in #36847
  • Update after #36962 by @ydshieh in #36965
  • Change GPUS to GPUs by @zhanluxianshen in #36945
  • typo fixed in README_fr.md by @NargiT in #36951
  • Updated docker files to use uv for installing packages by @Sai-Suraj-27 in #36957
  • update examples after ruff being updated by @ydshieh in #36972
  • Remove extra tensor clone in PyTorch code by @cyyever in #36748
  • [docs] Fix image link by @stevhliu in #36869
  • Add ruff target-version by @cyyever in #36971
  • update bot comment again by @ydshieh in #36974
  • 🚨Deprecate legacy argument for image-text-to-text models and adopt new behavior by default by @yonigozlan in #36307
  • Fix tensor dtype mismatch by @cyyever in #36985
  • byebye CircleCI TF jobs by @ydshieh in #36998
  • Use torch.expm1 by @cyyever in #36995
  • Install networkx==3.2.1 manually in some CircleCI jobs after #36957 by @ydshieh in #37000
  • Fix Optional type annotation by @cyyever in #36841
  • Fix get_device_properties by @ivarflakstad in #36997
  • Allow easy registration of custom attention functions by @Cyrilvallez in #36889
  • Fix removing "cpu" from frozenset in bitsandbytes.py to allow better ROCm support. by @anadon in #36975
  • Fix device_map check for ggml files by @MekkCyber in #37003
  • Log the correct learning rate by @SunMarc in #36973
  • fix typos in the code comments and error messages by @threewebcode in #36993
  • Remove deprecated training arguments by @cyyever in #36946
  • [docs] Attention mask image by @stevhliu in #36970
  • fix transformers_cli import relative path issue by @yao-matrix in #36989
  • Support QuestionAnswering Module for ModernBert based models. by @bakrianoo in #35566
  • Fix PixtralProcessor patch_size when spatial_merge_size is used by @mgoin in #37019
  • [Modeling] Load FP8 safetensors such as DeepSeek by @kylesayrs in #36828
  • Mark 2 tests as flaky for now by @ydshieh in #37038
  • remove redundant code in trainer by @hiyouga in #36994
  • Skip FP8 linear tests For device capability 9.0 by @MekkCyber in #37008
  • Add Distill Any Depth by @keetrap in #36614
  • fix pegasus init weights and other copied models by @jiqing-feng in #36844
  • Optimize to_py_obj for python-native numeric lists and scalars by @n0gu-furiosa in #36885
  • Fixup for distill_any_depth conversion script by @qubvel in #37043
  • [chat templates} support loading audio from video by @zucchini-nlp in #36955
  • [audio utils] fix fft_bin_width computation by @eustlb in #36603
  • [generate, cache] handle more complex device maps by @gante in #37014
  • clean pipeline question_answering. by @zhanluxianshen in #36986
  • Avoid unnecessary device operations in loss computing by @cyyever in #36950
  • Set weights_only in torch.load by @cyyever in #36991
  • Replace default split function with jnp.split() in flax models by @premmurugan229 in #37001
  • Remove deprecated batch_size parameter by @cyyever in #37007
  • fixed typo by @finnoh in #37036
  • fix: Fully remove legacy cache from Llama by @Wheest in #36958
  • Fix SDPA implementation in Qwen2-VL (issues with torch==2.6.0) by @ManuelFay in #36891
  • fix: AttributeError: 'LlavaProcessor' object has no attribute 'image_token_id' by @jp1924 in #37026
  • Fix some typos about benchmark scripts. by @zhanluxianshen in #37027
  • Change deprecated PT functions by @cyyever in #37041
  • [blip-2] Fix dtype mismatch when keep in fp32 by @zucchini-nlp in #37068
  • fix tied weigths issue by @ydshieh in #37031
  • Update w/ new account by @muellerzr in #37084
  • Fix state_dict map location when quantized by @Cyrilvallez in #37086
  • Fix AttentionInterface following feedback by @Cyrilvallez in #37010
  • fixed typo. by @zhanluxianshen in #37057
  • [generate] beam search -- fix output cropping by @gante in #37080
  • [Cache] rename dtype attribute 🚨 🚨 by @gante in #37044
  • Kenlm by @ydshieh in #37091
  • 🌐 [i18n-KO] Translated qwen2_vl.md to Korean by @MinJu-Ha in #36750
  • Gaudi: Fix the pipeline failed issue with hpu device by @yuanwu2017 in #36990
  • Support passing flash_attn_kwargs when gradient_checkpointing is enabled by @efsotr in #37037
  • Fix 4090/ada not detected as having FP8 support by @Qubitium in #37067
  • enable tp on CPU by @jiqing-feng in #36299
  • fix whisper re-compile by @jiqing-feng in #36712
  • [MLU] Fix FA2 check error, remove deepspeed-mlu deps. by @huismiling in #36159
  • Fix Gemma3 embedding scaling by @gau-nernst in #37109
  • RWKV: fix mask warning typo by @RobinKa in #37114
  • Remove deprecated code by @cyyever in #37059
  • [tests] remove cuda-only test marker in AwqConfigTest by @faaany in #37032
  • Export T5 (encoder-decoder) to ExecuTorch by @guangy10 in #36486
  • skip by @ydshieh in #37141
  • [qwen3] fix generation tests by @zucchini-nlp in #37142
  • Fix more inefficient PT operations by @cyyever in #37060
  • Fix std initialization in Idefics variants by @yaswanth19 in #37100
  • add gpt2 test on XPU by @jiqing-feng in #37028
  • Fix llava xpu tests. by @jiqing-feng in #37130
  • enable test_assisted_decoding_in_different_gpu test on XPU by @yao-matrix in #37120
  • Use public export API on torch 2.5 and future by @guangy10 in #36781
  • Convert _VALID_DICT_FIELDS to class attribute for shared dict parsing in subclasses by @Tavish9 in #36736
  • Only count num items in batch when needed by @IlyasMoutawwakil in #36867
  • Make canine model exportable by removing unncessary complicated logic by @tugsbayasgalan in #37124
  • [ModernBERT] Never save 'reference_compile' config; should be set based on end user by @tomaarsen in #36305
  • fix XPU UT error case brough by RNG difference btw XPU and CUDA by @yao-matrix in #37121
  • Fixes the inconsistency of the optionality of attention_mask by @Zephyr271828 in #37153
  • Avoid pipeline test failing related to Hub call by @ydshieh in #37170
  • Fix meta state dict loading with quantizers by @Cyrilvallez in #37136
  • Revert #37031 by @Cyrilvallez in #37178
  • [doc] Fix link for Quark quantization page by @BowenBao in #37179
  • [chat-template] fix video loading by @zucchini-nlp in #37146
  • Skip code 307 in RequestCounter by @ydshieh in #36953
  • Add device workaround for int4 weight only quantization after API update by @jerryzh168 in #36980
  • Fixes DynamicCache export issues due to control flow and inplace modifications by @xadupre in #36652
  • Try to avoid/reduce some remaining CI job failures by @ydshieh in #37202
  • fix: Add 'image-text-to-text' to TASK_MAPPING by @saattrupdan in #37107
  • Fix some code annotation typos. by @zhanluxianshen in #37102
  • Merge tensor operations with device transfer operations by @cyyever in #37097
  • [3/N] Use pyupgrade --py39-plus to improve code by @cyyever in #36936
  • Add py.typed by @cyyever in #37022
  • No more dtype_byte_size() by @Rocketknight1 in #37144
  • [Tests] add min_new_tokens to prevent flaky length checks by @gante in #37175
  • Stop DOSing the Hub in the CI by @Rocketknight1 in #37209
  • More ReDOS fixes! by @Rocketknight1 in #36964
  • Updated the model card for CLIP by @purusharthmalik in #37040
  • Update falcon model card by @ricalanis in #37184
  • Updated model card for Qwen2 by @Aravind-11 in #37192
  • Fix static cache export by @guangy10 in #37229
  • [Phi4] add multimodal chat template by @zucchini-nlp in #36996
  • Add new dim to num_items_in_batch if necessary by @regisss in #36967
  • Fix test by @Cyrilvallez in #37213
  • [tests] fix mamba integration simple inference precision issue by @faaany in #37193
  • [CI] lazy loading external datasets by @gante in #37218
  • enable 2 types of case on XPU by @yao-matrix in #37198
  • Fix AST parsing when looking for remote code imports by @Rocketknight1 in #37245
  • Add support for fast image processing in image-pretraining example by @jafraustro in #37021
  • Allow flexible generation params arg when checking pipeline specs by @Rocketknight1 in #37211
  • [CI] green llama tests by @gante in #37244
  • Adding links to ShieldGemma 2 technical report by @RyanMullins in #37247
  • feat: updated model card for qwen_2.5_vl by @arkhamHack in #37099
  • Update model card for Cohere by @bimal-gajera in #37056
  • chore: Update model doc for code_llama by @AbhishekRP2002 in #37115
  • Update Model Card for ModernBERT by @ParagEkbote in #37052
  • Update model card for electra by @Wu-n0 in #37063
  • [qwen-vl] fix image processor by @zucchini-nlp in #37258
  • update error msg by @itazap in #37207
  • Fix utils/check_bad_commit.py by @ydshieh in #37272
  • Support return_tensors in audio chat templates by @zucchini-nlp in #34601
  • Update ruff to 0.11.2 by @ydshieh in #36962
  • Fix typing for None valued variables by @cyyever in #37004
  • Use lru_cache for tokenization tests by @ydshieh in #36818
  • Create and Expose SamVisionModel as public for better accessibility by @geetu040 in #36493
  • [Feature] Support using FlashAttention2 on Ascend NPU by @FightingZhen in #36696
  • Remove low_cpu_mem_usage and _fast_init by @Cyrilvallez in #36963
  • Refactor return_dict logic to remove complicated if/else paths by @qubvel in #36794
  • Refactor attention for SigLIP based models by @qubvel in #36981
  • Add Optional to types by @cyyever in #37163
  • Purge unused ModelTester code by @Rocketknight1 in #37085

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @cyyever
    • [2/N] Use pyupgrade --py39-plus to improve code (#36857)
    • Remove extra tensor clone in PyTorch code (#36748)
    • Add ruff target-version (#36971)
    • Fix tensor dtype mismatch (#36985)
    • Use torch.expm1 (#36995)
    • Fix Optional type annotation (#36841)
    • Remove deprecated training arguments (#36946)
    • Avoid unnecessary device operations in loss computing (#36950)
    • Fix typing for None valued variables (#37004)
    • Set weights_only in torch.load (#36991)
    • Remove deprecated batch_size parameter (#37007)
    • Change deprecated PT functions (#37041)
    • Remove deprecated code (#37059)
    • Fix more inefficient PT operations (#37060)
    • Merge tensor operations with device transfer operations (#37097)
    • [3/N] Use pyupgrade --py39-plus to improve code (#36936)
    • Add py.typed (#37022)
    • Add Optional to types (#37163)
  • @bzantium
    • [WIP] add deepseek-v3 (#35926)
  • @bozheng-hit
    • Adding Qwen3 and Qwen3MoE (#36878)
  • @geetu040
    • Create and Expose SamVisionModel as public for better accessibility (#36493)
  • @FightingZhen
    • [Feature] Support using FlashAttention2 on Ascend NPU (#36696)
  • @nikosanto13
    • Introduce modular files for speech models (#35902)
Mar 28, 2025
Deepseek v3 (based on 4.50.3)

A new model is added to transformers: DeepSeek 3 (Also known as DeepSeek R1). It is added on top of the v4.50.3 release, and can be installed from the following tag: v4.50.3-DeepSeek-3.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.50.3-DeepSeek-3

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

DeepSeek 3 (Also known as DeepSeek R1)

The model is detailed in the following paper.

Overview

The DeepSeek-V3 model was proposed in DeepSeek-V3 Technical Report by DeepSeek-AI Team.

The abstract from the paper is the following:

We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.

Limitations and call for contribution!

We are super happy to make this code community-powered, and would love to see how you can help optimize the following:

  • current implementation uses the "naive" attention compution (so not really MLA)
  • current implementation loops through the experts. This should be replaced. Pointers to use get_packed_weights from intetrations/tensor_parallel.
  • current implementation uses the eleuther formula for ROPE, using the orginal one would be more efficient! (should still follow our API)
  • static cache is not supported (this should be just a generation config issue / config shape issues)

Usage tips

The model uses Multi-head Latent Attention (MLA) and DeepSeekMoE architectures for efficient inference and cost-effective training. It employs an auxiliary-loss-free strategy for load balancing and multi-token prediction training objective. The model can be used for various language tasks after being pre-trained on 14.8 trillion tokens and going through Supervised Fine-Tuning and Reinforcement Learning stages.

You can run the model in FP8 automatically, using 2 nodes of 8 H100 should be more than enough!

# `run_deepseek_v1.py`
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
torch.manual_seed(30)

tokenizer = AutoTokenizer.from_pretrained("deepseek-r1")

chat = [
  {"role": "user", "content": "Hello, how are you?"},
  {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
  {"role": "user", "content": "I'd like to show off how chat templating works!"},
]


model = AutoModelForCausalLM.from_pretrained("deepseek-r1", device_map="auto", torch_dtype=torch.bfloat16)
inputs = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)

outputs = model.generate(inputs, max_new_tokens=50)
print(tokenizer.batch_decode(outputs))

This generated:

<|Assistant|><think>
Okay, the user wants to demonstrate how chat templating works. Let me break down what that means. Chat templating is about structuring the conversation data, especially for models that need specific input formats. Maybe they're referring to something like how messages are formatted with roles (user, assistant, system) in APIs like OpenAI.

First, I should explain what chat templating is. It's the process of formatting conversation data into a structured format that the model can understand. This usually includes roles and content. For example, user messages, assistant responses, and system messages each have their own role tags.

They might want an example. Let me think of a simple conversation. The user says "Hello, how are you?" and the assistant responds "I'm doing great. How can I help you today?" Then the user follows up with wanting to show off chat templating. So the example should include the history and the new message.

In some frameworks, like Hugging Face's Transformers, chat templates are applied using Jinja2 templates. The template might look something like combining system messages, then looping through user and assistant messages with appropriate tags. For instance, using {% for message in messages %} and assigning roles like <|user|>, <|assistant|>, etc.

I should structure the example with the messages array, showing each role and content. Then apply a hypothetical template to convert that into a formatted string the model uses. Also, mention that different models have different templating requirements, like using special tokens or varying role labels.

Wait, the user mentioned "chat templating" in the context of showing off. Maybe they want a practical example they can present. So providing a code snippet or a structured data example would be helpful. Let me outline a typical messages array and then the templated output.

Also, it's important to note that proper templating ensures the model knows the conversation flow, which is crucial for generating coherent responses. Maybe include a note about why it's important, like maintaining context and role-specific processing.

Let me check if there are any common mistakes or things to avoid. For example, not closing tags properly, or mismatching roles. But maybe that's too detailed unless the user asks. Focus on the positive example first.

Putting it all together, the response should have an example messages array, the applied template, and the final formatted string. Maybe use angle brackets or special tokens as placeholders. Also, mention that this helps in training or fine-tuning models with structured data.

I think that's a solid approach. Let me structure it step by step to make it clear.
</think>

Chat templating is a way to structure conversation data (e.g., user/assistant interactions) into a format that language models understand. This is especially important for models trained to handle multi-turn dialogues, where the input must explicitly separate roles (user, assistant, system, etc.) and messages. Let’s break this down with an example!

---

### **Step 1: Raw Conversation History**
Suppose we have this conversation:
- **User**: "Hello, how are you?"
- **Assistant**: "I'm doing great. How can I help you today?"
- **User**: "I'd like to show off how chat templating works!"

---

### **Step 2: Structured Messages**
In frameworks like Hugging Face Transformers or OpenAI, conversations are often formatted as a list of dictionaries with `role` and `content`:
```python
messages = [
    {"role": "user", "content": "Hello, how are you?"},
    {"role": "assistant", "content": "I'm doing great. How can I help you today?"},
    {"role": "user", "content": "I'd like to show off how chat templating works!"},
]
```

---

### **Step 3: Apply a Chat Template**
A **chat template** converts this structured data into a single string formatted for the model. For example, using a Jinja-style template (common in Hugging Face):

```jinja
{% for message in messages %}
    {% if message['role'] == 'user' %}
        <|user|>{{ message['content'] }}<|end|>
    {% elif message['role'] == 'assistant' %}
        <|assistant|>{{ message['content'] }}<|end|>
    {% endif %}
{% endfor %}
<|assistant|>
```

---

### **Step 4: Final Templated Output**
Applying the template to our `messages` list would produce:
```text
<|user|>Hello, how are you?<|end|>
<|assistant|>I'm doing great. How can I help you today?<|end|>
<|user|>I'd like to show off how chat templating works!<|end|>
<|assistant|>
```

This tells the model:  
1. The conversation history (user/assistant turns).  
2. The model’s turn to generate a response (`<|assistant|>` at the end).  

---

### **Key Notes**:
- **Role Separation**: Tags like `<|user|>` and `<|assistant|>` help the model distinguish speakers.
- **Special Tokens**: Models often use unique tokens (e.g., `<|end|>`) to mark message boundaries.
- **Flexibility**: Templates vary by model (e.g., OpenAI uses `{"role": "user", "content": "..."}` instead of tags).

---

### **Why This Matters**:
- **Consistency**: Ensures the model understands dialogue structure.
- **Context Preservation**: Maintains the flow of multi-turn conversations.
- **Alignment**: Matches the format the model was trained on for better performance.

Want to dive deeper or see a specific framework’s implementation (e.g., OpenAI, Llama, Mistral)? Let me know! 😊<|end▁of▁sentence|>

Use the following to run it

torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0|1 --rdzv-id an_id --rdzv-backend c10d --rdzv-endpoint master_addr:master_port run_deepseek_r1.py

If you have:

[rank0]: ncclInternalError: Internal check failed.
[rank0]: Last error:
[rank0]: Bootstrap : no socket interface found

error, it means NCCL was probably not loaded.

Patch release v4.50.3

Thanks to the vllm team we have a few more bugs that slipped in!

  • [generate] beam search -- fix output cropping (#37080) by @gante

  • [blip-2] Fix dtype mismatch when keep in fp32 (#37068) by @zucchini-nlp

  • Fix PixtralProcessor patch_size when spatial_merge_size is used (#37019)

Mar 27, 2025
Patch release v4.50.2

I completely forgot to put these in the previous patch sorry! Should put the transformers backend in a good spot!

  • [Utils] torch version checks optionally accept dev versions (#36847) by @gante

  • Fix processor kwargs qwen2 vl (#36890) by @yonigozlan

  • Fix Pan and Scan on batched images Gemma3 (#36864) by @yonigozlan

Mar 25, 2025
Patch release v4.50.1

Patch release v4.50.1

There were some very minor bugs with the new hub kernels, and with remote code that we had to fix

  • Deprecate #36741 and map Causal to Conditional (#36917) by @zucchini-nlp

  • Fix pytorch deform attn path (#36923) by @qubvel

  • [chameleon] fix num image token check (#36918) by @zucchini-nlp

  • Fix torch version guard at import (#36907) by @zucchini-nlp

Mar 21, 2025
Release v4.50.0

New Model Additions

Model-based releases

Starting with version v4.49.0, we have been doing model-based releases, additionally to our traditional, software-based monthly releases. These model-based releases provide a tag from which models may be installed.

Contrarily to our software-releases; these are not pushed to pypi and are kept on our GitHub. Each release has a tag attributed to it, such as:

  • v4.49.0-Gemma-3
  • v4.49.0-AyaVision

⚠️ As bugs are identified and fixed on each model, the release tags are updated so that installing from that tag always gives the best experience possible with that model.

Each new model release will always be based on the current state of the main branch at the time of its creation. This ensures that new models start with the latest features and fixes available.

For example, if two models—Gemma-3 and AyaVision—are released from main, and then a fix for gemma3 is merged, it will look something like this:

              o---- v4.49.0-Gemma-3 (includes AyaVision, plus main fixes)
            /                  \  
---o--o--o--o--o-- (fix for gemma3) --o--o--o main
       \          
        o---- v4.49.0-AyaVision

We strive to merge model specific fixes on their respective branches as fast as possible!

Gemma 3

Gemma 3 is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.

The Gemma 3 model was proposed by Google. It is a vision-language model composed by a SigLIP vision encoder and a Gemma 2 language decoder linked by a multimodal linear projection.

It cuts an image into a fixed number of tokens same way as Siglip if the image does not exceed certain aspect ratio. For images that exceed the given aspect ratio, it crops the image into multiple smaller pacthes and concatenates them with the base image embedding.

One particularity is that the model uses bidirectional attention on all the image tokens. Also, the model interleaves sliding window local attention with full causal attention in the language backbone, where each sixth layer is a full causal attention layer.

  • Gemma3 by @RyanMullins in #36658

Shield Gemma2

ShieldGemma 2 is built on Gemma 3, is a 4 billion (4B) parameter model that checks the safety of both synthetic and natural images against key categories to help you build robust datasets and models. With this addition to the Gemma family of models, researchers and developers can now easily minimize the risk of harmful content in their models across key areas of harm as defined below:

  • No Sexually Explicit content: The image shall not contain content that depicts explicit or graphic sexual acts (e.g., pornography, erotic nudity, depictions of rape or sexual assault).
  • No Dangerous Content: The image shall not contain content that facilitates or encourages activities that could cause real-world harm (e.g., building firearms and explosive devices, promotion of terrorism, instructions for suicide).
  • No Violence/Gore content: The image shall not contain content that depicts shocking, sensational, or gratuitous violence (e.g., excessive blood and gore, gratuitous violence against animals, extreme injury or moment of death).

We recommend using ShieldGemma 2 as an input filter to vision language models, or as an output filter of image generation systems. To train a robust image safety model, we curated training datasets of natural and synthetic images and instruction-tuned Gemma 3 to demonstrate strong performance.

  • Shieldgemma2 #36678 by @RyanMullins

Aya Vision

AyaVision is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.

The Aya Vision 8B and 32B models is a state-of-the-art multilingual multimodal models developed by Cohere For AI. They build on the Aya Expanse recipe to handle both visual and textual information without compromising on the strong multilingual textual performance of the original model.

Aya Vision 8B combines the Siglip2-so400-384-14 vision encoder with the Cohere CommandR-7B language model further post-trained with the Aya Expanse recipe, creating a powerful vision-language model capable of understanding images and generating text across 23 languages. Whereas, Aya Vision 32B uses Aya Expanse 32B as the language model.

Key features of Aya Vision include:

  • Multimodal capabilities in 23 languages
  • Strong text-only multilingual capabilities inherited from CommandR-7B post-trained with the Aya Expanse recipe and Aya Expanse 32B
  • High-quality visual understanding using the Siglip2-so400-384-14 vision encoder
  • Seamless integration of visual and textual information in 23 languages.
  • Add aya by @ArthurZucker in #36521

Mistral 3.1

Mistral 3.1 is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.

Building upon Mistral Small 3 (2501), Mistral Small 3.1 (2503) adds state-of-the-art vision understanding and enhances long context capabilities up to 128k tokens without compromising text performance. With 24 billion parameters, this model achieves top-tier capabilities in both text and vision tasks.

It is ideal for:

  • Fast-response conversational agents.
  • Low-latency function calling.
  • Subject matter experts via fine-tuning.
  • Local inference for hobbyists and organizations handling sensitive data.
  • Programming and math reasoning.
  • Long document understanding.
  • Visual understanding.
  • Add Mistral3 by @Cyrilvallez in #36790

Smol VLM 2

SmolVLM-2 is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.

SmolVLM2 is an adaptation of the Idefics3 model with two main differences:

  • It uses SmolLM2 for the text model.
  • It supports multi-image and video inputs
  • SmolVLM2 by @orrzohar in #36126

SigLIP-2

SigLIP-2 is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.

The SigLIP2 model was proposed in SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features by Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner and Xiaohua Zhai.

The model comes in two variants

  1. FixRes - model works with fixed resolution images (backward compatible with SigLIP v1)
  2. NaFlex - model works with variable image aspect ratios and resolutions (SigLIP2 in transformers)
  • Add SigLIP 2 by @qubvel in #36323

Prompt Depth Anything

PromptDepthAnything is a high-resolution, accurate metric depth estimation model that leverages prompting, inspired by its success in vision-language (VLMs) and large language models (LLMs). Using iPhone LiDAR as a prompt, the model generates precise depth maps at up to 4K resolution, unlocking the potential of depth foundation models.

  • Add Prompt Depth Anything Model by @haotongl in #35401

New tool: attention visualization

We add a new tool to transformers to visualize the attention layout of a given model. It only requires a model ID as input, and will load the relevant tokenizer/model and display what the attention mask looks like. Some examples:


from transformers.utils.attention_visualizer import AttentionMaskVisualizer
visualizer = AttentionMaskVisualizer("meta-llama/Llama-3.2-3B-Instruct")
visualizer("A normal attention mask")

visualizer = AttentionMaskVisualizer("mistralai/Mistral-Small-24B-Instruct-2501")
visualizer("A normal attention mask with a long text to see how it is displayed, and if it is displayed correctly")

visualizer = AttentionMaskVisualizer("google/paligemma2-3b-mix-224")
visualizer("<img> You are an assistant.", suffix = "What is on the image?")

visualizer = AttentionMaskVisualizer("google/gemma-2b")
visualizer("You are an assistant. Make sure you print me") # we should have slidiing on non sliding side by side

visualizer = AttentionMaskVisualizer("google/gemma-3-27b-it")
visualizer("<img>You are an assistant. Make sure you print me") # we should have slidiing on non sliding side by side

  • Add attention visualization tool by @ArthurZucker in #36630

Deprecating transformers.agents in favor of smolagents

We are deprecating transformers.agents in favour of the smolagents library. Read more about smolagents here.

  • Deprecate transformers.agents by @aymeric-roucher in #36415

Quantization

We support adding custom quantization method by using the @register_quantization_config and @register_quantizer decorator:

@register_quantization_config("custom")
class CustomConfig(QuantizationConfigMixin):
   pass

@register_quantizer("custom")
class CustomQuantizer(HfQuantizer):
   pass

quantized_model = AutoModelForCausalLM.from_pretrained(
    "facebook/opt-350m", quantization_config=CustomConfig(), torch_dtype="auto"
)
  • Added Support for Custom Quantization by @keetrap in #35915
  • Add Example for Custom quantization by @MekkCyber in #36286

AMD is developing its in-house quantizer named Quark released under MIT license, which supports a broad range of quantization pre-processing, algorithms, dtypes and target hardware. You can now load a model quantized by quark library:

# pip install amd-quark

model_id = "EmbeddedLLM/Llama-3.1-8B-Instruct-w_fp8_per_channel_sym"
model = AutoModelForCausalLM.from_pretrained(model_id)
model = model.to("cuda")
  • Support loading Quark quantized models in Transformers by @fxmarty-amd and @BowenBao in #36372

Torchao is augmented with autoquant support, CPU-quantization, as well as new AOBaseConfig object instances for more advanced configuration.

  • Add autoquant support for torchao quantizer by @jerryzh168 in #35503
  • enable torchao quantization on CPU by @jiqing-feng in #36146
  • Add option for ao base configs by @drisspg in #36526

Tensor Parallelism implementation changes

At loading time, the parallelization is now applied module-by-module, so that no memory overhead is required compared to what the final weight distribution will be!

  • TP initialization module-by-module by @Cyrilvallez in #35996

Generation

This release includes two speed upgrades to generate:

  1. Assisted generation now works with ANY model as an assistant, even with do_sample=True;
from transformers import pipeline
import torch

prompt = "Alice and Bob"
checkpoint = "google/gemma-2-9b"
assistant_checkpoint = "double7/vicuna-68m"

pipe = pipeline(
    "text-generation",
    model=checkpoint,
    assistant_model=assistant_checkpoint,
    do_sample=True
)
pipe_output = pipe(prompt, max_new_tokens=50, do_sample=True)
print(pipe_output[0]["generated_text"])
  1. Beam search was vectorized, and should be significantly faster with a large num_beams. The speedup is more visible on smaller models, where model.forward doesn't dominate the total run time.
  • Universal Speculative Decoding CandidateGenerator by @keyboardAnt, @jmamou, and @gauravjain14 in #35029
  • [generate] ✨ vectorized beam search ✨ by @gante in #35802

Documentation

A significant redesign of our documentation has wrapped-up. The goal was to greatly simplify the transformers documentation, making it much more easy to navigate. Let us know what you think!

  • [docs] Redesign by @stevhliu in #31757

Notable repo maintenance

The research examples folder that was hosted in transformers is no more. We have moved it out of transformers and in the following repo: github.com/huggingface/transformers-research-projects/

  • Remove research projects by @Rocketknight1 in #36645

We have updated our flex attention support so as to have it be on-par with our Flash Attention 2 support.

  • Proper_flex by @ArthurZucker in #36643

More models support flex attention now thanks to @qubvel

  • Refactor Attention implementation for ViT-based models by @qubvel in #36545

First integration of hub kernels for deformable detr!

  • Use deformable_detr kernel from the Hub (#36853) by @danieldk

Bugfixes and improvements

  • [tests] fix EsmModelIntegrationTest::test_inference_bitsandbytes by @faaany in #36225
  • Fix LlavaForConditionalGenerationModelTest::test_config after #36077 by @ydshieh in #36230
  • AMD DeepSpeed image additional HIP dependencies by @ivarflakstad in #36195
  • [generate] remove cache v4.47 deprecations by @gante in #36212
  • Add missing atol to torch.testing.assert_close where rtol is specified by @ivarflakstad in #36234
  • [tests] remove tf/flax tests in /generation by @gante in #36235
  • [generate] Fix encoder decoder models attention mask by @eustlb in #36018
  • Add compressed tensor in quant dockerfile by @SunMarc in #36239
  • [tests] remove test_export_to_onnx by @gante in #36241
  • Au revoir flaky test_fast_is_faster_than_slow by @ydshieh in #36240
  • Fix TorchAoConfig not JSON serializable by @andrewor14 in #36206
  • Remove flakiness in VLMs by @zucchini-nlp in #36242
  • feat: add support for tensor parallel training workflow with accelerate by @kmehant in #34194
  • Fix XGLM loss computation (PyTorch and TensorFlow) by @damianoamatruda in #35878
  • GitModelIntegrationTest - flatten the expected slice tensor by @ivarflakstad in #36260
  • Added Support for Custom Quantization by @keetrap in #35915
  • Qwen2VL fix cos,sin dtypes to float when used with deepspeed by @ArdalanM in #36188
  • Uniformize LlavaNextVideoProcessor kwargs by @yonigozlan in #35613
  • Add support for post-processing kwargs in image-text-to-text pipeline by @yonigozlan in #35374
  • Add dithering to the Speech2TextFeatureExtractor API. by @KarelVesely84 in #34638
  • [tests] remove pt_tf equivalence tests by @gante in #36253
  • TP initialization module-by-module by @Cyrilvallez in #35996
  • [tests] deflake dither test by @gante in #36284
  • [tests] remove flax-pt equivalence and cross tests by @gante in #36283
  • [tests] make test_from_pretrained_low_cpu_mem_usage_equal less flaky by @gante in #36255
  • Add Example for Custom quantization by @MekkCyber in #36286
  • docs: Update README_zh-hans.md by @hyjbrave in #36269
  • Fix callback handler reference by @SunMarc in #36250
  • Make cache traceable by @IlyasMoutawwakil in #35873
  • Fix broken CI on release branch due to missing conversion files by @ydshieh in #36275
  • Ignore conversion files in test fetcher by @ydshieh in #36251
  • SmolVLM2 by @orrzohar in #36126
  • Fix typo in Pixtral example by @12v in #36302
  • fix: prevent second save in the end of training if last step was saved already by @NosimusAI in #36219
  • [smolvlm] make CI green by @gante in #36306
  • Fix default attention mask of generate in MoshiForConditionalGeneration by @cyan-channel-io in #36171
  • VLMs: even more clean-up by @zucchini-nlp in #36249
  • Add SigLIP 2 by @qubvel in #36323
  • [CI] Check test if the GenerationTesterMixin inheritance is correct 🐛 🔫 by @gante in #36180
  • [tests] make quanto tests device-agnostic by @faaany in #36328
  • Uses Collection in transformers.image_transforms.normalize by @CalOmnie in #36301
  • Fix exploitable regexes in Nougat and GPTSan/GPTJNeoXJapanese by @Rocketknight1 in #36121
  • [tests] enable bnb tests on xpu by @faaany in #36233
  • Improve model loading for compressed tensor models by @rahul-tuli in #36152
  • Change slack channel for mi250 CI to amd-hf-ci by @ivarflakstad in #36346
  • Add autoquant support for torchao quantizer by @jerryzh168 in #35503
  • Update amd pytorch index to match base image by @ivarflakstad in #36347
  • fix(type): padding_side type should be Optional[str] by @shenxiangzhuang in #36326
  • [Modeling] Reduce runtime when loading missing keys by @kylesayrs in #36312
  • notify new model merged to main by @ydshieh in #36375
  • Update modeling_llava_onevision.py by @yinsong1986 in #36391
  • Load models much faster on accelerator devices!! by @Cyrilvallez in #36380
  • [modular] Do not track imports in functions by @Cyrilvallez in #36279
  • Fix is_causal fail with compile by @Cyrilvallez in #36374
  • enable torchao quantization on CPU by @jiqing-feng in #36146
  • Update _get_eval_sampler to reflect Trainer.tokenizer is deprecation self.tokenizer -> self.processing_class by @yukiman76 in #36315
  • Fix doc formatting in forward passes & modular by @Cyrilvallez in #36243
  • Added handling for length <2 of suppress_tokens for whisper by @andreystarenky in #36336
  • addressing the issue #34611 to make FlaxDinov2 compatible with any batch size by @MHRDYN7 in #35138
  • tests: revert change of torch_require_multi_gpu to be device agnostic by @dvrogozh in #35721
  • [tests] enable autoawq tests on XPU by @faaany in #36327
  • fix audio classification pipeline fp16 test on cuda by @jiqing-feng in #36359
  • chore: fix function argument descriptions by @threewebcode in #36392
  • Fix pytorch integration tests for SAM by @qubvel in #36397
  • [CLI] add import guards by @gante in #36376
  • Fix convert_to_rgb for SAM ImageProcessor by @MSt-10 in #36369
  • Security fix for benchmark.yml by @ydshieh in #36402
  • Fixed VitDet for non-squre Images by @cjfghk5697 in #35969
  • Add retry hf hub decorator by @muellerzr in #35213
  • Deprecate transformers.agents by @aymeric-roucher in #36415
  • Fixing the docs corresponding to the breaking change in torch 2.6. by @Narsil in #36420
  • add recommendations for NPU using flash_attn by @zheliuyu in #36383
  • fix: prevent model access error during Optuna hyperparameter tuning by @emapco in #36395
  • Universal Speculative Decoding CandidateGenerator by @keyboardAnt in #35029
  • Fix compressed tensors config by @MekkCyber in #36421
  • Update form pretrained to make TP a first class citizen by @ArthurZucker in #36335
  • Fix Expected output for compressed-tensors tests by @MekkCyber in #36425
  • restrict cache allocator to non quantized model by @SunMarc in #36428
  • Change PR to draft when it is (re)opened by @ydshieh in #36417
  • Fix permission by @ydshieh in #36443
  • Fix another permission by @ydshieh in #36444
  • Add contents: write by @ydshieh in #36445
  • [save_pretrained ] Skip collecting duplicated weight by @wejoncy in #36409
  • [generate] torch.distributed-compatible DynamicCache by @gante in #36373
  • Lazy import libraries in src/transformers/image_utils.py by @hmellor in #36435
  • Fix hub_retry by @ydshieh in #36449
  • [GroundingDino] Fix grounding dino loss 🚨 by @EduardoPach in #31828
  • Fix loading models with mismatched sizes by @qubvel in #36463
  • [docs] fix bug in deepspeed config by @faaany in #36081
  • Add Got-OCR 2 Fast image processor and refactor slow one by @yonigozlan in #36185
  • Fix couples of issues from #36335 by @SunMarc in #36453
  • Fix _load_state_dict_into_meta_model with device_map=None by @hlky in #36488
  • Fix loading zero3 weights by @muellerzr in #36455
  • Check TRUST_REMOTE_CODE for RealmRetriever for security by @ydshieh in #36511
  • Fix kwargs UserWarning in SamImageProcessor by @MSt-10 in #36479
  • fix torch_dtype, contiguous, and load_state_dict regression by @SunMarc in #36512
  • Fix some typos in docs by @co63oc in #36502
  • chore: fix message descriptions in arguments and comments by @threewebcode in #36504
  • Fix pipeline+peft interaction by @Rocketknight1 in #36480
  • Fix edge case for continue_final_message by @Rocketknight1 in #36404
  • [Style] fix E721 warnings by @kashif in #36474
  • Remove unused code by @Rocketknight1 in #36459
  • [docs] Redesign by @stevhliu in #31757
  • Add aya by @ArthurZucker in #36521
  • chore: Fix typos in docs and examples by @co63oc in #36524
  • Fix bamba tests amd by @ivarflakstad in #36535
  • Fix links in quantization doc by @MekkCyber in #36528
  • chore: enhance messages in docstrings by @threewebcode in #36525
  • guard torch version for uint16 by @SunMarc in #36520
  • Fix typos in tests by @co63oc in #36547
  • Fix typos . by @zhanluxianshen in #36551
  • chore: enhance message descriptions in parameters,comments,logs and docstrings by @threewebcode in #36554
  • Delete redundancy if case in model_utils by @zhanluxianshen in #36559
  • Modular Conversion --fix_and_overwrite on Windows by @hlky in #36583
  • Integrate SwanLab for offline/online experiment tracking and local visualization by @ShaohonChen in #36433
  • [bark] fix loading of generation config by @gante in #36587
  • [XGLM] tag tests as slow by @gante in #36592
  • fix: argument by @ariG23498 in #36558
  • Mention UltraScale Playbook 🌌 in docs by @NouamaneTazi in #36589
  • avoid errors when the size of input_ids passed to PrefixConstrainedLogitsProcessor is zero by @HiDolen in #36489
  • Export base streamer. by @AndreasAbdi in #36500
  • Github action for auto-assigning reviewers by @Rocketknight1 in #35846
  • Update chat_extras.md with content correction by @krishkkk in #36599
  • Update "who to tag" / "who can review" by @gante in #36394
  • Fixed datatype related issues in DataCollatorForLanguageModeling by @capemox in #36457
  • Fix check for XPU. PyTorch >= 2.6 no longer needs ipex. by @tripzero in #36593
  • [HybridCache] disable automatic compilation by @gante in #36620
  • Fix auto-assign reviewers by @Rocketknight1 in #36631
  • chore: fix typos in language models by @threewebcode in #36586
  • [docs] Serving LLMs by @stevhliu in #36522
  • Refactor some core stuff by @ArthurZucker in #36539
  • Fix bugs in mllama image processing by @tjohnson31415 in #36156
  • Proper_flex by @ArthurZucker in #36643
  • Fix AriaForConditionalGeneration flex attn test by @ivarflakstad in #36604
  • Remove remote code warning by @Rocketknight1 in #36285
  • Stop warnings from unnecessary torch.tensor() overuse by @Rocketknight1 in #36538
  • [docs] Update docs dependency by @stevhliu in #36635
  • Remove research projects by @Rocketknight1 in #36645
  • Fix gguf docs by @SunMarc in #36601
  • fix typos in the docs directory by @threewebcode in #36639
  • Gemma3 by @RyanMullins in #36658
  • HPU support by @IlyasMoutawwakil in #36424
  • fix block mask typing by @ArthurZucker in #36661
  • [CI] gemma 3 make fix-copies by @gante in #36664
  • Fix bnb regression due to empty state dict by @SunMarc in #36663
  • [core] Large/full refactor of from_pretrained by @Cyrilvallez in #36033
  • Don't accidentally mutate the base_model_tp_plan by @Rocketknight1 in #36677
  • Fix Failing GPTQ tests by @MekkCyber in #36666
  • Remove hardcoded slow image processor class in processors supporting fast ones by @yonigozlan in #36266
  • [quants] refactor logic for modules_to_not_convert by @SunMarc in #36672
  • Remove differences between init and preprocess kwargs for fast image processors by @yonigozlan in #36186
  • Refactor siglip2 fast image processor by @yonigozlan in #36406
  • Fix rescale normalize inconsistencies in fast image processors by @yonigozlan in #36388
  • [Cache] Don't initialize the cache on meta device by @gante in #36543
  • Update config.torch_dtype correctly by @SunMarc in #36679
  • Fix slicing for 0-dim param by @SunMarc in #36580
  • Changing the test model in Quanto kv cache by @MekkCyber in #36670
  • fix wandb hp search unable to resume from sweep_id by @bd793fcb in #35883
  • Upgrading torch version and cuda version in quantization docker by @MekkCyber in #36264
  • Change Qwen2_VL image processors to have init and call accept the same kwargs by @yonigozlan in #36207
  • fix type annotation for ALL_ATTENTION_FUNCTIONS by @WineChord in #36690
  • Fix dtype for params without tp_plan by @Cyrilvallez in #36681
  • chore: fix typos in utils module by @threewebcode in #36668
  • [CI] Automatic rerun of certain test failures by @gante in #36694
  • Add loading speed test by @Cyrilvallez in #36671
  • fix: fsdp sharded state dict wont work for save_only_model knob by @kmehant in #36627
  • Handling an exception related to HQQ quantization in modeling by @MekkCyber in #36702
  • Add GGUF support to T5-Encoder by @Isotr0py in #36700
  • Final CI cleanup by @Rocketknight1 in #36703
  • Add support for fast image processors in add-new-model-like CLI by @yonigozlan in #36313
  • Gemma3 processor typo by @Kuangdd01 in #36710
  • Make the flaky list a little more general by @Rocketknight1 in #36704
  • Cleanup the regex used for doc preprocessing by @Rocketknight1 in #36648
  • [model loading] don't gc.collect() if only 1 shard is used by @gante in #36721
  • Fix/best model checkpoint fix by @seanswyi in #35885
  • Try working around the processor registration bugs by @Rocketknight1 in #36184
  • [tests] Parameterized test_eager_matches_sdpa_inference by @gante in #36650
  • 🌐 [i18n-KO] Translated codegen.md to Korean by @maximizemaxwell in #36698
  • Fix post_init() code duplication by @Cyrilvallez in #36727
  • Fix grad accum arbitrary value by @IlyasMoutawwakil in #36691
  • [Generation, Gemma 3] When passing a custom generation_config, overwrite default values with the model's base generation_config by @gante in #36684
  • 🚨🚨🚨 Fix sdpa in SAM and refactor relative position embeddings by @geetu040 in #36422
  • enable/disable compile for quants methods by @SunMarc in #36519
  • fix can_generate by @jiqing-feng in #36570
  • Allow ray datasets to be used with trainer by @FredrikNoren in #36699
  • fix xpu tests by @jiqing-feng in #36656
  • Fix test isolation for clear_import_cache utility by @sambhavnoobcoder in #36345
  • Fix TrainingArguments.torch_empty_cache_steps post_init check by @pkuderov in #36734
  • [MINOR:TYPO] Update hubert.md by @cakiki in #36733
  • [CI] remove redundant checks in test_eager_matches_sdpa_inference by @gante in #36740
  • [docs] Update README by @stevhliu in #36265
  • doc: Clarify is_decoder usage in PretrainedConfig documentation by @d-kleine in #36724
  • fix typos in the tests directory by @threewebcode in #36717
  • chore: fix typos in tests directory by @threewebcode in #36785
  • Fixing typo in gemma3 image_processor_fast and adding a small test by @Zebz13 in #36776
  • Fix gemma3_text tokenizer in mapping by @LysandreJik in #36793
  • Add Mistral3 by @Cyrilvallez in #36790
  • fix hqq due to recent modeling changes by @SunMarc in #36771
  • Update SHA for tj-actions/changed-files by @ydshieh in #36795
  • Loading optimizations by @Cyrilvallez in #36742
  • Fix Mistral3 tests by @yonigozlan in #36797
  • Fix casting dtype for qunatization by @SunMarc in #36799
  • Fix chameleon's TypeError because inputs_embeds may None by @YenFuLin in #36673
  • Support custom dosctrings in modular by @yonigozlan in #36726
  • [generate] ✨ vectorized beam search ✨ by @gante in #35802
  • Expectations test utils by @ivarflakstad in #36569
  • fix "Cannot copy out of meta tensor; no data!" issue for BartForConditionalGeneration model by @yao-matrix in #36572
  • Remove dist": "loadfile" for pytest in CircleCI jobs by @ydshieh in #36811
  • Fix Device map for bitsandbytes tests by @MekkCyber in #36800
  • [Generation] remove leftover code from end-to-end compilation by @gante in #36685
  • Add attention visualization tool by @ArthurZucker in #36630
  • Add option for ao base configs by @drisspg in #36526
  • enable OffloadedCache on XPU from PyTorch 2.7 by @yao-matrix in #36654
  • [gemma 3] multimodal checkpoints + AutoModelForCausalLM by @gante in #36741
  • One more fix for reviewer assignment by @Rocketknight1 in #36829
  • Support tracable dynamicKVcache by @tugsbayasgalan in #36311
  • Add Space to Bitsandbytes doc by @MekkCyber in #36834
  • quick fix fast_image_processor register error by @JJJYmmm in #36716
  • Update configuration_qwen2.py by @michaelfeil in #36735
  • Just import torch AdamW instead by @Rocketknight1 in #36177
  • Move the warning to the documentation for DataCollatorWithFlattening by @qgallouedec in #36707
  • Fix swanlab global step by @Zeyi-Lin in #36728
  • Disable inductor config setter by default by @HDCharles in #36608
  • [ForCausalLMLoss] allow users to pass shifted labels by @stas00 in #36607
  • fix tiktoken convert to pass AddedToken to Tokenizer by @itazap in #36566
  • Saving Trainer.collator.tokenizer in when Trainer.processing_class is None by @innerNULL in #36552
  • Pass num_items_in_batch directly to loss computation by @eljandoubi in #36753
  • Fix fp16 ONNX export for RT-DETR and RT-DETRv2 by @qubvel in #36460
  • Update deprecated Jax calls by @rasmi in #35919
  • [qwen2 audio] remove redundant code and update docs by @gante in #36282
  • Pass state dict by @phos-phophy in #35234
  • [modular] Sort modular skips by @gante in #36304
  • [generate] clarify docstrings: when to inherit GenerationMixin by @gante in #36605
  • Update min safetensors bis by @SunMarc in #36823
  • Fix import for torch 2.0, 2.1 - guard typehint for "device_mesh" by @qubvel in #36768
  • Gemma 3: Adding explicit GenerationConfig and refactoring conversion … by @RyanMullins in #36833
  • Fix: remove the redundant snippet of _whole_word_mask by @HuangBugWei in #36759
  • Shieldgemma2 by @RyanMullins in #36678
  • Fix ONNX export for sequence classification head by @echarlaix in #36332
  • Fix hqq skipped modules and dynamic quant by @mobicham in #36821
  • Use pyupgrade --py39-plus to improve code by @cyyever in #36843
  • Support loading Quark quantized models in Transformers by @fxmarty-amd in #36372
  • DeepSpeed tensor parallel+ZeRO by @inkcherry in #36825
  • Refactor Attention implementation for ViT-based models by @qubvel in #36545
  • Add Prompt Depth Anything Model by @haotongl in #35401
  • Add model visual debugger by @molbap in #36798
  • [torchao] revert to get_apply_tensor_subclass by @SunMarc in #36849
  • Gemma3: fix test by @zucchini-nlp in #36820
  • [CI] fix update metadata job by @gante in #36850
  • Add support for seed in DataCollatorForLanguageModeling by @capemox in #36497
  • Refactor Aya Vision with modular by @yonigozlan in #36688
  • Mllama: raise better error by @zucchini-nlp in #35934
  • [CI] doc builder without custom image by @gante in #36862
  • FIX FSDP plugin update for QLoRA by @BenjaminBossan in #36720
  • Remove call to .item in get_batch_samples by @regisss in #36861
  • chore: fix typos in the tests directory by @threewebcode in #36813
  • Make ViTPooler configurable by @sebbaur in #36517
  • Revert "Update deprecated Jax calls by @ArthurZucker in #35919)"
  • [generate] model defaults being inherited only happens for newer models by @gante in #36881
  • :red_circle: :red_circle: :red_circle: supersede paligemma forward to shift pos id indexing by @molbap in #36859
  • Gemma 3 tests expect greedy decoding by @molbap in #36882
  • Use deformable_detr kernel from the Hub by @danieldk in #36853
  • Minor Gemma 3 fixes by @molbap in #36884
  • Fix: dtype cannot be str by @zucchini-nlp in #36262

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @IlyasMoutawwakil
    • Make cache traceable (#35873)
    • HPU support (#36424)
    • Fix grad accum arbitrary value (#36691)
  • @orrzohar
    • SmolVLM2 (#36126)
  • @threewebcode
    • chore: fix function argument descriptions (#36392)
    • chore: fix message descriptions in arguments and comments (#36504)
    • chore: enhance messages in docstrings (#36525)
    • chore: enhance message descriptions in parameters,comments,logs and docstrings (#36554)
    • chore: fix typos in language models (#36586)
    • fix typos in the docs directory (#36639)
    • chore: fix typos in utils module (#36668)
    • fix typos in the tests directory (#36717)
    • chore: fix typos in tests directory (#36785)
    • chore: fix typos in the tests directory (#36813)
  • @aymeric-roucher
    • Deprecate transformers.agents (#36415)
  • @keyboardAnt
    • Universal Speculative Decoding CandidateGenerator (#35029)
  • @EduardoPach
    • [GroundingDino] Fix grounding dino loss 🚨 (#31828)
  • @co63oc
    • Fix some typos in docs (#36502)
    • chore: Fix typos in docs and examples (#36524)
    • Fix typos in tests (#36547)
  • @RyanMullins
    • Gemma3 (#36658)
    • Gemma 3: Adding explicit GenerationConfig and refactoring conversion … (#36833)
    • Shieldgemma2 (#36678)
  • @cyyever
    • Use pyupgrade --py39-plus to improve code (#36843)
  • @haotongl
    • Add Prompt Depth Anything Model (#35401)
  • @danieldk
    • Use deformable_detr kernel from the Hub (#36853)
Mar 18, 2025
Mistral 3 (Based on v4.49.0)

A new model is added to transformers: Mistral 3. It is added on top of the v4.49.0 release, and can be installed from the following tag: v4.49.0-Mistral-3.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.49.0-Mistral-3

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

Mistral 3

The model is detailed in the following blog post. The models are available on the Hub with the following tag: mistral3

Overview

Building upon Mistral Small 3 (2501), Mistral Small 3.1 (2503) adds state-of-the-art vision understanding and enhances long context capabilities up to 128k tokens without compromising text performance. With 24 billion parameters, this model achieves top-tier capabilities in both text and vision tasks.

It is ideal for:

  • Fast-response conversational agents.
  • Low-latency function calling.
  • Subject matter experts via fine-tuning.
  • Local inference for hobbyists and organizations handling sensitive data.
  • Programming and math reasoning.
  • Long document understanding.
  • Visual understanding.

This model was contributed by cyrilvallez and yonigozlan.

The original code can be found here and here.

Usage example

Inference with Pipeline

Here is how you can use the image-text-to-text pipeline to perform inference with the Mistral3 models in just a few lines of code:

>>> from transformers import pipeline

>>> messages = [
...     {
...         "role": "user",
...         "content": [
...             {
...                 "type": "image",
...                 "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
...             },
...             {"type": "text", "text": "Describe this image."},
...         ],
...     },
... ]

>>> pipe = pipeline("image-text-to-text", model="mistralai/Mistral-Small-3.1-24B-Instruct-2503", torch_dtype=torch.bfloat16)
>>> outputs = pipe(text=messages, max_new_tokens=50, return_full_text=False)
>>> outputs[0]["generated_text"]
'The image depicts a vibrant and lush garden scene featuring a variety of wildflowers and plants. The central focus is on a large, pinkish-purple flower, likely a Greater Celandine (Chelidonium majus), with a'

Inference on a single image

This example demonstrates how to perform inference on a single image with the Mistral3 models using chat templates.

>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
...     {
...         "role": "user",
...         "content": [
...             {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
...             {"type": "text", "text": "Describe this image"},
...         ],
...     }
... ]

>>> inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

>>> generate_ids = model.generate(**inputs, max_new_tokens=20)
>>> decoded_output = processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)

>>> decoded_output
"The image depicts two cats lying on a pink blanket. The larger cat, which appears to be an"...

Text-only generation

This example shows how to generate text using the Mistral3 model without providing any image input.

>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = ".mistralai/Mistral-Small-3.1-24B-Instruct-2503"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> SYSTEM_PROMPT = "You are a conversational agent that always answers straight to the point, always end your accurate response with an ASCII drawing of a cat."
>>> user_prompt = "Give me 5 non-formal ways to say 'See you later' in French."

>>> messages = [
...    {"role": "system", "content": SYSTEM_PROMPT},
...    {"role": "user", "content": user_prompt},
... ]

>>> text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
>>> inputs = processor(text=text, return_tensors="pt").to(0, dtype=torch.float16)
>>> generate_ids = model.generate(**inputs, max_new_tokens=50, do_sample=False)
>>> decoded_output = processor.batch_decode(generate_ids[:, inputs["input_ids"].shape[1] :], skip_special_tokens=True)[0]

>>> print(decoded_output)
"1. À plus tard!
2. Salut, à plus!
3. À toute!
4. À la prochaine!
5. Je me casse, à plus!

```
 /\_/\
( o.o )
 > ^ <
```"

Batched image and text inputs

Mistral3 models also support batched image and text inputs.

>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)

>>> messages = [
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
...                 {"type": "text", "text": "Write a haiku for this image"},
...             ],
...         },
...     ],
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
...                 {"type": "text", "text": "Describe this image"},
...             ],
...         },
...     ],
... ]


>>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

>>> output = model.generate(**inputs, max_new_tokens=25)

>>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
>>> decoded_outputs
["Write a haiku for this imageCalm waters reflect\nWhispers of the forest's breath\nPeace on wooden path"
, "Describe this imageThe image depicts a vibrant street scene in what appears to be a Chinatown district. The focal point is a traditional Chinese"]

Batched multi-image input and quantization with BitsAndBytes

This implementation of the Mistral3 models supports batched text-images inputs with different number of images for each text. This example also how to use BitsAndBytes to load the model in 4bit quantization.

>>> from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
>>> import torch

>>> torch_device = "cuda"
>>> model_checkpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> quantization_config = BitsAndBytesConfig(load_in_4bit=True)
>>> model = AutoModelForImageTextToText.from_pretrained(
...     model_checkpoint, quantization_config=quantization_config
... )

>>> messages = [
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
...                 {"type": "text", "text": "Write a haiku for this image"},
...             ],
...         },
...     ],
...     [
...         {
...             "role": "user",
...             "content": [
...                 {"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"},
...                 {"type": "image", "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg"},
...                 {"type": "text", "text": "These images depict two different landmarks. Can you identify them?"},
...             ],
...         },
...     ],
>>> ]

>>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)

>>> output = model.generate(**inputs, max_new_tokens=25)

>>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
>>> decoded_outputs
["Write a haiku for this imageSure, here is a haiku inspired by the image:\n\nCalm lake's wooden path\nSilent forest stands guard\n", "These images depict two different landmarks. Can you identify them? Certainly! The images depict two iconic landmarks:\n\n1. The first image shows the Statue of Liberty in New York City."]
Gemma 3 (Based on v4.49.0)

A new model is added to transformers: Gemma 3. It is added on top of the v4.49.0 release, and can be installed from the following tag: v4.49.0-Gemma-3.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

Gemma 3

The model is detailed in the following blog post. The models and demos using the model are available in the following collection.

A Space to play around with the 12B-it flavor is available here.

Overview

The Gemma 3 model was proposed by Google. It is a vision-language model composed by a SigLIP vision encoder and a Gemma 2 language decoder linked by a multimodal linear projection.

It cuts an image into a fixed number of tokens same way as Siglip if the image does not exceed certain aspect ratio. For images that exceed the given aspect ratio, it crops the image into multiple smaller pacthes and concatenates them with the base image embedding.

One particularity is that the model uses bidirectional attention on all the image tokens. Also, the model interleaves sliding window local attention with full causal attention in the language backbone, where each sixth layer is a full causal attention layer.

Usage tips

  • For image+text and image-only inputs use Gemma3ForConditionalGeneration.
  • For text-only inputs use Gemma3ForCausalLM for generation to avoid loading the vision tower.
  • Each sample can contain multiple images, and the number of images can vary between samples. However make sure to pass correctly batched images to the processor, where each batch is a list of one or more images.
  • The text passed to the processor should have the "<start_of_image_>" token where the images should be inserted.
  • The processor has its own apply_chat_template method to convert chat messages to text that can then be passed as text to the processor. You can also get a vectorized output from apply_chat_template. See the examples below for more details on how to use it.

Image cropping for high resolution images

The model supports cropping images into smaller patches when the image aspect ratio exceeds a certain value. By default the images are not cropped and only the base image is forwarded to the model. Users can set do_pan_and_scan=True to obtain several crops per image along with the base image to improve the quality in DocVQA or similar tasks requiring higher resolution images.

Pan and scan is an inference time optimization to handle images with skewed aspect ratios. When enabled, it improves performance on tasks related to document understanding, infographics, OCR, etc.

from transformers import AutoProcessor

processor = AutoProcessor.from_pretrained("google/gemma-3-4b-it", padding_side="left")

url = "https://media.istockphoto.com/id/1192867753/photo/cow-in-berchida-beach-siniscola.jpg?s=612x612&w=0&k=20&c=v0hjjniwsMNfJSuKWZuIn8pssmD5h5bSN1peBd1CmH4="
messages = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are a helpful assistant."}
        ]
    },
    {
        "role": "user", "content": [
            {"type": "image", "url": url},
            {"type": "text", "text": "What is shown in this image?"},
        ]
    },
]
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
    do_pan_and_scan=True,
).to(model.device)

Usage Example

Single-image Inference

from transformers import AutoProcessor, Gemma3ForConditionalGeneration

model_id = "google/gemma-3-4b-it"
model = Gemma3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
processor = AutoProcessor.from_pretrained(model_id, padding_side="left")

url = "https://media.istockphoto.com/id/1192867753/photo/cow-in-berchida-beach-siniscola.jpg?s=612x612&w=0&k=20&c=v0hjjniwsMNfJSuKWZuIn8pssmD5h5bSN1peBd1CmH4="
messages = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are a helpful assistant."}
        ]
    },
    {
        "role": "user", "content": [
            {"type": "image", "url": url},
            {"type": "text", "text": "What is shown in this image?"},
        ]
    },
]
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)

output = model.generate(**inputs, max_new_tokens=50)
print(processor.decode(output[0], skip_special_tokens=True)[inputs.input_ids.shape[1]: ])

Multi-image Inference

from transformers import AutoTokenizer, Gemma3ForCausalLM

model_id = "google/gemma-3-4b-it"
model = Gemma3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
processor = AutoProcessor.from_pretrained(model_id, padding_side="left")

url_cow = "https://media.istockphoto.com/id/1192867753/photo/cow-in-berchida-beach-siniscola.jpg?s=612x612&w=0&k=20&c=v0hjjniwsMNfJSuKWZuIn8pssmD5h5bSN1peBd1CmH4="
url_stop = "https://www.ilankelman.org/stopsigns/australia.jpg"
messages = [
    {
        "role": "system",
        "content": [
            {"type": "text", "text": "You are a helpful assistant."}
        ]
    },
    {
        "role": "user", "content": [
            {"type": "image", "url": url_cow},
            {"type": "image", "url": url_stop},
            {"type": "text", "text": "Are these two images identical?"},
        ]
    },
]
inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
).to(model.device)

output = model.generate(**inputs, max_new_tokens=50)
print(processor.decode(output[0], skip_special_tokens=True)[inputs.input_ids.shape[1]: ])

Text-only inference

from transformers import AutoTokenizer, Gemma3ForCausalLM

model_id = "google/gemma-3-1b-it"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = Gemma3ForCausalLM.from_pretrained(model_id, device_map="auto")

input_ids = tokenizer("Write me a poem about Machine Learning.", return_tensors="pt").to(model.device)

outputs = model.generate(**input_ids, max_new_tokens=100)
text = tokenizer.batch_decode(outputs, skip_special_tokens=True)

print(text)

Mar 4, 2025
Aya Vision (Based on v4.49.0)

A new model is added to transformers: Aya Vision. It is added on top of the v4.49.0 release, and can be installed from the following tag: v4.49.0-AyaVision.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.49.0-AyaVision

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

Aya Vision

The model is detailed in the following blog post.

Overview

The Aya Vision 8B and 32B models is a state-of-the-art multilingual multimodal models developed by Cohere For AI. They build on the Aya Expanse recipe to handle both visual and textual information without compromising on the strong multilingual textual performance of the original model.

Aya Vision 8B combines the Siglip2-so400-384-14 vision encoder with the Cohere CommandR-7B language model further post-trained with the Aya Expanse recipe, creating a powerful vision-language model capable of understanding images and generating text across 23 languages. Whereas, Aya Vision 32B uses Aya Expanse 32B as the language model.

Key features of Aya Vision include:

  • Multimodal capabilities in 23 languages
  • Strong text-only multilingual capabilities inherited from CommandR-7B post-trained with the Aya Expanse recipe and Aya Expanse 32B
  • High-quality visual understanding using the Siglip2-so400-384-14 vision encoder
  • Seamless integration of visual and textual information in 23 languages.

Usage Example

Here's an example usage of the Aya Vision model.

from transformers import AutoProcessor, AutoModelForImageTextToText
import torch

model_id = "CohereForAI/aya-vision-32b"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id, device_map="auto", torch_dtype=torch.float16
)

# Format message with the aya-vision chat template
messages = [
    {"role": "user",
     "content": [
       {"type": "image", "url": "https://pbs.twimg.com/media/Fx7YvfQWYAIp6rZ?format=jpg&name=medium"},
        {"type": "text", "text": "चित्र में लिखा पाठ क्या कहता है?"},
    ]},
    ]

inputs = processor.apply_chat_template(
    messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
).to(model.device)

gen_tokens = model.generate(
    **inputs, 
    max_new_tokens=300, 
    do_sample=True, 
    temperature=0.3,
)

print(processor.tokenizer.decode(gen_tokens[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
Feb 21, 2025
SigLIP-2 (Based on v4.49.0)

A new model is added to transformers: SigLIP-2. It is added on top of the v4.49.0 release, and can be installed from the following tag: v4.49.0-SigLIP-2.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.49.0-SigLIP-2

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

SigLIP2

The paper page for the model is available here. It is detailed in the following blog post.

The models and demos using the model are available in the following collection.

Overview

The SigLIP2 model was proposed in SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features by Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner and Xiaohua Zhai.

The model comes in two variants

  1. FixRes - model works with fixed resolution images (backward compatible with SigLIP v1)
  2. NaFlex - model works with variable image aspect ratios and resolutions (SigLIP2 in transformers)

The abstract from the paper is the following:

We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe—this includes decoder-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification (best SigLIP 2 ViT-g/16 achieves 85.0% ImageNet zero-shot accuracy), image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements on localization and dense prediction tasks. We also train variants which support multiple resolutions and preserve the input’s native aspect ratio. Finally, we train on a more diverse data-mixture that includes de-biasing techniques, leading to much better multilingual understanding and improved fair- ness. To provide users with the ability to trade-off inference cost with performance, we release model checkpoints at four sizes (ViT-B/86M, L/303M, So400m/400M, and g/1B).

Usage tips

  • Usage of SigLIP2 is similar to SigLIP and CLIP. The main difference from CLIP is the training loss, which does not require a global view of all the pairwise similarities of images and texts within a batch. One needs to apply the sigmoid activation function to the logits, rather than the softmax.
  • Training is supported but does not use torch.distributed utilities which may limit the scalability of batch size. However, DDP and FDSP works on single-node multi-gpu setup.
  • When using the standalone [GemmaTokenizerFast] make sure to pass padding="max_length" and max_length=64 as that's how the model was trained.
  • Model was trained with lowercased text, make sure you make the same preprocessing for your text labels.
  • To get the same results as the pipeline, a prompt template of "this is a photo of {label}" should be used.
  • The NaFlex variant supports processing images at higher resolutions by adjusting the max_num_patches parameter in the Processor. The default value is max_num_patches=256. Increasing max_num_patches to 1024 (4x) will approximately double processed image height and width, while preserving the aspect ratio.

<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/siglip2_metrics_table.png" alt="drawing" width="600"/>

This model was contributed by qubvel. The original code can be found here.

Usage example

There are 2 main ways to use SigLIP2: either using the pipeline API, which abstracts away all the complexity for you, or by using the Siglip2Model class yourself.

FixRes variant

Pipeline API

The pipeline allows to use the model in a few lines of code:

>>> from transformers import pipeline
>>> from PIL import Image
>>> import requests

>>> # load pipe
>>> image_classifier = pipeline(
...     task="zero-shot-image-classification",
...     model="google/siglip2-base-patch16-224",
... )

>>> # load image
>>> url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> # inference
>>> candidate_labels = ["2 cats", "a plane", "a remote"]
>>> outputs = image_classifier(image, candidate_labels=candidate_labels)
>>> outputs = [{"score": round(output["score"], 4), "label": output["label"] } for output in outputs]
>>> print(outputs)
[{'score': 0.1499, 'label': '2 cats'}, {'score': 0.0008, 'label': 'a remote'}, {'score': 0.0, 'label': 'a plane'}]

Using the model yourself

If you want to do the pre- and postprocessing yourself, here's how to do that:

>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, AutoModel
>>> import torch

>>> model = AutoModel.from_pretrained("google/siglip2-base-patch16-224")
>>> processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-224")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> candidate_labels = ["2 cats", "2 dogs"]
# follows the pipeline prompt template to get same results
>>> texts = [f"This is a photo of {label}." for label in candidate_labels]

# IMPORTANT: we pass `padding=max_length` and `max_length=64` since the model was trained with this
>>> inputs = processor(text=texts, images=image, padding="max_length", max_length=64, return_tensors="pt")

>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> logits_per_image = outputs.logits_per_image
>>> probs = torch.sigmoid(logits_per_image) # these are the probabilities
>>> print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")
15.0% that image 0 is '2 cats'

NaFlex variant

NaFlex combines ideas from FlexiViT, i.e. supporting multiple, predefined sequence lengths with a single ViT model, and NaViT, namely processing images at their native aspect ratio. This enables processing different types of images at appropriate resolution, e.g. using a larger resolution to process document images, while at the same time minimizing the impact of aspect ratio distortion on certain inference tasks, e.g. on OCR.

Given a patch size and target sequence length, NaFlex preprocesses the data by first resizing the input image such that the height and width after resizing are multiples of the patch size, while

1. keeping the aspect ratio distortion as small as possible
2. producing a sequence length of at most the desired target sequence length (`max_num_patches`)

The resulting distortion in width and height is at most (patch_size - 1) / width and (patch_size - 1) / height, respectively, which tends to be small for common resolutions and aspect ratios. After resizing, the image is split into a sequence of patches, and a mask with padding information is added.

>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, AutoModel
>>> import torch

>>> model = AutoModel.from_pretrained("google/siglip2-base-patch16-naflex")
>>> processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-naflex")

>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)

>>> candidate_labels = ["2 cats", "2 dogs"]
# follows the pipeline prompt template to get same results
>>> texts = [f"This is a photo of {label}." for label in candidate_labels]

# default value for `max_num_patches` is 256, but you can increase resulted image resolution providing
# higher values e.g. `max_num_patches=512`
>>> inputs = processor(text=texts, images=image, max_num_patches=256, return_tensors="pt")

>>> with torch.no_grad():
...     outputs = model(**inputs)

>>> logits_per_image = outputs.logits_per_image
>>> probs = torch.sigmoid(logits_per_image) # these are the probabilities
>>> print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")
21.1% that image 0 is '2 cats'
Feb 20, 2025
SmolVLM-2 (Based on v4.49.0)

A new model is added to transformers: SmolVLM-2. It is added on top of the v4.49.0 release, and can be installed from the following tag: v4.49.0-SmolVLM-2.

In order to install this version, please install with the following command:

pip install git+https://github.com/huggingface/transformers@v4.49.0-SmolVLM-2

If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.

SmolVLM-2

SmolVLM-2 is detailed in the following blog post.

The models and demos using the model are available in the following collection.

Overview

SmolVLM2 is an adaptation of the Idefics3 model with two main differences:

  • It uses SmolLM2 for the text model.
  • It supports multi-image and video inputs

Usage tips

Input images are processed either by upsampling (if resizing is enabled) or at their original resolution. The resizing behavior depends on two parameters: do_resize and size.

Videos should not be upsampled.

If do_resize is set to True, the model resizes images so that the longest edge is 4*512 pixels by default. The default resizing behavior can be customized by passing a dictionary to the size parameter. For example, {"longest_edge": 4 * 512} is the default, but you can change it to a different value if needed.

Here’s how to control resizing and set a custom size:

image_processor = SmolVLMImageProcessor(do_resize=True, size={"longest_edge": 2 * 512}, max_image_size=512)

Additionally, the max_image_size parameter, which controls the size of each square patch the image is decomposed into, is set to 512 by default but can be adjusted as needed. After resizing (if applicable), the image processor decomposes the images into square patches based on the max_image_size parameter.

This model was contributed by orrzohar.

Usage example

Single Media inference

The model can accept both images and videos as input, but you should use only one of the modalities at a time. Here's an example code for that.

import torch
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM2-256M-Video-Instruct")
model = AutoModelForImageTextToText.from_pretrained(
    "HuggingFaceTB/SmolVLM2-256M-Video-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="cuda"
)

conversation = [
    {
        "role": "user",
        "content":[
            {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
            {"type": "text", "text": "Describe this image."}
        ]
    }
]

inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

output_ids = model.generate(**inputs, max_new_tokens=128)
generated_texts = processor.batch_decode(output_ids, skip_special_tokens=True)
print(generated_texts)


# Video
conversation = [
    {
        "role": "user",
        "content": [
            {"type": "video", "path": "/path/to/video.mp4"},
            {"type": "text", "text": "Describe this video in detail"}
        ]
    },
]

inputs = processor.apply_chat_template(
    conversation,
    add_generation_prompt=True,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)

generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=100)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts[0])
Feb 17, 2025
v4.49.0: Helium, Qwen2.5-VL, SuperGlue, Granite Vision, Zamba2, GOT-OCR 2.0, DAB-DETR, Depth Pro, RT-DETRv2, GPTQModel

New models

Helium

Helium-1 preview is a lightweight language model with 2B parameters, targeting edge and mobile devices. It supports the following languages: English, French, German, Italian, Portuguese, Spanish.

<img width="860" alt="image" src="https://github.com/user-attachments/assets/52e91b74-5572-46a6-93e5-058730411675" />
  • Add-helium by @ArthurZucker in #35669

Qwen2.5-VL

The Qwen2.5-VL model is an update to Qwen2-VL from Qwen team, Alibaba Group.

The abstract from this update is the following:

Qwen2.5-VL marks a major step forward from Qwen2-VL, built upon the latest Qwen2.5 LLM. We’ve accelerated training and testing through the strategic implementation of window attention within the ViT. The ViT architecture itself has been refined with SwiGLU and RMSNorm, aligning it more closely with the LLM’s structure. A key innovation is the expansion of native dynamic resolution to encompass the temporal dimension, in addition to spatial aspects. Furthermore, we’ve upgraded MRoPE, incorporating absolute time alignment on the time axis to allow the model to effectively capture temporal dynamics, regardless of frame rate, leading to superior video understanding.

  • add qwen2.5vl by @ShuaiBai623 in #35569

SuperGlue

The SuperGlue model was proposed in SuperGlue: Learning Feature Matching with Graph Neural Networks by Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz and Andrew Rabinovich.

This model consists of matching two sets of interest points detected in an image. Paired with the SuperPoint model, it can be used to match two images and estimate the pose between them. This model is useful for tasks such as image matching, homography estimation, etc.

<img width="424" alt="image" src="https://github.com/user-attachments/assets/1d81983f-f9ce-4d82-adb7-e76098df543a" />
  • Add SuperGlue model by @sbucaille in #29886

Granite Vision Support

The Granite Vision model is a variant of LLaVA-NeXT, leveraging a Granite language model alongside a SigLIP visual encoder. It utilizes multiple concatenated vision hidden states as its image features, similar to VipLlava. It also uses a larger set of image grid pinpoints than the original LlaVa-NeXT models to support additional aspect ratios.

  • Granite Vision Support by @alex-jw-brooks in #35579

Zamba2

Zamba2 is a large language model (LLM) trained by Zyphra, and made available under an Apache 2.0 license.

Zamba2-1.2B, Zamba2-2.7B and Zamba2-7B are hybrid models combining state-space models (Specifically Mamba) and transformer, and were trained using next-token prediction. Zamba2 uses shared transformer layers after every 6 mamba blocks. It uses the Mistral v0.1 tokenizer. We came to this architecture after a series of ablations at small scales. Zamba2-1.2B, Zamba2-2.7B and Zamba2-7B were pre-trained on 2T and 3T tokens, respectively.

  • Add Zamba2 by @pglorio in #34517

GOT-OCR 2.0

GOT-OCR2 works on a wide range of tasks, including plain document OCR, scene text OCR, formatted document OCR, and even OCR for tables, charts, mathematical formulas, geometric shapes, molecular formulas and sheet music. While this implementation of the model will only output plain text, the outputs can be further processed to render the desired format, with packages like pdftex, mathpix, matplotlib, tikz, verovio or pyecharts. The model can also be used for interactive OCR, where the user can specify the region to be recognized by providing the coordinates or the color of the region’s bounding box.

  • Add GOT-OCR 2.0 to Transformers by @yonigozlan in #34721

DAB-DETR

DAB-DETR is an enhanced variant of Conditional DETR. It utilizes dynamically updated anchor boxes to provide both a reference query point (x, y) and a reference anchor size (w, h), improving cross-attention computation. This new approach achieves 45.7% AP when trained for 50 epochs with a single ResNet-50 model as the backbone.

  • Add DAB-DETR for object detection by @conditionedstimulus in #30803

Depth PRO

DepthPro is a foundation model for zero-shot metric monocular depth estimation, designed to generate high-resolution depth maps with remarkable sharpness and fine-grained details. It employs a multi-scale Vision Transformer (ViT)-based architecture, where images are downsampled, divided into patches, and processed using a shared Dinov2 encoder. The extracted patch-level features are merged, upsampled, and refined using a DPT-like fusion stage, enabling precise depth estimation.

  • Add Apple's Depth-Pro for depth estimation by @geetu040 in #34583

RT-DETRv2

An improved Real-Time DEtection TRansformer (RT-DETR). RT-DETRv2 refines RT-DETR by introducing selective multi-scale feature extraction, a discrete sampling operator for broader deployment compatibility. These improvements yield a 0.3 to 1.4 increase in mAP metrics on the COCO dataset, all while maintaining the same parameter count and frames-per-second (FPS) performance.

  • Adding RTDETRv2 by @jadechoghari in #34773

Transformers-CLI

Transformers' CLI welcomes a new command: chat. This command starts a conversation with the model of your choosing directly in your terminal.

This feature exists in TRL and has been migrated to transformers for easier usage.

  • [Chat] Add Chat from TRL 🐈 by @gante in #35714

Processor Standardization

An ongoing work is to standardize the image processors so that their API is equivalent. Additionally, the processors are given a fast variant so that they are never blockers in the image processing pipelines.

In this release, several processors have been standardized and have seen their fast version be contributed.

  • OwlViT/Owlv2 post processing standardization by @qubvel in #34929
  • OmDet Turbo processor standardization by @qubvel in #34937
  • Grounding DINO Processor standardization by @qubvel in #34853
  • Refactoring of ImageProcessorFast by @yonigozlan in #35069
  • add Qwen2-VL image processor fast by @yonigozlan in #35733
  • Remove Multi-threaded image conversion for fast image processors by @yonigozlan in #36105

Breaking changes

DPT segmentation maps

DPT image processors did not support segmentation_maps, instead only requiring images. This has been fixed. This adds an argument to the preprocess method, therefore users using arguments as positional arguments with that method may see changed behavior. We recommend using keyword arguments for such methods so as to not be bothered by the addition of new features.

  • 🔴 🔴 🔴 Added segmentation maps support for DPT image processor by @simonreise in #34345

Image classification pipeline and single vs multi-label

The problem_type in the config.json file was read incorrectly by the pipeline, which mapped single-label to multi-label losses, and vice-versa. This has been fixed.

  • 🚨🚨🚨 image-classification pipeline single-label and multi-label prob type squashing fns (sigmoid vs softmax) are backwards by @rwightman in #35848

Fixing the LayerNorm beta/gamma renames

The description of the pull request is the easiest way to understand the problem, why it exists, and how it is solved; please read the description below:

  • 🚨🚨🚨 An attempt to fix #29554. Include 'LayerNorm.' in gamma/beta rename scope, optimize string search. by @rwightman in #35615

VLM cleanup

The ignore_index property of the llava configuration has been removed as it was not serving a purpose.

  • 🔴 VLM: compile compatibility by @zucchini-nlp in #35724

Quantization

Quantization has received several improvements and fixes, including the contribution of FP8 quantization and the HIGGS quantization interface.

Additionally, we're replacing the AutoGPTQ implementaiton with GPTQModel from ModelCloud (see repository here)).

GPTQModel originated as major refractor of AutoGPTQ but is now a full-stand-in replacement with cleaner api, up-to-date model support, faster inference, higher quality quants.

  • Enable gptqmodel by @ jiqing-feng in #35012
  • Split and clean up GGUF quantization tests by @Isotr0py in #35502
  • Display warning for unknown quants config instead of an error by @SunMarc in #35963
  • Adding FP8 Quantization to transformers by @MekkCyber in #36026
  • New HIGGS quantization interfaces, JIT kernel compilation support. by @BlackSamorez in #36148

Generate

  • [generate] revert change in Aria: the maximum cache length must match max_length by @gante in #36120
  • 🧹 remove generate-related objects and methods scheduled for removal in v4.48 by @gante in #35677
  • [generate] can instantiate GenerationConfig(cache_implementation="static") by @gante in #35679
  • [generate] return Cache object even if passed in a legacy format by @gante in #35673
  • [generate] update docstring of SequenceBiasLogitsProcessor by @gante in #35699
  • Test: generate with torch.compile(model.forward) as a fast test by @gante in #34544
  • [generate] move max time tests by @gante in #35962
  • [generate] shape checks in tests compatible with fixed-length caches (+ some minor fixes) by @gante in #35993

Pipelines

Pipelines have received several bug fixes and improvements which are detailed below.

  • Stop mutating input dicts in audio classification pipeline by @Rocketknight1 in #35754
  • fix document qa bf16 pipeline by @jiqing-feng in #35456
  • fix low-precision audio classification pipeline by @jiqing-feng in #35435
  • [pipeline] missing import regarding assisted generation by @gante in #35752
  • Output dicts support in text generation pipeline by @jonasrohw in #35092
  • Fix Audio Classification Pipeline top_k Documentation Mismatch and Bug #35736 by @sambhavnoobcoder in #35771

Bugfixes and improvements

  • Fix flaky test_custom_4d_attention_mask by @ydshieh in #35606
  • Use inherit tempdir makers for tests + fix failing DS tests by @muellerzr in #35600
  • Added error when sequence length is bigger than max_position_embeddings by @Taha1506 in #32156
  • Let EarlyStoppingCallback not require load_best_model_at_end by @muellerzr in #35101
  • Fix flaky test_beam_search_low_memory by @ydshieh in #35611
  • Skip MobileNetV1ModelTest::test_batching_equivalence for now by @ydshieh in #35614
  • Update codeowners with individual model owners by @Rocketknight1 in #35595
  • Fix device in rope module when using dynamic updates by @Cyrilvallez in #35608
  • Fix whisper compile by @jiqing-feng in #35413
  • Removed some duplicated code by @Sai-Suraj-27 in #35637
  • [Phi] bias should be True by @ArthurZucker in #35650
  • Enable different torch dtype in sub models by @zucchini-nlp in #34873
  • [Compile] Only test compiling model forward pass by @ArthurZucker in #35658
  • [tests] make cuda-only tests device-agnostic by @faaany in #35607
  • [i18n-ar] Translated file : docs/source/ar/tasks/token_classification.md into Arabic by @AhmedAlmaghz in #35193
  • Fix zero_shot_image_classification documentation guide link in SigLIP by @aretrace in #35671
  • Fix : adding einops lib in the CI docker for some bitsandbytes tests by @MekkCyber in #35652
  • Update torchao.md: use auto-compilation by @martin0258 in #35490
  • Fix : HQQ config when hqq not available by @MekkCyber in #35655
  • Fix expected output for ggml test by @MekkCyber in #35686
  • Fix : add require_read_token for gemma2 gated model by @MekkCyber in #35687
  • Enhanced Installation Section in README.md by @egojoseph in #35094
  • Enhance DataCollatorForLanguageModeling with Configurable Token Replacement Probabilities by @mahdibaghbanzadeh in #35251
  • Clean-up composite configs by @zucchini-nlp in #34603
  • Add future import for Py < 3.10 by @Rocketknight1 in #35666
  • Enable gptqmodel by @jiqing-feng in #35012
  • Fix : Nemotron Processor in GGUF conversion by @MekkCyber in #35708
  • Fix typo in /docs/source/ja/model_doc/decision_transformer.md URL by @hiroaki222 in #35705
  • Replace deprecated batch_size with max_batch_size when using HybridCache by @mtreinik in #35498
  • Fix: Falcon tie_word_embeddings in GGUF by @MekkCyber in #35715
  • Fix condition when GA loss bug fix is not performed by @techkang in #35651
  • Fix the bug that Trainer cannot correctly call torch_jit_model_eval by @Wanguy in #35722
  • [generation] fix type hint by @gante in #35725
  • Add proper jinja2 error by @Rocketknight1 in #35533
  • Optimize ForCausalLMLoss by removing unnecessary contiguous() call to reduce memory overhead by @efsotr in #35646
  • Modular: support for importing functions from any file by @Cyrilvallez in #35692
  • Remove batch size argument warning when unjustified by @quintenroets in #35519
  • [cache] add a test to confirm we can use cache at train time by @gante in #35709
  • Remove pt_to_tf by @gante in #35672
  • Added resource class configuration option for check_circleci_user job by @Sai-Suraj-27 in #32866
  • Fix some tests by @Cyrilvallez in #35682
  • Unable to use MimiModel with DeepSpeed ZeRO-3 by @anferico in #34735
  • check is added for the report_to variable in TrainingArguments by @alpertunga-bile in #35403
  • Added liger_kernel compatibility with PeftModel by @ambroser53 in #35680
  • Restore is_torch_greater_or_equal_than for backward compatibility by @tlrmchlsmth in #35734
  • Revert "Unable to use MimiModel with DeepSpeed ZeRO-3" by @eustlb in #35755
  • ci: fix xpu skip condition for test_model_parallel_beam_search by @dvrogozh in #35742
  • Use AMD CI workflow defined in hf-workflows by @ivarflakstad in #35058
  • Fix CI for VLMs by @zucchini-nlp in #35690
  • Security fix for self-comment-ci.yml by @ydshieh in #35548
  • [ViTPose] Convert more checkpoints by @NielsRogge in #35638
  • fix register_buffer in MimiEuclideanCodebook by @anferico in #35759
  • remove code owners as it was generating too much noise BUT by @ArthurZucker in #35784
  • Skip Falcon 7B GGML Test by @MekkCyber in #35783
  • [fix] cannot import name 'Pop2PianoFeatureExtractor' from 'transformers' by @faaany in #35604
  • transformers.image_transforms.normalize wrong types by @CalOmnie in #35773
  • Patch moonshine by @eustlb in #35731
  • Don't import torch.distributed when it's not available by @booxter in #35777
  • Fix vits low-precision dtype by @jiqing-feng in #35418
  • Tool calling: support more types by @aymeric-roucher in #35776
  • Fixes, improvements to timm import behaviour by @rwightman in #35800
  • modular_model_converter bugfix on assignments by @nikosanto13 in #35642
  • Deterministic sorting in modular converter when adding new functions by @Cyrilvallez in #35795
  • Fix "test_chat_template_dict" in video LLMs by @zucchini-nlp in #35660
  • Update AMD Docker image by @ivarflakstad in #35804
  • Add LlavaImageProcessor by @NielsRogge in #33191
  • Byebye test_batching_equivalence's flakiness by @ydshieh in #35729
  • [Doc] Adding blog post to model doc for TimmWrapper by @ariG23498 in #35744
  • add a new flax example for Bert model inference by @louie-tsai in #34794
  • Support adamw_torch_8bit by @fzyzcjy in #34993
  • Auto-add timm tag to timm-wrapper models. by @pcuenca in #35794
  • Fix : BLOOM tie_word_embeddings in GGUF by @MekkCyber in #35812
  • Fixed typo in autoawq version number in an error message for IPEX backend requirements. by @InfroLab in #35815
  • Remove deprecated get_cached_models by @Wauplin in #35809
  • Optimized set_initialized_submodules. by @LagPixelLOL in #35493
  • [i18n-ar] Translated file: docs/source/ar/tasks/masked_language_modeling.md into Arabic by @AhmedAlmaghz in #35198
  • move fastspeech to audio models by @eustlb in #35788
  • Improve modular documentation by @Cyrilvallez in #35737
  • [Mimi] update test expected values for t4 runners by @eustlb in #35696
  • Remove old benchmark code by @gante in #35730
  • Remove pyav pin to allow python 3.11 to be used by @CalOmnie in #35823
  • Another security patch for self-comment-ci.yml by @ydshieh in #35816
  • Init cache on meta device by @zucchini-nlp in #35164
  • Hotfix: missing working-directory in self-comment-ci.yml by @ydshieh in #35833
  • [gpt2] fix generation tests by @gante in #35822
  • Fix : Nemotron tokenizer for GGUF format by @MekkCyber in #35836
  • Fix head_dim in config extracted from Gemma2 GGUF model by @Isotr0py in #35818
  • [chat] docs fix by @gante in #35840
  • Fix compatibility issues when using auto_gptq with these older versions by @LRL-ModelCloud in #35830
  • Add PyTorch version check for FA backend on AMD GPUs by @mht-sharma in #35813
  • Fix NoneType type as it requires py>=3.10 by @SunMarc in #35843
  • [ tests] remove some flash attention class tests by @ArthurZucker in #35817
  • [Backend support] Allow num_logits_to_keep as Tensor + add flag by @Cyrilvallez in #35757
  • Fix GA loss for Deepspeed by @timjeffrey10 in #35808
  • Fix uploading processors/tokenizers to WandB on train end by @jack89roberts in #35701
  • Fix more CI tests by @ArthurZucker in #35661
  • [DOC] Fix contamination and missing paragraph in translation by @Yosshi999 in #35851
  • Fix typo by @SilverSoldier in #35854
  • fix apply_chat_template() padding choice by @baoyf4244 in #35828
  • Fix test_pipelines_video_classification that was always failing by @CalOmnie in #35842
  • Fix Llava-NeXT / Llava-NeXT Video / Llava-OneVision's token unpadding mismatch by @sheryc in #35779
  • use torch.testing.assertclose instead to get more details about error in cis by @ArthurZucker in #35659
  • add xpu device check in device_placement by @faaany in #35865
  • Add Rocketknight1 to self-comment-ci.yml by @ydshieh in #35881
  • [doctest] Fixes by @stevhliu in #35863
  • Fix fast image processor warnings in object detection examples by @sugendran in #35892
  • Update deepspeed amd image by @ivarflakstad in #35906
  • Fix typing in audio_utils.chroma_filter_bank by @CalOmnie in #35888
  • [docs] uv install by @stevhliu in #35821
  • Fix the config class comparison for remote code models by @Rocketknight1 in #35592
  • Close Zamba2Config code block by @Rocketknight1 in #35914
  • [docs] Fix Zamba2 by @stevhliu in #35916
  • Remove _supports_static_cache = True for some model classes by @ydshieh in #34975
  • Use rocm6.2 for AMD images by @ivarflakstad in #35930
  • Add default TP plan for all models with backend support by @Cyrilvallez in #35870
  • Fix: loading DBRX back from saved path by @zucchini-nlp in #35728
  • Fix mask slicing for models with HybridCache by @Cyrilvallez in #35681
  • Qwen-2-5-VL: fix CI by @zucchini-nlp in #35935
  • Fix TP initialization by @Cyrilvallez in #35860
  • fix(FA): QKV not being casted to target_dtype for FA with dpo lora by @NanoCode012 in #35834
  • Remove INC notebook reference in documentation by @echarlaix in #35936
  • use torch constraints to check if covariance is positive definite during mean resizing. by @abuelnasr0 in #35693
  • fix test_generated_length_assisted_generation by @keyboardAnt in #34935
  • Update unwrap_and_save_reload_schedule to use weights_only=False by @ydshieh in #35952
  • Update squad_convert_example_to_features to work with numpy v2 by @ydshieh in #35955
  • Fix flaky test_assisted_decoding_matches_greedy_search by @ydshieh in #35951
  • Trainer Refactor: Part 1 by @muellerzr in #35567
  • update docker file transformers-pytorch-deepspeed-latest-gpu by @ydshieh in #35940
  • [tests] further fix Tester object has no attribute '_testMethodName' by @faaany in #35781
  • Update README.md by @BlessedTatonka in #35958
  • fix iterator overflow when gradient accumulation is 1 by @winglian in #35960
  • Fix is_causal being a tensor by @IlyasMoutawwakil in #35791
  • [bart] minor test fixes by @gante in #35965
  • Pixtral: vectorize patch embeddings and enable tests by @zucchini-nlp in #35122
  • Whisper: fix static cache CI by @zucchini-nlp in #35852
  • Less flaky for TimmBackboneModelTest::test_batching_equivalence by @ydshieh in #35971
  • Support batching for UsefulSensors Moonshine by @njeffrie in #35922
  • not to use A100 for benchmark.yml by @ydshieh in #35974
  • Handle empty change indices in SAM's mask to rle conversion by @MSt-10 in #35665
  • Add support for nested images to LLava and VipLLava by @yonigozlan in #35558
  • [Moonshine] compute head_dim_padding at init by @eustlb in #35984
  • [Moshi] disable automatic compilation if the model can't compile by @gante in #35992
  • use torch 2.6 for daily CI by @ydshieh in #35985
  • Update-tp test by @ArthurZucker in #35844
  • Add mean_resizing for every VLMs' resizing_token_embeddings() by @YenFuLin in #35717
  • Update Granite Vision Model Path / Tests by @alex-jw-brooks in #35998
  • Qwen2-VL: fix rope delta calculation by @zucchini-nlp in #36013
  • Fix custom kernel for DeformableDetr, RT-Detr, GroindingDINO, OmDet-Turbo in Pytorch 2.6.0 by @qubvel in #35979
  • apply_chat_template: consistent behaviour for return_assistant_tokens_mask=True return_tensors=True by @mrsndmn in #35582
  • layernorm_decay_fix by @Ryoo72 in #35927
  • Update Mistral converter by @Cyrilvallez in #35967
  • Refactor (and fix) gpt_neox by @Cyrilvallez in #35610
  • Fix device mismatch error in Whisper model during feature extraction by @thedebugger in #35866
  • Fix RMSNormGated in Zamba2 by @pglorio in #35943
  • Commont bot CI for other jobs (generation / quantization) by @ydshieh in #35341
  • Hotfix for self-comment-ci.yml by @ydshieh in #36030
  • feat(ci): ignore trufflehog unverified results by @McPatate in #36031
  • CircleCI with python 3.9 by @ydshieh in #36027
  • Update tests regarding attention types after #35235 by @ydshieh in #36024
  • Fix Gemma2 synced multi-GPU generation by @ManukyanD in #35232
  • Fix synced multi-GPU generation with LLMs and VLMs by @ManukyanD in #35893
  • Add XPU type for work-around -inf mask causing sdpa NaN issue in modeling files by @Liangliang-Ma in #35647
  • add support for empty list as input to create_model_card by @ROZBEH in #36042
  • DeepSpeed github repo move sync by @stas00 in #36021
  • [docs] no hard coding cuda as bnb has multi-backend support by @faaany in #35867
  • [docs] fix bugs in the bitsandbytes documentation by @faaany in #35868
  • [docs] no hard-coding cuda by @faaany in #36043
  • Fix how we compute the final non-padding token for ForSequenceClassification models by @Rocketknight1 in #35911
  • Add Qwen2VLImageProcessorFast into Qwen2VLProcessor by @yeliudev in #35987
  • Iterative generation using Input embeds and past_key_values by @yaswanth19 in #35890
  • Fix usage of unpad_input function by @pavelgein in #35925
  • Fix repo consistency by @ydshieh in #36063
  • Update test_flash_attn_2_can_dispatch_composite_models by @ydshieh in #36050
  • Paligemma: fix generation with Gemma2 by @zucchini-nlp in #36044
  • Save checkpoint to temporary directory to handle partial saves during failures by @SilverSoldier in #35580
  • Nail in edge case of torch dtype being overriden permantly in the case of an error by @muellerzr in #35845
  • Fix words typos in ggml test. by @zhanluxianshen in #36060
  • Fix model kwargs by @muellerzr in #35875
  • Fix StopStringCriteria to handle tokens above len(tokenizer) by @Rocketknight1 in #35797
  • [docs] fix outdated example code in trainer.md by @faaany in #36066
  • Adding RT-DETRv2 for object detection by @jadechoghari in #34773
  • Fix bug in apply_rotary_pos_emb_flashatt: in Qwen2-5-VL by @DeepWaved in #36065
  • Move audio top_k tests to the right file and add slow decorator by @Rocketknight1 in #36072
  • Fix OS err by @muellerzr in #36094
  • [docs] fix model checkpoint name by @faaany in #36075
  • [docs] fix typo by @faaany in #36080
  • [docs] fix not-working example code in perf_infer_gpu_one.md by @faaany in #36087
  • fix MllamaVisionAttention typehint by @kylesayrs in #35975
  • Processors: allow tuples of images when checking by @zucchini-nlp in #36084
  • Chat template: update for processor by @zucchini-nlp in #35953
  • Paligemma: revert #36084 by @zucchini-nlp in #36113
  • Support constant lr with cooldown by @LoserCheems in #35453
  • Enable pytest live log and show warning logs on GitHub Actions CI runs by @ydshieh in #35912
  • Refactor OPT model by @jiqing-feng in #36101
  • Revert checkpoint tmp dir by @SunMarc in #36112
  • [Bugfix] fix file name of docstring in utils/check_table.py by @kkscilife in #36108
  • fix bnb warning by @SunMarc in #36116
  • AutoformerForPrediction test add atol by @ivarflakstad in #36017
  • Fix nighlty CIs: missing atols by @ArthurZucker in #35903
  • Add common test for torch.export and fix some vision models by @qubvel in #35124
  • fix: typos in documentation files by @maximevtush in #36122
  • update awesome-transformers.md. by @zhanluxianshen in #36115
  • Fix max size deprecated warning by @HichTala in #34998
  • Fix CI issues by @molbap in #35662
  • update tiktoken integ to use converted by @ArthurZucker in #36135
  • Make output_dir Optional in TrainingArguments #27866 by @sambhavnoobcoder in #35735
  • [docs] minor doc fix by @faaany in #36127
  • [docs] update awq doc by @faaany in #36079
  • Add pipeline parallel plan to PretrainedConfig and PreTrainedModel by @hmellor in #36091
  • add RAdamScheduleFree optimizer by @nhamanasu in #35313
  • added warning to Trainer when label_names is not specified for PeftModel by @MilkClouds in #32085
  • Whisper: remove redundant assisted generation tests by @gante in #34814
  • Add utility for Reload Transformers imports cache for development workflow #35508 by @sambhavnoobcoder in #35858
  • VLM: enable skipped tests by @zucchini-nlp in #35746
  • [commands] remove deprecated/inoperational commands by @gante in #35718
  • Fix Gradient Checkpointing for Deberta & Deberta-V2 using PEFT / Adapters by @lenglaender in #35898
  • 🚨 Remove cache migration script by @Wauplin in #35810
  • multi-gpu: fix tensor device placements for various models by @dvrogozh in #35763
  • Optim: APOLLO optimizer integration by @zhuhanqing in #36062
  • Fix multi gpu loss sync condition, add doc and test by @techkang in #35743
  • adding option to save/reload scaler by @hsilva664 in #34932
  • Update doc re list of models supporting TP by @kwen2501 in #35864
  • Add more rigerous non-slow grad accum tests by @muellerzr in #35668
  • Fix test fetcher by @ydshieh in #36129
  • skip test_initialization for VitPoseBackboneModelTest for now by @ydshieh in #36154
  • Add git LFS to AMD docker image by @ivarflakstad in #36016
  • Mllama fsdp by @blbadger in #36000
  • Fix PaliGemma Pad Token Masking During Training #35855 by @sambhavnoobcoder in #35859
  • Add reminder config to issue template and print DS version in env by @Ben-Schneider-code in #35156
  • Fix Gemma2 dtype issue when storing weights in float16 precision by @Nerogar in #35398
  • Replace deprecated update_repo_visibility by @Wauplin in #35970
  • Fix tests for vision models by @qubvel in #35654
  • qwen2.5vl: fix bugs when using flash2+bf16 or num_return_sequences>1 by @gewenbin0992 in #36083
  • docs: fix return type annotation of get_default_model_revision by @MarcoGorelli in #35982
  • Fix PretrainedTokenizerFast check => Fix PretrainedTokenizerFast Save by @CL-ModelCloud in #35835
  • Move DataCollatorForMultipleChoice from the docs to the package by @bauwenst in #34763
  • Helium documentation fixes by @LysandreJik in #36170
  • Remove loading custom kernel for RT-DETRv2 by @qubvel in #36098
  • [Modular] skip modular checks based on diff by @gante in #36130
  • Fix red CI by @ArthurZucker in #36174
  • Fix : fix doc fp8 by @MekkCyber in #36173
  • Efficient Inference Kernel for SpQR by @elvircrn in #34976
  • fix training issues by @ArthurZucker in #36158
  • add disable compile option by @ArthurZucker in #36161
  • CI: avoid human error, automatically infer generative models by @gante in #33212
  • Use tqdm auto by @SmartManoj in #35726
  • Optimize Qwen2VL vision model by precomputing cos/sin embeds before ViT blocks by @li-plus in #35837
  • Make check_repository_consistency run faster by MP by @ydshieh in #36175
  • Fix the key name for _load_rng_state under torch.cuda by @wizyoung in #36138
  • Follow up to SpQR integration by @MekkCyber in #36176
  • Fix a mistake in #36175 by @ydshieh in #36179
  • Fix make_batched_videos and add tests by @yonigozlan in #36143
  • Uniformize OwlViT and Owlv2 processors by @yonigozlan in #35700
  • Add support for partial rotary embeddings in Phi3 model by @garg-amit in #35947
  • CI: fix test-save-trainer by @zucchini-nlp in #36191
  • Chat template docs by @zucchini-nlp in #36163
  • Add ImageProcessorFast to Qwen2.5-VL processor by @Isotr0py in #36164
  • Prepare processors for VideoLLMs by @zucchini-nlp in #36149
  • Add require_read_token to fp8 tests by @MekkCyber in #36189
  • Revert qwen2 breaking changes related to attention refactor by @ArthurZucker in #36162
  • Guard against unset resolved_archive_file by @dmlap in #35628
  • [Bugfix] Fix reloading of pixtral/llava configs by @kylesayrs in #36077

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @jiqing-feng
    • Fix whisper compile (#35413)
    • Enable gptqmodel (#35012)
    • fix document qa bf16 pipeline (#35456)
    • Fix vits low-precision dtype (#35418)
    • fix low-precision audio classification pipeline (#35435)
    • Refactor OPT model (#36101)
  • @AhmedAlmaghz
    • [i18n-ar] Translated file : docs/source/ar/tasks/token_classification.md into Arabic (#35193)
    • [i18n-ar] Translated file: docs/source/ar/tasks/masked_language_modeling.md into Arabic (#35198)
  • @sbucaille
    • Add SuperGlue model (#29886)
  • @Isotr0py
    • Fix head_dim in config extracted from Gemma2 GGUF model (#35818)
    • Split and clean up GGUF quantization tests (#35502)
    • Add ImageProcessorFast to Qwen2.5-VL processor (#36164)
  • @ShuaiBai623
    • add qwen2.5vl (#35569)
  • @alex-jw-brooks
    • Granite Vision Support (#35579)
    • Update Granite Vision Model Path / Tests (#35998)
  • @pglorio
    • Add Zamba2 (#34517)
    • Fix RMSNormGated in Zamba2 (#35943)
  • @conditionedstimulus
    • Add DAB-DETR for object detection (#30803)
  • @jadechoghari
    • Adding RT-DETRv2 for object detection (#34773)
  • @geetu040
    • Add Apple's Depth-Pro for depth estimation (#34583)
  • @zhuhanqing
    • Optim: APOLLO optimizer integration (#36062)
  • @bauwenst
    • Move DataCollatorForMultipleChoice from the docs to the package (#34763)
  • @elvircrn
    • Efficient Inference Kernel for SpQR (#34976)
Feb 7, 2025
Patch release v4.48.3

This ends the python3.9 issues mostly!

  • Add future import for Py < 3.10 (#35666) by @Rocketknight1

For some very niche cases, the new rope embedding introduced device failures

  • Fix device in rope module when using dynamic updates (#35608) by @Cyrilvallez

Num items in batch

  • Fix model kwargs (#35875) by @muellerzr: this is long due, sorry that it took so long. Some models were not compatible with the num_items_in_batch

Finally the fix to Gemma2 is propagated to paligemma2!

  • Paligemma: fix generation with Gemma2 (#36044) by @zucchini-nlp
Jan 30, 2025
Patch release v4.48.2

Sorry because the fixes for num_items_in_batches are not done yet 😓 To follow along see this PR, a new patch will be available soon!

Now, we mostly had BC issue with python version 3.9:

  • Restore is_torch_greater_or_equal_than for backward compatibility (#35734) by @tlrmchlsmth
  • Fix NoneType type as it requires py>=3.10 (#35843) by @SunMarc

Then we had a small regression for DBRX saving:

  • Fix: loading DBRX back from saved path (#35728) by @zucchini-nlp

Finally we have a fix for gemma and the hybrid attention architectures:

  • Fix mask slicing for models with HybridCache #35681 by @Cyrilvallez

Miscellaneous:

  • Fix is_causal being a tensor (#35791) by @IlyasMoutawwakil
Jan 20, 2025
Patch release v4.48.1

Yet again we are dawned with a gradient accumulation fix! There is also a refactoring of the attention that let a small typo in, we made sure PHI is no longer broken!

Moonshine had a small issue when wrapping generate so we removed that!

  • [Phi] bias should be True (#35650) @ArthurZucker
  • Fix condition when GA loss bug fix is not performed (#35651) @techkang
  • Patch moonshine (#35731) @eustlb

🤗

Jan 10, 2025
v4.48.0: ModernBERT, Aria, TimmWrapper, ColPali, Falcon3, Bamba, VitPose, DinoV2 w/ Registers, Emu3, Cohere v2, TextNet, DiffLlama, PixtralLarge, Moonshine

New models

ModernBERT

The ModernBert model was proposed in Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference by Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Galalgher, Raja Bisas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Grifin Adams, Jeremy Howard and Iacopo Poli.

It is a refresh of the traditional encoder architecture, as used in previous models such as BERT and RoBERTa.

It builds on BERT and implements many modern architectural improvements which have been developed since its original release, such as:

  • Rotary Positional Embeddings to support sequences of up to 8192 tokens.
  • Unpadding to ensure no compute is wasted on padding tokens, speeding up processing time for batches with mixed-length sequences.
  • GeGLU Replacing the original MLP layers with GeGLU layers, shown to improve performance.
  • Alternating Attention where most attention layers employ a sliding window of 128 tokens, with Global Attention only used every 3 layers.
  • Flash Attention to speed up processing.
  • A model designed following recent The Case for Co-Designing Model Architectures with Hardware, ensuring maximum efficiency across inference GPUs.
  • Modern training data scales (2 trillion tokens) and mixtures (including code ande math data)

  • Add ModernBERT to Transformers by @warner-benjamin in #35158

Aria

The Aria model was proposed in Aria: An Open Multimodal Native Mixture-of-Experts Model by Li et al. from the Rhymes.AI team.

Aria is an open multimodal-native model with best-in-class performance across a wide range of multimodal, language, and coding tasks. It has a Mixture-of-Experts architecture, with respectively 3.9B and 3.5B activated parameters per visual token and text token.

  • Add Aria by @aymeric-roucher in #34157

TimmWrapper

We add a TimmWrapper set of classes such that timm models can be loaded in as transformer models into the library.

Here's a general usage example:

import torch
from urllib.request import urlopen
from PIL import Image
from transformers import AutoConfig, AutoModelForImageClassification, AutoImageProcessor

checkpoint = "timm/resnet50.a1_in1k"
img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))

image_processor = AutoImageProcessor.from_pretrained(checkpoint)
inputs = image_processor(img, return_tensors="pt")
model = AutoModelForImageClassification.from_pretrained(checkpoint)

with torch.no_grad():
    logits = model(**inputs).logits

top5_probabilities, top5_class_indices = torch.topk(logits.softmax(dim=1) * 100, k=5)

Thanks to this, timm models now have access to pipelines, as well as Trainer, accelerate device maps, quantization, etc:

import torch
from urllib.request import urlopen
from PIL import Image

from transformers import pipeline

img = Image.open(urlopen(
    'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
pipe = pipeline("image-classification", model="timm/resnet18.a1_in1k")
print(pipe(img))
  • Add TimmWrapper by @qubvel and @amyeroberts in #34564

Pixtral-Large

Pixtral modeling and checkpoint conversion code has been updated to support the new Pixtral-Large model.

  • Update Pixtral conversion script to support large format! by @arthurzucker in #34801

ColPali

The ColPali model was proposed in ColPali: Efficient Document Retrieval with Vision Language Models by Manuel Faysse*, Hugues Sibille*, Tony Wu*, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo (* denotes equal contribution). Work lead by ILLUIN Technology.

In the proposed ColPali approach, the authors leverage VLMs to construct efficient multi-vector embeddings directly from document images (“screenshots”) for document retrieval. They train the model to maximize the similarity between these document embeddings and the corresponding query embeddings, using the late interaction method introduced in ColBERT.

  • Add ColPali to 🤗 transformers by @tonywu71 and @yonigozlan in #33736

Falcon3

Falcon3 represents a natural evolution from previous releases, emphasizing expanding the models’ science, math, and code capabilities. This iteration includes five base models: Falcon3-1B-Base, Falcon3-3B-Base, Falcon3-Mamba-7B-Base, Falcon3-7B-Base, and Falcon3-10B-Base. In developing these models, the authors incorporated several key innovations aimed at improving the models’ performances while reducing training costs:

One pre-training: They conducted a single large-scale pretraining run on the 7B model, using 2048 H100 GPU chips, leveraging 14 trillion tokens featuring web, code, STEM, and curated high-quality and multilingual data. Depth up-scaling for improved reasoning: Building on recent studies on the effects of model depth, they upscaled the 7B model to a 10B parameters model by duplicating the redundant layers and continuing pre-training with 2TT of high-quality data. This yielded Falcon3-10B-Base which achieves state-of-the-art zero-shot and few-shot performance for models under 13B parameters. Knowledge distillation for better tiny models: To provide compact and efficient alternatives, we developed Falcon3-1B-Base and Falcon3-3B-Base by leveraging pruning and knowledge distillation techniques, using less than 100GT of curated high-quality data, thereby redefining pre-training efficiency.

  • Add Falcon3 documentation by @mokeddembillel in #35307

Bamba

Bamba-9B is a decoder-only language model based on the Mamba-2 architecture and is designed to handle a wide range of text generation tasks. It is trained from scratch using a two-stage training approach. In the first stage, the model is trained on 2 trillion tokens from the Dolma v1.7 dataset. In the second stage, it undergoes additional training on 200 billion tokens, leveraging a carefully curated blend of high-quality data to further refine its performance and enhance output quality.

Checkout all Bamba-9B model checkpoints here.

  • Add the Bamba Model by @fabianlim in #34982

VitPose

ViTPose is a state-of-the-art vision transformer-based model for human pose estimation, introduced by Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao in "ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation”.

The model leverages the capabilities of vision transformers to accurately predict 2D human keypoints. Adopting a top-down approach, ViTPose estimates keypoints locations for each detected person, allowing it to be easily used with any object detection model.

  • Add VitPose by @SangbumChoi and @NielsRogge in #30530

DINOv2 with registers

The DINOv2 with Registers model was proposed in Vision Transformers Need Registers by Timothée Darcet, Maxime Oquab, Julien Mairal, Piotr Bojanowski.

The Vision Transformer (ViT) is a transformer encoder model (BERT-like) originally introduced to do supervised image classification on ImageNet.

Next, people figured out ways to make ViT work really well on self-supervised image feature extraction (i.e. learning meaningful features, also called embeddings) on images without requiring any labels. Some example papers here include DINOv2 and MAE.

The authors of DINOv2 noticed that ViTs have artifacts in attention maps. It’s due to the model using some image patches as “registers”. The authors propose a fix: just add some new tokens (called “register” tokens), which you only use during pre-training (and throw away afterwards). This results in:

  • no artifacts
  • interpretable attention maps
  • and improved performances.
  • Add DINOv2 with registers by @NielsRogge in #35348

Emu3

The Emu3 model was proposed in Emu3: Next-Token Prediction is All You Need by Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, Zhongyuan Wang.

Emu3 sets a new standard in multimodal AI by using next-token prediction to handle images, text, and videos. It simplifies multimodal modeling by tokenizing all data into a unified format and training a single transformer. Visual data is tokenized using vector quantization methods based on VQ-VAE model. Discretized visual tokens are later fused with text token ids for image and text generation.

Emu3 outperforms leading models like SDXL and LLaVA-1.6 in both generation and perception tasks, without relying on diffusion or compositional methods..

  • Add Emu3 by @zucchini-nlp in #33770

Cohere2

A new Cohere update was added through a new "Cohere2" set of classes.

  • Add Cohere2 model by @alexrs-cohere in #35224

TextNet

TextNet is a lightweight and efficient architecture designed specifically for text detection, offering superior performance compared to traditional models like MobileNetV3. With variants TextNet-T, TextNet-S, and TextNet-B (6.8M, 8.0M, and 8.9M parameters respectively), it achieves an excellent balance between accuracy and inference speed.

  • Add TextNet by @jadechoghari in #34979

DiffLlama

Differential Transformer combines the Llama architecture with Differential Transformer's Attention.

  • Add DiffLllama by @weak-kajuma in #34083

PixtralLarge

The conversion script needed a few update, while the modeling code was barely changed!

  • [PixtralLarge] Update Pixtral conversion script to support large format! (#34801)

Moonshine

Moonshine is an autoregressive speech recognition encoder-decoder model that improves upon Whisper's architecture. Namely, it replaces absolute position embeddings with Rotary Position Embeddings (RoPE). This allows Moonshine to handle audio inputs of any length, unlike Whisper, which is restricted to fixed 30-second windows. It was introduced by Nat Jeffries, Evan King, Manjunath Kudlur, Guy Nicholson, James Wang, and Pete Warden in Moonshine: Speech Recognition for Live Transcription and Voice Commands .

  • Add Moonshine by @eustlb in #34784

Quantization methods

VPTQ Quantization

From the VPTQ contributors:

VPTQ is a novel Post-Training Quantization method that leverages Vector Quantization to high accuracy on LLMs at an extremely low bit-width (<2-bit). VPTQ can compress 70B, even the 405B model, to 1-2 bits without retraining and maintain high accuracy.. More details here: https://github.com/microsoft/vptq

  • FEAT : Adding VPTQ quantization method to HFQuantizer by @wejoncy in #34770

HIGGS Quantization

From the contributors:

HIGGS is a new 0-shot quantization algorithm that combines Hadamard preprocessing with MSE-Optimal quantization grids to achieve lower quantization error and SOTA performance. You can find more information in the paper.

Runtime support for HIGGS is implemented through FLUTE, and its library.

This PR adds support for HIGGS+FLUTE into transformers allowing for low-error 0-shot quantization and fast LLM inference.

  • HIGGS Quantization Support by @BlackSamorez in #34997

Cleanup

We merged a cleanup for vision language models, to make sure it all models are standardized.

  • VLMs: major clean up 🧼 (#34502)

Breaking changes

Conversion scripts

Many models in Transformers include scripts to convert the original model checkpoints into a Transformers-compatible format. These scripts can be found in the repo using the glob pattern models/**/convert_*.py. They were a recurring source of vulnerability reports and CVEs because many models were originally released using insecure formats like older PyTorch .bin weights or pickle files. The conversion scripts had to open these formats, and this meant that they were vulnerable to maliciously crafted inputs.

In practice, we do not see this as a serious vulnerability. The conversion scripts are never imported or called by the rest of the library; each script is standalone, and so the only way to exploit the vulnerability is to create a malicious checkpoint, induce a user to download it, and then also induce them to manually call a specific conversion script on it.

However, even if there is little practical risk of an exploit, we are aware that open vulnerability reports create a compliance problem for users, and so beginning with this release we will be excluding these conversion scripts from release branches and wheels. They will remain accessible to developers on the main branch.

  • 🚨🚨🚨 Delete conversion scripts when making release wheels by @Rocketknight1 in #35296

Backtracking in Nougat

A regular expression used within the Nougat code has been modified to ensure it does not hang. The method should output the same results but we cannot guarantee it; we recommend upgrading to the latest transformers if you use this model to ensure your code is performance-optimized.

  • 🚨🚨🚨 Limit backtracking in Nougat regexp by @qubvel in #35264

Whisper decoding

This PR finalizes work that aimes to enable short-form (< 30 secs) and long-form generation using temperature fallback. It is a significant improvement to the whisper codebase, but it does result in the following breaking changes:

➡️ Previously:
• Short-form: Returned a ModelOutput or torch.LongTensor, including decoder input IDs and the EOS token ID.
• Long-form: Returned a Dict or torch.LongTensor, excluding decoder input IDs and the EOS token ID.

➡️ From now on:
Short-form and long-form generation are now treated identically, meaning output differentiation based on these modes is no longer applicable.

Decoder input IDs and EOS token IDs are never returned, except in two specific cases: when return_dict_in_generate=True and (return_timestamps=False or force_unique_generate_call=True).

In this case, the output will be a ModelOutput, which is the result of the underlying call to GenerationMixin’s generate. Indeed, return_timestamps=False ensures no seeking occurs; only a single call to generate is made. Therefore, this output includes both decoder input IDs and the EOS token ID.

  • [Whisper] 🚨 Fix whisper decoding 🚨 by @eustlb in #34135

Attention refactor

In order to have a cleaner, isolated, future-proof code for the attention layers, they have been refactored so as to keep the model attention code within their files; but attention definitions relating to SDPA, Flash Attention, and other types of attention have been moved to a common file.

  • 🚨All attention refactor🚨 by @ArthurZucker in #35235

Bugfixes and improvements

  • Pipeline: simple API for assisted generation by @gante and @Rocketknight1 #34504
  • [tokenizers] Ensure that add_prefix_space is propagated to backend_tokenizer.pre_tokenizer (#35593)
  • Setup loss_type in config at model init time (#34616)
  • [docs] Update Python version in translations by @jla524 in #35096
  • [docs] top_p, top_k, temperature docstrings by @stevhliu in #35065
  • Fix private forked repo. CI by @ydshieh in #35114
  • Add feature dim attributes to BitLinear for easier PEFT integration by @agostinv in #34946
  • Update I-JEPA checkpoints path by @qubvel in #35120
  • Fix GA loss bugs and add unit test by @techkang in #35121
  • [I-JEPA] Update docs by @NielsRogge in #35148
  • Corrected typo in agent system prompts by @Uvi-12 in #35143
  • Option to set 'non_blocking' for to(device) in BatchEncoding and BatchFeature by @daniel-bogdoll in #34883
  • Fix typo in EETQ Tests by @MekkCyber in #35160
  • Cleanup: continue the init refactor by @LysandreJik in #35167
  • Super tiny fix logging message by @fzyzcjy in #35132
  • Fixed typo of 'avilable' in prompts.py by @Uvi-12 in #35145
  • [CI] Fix bnb quantization tests with accelerate>=1.2.0 by @matthewdouglas in #35172
  • Fix num_items_in_batch not being an integer by @xspirus in #35115
  • Assisted decoding multi-gpu by @zucchini-nlp in #35116
  • Fix file path for shard_num 1 with mllama converter by @strangiato in #35053
  • Support BatchNorm in Hubert pos_conv_emb as in fairseq by @gallilmaimon in #34389
  • Remove unnecessary masked_fill in deberta models by @xadupre in #35182
  • Fix DBRX LayerNorm init method by @hgt312 in #35177
  • Fixing GGUF support for StableLm by @MekkCyber in #35060
  • [i18n-ar] Translated file : docs/source/ar/community.md into Arabic by @AhmedAlmaghz in #33027
  • Multiple typo fixes in NLP, Audio docs by @henryhmko in #35181
  • Only import torch.distributed if it is available by @GaetanLepage in #35133
  • [i18n-<languageCode>] Translating Benchmarks.md to Chinese by @asdkfjsd in #35137
  • [docs] Fix FlashAttention link by @stevhliu in #35171
  • Update data collator docstrings to accurately reference Nvidia tensor core compute capability version by @johngrahamreynolds in #35188
  • [i18n-<languageCode>] Translating agents.md to Chinese by @HMJ0628 in #35139
  • BLIP: enable device map by @zucchini-nlp in #34850
  • 🧹 Remove deprecated RotaryEmbedding parts in the Attention layers by @Cyrilvallez in #34858
  • [PEFT] Better Trainer error when prompt learning with loading best model at the end by @BenjaminBossan in #35087
  • Cleanup: continue the init refactor by @LysandreJik in #35170
  • Fix CI by @Cyrilvallez in #35208
  • Fix seamless TTS generate by @ylacombe in #34968
  • docs: clarify initializer_range parameter description in Idefics3VisionConfig by @h3110Fr13nd in #35215
  • Fixed typo of 'indentifier' in audio_utils.py by @Uvi-12 in #35226
  • Fix type hints for apply_chat_template by @Rocketknight1 in #35216
  • Support Python 3.10+ Union style in chat template type hints parsing by @RezaRahemtola in #35103
  • Refactoring AssistedCandidateGenerator for Improved Modularity and Reusability by @keyboardAnt and @jmamou in #35009
  • Change back to Thread for SF conversion by @ydshieh in #35236
  • [Init refactor] Modular changes by @LysandreJik in #35240
  • Fix typo in chat template example by @EricWinsorDSIT in #35250
  • Run model as compressed/uncompressed mode by @horheynm in #34719
  • skip Fuyu from test_generate by @nhamanasu in #35246
  • [tests] fix "Tester object has no attribute '_testMethodName'" by @faaany in #34910
  • Use rsfE with pytest by @ydshieh in #35119
  • Update AMD docker image (rocm 6.1) by @ivarflakstad in #35259
  • Fixed typos in Audio Classification Documentation by @Uvi-12 in #35263
  • Translating agents_advanced.md to Chinese by @HMJ0628 in #35231
  • Fix FSDP no longer working by @muellerzr in #35212
  • don't use no_sync when deepspeed doesn't support it for certain zero stages by @winglian in #35157
  • [i18n-Chinese] Translating perf_train_cpu.md to Chinese by @asdkfjsd in #35242
  • Fall back to slow image processor in ImageProcessingAuto when no fast processor available by @yonigozlan in #34785
  • Aggeregate test summary files in CircleCI workflow runs by @ydshieh in #34989
  • Blip: fix offloading and MP tests by @zucchini-nlp in #35239
  • Fix : model used to test ggml conversion of Falcon-7b is incorrect by @MekkCyber in #35083
  • Temporarily disable amd push ci by @ivarflakstad in #35293
  • Delete redundancy for loop checks. by @zhanluxianshen in #35288
  • [Whisper] patch float type on mps by @eustlb in #35295
  • Fix typos in Translated Audio Classification Docs by @jla524 in #35287
  • Translating "translate perf_infer_gpu_multi.md" to Chinese by @HMJ0628 in #35271
  • Fix wrongs in quicktour[zh] by @zhanluxianshen in #35272
  • Improved documentation of Automatic speech recognition by @Uvi-12 in #35268
  • fix modular order by @ArthurZucker in #35297
  • Add sdpa for Beit by @OmarManzoor in #34941
  • Support for SDPA for SAM models by @MagnusS0 in #34110
  • remove benchmark job in push-important-models.yml by @ydshieh in #35292
  • Fix typos in translated quicktour docs by @jla524 in #35302
  • Fix image preview in multi-GPU inference docs by @jla524 in #35303
  • Fix remove unused parameter in docs by @zzzzzsa in #35306
  • Add Cohere2 docs details by @alexrs-cohere in #35294
  • Fixed typo in audio_classification.md by @Uvi-12 in #35305
  • [docs] Improve register_pipeline by @stevhliu in #35300
  • Fix loading with only state dict and low_cpu_mem_usage = True by @SunMarc in #35217
  • [tests] make cuda-only tests device-agnostic by @faaany in #35222
  • Trigger GitHub CI with a comment on PR by @ydshieh in #35211
  • change bnb tests by @jiqing-feng in #34713
  • [Whisper] fix docstrings typo by @eustlb in #35319
  • feat: add benchmarks_entrypoint.py by @McPatate in #34495
  • Fix documentation for ColPali by @tonywu71 in #35321
  • Update comment CI bot by @ydshieh in #35323
  • PaliGemma: Make sure to add <eos> to suffix if <image> is present in text by @probicheaux in #35201
  • Fix some fa2 tests by @ArthurZucker in #35340
  • Modernbert Release Fixes by @warner-benjamin in #35344
  • [docs] Add link to ModernBERT Text Classification GLUE finetuning script by @tomaarsen in #35347
  • fix onnx export of speech foundation models by @nikosanto13 in #34224
  • [Mamba2] Fix caching, slow path, and multi-gpu by @vasqu in #35154
  • Reduce CircleCI usage by @ydshieh in #35355
  • Implement AsyncTextIteratorStreamer for asynchronous streaming by @CISC in #34931
  • Cleaner attention interfaces by @Cyrilvallez in #35342
  • Add Tensor Parallel support for Qwen2VL by @jla524 in #35050
  • fix zoedepth initialization error under deepspeed zero3 by @Tavish9 in #35011
  • Aurevoir PyTorch 1 by @ydshieh in #35358
  • bugfix: torch.export failure caused by _make_causal_mask by @jiwoong-choi in #35291
  • update codecarbon by @nhamanasu in #35243
  • Update test fetcher when we want to test all by @ArthurZucker in #35364
  • Use weights_only=True with torch.load for transfo_xl by @ydshieh in #35241
  • Make test_generate_with_static_cache even less flaky by @ydshieh in #34995
  • Improve modular transformers documentation by @joelpaulkoch in #35322
  • Improved Documentation Of Audio Classification by @Uvi-12 in #35368
  • [docs] Follow up register_pipeline by @stevhliu in #35310
  • owlvit/2 dynamic input resolution by @bastrob in #34764
  • Fix new FA2 if is_causal is passed explicitly by @Cyrilvallez in #35390
  • bitsandbytes: simplify 8bit dequantization by @matthewdouglas in #35068
  • make LlamaModel._update_causal_mask torch compilable by @winglian in #35187
  • Patch GPTNeoX to use adequate FA2 if position_ids is provided by @taha-yassine in #35318
  • uniformize kwargs for SAM by @tibor-reiss in #34578
  • Deprecate _is_quantized_training_enabled by @MekkCyber in #34991
  • Scale loss before backward by @qgallouedec in #35207
  • Fix typing in docstring for PaliGemmaProcessor by @alvarobartt in #35278
  • Fix : VPTQ test by @MekkCyber in #35394
  • add bnb support for Ascend NPU by @statelesshz in #31512
  • bugfix Idefics3 processor - handle gracefully cases with text and no images by @mfarre in #35363
  • Adding logger.info about update_torch_dtype in some quantizers by @MekkCyber in #35046
  • Add compile test for fast image processor by @yonigozlan in #35184
  • Disable .github/workflows/self-comment-ci.yml for now by @ydshieh in #35366
  • enable non-cuda awq model support without modify version by @jiqing-feng in #35334
  • [GPTQ, CompressedTensors] Fix unsafe imports and metada check by @vasqu in #34815
  • Drop inplace operation for loss computation with gradient accumulation by @qgallouedec in #35416
  • Fix: Rename keyword argument in_channels to num_channels by @ningyuv in #35289
  • CLIP conversion script - Change fairseq to OpenAI by @gau-nernst in #35384
  • Fix f-string to show ACCELERATE_MIN_VERSION on error by @KSafran in #35189
  • Fix model_accepts_loss_kwargs for timm model by @qubvel in #35257
  • Update perf_infer_gpu_one.md: fix a typo by @martin0258 in #35441
  • Add compute_loss_func to Seq2SeqTrainer by @d223302 in #35136
  • Update docs for sdpa_kernel by @jla524 in #35410
  • [i18n-ar] Translated file: docs/source/ar/tasks/question_answering.md into Arabic by @AhmedAlmaghz in #35196
  • [i18n-ar] Translated file: docs/source/ar/tasks/summarization.md into Arabic by @AhmedAlmaghz in #35195
  • Update translated docs for sdpa_kernel by @jla524 in #35461
  • Reintroduce Python 3.9 support for ModernBERT by @tomaarsen in #35458
  • Fix new BNB test failures by @matthewdouglas in #35345
  • Fix docs typos. by @zhanluxianshen in #35465
  • Fix paligemma warning message by @hiyouga in #35486

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @ydshieh
    • Fix private forked repo. CI (#35114)
    • Change back to Thread for SF conversion (#35236)
    • Use rsfE with pytest (#35119)
    • Aggeregate test summary files in CircleCI workflow runs (#34989)
    • remove benchmark job in push-important-models.yml (#35292)
    • Trigger GitHub CI with a comment on PR (#35211)
    • Update comment CI bot (#35323)
    • Reduce CircleCI usage (#35355)
    • Aurevoir PyTorch 1 (#35358)
    • Use weights_only=True with torch.load for transfo_xl (#35241)
    • Make test_generate_with_static_cache even less flaky (#34995)
    • Disable .github/workflows/self-comment-ci.yml for now (#35366)
  • @aymeric-roucher
    • Add Aria (#34157)
  • @NielsRogge
    • [I-JEPA] Update docs (#35148)
    • Add DINOv2 with registers (#35348)
  • @HMJ0628
    • [i18n-<languageCode>] Translating agents.md to Chinese (#35139)
    • Translating agents_advanced.md to Chinese (#35231)
    • Translating "translate perf_infer_gpu_multi.md" to Chinese (#35271)
  • @alexrs-cohere
    • Add Cohere2 model (#35224)
    • Add Cohere2 docs details (#35294)
  • @ArthurZucker
    • fix modular order (#35297)
    • 🚨All attention refactor🚨 (#35235)
    • Fix some fa2 tests (#35340)
    • Update test fetcher when we want to test all (#35364)
  • @tonywu71
    • Add ColPali to 🤗 transformers (#33736)
    • Fix documentation for ColPali (#35321)
  • @OmarManzoor
    • Add sdpa for Beit (#34941)
  • @fabianlim
    • Add the Bamba Model (#34982)
  • @warner-benjamin
    • Add ModernBERT to Transformers (#35158)
    • Modernbert Release Fixes (#35344)
  • @wejoncy
    • FEAT : Adding VPTQ quantization method to HFQuantizer (#34770)
  • @bastrob
    • owlvit/2 dynamic input resolution (#34764)
  • @BlackSamorez
    • HIGGS Quantization Support (#34997)
Dec 17, 2024

Patch release v4.47.1

We waited a little bit to make sure it was stable, thanks @winglian for double checking and everyone for the fixes!

  • Fix GA loss bugs and add unit test (#35121) Contributed by @techkang and @ArthurZucker.

  • Fix num_items_in_batch not being an integer (#35115)) Contributed by @xspirus.

  • Fix FSDP no longer working (#35212) Contributed by @muellerzr.

  • Don't use no_sync when DeepSpeed doesn't support it for certain ZeRO configurations (#35212) Contributed by @winglian.

  • Only import torch.distributed if it is available (#35133) Contributed by @GaetanLepage.

  • [Whisper] Patch float type on MPS (#35295) Contributed by @eustlb. 🔜 we should probably have MPS CIs to avoid repeating this!

Dec 5, 2024
v4.47.0: PaliGemma-2, I-JEPA, OLMo-2, LayerSkip, Tensor Parallel

New models

PaliGemma-2

PaliGemma 2 and PaliGemma are lightweight open vision-language models (VLM) inspired by PaLI-3, and based on open components like the SigLIP vision model and the Gemma language model. PaliGemma takes both images and text as inputs and can answer questions about images with detail and context, meaning that PaliGemma can perform deeper analysis of images and provide useful insights, such as captioning for images and short videos, object detection, and reading text embedded within images.

PaliGemma 2 is available in 3B, 10B, and 28B parameter sizes, which are based on Gemma 2 2B, 9B, and 27B models, respectively. The original PaliGemma models are available in the 3B size. For more information on Gemma model variants, see the Gemma models list. PaliGemma model variants support different pixel resolutions for image inputs, including 224 x 224, 448 x 448, and 896 x 896 pixels.

<img width="743" alt="image" src="https://github.com/user-attachments/assets/55cda8a6-b463-4a58-b7d3-f7d50ee2fa11">

I-JEPA

The I-JEPA model was proposed in Image-based Joint-Embedding Predictive Architecture by Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas. I-JEPA is a self-supervised learning method that predicts the representations of one part of an image based on other parts of the same image. This approach focuses on learning semantic features without relying on pre-defined invariances from hand-crafted data transformations, which can bias specific tasks, or on filling in pixel-level details, which often leads to less meaningful representations.

<img width="413" alt="image" src="https://github.com/user-attachments/assets/561ca9d7-0327-477a-96b8-61d2af0caf34">
  • Add I-JEPA by @jmtzt in #33125

OLMo 2

<img width="833" alt="image" src="https://github.com/user-attachments/assets/1abdde92-0aae-404a-b83e-77ec8bd13b7f">

The OLMo2 model is the successor of the OLMo model, which was proposed in OLMo: Accelerating the Science of Language Models.

The architectural changes from the original OLMo model to this model are:

  • RMSNorm is used instead of standard layer norm.
  • Norm is applied to attention queries and keys.
  • Norm is applied after attention/feedforward layers rather than before.

Commits:

  • Add OLMo November 2024 by @2015aroras in #34551
  • Rename OLMo November to OLMo2 by @2015aroras in #34864

Layer-Skip Llama

We add support for Meta's Layer-Skip Llama 3.2 1B model.

The Llama3.2 1B model was continually pretrained with LayerSkip recipe, early exit loss and layer dropout, as presented in Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding and is capable of performing self-speculative decoding: decode with earlier layers and verify with remaining layers.

<img width="854" alt="image" src="https://github.com/user-attachments/assets/4a9e3596-e44e-419f-804d-9f4d03f8f680">
  • Self-speculation (Layer-Skip Llama) by @ArthurZucker in #34240

Tensor Parallel implementation

This PR uses the torch.distributed.tensor.parallel subpackage to implement Tensor Parallel for Llama (as an example).

The motivation is multi-fold:

  1. to make modeling code simple as single-worker case:
    all manual TP implementations under if self.config.pretraining_tp > 1 can be removed.

  2. to make tensor parallelism easily accessible by users:
    added a model.tensor_parallel(device_mesh) method that allows users to turn a single-proc model into a parallel model. !- Please guide me to a right place to put this function/method if PreTrainedModel is not a preferred place. -!

This is the first PR of many to simplify and enable Tensor Parallel across models.

  • Simplify Tensor Parallel implementation with PyTorch TP by @kwen2501 in #34184

Farewell, Python 3.8

Python 3.8 reaches end of life, and, as such, we drop it from our CI.

  • Drop support for Python 3.8 by @ydshieh in #34314

GGUF improvements

Several improvements have been done to the GGUF support in transformers; notably by adding new architectures to the list of supported architectures.

  • Add T5 GGUF loading support by @junejae in #33389
  • Add GGUF for Mamba by @VladOS95-cyber in #34200
  • Add Nemotron GGUF Loading Support by @farrosalferro in #34725
  • Improve gguf tensor processing by @VladOS95-cyber in #34515
  • Fix use_parallel_residual and qkv_bias for StableLM GGUF config extraction by @Isotr0py in #34450

Fast processors

We continue the work to improve the speed of fast processors as detailed in this roadmap.

We contribute a fast processor to RT-DETR.

  • Add Image Processor Fast RT-DETR by @yonigozlan in #34354

New pipelines

A new pipeline has been added to transformers: image-text-to-text!

the pipeline support the following inputs:

  • unbatched images and text - images=image, text=text
  • batched images and text - images = [image, image], text= [text, text]
  • several images per prompt (only for models supporting the use of an image token) - images = [[image, image], [image]] or images=[image, image, image], text = ["... <image>...<image>...", "...<image>..."]
  • Chat templates (for models supporting them).
  • Add image text to text pipeline by @yonigozlan in #34170

Notable refactors

Separate chat templates into a single file

We have had several issues with chat templates because they're stored as single lines in the JSON config files:

  • Impossible to review diffs
  • Very hard to edit in the web UI (or in general)
  • Differences between processor templates in chat_template.json and tokenizer templates in tokenizer_config.json causing confusion
  • Some models use multiple templates, requiring a template dict, but we're trying to discourage that in future and move those models to single templates with conditional behaviour instead

The solution:

  • Just move chat templates to a single chat_template.jinja file in the repo
  • If multiple templates are required, then they should still be stored in the JSON file. This is not supported for Processor classes, so processors should always be able to save their template as a raw Jinja file. In general, we'll be gently deprecating multiple templates in future.
  • If a chat_template.jinja file is present, it overrides the JSON files. If a tokenizer is loaded with both Jinja and JSON chat templates and resaved, it should save only the Jinja file, and not have any chat_template entry in tokenizer_config.json.

For now, we continue saving in the old format by default. I'll probably keep it this way for several versions before making the new format the default, to ensure that most users are able to load the new format before it becomes common. Until then, the new format should mostly be used for testing, to make sure it's ready for deployment when we do the switch.

  • Separate chat templates into a single file by @Rocketknight1 in #33957

Large modular logic refactor

This PR largely rework the logic we use in the modular converter. It is (hopefully) clearer and maintainable. Instead of going in all directions, adding stuff, then deleting it if not needed, we now do the following:

  • visit all the modular file (record imports/functions/classes/assignments nodes)
    • create function dependency mapping
  • for each import coming from another model:
    • visit the corresponding file
    • create function dependency mapping
    • update mapping with function/assignment from the modular (updated/new functions)
    • create the class dependency graph based on merged dependencies
  • update dependency graph of the modular with the functions and assignments imported from the other files
  • for each class recorded in the modular:
    • if inherithing from class in another file:
      • replace call to super
      • find the dependencies after the node was replaced
      • follow (updated with modular defs) dependency mapping to add all nodes
    • else:
      • only add needed imported functions (and their dependencies)
  • determine the needed imports and add them
  • Large modular logic refactoring by @Cyrilvallez in #34487

Community bugfixes and improvements

  • Remove graph breaks for torch.compile() in flash_attention_forward when Lllama Model is padding free tuned by @Abhishek-TAMU in #33932
  • Better defaults by @ArthurZucker in #34026
  • translated gguf.md into chinese by @blueingman in #34163
  • CI: fix failures by @zucchini-nlp in #34371
  • Zamba is an LM by @LysandreJik in #34342
  • add code generation to natural language processing section by @furtnerthomas in #34333
  • Fix pil_torch_interpolation_mapping import in image_processing_detr_fast by @yonigozlan in #34375
  • Add code sample docstrings and checkpoint reference for GLM models by @h3110Fr13nd in #34360
  • refactor: remove redundant if-condition and improve type correctness for convert_tokens_to_ids by @winstxnhdw in #34030
  • Ignore unsupported kwarg in ProcessorMixin call by @yonigozlan in #34285
  • [PEFT] Add warning for missing key in LoRA adapter by @BenjaminBossan in #34068
  • Fix torch.fx issue related to the new loss_kwargs keyword argument by @michaelbenayoun in #34380
  • Correct the new defaults by @Cyrilvallez in #34377
  • [auto. ping] Avoid sending empty info + add more team members by @ydshieh in #34383
  • Fix glm by @Cyrilvallez in #34388
  • Use non nested images and batched text Idefics2/3 by @yonigozlan in #34222
  • Fix onnx non-expotable inplace aten op by @IlyasMoutawwakil in #34376
  • Fix right padding in LLaVA models by @zucchini-nlp in #34305
  • no filter by @ydshieh in #34391
  • SynthID: better example by @gante in #34372
  • Tests: upgrade test_eager_matches_sdpa_generate by @gante in #34386
  • Fix bnb training test failure by @matthewdouglas in #34414
  • Avoid check expected exception when it is on CUDA by @ydshieh in #34408
  • Fix typos in agents_advanced.md by @rudydel in #34405
  • [docs] Cache implementations by @stevhliu in #34325
  • Fix pix2struct by @IlyasMoutawwakil in #34374
  • pin tensorflow_probability<0.22 in docker files by @ydshieh in #34381
  • Tiny update after #34383 by @ydshieh in #34404
  • Fix batch size handling in prediction_loop for DataLoaderShard by @zeus2611 in #34343
  • exclude fsdp from delay_optimizer_creation by @eljandoubi in #34140
  • New option called "best" for args.save_strategy. by @seanswyi in #31817
  • [docs] update input documentation for MAMBA2 and MISTRAL models to include cache_position and attention_mask details by @h3110Fr13nd in #34322
  • 🌐 [i18n-KO] Translated model_doc/barthez.md to Korean by @Jwaminju in #33980
  • Apply linting to the important code blocks to make it readable by @ShubhamJagtap2000 in #34449
  • Torchao weights only + prequantized compability by @SunMarc in #34355
  • [i18n-ar] Translated file : docs/source/ar/fast_tokenizers.md into Arabic by @AhmedAlmaghz in #33034
  • enable average tokens across devices by @techkang in #34373
  • feat: run benchmarks on A100 by @McPatate in #34287
  • Add post_process_depth_estimation for GLPN by @alex-bene in #34413
  • LLaVA: latency issues by @zucchini-nlp in #34460
  • Generation: fix test by @zucchini-nlp in #34369
  • Fix CI by @zucchini-nlp in #34458
  • use a tinymodel to test generation config which aviod timeout by @techkang in #34482
  • 🚨🚨🚨 [SuperPoint] Fix keypoint coordinate output and add post processing by @sbucaille in #33200
  • Simplify running tests in a subprocess by @ydshieh in #34213
  • Fix perplexity computation in perplexity.md by @Framartin in #34387
  • Fixes for Modular Converter on Windows by @hlky in #34266
  • Fix regression loading dtype by @SunMarc in #34409
  • Bert is ExecuTorch compatible by @guangy10 in #34424
  • manual head_dim for mixtral model by @wavy-jung in #34281
  • fix-qwen2vl-no-position_ids by @simonJJJ in #33487
  • Bug fix for drop path decay rate in swin transformer by @abhi-glitchhg in #34291
  • MobileBERT is ExecuTorch compatible by @guangy10 in #34473
  • Albert is ExecuTorch compatible by @guangy10 in #34476
  • Adding optimizer_cls_and_kwargs to Trainer.__init__ by @apoorvkh in #34358
  • Fix performance in get_imports regexp by @AlekseyLobanov in #34298
  • fix incorrect warning by @yonigozlan in #34416
  • Un-deprecate timeout arg in pipelines by @Rocketknight1 in #34382
  • Roberta is ExecuTorch compatible by @guangy10 in #34425
  • Fix format mistake in string repr of tokenizer objects by @gpetho in #34493
  • Mllama: update docs by @zucchini-nlp in #34334
  • VLMs: fix number of image tokens by @zucchini-nlp in #34332
  • Tests: move generate tests to the right mixin and delete redundant tests by @gante in #34464
  • fix pixtral processor by @molbap in #34486
  • Use torch 2.5 in scheduled CI by @ydshieh in #34465
  • Fix super tiny extra space typo by @fzyzcjy in #34440
  • UPDATE Documentation for #TRANSLATING.md Documentation into Multiple Languages.(Changes made) by @anshumangahlot in #34226
  • enable QA bf16 pipeline by @jiqing-feng in #34483
  • Fix: img size mismatch caused by incorrect unpadding in LLaVA-Next by @jp1924 in #34522
  • Fix step shifting when accumulate gradient by @kibitzing in #33673
  • avoid calling gc.collect and cuda.empty_cache by @ydshieh in #34514
  • Qwen2VL: skip base input_ids-inputs_embeds equivalence check by @gante in #34535
  • fix(DPT,Depth-Anything) Address expected_slice errors inside inference tests by @philkuz in #34518
  • feat: add benchmarks pg indexes by @McPatate in #34536
  • make test_eager_matches_sdpa_inference less flaky by @ydshieh in #34512
  • Bug Fix for issue #34294 by @fpgaminer in #34295
  • [CLIPSeg] Make interpolate_pos_encoding default to True by @NielsRogge in #34419
  • update doc by @jiqing-feng in #34478
  • [i18n-ar] Translated file : docs/source/ar/multilingual.md into Arabic by @AhmedAlmaghz in #33048
  • Blip: get/set input embeddings correctly by @zucchini-nlp in #34152
  • BLIP: enable generation tests by @zucchini-nlp in #34174
  • :red_circle: :red_circle: fix query_pre_attn_scalar different of num_heads in default gemma2 config by @molbap in #34540
  • [i18n-HI] Translated accelerate page to Hindi by @karthik-script in #34443
  • Update trainer for easier handling of accumulate, compile fixes, and proper reporting by @muellerzr in #34511
  • VLM: special multimodal Tokenizer by @zucchini-nlp in #34461
  • MPS: isin_mps_friendly can support 0D tensors by @gante in #34538
  • Add text support to the Trainer's TensorBoard integration by @JacobLinCool in #34418
  • [i18n-HI] Translated TFLite page to Hindi by @karthik-script in #34572
  • 🌐 [i18n-KO] Translated perf_train_special.md to Korean by @maximizemaxwell in #34590
  • 🌐 [i18n-KO] Update README_ko.md by @J4BEZ in #33098
  • fix TrainerState doc because num_input_tokens_seen is unused by defau… by @techkang in #34593
  • Fix Whisper CI by @ydshieh in #34541
  • Skip DeepSpeed ZeRO Stage 3 model initialization when bnb by @eljandoubi in #34395
  • FIX: Broken repr of TorchAoConfig by @BenjaminBossan in #34560
  • Load sub-configs from composite configs by @zucchini-nlp in #34410
  • DistilBERT is ExecuTorch compatible by @guangy10 in #34475
  • Remove unused test_dataset by @thisisiron in #34516
  • Revert "Fix Whisper CI" by @ydshieh in #34605
  • Fix #34494 assistant tokens when truncated by @yonigottesman in #34531
  • Remove @slow for test_eager_matches_sdpa_inference by @ydshieh in #34558
  • Changing repr in torchao to show quantized Linear by @MekkCyber in #34202
  • Fix torchvision interpolation CI by @yonigozlan in #34539
  • 🌐 [i18n-KO] Translated convbert.md to Korean by @ahnjj in #34599
  • fix(dvclive): pass fake dataset to avoid exception in trainer init by @shcheklein in #34455
  • 🌐 [i18n-KO] Translated timesformer.md to Korean by @mreraser in #33972
  • 🌐 [i18n-KO] Translated bert.md to Korean by @maximizemaxwell in #34627
  • [i18n-ar] Translated file : docs/source/ar/trainer.md into Arabic by @AhmedAlmaghz in #33080
  • Update llm_engine.py by @louisbrulenaudet in #33332
  • Agents: turn any Space into a Tool with Tool.from_space() by @aymeric-roucher in #34561
  • [docs] update not-working model revision by @faaany in #34682
  • [i18n-ar] Translated file : docs/source/ar/torchscript.md into Arabic by @AhmedAlmaghz in #33079
  • Agents: Small fixes in streaming to gradio + add tests by @aymeric-roucher in #34549
  • 🌐 [i18n-KO] Translated marian.md to Korean by @maximizemaxwell in #34698
  • [docs] Broken link in generation_strategies by @pcuenca in #34717
  • Fix example in EsmConfig docstring by @yuanx749 in #34653
  • [docs] add xpu device check by @faaany in #34684
  • Retain newlines in chat template when continue_final_message=True by @lewtun in #34253
  • Update llava.md by @LysandreJik in #34749
  • fix(wandb): pass fake dataset to avoid exception in trainer (see #34455) by @CezaPasc in #34720
  • add xpu path for awq by @jiqing-feng in #34712
  • FSDP grad accum fix by @winglian in #34645
  • Remove FSDP wrapping from sub-models. by @eljandoubi in #34452
  • 🧼 remove v4.44 deprecations by @gante in #34245
  • VLMs: patch_size -> num_image_tokens in processing by @zucchini-nlp in #33424
  • Fix broken link by @ofek in #34618
  • fix a typo bug where 'id2label' was incorrectly written as 'i2label' when reading config by @ZuoChenFttS in #34637
  • Fix skip of test_training_gradient_checkpointing by @dvrogozh in #34723
  • make sure to disable gradients for integer tensor by @winglian in #32943
  • [docs] make empty_cache device-agnostic by @faaany in #34774
  • [docs] add XPU besides CUDA, MPS etc. by @faaany in #34777
  • [tests] add XPU part to testing by @faaany in #34778
  • fix: Update pixel_values parameter in hf_model input by @thisisiron in #34782
  • Fix callback key name by @jung-hunsoo in #34762
  • fix: Wrong task mentioned in docs by @ecyht2 in #34757
  • Allow handling files as args for a tool created with Tool.from_space by @aymeric-roucher in #34687
  • Fix Whisper CI by @ydshieh in #34617
  • protect tensor parallel usage by @ArthurZucker in #34800
  • Trainer hyperparameter search kwargs docs update by @GuillemGSubies in #34459
  • feat: allow to use hf-hub models for timm backbone by @cgebbe in #34729
  • Support gradient checkpointing in Qwen2VL ViT by @li-plus in #34724
  • Fix: siglip image processor rgb_convert is not being applied correctly. by @jp1924 in #34301
  • fix cpu bnb path by @jiqing-feng in #34647
  • Gemma capping by @ArthurZucker in #34282
  • Fix cache_utils for optimum.quanto kvcache quantization by @SunMarc in #34750
  • Modular fix by @Cyrilvallez in #34802
  • MLU devices : Checks if mlu is available via an cndev-based check which won't trigger the drivers and leave mlu by @huismiling in #34326
  • 🚨🚨🚨 fix(Mask2Former): torch export 🚨🚨🚨 by @philkuz in #34393
  • Feature: print tokens per second during training by @tibor-reiss in #34507
  • Add do_convert_rgb to vit by @jp1924 in #34523
  • Fix post process function called in the instance segmentation example of mask2former by @OnTheThirdDay in #34588
  • fix crash in tiiuae/falcon-11B-vlm image-to-text generation by @sywangyi in #34728
  • Add support for OpenAI api "image_url" input in chat for image-text-to-text pipeline by @yonigozlan in #34562
  • Add Image Processor Fast Deformable DETR by @yonigozlan in #34353
  • Run test_medium_seamless_m4t_pt in subprocess to avoid many failures by @ydshieh in #34812
  • Fix check_training_gradient_checkpointing by @ydshieh in #34806
  • Added image-text-to-text pipeline to task guide by @merveenoyan in #34783
  • Translate attention.md into Chinese by @wwwbai in #34716
  • LLaVA OV: fix unpadding precision by @zucchini-nlp in #34779
  • Fix low memory beam search by @zucchini-nlp in #34746
  • Fix the memory usage issue of logits in generate() by @kjohew in #34813
  • fix(DPT,Depth-Anything) torch.export by @philkuz in #34103
  • Fix: take into account meta device by @tibor-reiss in #34134
  • Fix hyperparameter search when optuna+deepseed by @corentin-ryr in #34642
  • Fix CI by tweaking torchao tests by @SunMarc in #34832
  • Fix CI slack reporting issue by @ydshieh in #34833
  • VLMs: enable generation tests - last batch by @zucchini-nlp in #34484
  • Change logging level from warning to info for max_steps overriding num_train_epochs by @qgallouedec in #34810
  • Fix ds nvme by @eljandoubi in #34444
  • Fix heuristic scheduling for UAG by @jmamou in #34805
  • Refactor StarCoder2 using modular by @Cyrilvallez in #34015
  • Watermarking: fix order by @zucchini-nlp in #34849
  • Update checks for torch.distributed.tensor to require torch >= 2.5 by @loadams in #34816
  • Remove quantization related config from dequantized model by @konradkalita in #34856
  • Auto compile when static cache by @ArthurZucker in #34247
  • Speculative decoding: Test the target distribution (to prevent issues like #32867) by @keyboardAnt in #34553
  • smol improvements to support more flexible usage by @andimarafioti in #34857
  • [CI] Skip EETQ tests while package is broken with latest transformers by @BenjaminBossan in #34854
  • Bitnet test fix to avoid using gated model by @MekkCyber in #34863
  • Fix support for image processors modifications in modular by @yonigozlan in #34866
  • Fix: Enable prefill phase key value caching of nemotron/minitron models by @jeongin601 in #34742
  • Add safe_globals to resume training on PyTorch 2.6 by @dvrogozh in #34632
  • Cache: init empty cache when use_cache by @zucchini-nlp in #34274
  • BLIP: fix generation after hub update by @zucchini-nlp in #34876
  • [Deberta/Deberta-v2] Refactor code base to support compile, export, and fix LLM by @ArthurZucker in #22105
  • 🔴 Mllama: fix base prefix by @zucchini-nlp in #34874
  • Sum gathered input tokens by @techkang in #34554
  • allow unused input parameters passthrough when chunking in asr pipelines by @VictorAtIfInsurance in #33889
  • prepare_fa2_from_position_ids function bugfix by @meliksahturker in #33269
  • chore: fix some typos by @wanxiangchwng in #34891
  • Fix convert_tokens_to_string when decoder is None by @dszeto in #34569
  • [peft] Given that self.active_adapter is deprecated, avoid using it by @tomaarsen in #34804
  • Fix Qwen2 failing tests by @jla524 in #34819
  • Fix : BitNet tests by @MekkCyber in #34895
  • [AWQ, CI] Bump AWQ version used in docker image by @BenjaminBossan in #34922
  • fix static cache data type miss-match by @jiqing-feng in #34799
  • Fix test_auto_backbone_timm_model_from_pretrained by @ydshieh in #34877
  • Upgrade torch version to 2.5 in dockerfile for quantization CI by @MekkCyber in #34924
  • Fix failling GGML test by @MekkCyber in #34871
  • Updated documentation and added conversion utility by @ViktorooReps in #34319
  • making gpt2 fx traceable by @xuzifei-dmatrix in #34633
  • Fix import structure for Fast Image processors by @yonigozlan in #34859
  • VideoLLaVA: add default values by @zucchini-nlp in #34916
  • Skipping aqlm non working inference tests till fix merged by @MekkCyber in #34865
  • [Whisper] Fix whisper integration tests by @eustlb in #34111
  • Add Pytorch Tensor Parallel support for Mistral by @VladOS95-cyber in #34927
  • change apply_rotary_pos_emb of Glmmodel for GLM-Edge Series model by @zRzRzRzRzRzRzR in #34629
  • Fix torch.onnx.export of Qwen2-VL vision encoder by @xenova in #34852
  • Update the Python version in the Chinese README to match the English README. by @vansin in #34870
  • [i18n-ar] Translated file : docs/source/ar/benchmarks.md into Arabic by @AhmedAlmaghz in #33023
  • [docs] use device-agnostic API instead of cuda by @faaany in #34913
  • [doc] use full path for run_qa.py by @faaany in #34914
  • docs: HUGGINGFACE_HUB_CACHE -> HF_HUB_CACHE by @imba-tjd in #34904
  • [i18n-zh]Translated tiktoken.md into chinese by @blueingman in #34936
  • [FlexAttention] Update gemma2 by @ArthurZucker in #34942
  • Fix : Add PEFT from source to CI docker by @MekkCyber in #34969
  • Avoid calling get_max_length by @ydshieh in #34971
  • Fix flaky test execution caused by Thread by @ydshieh in #34966
  • 🌐 [i18n-KO] Translated encoder-decoder.md to Korean by @maximizemaxwell in #34880
  • [docs] add explanation to release_memory() by @faaany in #34911
  • [i18n-zh]Translated perf_train_special.md into Chinese by @blueingman in #34948
  • Fix typo in code block in vipllava.md by @yuanx749 in #34957
  • Fixed typo in VisitWebpageTool by @sergiopaniego in #34978
  • [PEFT] Set eval mode when loading PEFT adapter by @BenjaminBossan in #34509
  • Fix save_pretrained for partially offloaded models by @kylesayrs in #34890
  • 🚨🚨🚨 Changed DINOv2Config default patch size to 14 by @OFSkean in #34568
  • Refine the code of Universal Assisted Generation by @xinpengzz in #34823
  • Allow compressed-tensors quantized model to be trained by @horheynm in #34520
  • Offloaded cache: fix generate by @zucchini-nlp in #34921
  • Fix utils/check_bad_commit.py (for auto ping in CI) by @ydshieh in #34943
  • Add optimized PixtralImageProcessorFast by @mgoin in #34836
  • Improve .from_pretrained type annotations by @qubvel in #34973
  • Fix docker CI : install autogptq from source by @MekkCyber in #35000
  • Let server decide default repo visibility by @Wauplin in #34999
  • 🚨🚨🚨 Uniformize kwargs for TrOCR Processor by @tibor-reiss in #34587
  • Update timm version by @qubvel in #35005
  • fix: double verbs by @SamuelLarkin in #35008
  • Update FillMaskPipeline.__call__ signature and docstring by @alvarobartt in #35006
  • Only cast cu_seqlens when tracing by @xenova in #35016
  • fix variable undefined bug when return_tensors is not specified in llava processing by @chenweize1998 in #34953
  • Optimize memory usage of mllama encoder by @milesial in #34930
  • Typo in warning switching to optimum-quanto by @Bojun-Feng in #35028
  • Add type hints for forward functions in Gemma2 by @jla524 in #35034
  • Fix test_eager_matches_sdpa_inference for XPU backend by @dvrogozh in #34889
  • Multiple typo fixes in Tutorials docs by @henryhmko in #35035
  • add docstring example for compute_loss_func by @secrettoad in #35020
  • [i18n-ar] Translated file : docs/source/ar/notebooks.md into Arabic by @AhmedAlmaghz in #33049
  • [docs] add the missing import for Image and bug fix by @faaany in #34776
  • Translate bertlogy.md into Chinese by @wwwbai in #34908
  • Automatic compilation in generate: do not rely on inner function by @Cyrilvallez in #34923
  • Add token cost + runtime monitoring to Agent and HfEngine children by @aymeric-roucher in #34548
  • Fix BertGeneration by @ydshieh in #35043
  • fix speecht5 failure issue in test_peft_gradient_checkpointing_enable… by @sywangyi in #34454
  • [docs] fix example code bug by @faaany in #35054
  • Translate community.md into Chinese by @wwwbai in #35013
  • [docs] use device-agnostic instead of cuda by @faaany in #35047
  • [docs] use device-agnostic API instead of hard-coded cuda by @faaany in #35048
  • Fix pad_token_tensor is None in warning by @tshu-w in #34005
  • Add Pytorch Tensor Parallel support for Qwen2, Qwen2Moe, Starcoder2 by @VladOS95-cyber in #35007
  • [GPTNeoX] Flex Attention + Refactor by @vasqu in #34896
  • Support for easier multimodal use of modular by @Cyrilvallez in #35056
  • [docs] add a comment that offloading requires CUDA GPU by @faaany in #35055
  • [docs] Increase visibility of torch_dtype="auto" by @stevhliu in #35067
  • Informative by @ydshieh in #35059
  • [Whisper] Fix whisper tokenizer by @eustlb in #34537
  • [tokenizers] bump to 0.21 by @ArthurZucker in #34972
  • Update Mistral conversion script by @Cyrilvallez in #34829
  • Fix tie_word_embeddings handling for GGUF models by @Isotr0py in #35085
  • Deprecate quanto and switch to optimum-quanto by @MekkCyber in #35001
  • BLIP: this is correct now by @zucchini-nlp in #35081
  • [trainer] fix the GA model_accepts_loss_kwargs by @ArthurZucker in #34915
  • Fix flaky Hub CI (test_trainer.py) by @ydshieh in #35062
  • Adaptive dynamic number of speculative tokens by @jmamou in #34156

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @AhmedAlmaghz
    • [i18n-ar] Translated file : docs/source/ar/fast_tokenizers.md into Arabic (#33034)
    • [i18n-ar] Translated file : docs/source/ar/multilingual.md into Arabic (#33048)
    • [i18n-ar] Translated file : docs/source/ar/trainer.md into Arabic (#33080)
    • [i18n-ar] Translated file : docs/source/ar/torchscript.md into Arabic (#33079)
    • [i18n-ar] Translated file : docs/source/ar/benchmarks.md into Arabic (#33023)
  • @maximizemaxwell
    • 🌐 [i18n-KO] Translated perf_train_special.md to Korean (#34590)
    • 🌐 [i18n-KO] Translated bert.md to Korean (#34627)
    • 🌐 [i18n-KO] Translated marian.md to Korean (#34698)
    • 🌐 [i18n-KO] Translated encoder-decoder.md to Korean (#34880)
  • @2015aroras
    • Add OLMo November 2024 (#34551)
    • Rename OLMo November to OLMo2 (#34864)
  • @mgoin
    • Add optimized PixtralImageProcessorFast (#34836)
Nov 18, 2024
Patch release v4.46.3

One small fix for FSDP + gradient accumulation loss issue!

  • FSDP grad accum fix, #34645 by @winglian
Latest
v5.5.4
Tracking Since
Apr 23, 2024
Last checked Apr 19, 2026