Since the release of Llama 4, we have fixed a few issues that we are now releasing in patch v4.51.1
Thanks all for your patience
Llama 4, developed by Meta, introduces a new auto-regressive Mixture-of-Experts (MoE) architecture.This generation includes two models:
Both models leverage early fusion for native multimodality, enabling them to process text and image inputs. Maverick and Scout are both trained on up to 40 trillion tokens on data encompassing 200 languages (with specific fine-tuning support for 12 languages including Arabic, Spanish, German, and Hindi).
For deployment, Llama 4 Scout is designed for accessibility, fitting on a single server-grade GPU via on-the-fly 4-bit or 8-bit quantization, while Maverick is available in BF16 and FP8 formats. These models are released under the custom Llama 4 Community License Agreement, available on the model repositories
Getting started with Llama 4 using transformers is straightforward. Make sure you have transformers v4.51.0 or later installed:
pip install -U transformers[hf_xet]
Here's a quick example using the instruction-tuned Maverick model responding about two images, using tensor parallel for maximum speed. You need to run this script on an instance with 8 GPUs, using a command like:
torchrun –nproc-per-instance=8 script.py
from transformers import AutoProcessor, Llama4ForConditionalGeneration
import torch
model_id = "meta-llama/Llama-4-Maverick-17B-128E-Instruct"
processor = AutoProcessor.from_pretrained(model_id)
model = Llama4ForConditionalGeneration.from_pretrained(
model_id,
attn_implementation="flex_attention",
device_map="auto",
torch_dtype=torch.bfloat16,
)
url1 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/0052a70beed5bf71b92610a43a52df6d286cd5f3/diffusers/rabbit.jpg"
url2 = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/datasets/cat_style_layout.png"
messages = [
{
"role": "user",
"content": [
{"type": "image", "url": url1},
{"type": "image", "url": url2},
{"type": "text", "text": "Can you describe how these two images are similar, and how they differ?"},
]
},
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
outputs = model.generate(
**inputs,
max_new_tokens=256,
)
response = processor.batch_decode(outputs[:, inputs["input_ids"].shape[-1]:])[0]
print(response)
print(outputs[0])
Make sure to check the model cards on the repos (Llama 4 Maverick (~400B) and Llama 4 Scout (~109B)) for detailed usage instructions, including multimodal examples, specific prompt formats (like system prompts), quantization details, and advanced configuration options!
Phi-4-multimodal-instruct is a lightweight open multimodal foundation model that leverages the language, vision, and speech research and datasets used for Phi-3.5 and 4.0 models. The model processes text, image, and audio inputs, generating text outputs, and comes with 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning, direct preference optimization and RLHF (Reinforcement Learning from Human Feedback) to support precise instruction adherence and safety measures. The languages that each modal supports are the following:
DeepSeek-v3 is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.
The model is detailed in the following paper.
The DeepSeek-V3 model was proposed in DeepSeek-V3 Technical Report by DeepSeek-AI Team.
The abstract from the paper is the following:
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.
The Qwen3 architecture has been contributed to transformers and is available in v4.51.0. At time of release, the models themselves have not yet been released - stay tuned for a release from the Qwen team!
Model docs are getting a significant overhaul by providing much needed, ready-to-use examples one can copy-paste in their modules/consoles. We will adapt these examples to each model, with the goal of providing relevant examples on a per-model basis.
A very large PR was provided by @nikosanto13 that helped add modular files to all speech models in the library; seeing the difference between each of them is now much simpler, as well as maintenance and eventual refactors.
original_max_position_embeddings to YARN rope_scaling optional keys by @JustinTong0323 in #36877trainer_pt_utils.py docstrings for consistency by @ethanknights in #36912DataCollatorForWholeWordMask by @capemox in #36903uv for installing packages by @Sai-Suraj-27 in #36957networkx==3.2.1 manually in some CircleCI jobs after #36957 by @ydshieh in #37000to_py_obj for python-native numeric lists and scalars by @n0gu-furiosa in #36885qwen2_vl.md to Korean by @MinJu-Ha in #36750AwqConfigTest by @faaany in #37032test_assisted_decoding_in_different_gpu test on XPU by @yao-matrix in #37120_VALID_DICT_FIELDS to class attribute for shared dict parsing in subclasses by @Tavish9 in #36736ModernBERT] Never save 'reference_compile' config; should be set based on end user by @tomaarsen in #36305307 in RequestCounter by @ydshieh in #36953TASK_MAPPING by @saattrupdan in #37107min_new_tokens to prevent flaky length checks by @gante in #37175num_items_in_batch if necessary by @regisss in #36967utils/check_bad_commit.py by @ydshieh in #37272return_tensors in audio chat templates by @zucchini-nlp in #346010.11.2 by @ydshieh in #36962lru_cache for tokenization tests by @ydshieh in #36818return_dict logic to remove complicated if/else paths by @qubvel in #36794The following contributors have made significant changes to the library over the last release:
A new model is added to transformers: DeepSeek 3 (Also known as DeepSeek R1). It is added on top of the v4.50.3 release, and can be installed from the following tag: v4.50.3-DeepSeek-3.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.50.3-DeepSeek-3
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
The model is detailed in the following paper.
The DeepSeek-V3 model was proposed in DeepSeek-V3 Technical Report by DeepSeek-AI Team.
The abstract from the paper is the following:
We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks. The model checkpoints are available at https://github.com/deepseek-ai/DeepSeek-V3.
We are super happy to make this code community-powered, and would love to see how you can help optimize the following:
get_packed_weights from intetrations/tensor_parallel.The model uses Multi-head Latent Attention (MLA) and DeepSeekMoE architectures for efficient inference and cost-effective training. It employs an auxiliary-loss-free strategy for load balancing and multi-token prediction training objective. The model can be used for various language tasks after being pre-trained on 14.8 trillion tokens and going through Supervised Fine-Tuning and Reinforcement Learning stages.
You can run the model in FP8 automatically, using 2 nodes of 8 H100 should be more than enough!
# `run_deepseek_v1.py`
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
torch.manual_seed(30)
tokenizer = AutoTokenizer.from_pretrained("deepseek-r1")
chat = [
{"role": "user", "content": "Hello, how are you?"},
{"role": "assistant", "content": "I'm doing great. How can I help you today?"},
{"role": "user", "content": "I'd like to show off how chat templating works!"},
]
model = AutoModelForCausalLM.from_pretrained("deepseek-r1", device_map="auto", torch_dtype=torch.bfloat16)
inputs = tokenizer.apply_chat_template(chat, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
outputs = model.generate(inputs, max_new_tokens=50)
print(tokenizer.batch_decode(outputs))
This generated:
<|Assistant|><think>
Okay, the user wants to demonstrate how chat templating works. Let me break down what that means. Chat templating is about structuring the conversation data, especially for models that need specific input formats. Maybe they're referring to something like how messages are formatted with roles (user, assistant, system) in APIs like OpenAI.
First, I should explain what chat templating is. It's the process of formatting conversation data into a structured format that the model can understand. This usually includes roles and content. For example, user messages, assistant responses, and system messages each have their own role tags.
They might want an example. Let me think of a simple conversation. The user says "Hello, how are you?" and the assistant responds "I'm doing great. How can I help you today?" Then the user follows up with wanting to show off chat templating. So the example should include the history and the new message.
In some frameworks, like Hugging Face's Transformers, chat templates are applied using Jinja2 templates. The template might look something like combining system messages, then looping through user and assistant messages with appropriate tags. For instance, using {% for message in messages %} and assigning roles like <|user|>, <|assistant|>, etc.
I should structure the example with the messages array, showing each role and content. Then apply a hypothetical template to convert that into a formatted string the model uses. Also, mention that different models have different templating requirements, like using special tokens or varying role labels.
Wait, the user mentioned "chat templating" in the context of showing off. Maybe they want a practical example they can present. So providing a code snippet or a structured data example would be helpful. Let me outline a typical messages array and then the templated output.
Also, it's important to note that proper templating ensures the model knows the conversation flow, which is crucial for generating coherent responses. Maybe include a note about why it's important, like maintaining context and role-specific processing.
Let me check if there are any common mistakes or things to avoid. For example, not closing tags properly, or mismatching roles. But maybe that's too detailed unless the user asks. Focus on the positive example first.
Putting it all together, the response should have an example messages array, the applied template, and the final formatted string. Maybe use angle brackets or special tokens as placeholders. Also, mention that this helps in training or fine-tuning models with structured data.
I think that's a solid approach. Let me structure it step by step to make it clear.
</think>
Chat templating is a way to structure conversation data (e.g., user/assistant interactions) into a format that language models understand. This is especially important for models trained to handle multi-turn dialogues, where the input must explicitly separate roles (user, assistant, system, etc.) and messages. Let’s break this down with an example!
---
### **Step 1: Raw Conversation History**
Suppose we have this conversation:
- **User**: "Hello, how are you?"
- **Assistant**: "I'm doing great. How can I help you today?"
- **User**: "I'd like to show off how chat templating works!"
---
### **Step 2: Structured Messages**
In frameworks like Hugging Face Transformers or OpenAI, conversations are often formatted as a list of dictionaries with `role` and `content`:
```python
messages = [
{"role": "user", "content": "Hello, how are you?"},
{"role": "assistant", "content": "I'm doing great. How can I help you today?"},
{"role": "user", "content": "I'd like to show off how chat templating works!"},
]
```
---
### **Step 3: Apply a Chat Template**
A **chat template** converts this structured data into a single string formatted for the model. For example, using a Jinja-style template (common in Hugging Face):
```jinja
{% for message in messages %}
{% if message['role'] == 'user' %}
<|user|>{{ message['content'] }}<|end|>
{% elif message['role'] == 'assistant' %}
<|assistant|>{{ message['content'] }}<|end|>
{% endif %}
{% endfor %}
<|assistant|>
```
---
### **Step 4: Final Templated Output**
Applying the template to our `messages` list would produce:
```text
<|user|>Hello, how are you?<|end|>
<|assistant|>I'm doing great. How can I help you today?<|end|>
<|user|>I'd like to show off how chat templating works!<|end|>
<|assistant|>
```
This tells the model:
1. The conversation history (user/assistant turns).
2. The model’s turn to generate a response (`<|assistant|>` at the end).
---
### **Key Notes**:
- **Role Separation**: Tags like `<|user|>` and `<|assistant|>` help the model distinguish speakers.
- **Special Tokens**: Models often use unique tokens (e.g., `<|end|>`) to mark message boundaries.
- **Flexibility**: Templates vary by model (e.g., OpenAI uses `{"role": "user", "content": "..."}` instead of tags).
---
### **Why This Matters**:
- **Consistency**: Ensures the model understands dialogue structure.
- **Context Preservation**: Maintains the flow of multi-turn conversations.
- **Alignment**: Matches the format the model was trained on for better performance.
Want to dive deeper or see a specific framework’s implementation (e.g., OpenAI, Llama, Mistral)? Let me know! 😊<|end▁of▁sentence|>
Use the following to run it
torchrun --nproc_per_node=8 --nnodes=2 --node_rank=0|1 --rdzv-id an_id --rdzv-backend c10d --rdzv-endpoint master_addr:master_port run_deepseek_r1.py
If you have:
[rank0]: ncclInternalError: Internal check failed.
[rank0]: Last error:
[rank0]: Bootstrap : no socket interface found
error, it means NCCL was probably not loaded.
Thanks to the vllm team we have a few more bugs that slipped in!
[generate] beam search -- fix output cropping (#37080) by @gante
[blip-2] Fix dtype mismatch when keep in fp32 (#37068) by @zucchini-nlp
Fix PixtralProcessor patch_size when spatial_merge_size is used (#37019)
I completely forgot to put these in the previous patch sorry! Should put the transformers backend in a good spot!
[Utils] torch version checks optionally accept dev versions (#36847) by @gante
Fix processor kwargs qwen2 vl (#36890) by @yonigozlan
Fix Pan and Scan on batched images Gemma3 (#36864) by @yonigozlan
There were some very minor bugs with the new hub kernels, and with remote code that we had to fix
Deprecate #36741 and map Causal to Conditional (#36917) by @zucchini-nlp
Fix pytorch deform attn path (#36923) by @qubvel
[chameleon] fix num image token check (#36918) by @zucchini-nlp
Fix torch version guard at import (#36907) by @zucchini-nlp
Starting with version v4.49.0, we have been doing model-based releases, additionally to our traditional, software-based monthly releases. These model-based releases provide a tag from which models may be installed.
Contrarily to our software-releases; these are not pushed to pypi and are kept on our GitHub. Each release has a tag attributed to it, such as:
v4.49.0-Gemma-3v4.49.0-AyaVision⚠️ As bugs are identified and fixed on each model, the release tags are updated so that installing from that tag always gives the best experience possible with that model.
Each new model release will always be based on the current state of the main branch at the time of its creation. This ensures that new models start with the latest features and fixes available.
For example, if two models—Gemma-3 and AyaVision—are released from main, and then a fix for gemma3 is merged, it will look something like this:
o---- v4.49.0-Gemma-3 (includes AyaVision, plus main fixes)
/ \
---o--o--o--o--o-- (fix for gemma3) --o--o--o main
\
o---- v4.49.0-AyaVision
We strive to merge model specific fixes on their respective branches as fast as possible!
Gemma 3 is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.
The Gemma 3 model was proposed by Google. It is a vision-language model composed by a SigLIP vision encoder and a Gemma 2 language decoder linked by a multimodal linear projection.
It cuts an image into a fixed number of tokens same way as Siglip if the image does not exceed certain aspect ratio. For images that exceed the given aspect ratio, it crops the image into multiple smaller pacthes and concatenates them with the base image embedding.
One particularity is that the model uses bidirectional attention on all the image tokens. Also, the model interleaves sliding window local attention with full causal attention in the language backbone, where each sixth layer is a full causal attention layer.
ShieldGemma 2 is built on Gemma 3, is a 4 billion (4B) parameter model that checks the safety of both synthetic and natural images against key categories to help you build robust datasets and models. With this addition to the Gemma family of models, researchers and developers can now easily minimize the risk of harmful content in their models across key areas of harm as defined below:
We recommend using ShieldGemma 2 as an input filter to vision language models, or as an output filter of image generation systems. To train a robust image safety model, we curated training datasets of natural and synthetic images and instruction-tuned Gemma 3 to demonstrate strong performance.
AyaVision is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.
The Aya Vision 8B and 32B models is a state-of-the-art multilingual multimodal models developed by Cohere For AI. They build on the Aya Expanse recipe to handle both visual and textual information without compromising on the strong multilingual textual performance of the original model.
Aya Vision 8B combines the Siglip2-so400-384-14 vision encoder with the Cohere CommandR-7B language model further post-trained with the Aya Expanse recipe, creating a powerful vision-language model capable of understanding images and generating text across 23 languages. Whereas, Aya Vision 32B uses Aya Expanse 32B as the language model.
Key features of Aya Vision include:
Mistral 3.1 is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.
Building upon Mistral Small 3 (2501), Mistral Small 3.1 (2503) adds state-of-the-art vision understanding and enhances long context capabilities up to 128k tokens without compromising text performance. With 24 billion parameters, this model achieves top-tier capabilities in both text and vision tasks.
It is ideal for:
SmolVLM-2 is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.
SmolVLM2 is an adaptation of the Idefics3 model with two main differences:
SigLIP-2 is heavily referenced in the following model-based release and we recommend reading these if you want all the information relative to that model.
The SigLIP2 model was proposed in SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features by Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner and Xiaohua Zhai.
The model comes in two variants
transformers)PromptDepthAnything is a high-resolution, accurate metric depth estimation model that leverages prompting, inspired by its success in vision-language (VLMs) and large language models (LLMs). Using iPhone LiDAR as a prompt, the model generates precise depth maps at up to 4K resolution, unlocking the potential of depth foundation models.
We add a new tool to transformers to visualize the attention layout of a given model. It only requires a model ID as input, and will load the relevant tokenizer/model and display what the attention mask looks like. Some examples:
from transformers.utils.attention_visualizer import AttentionMaskVisualizer
visualizer = AttentionMaskVisualizer("meta-llama/Llama-3.2-3B-Instruct")
visualizer("A normal attention mask")
visualizer = AttentionMaskVisualizer("mistralai/Mistral-Small-24B-Instruct-2501")
visualizer("A normal attention mask with a long text to see how it is displayed, and if it is displayed correctly")
visualizer = AttentionMaskVisualizer("google/paligemma2-3b-mix-224")
visualizer("<img> You are an assistant.", suffix = "What is on the image?")
visualizer = AttentionMaskVisualizer("google/gemma-2b")
visualizer("You are an assistant. Make sure you print me") # we should have slidiing on non sliding side by side
visualizer = AttentionMaskVisualizer("google/gemma-3-27b-it")
visualizer("<img>You are an assistant. Make sure you print me") # we should have slidiing on non sliding side by side
We are deprecating transformers.agents in favour of the smolagents library. Read more about smolagents here.
We support adding custom quantization method by using the @register_quantization_config and @register_quantizer decorator:
@register_quantization_config("custom")
class CustomConfig(QuantizationConfigMixin):
pass
@register_quantizer("custom")
class CustomQuantizer(HfQuantizer):
pass
quantized_model = AutoModelForCausalLM.from_pretrained(
"facebook/opt-350m", quantization_config=CustomConfig(), torch_dtype="auto"
)
AMD is developing its in-house quantizer named Quark released under MIT license, which supports a broad range of quantization pre-processing, algorithms, dtypes and target hardware. You can now load a model quantized by quark library:
# pip install amd-quark
model_id = "EmbeddedLLM/Llama-3.1-8B-Instruct-w_fp8_per_channel_sym"
model = AutoModelForCausalLM.from_pretrained(model_id)
model = model.to("cuda")
Torchao is augmented with autoquant support, CPU-quantization, as well as new AOBaseConfig object instances for more advanced configuration.
At loading time, the parallelization is now applied module-by-module, so that no memory overhead is required compared to what the final weight distribution will be!
This release includes two speed upgrades to generate:
do_sample=True;from transformers import pipeline
import torch
prompt = "Alice and Bob"
checkpoint = "google/gemma-2-9b"
assistant_checkpoint = "double7/vicuna-68m"
pipe = pipeline(
"text-generation",
model=checkpoint,
assistant_model=assistant_checkpoint,
do_sample=True
)
pipe_output = pipe(prompt, max_new_tokens=50, do_sample=True)
print(pipe_output[0]["generated_text"])
num_beams. The speedup is more visible on smaller models, where model.forward doesn't dominate the total run time.CandidateGenerator by @keyboardAnt, @jmamou, and @gauravjain14 in #35029A significant redesign of our documentation has wrapped-up. The goal was to greatly simplify the transformers documentation, making it much more easy to navigate. Let us know what you think!
The research examples folder that was hosted in transformers is no more. We have moved it out of transformers and in the following repo: github.com/huggingface/transformers-research-projects/
We have updated our flex attention support so as to have it be on-par with our Flash Attention 2 support.
EsmModelIntegrationTest::test_inference_bitsandbytes by @faaany in #36225LlavaForConditionalGenerationModelTest::test_config after #36077 by @ydshieh in #36230/generation by @gante in #36235test_export_to_onnx by @gante in #36241test_fast_is_faster_than_slow by @ydshieh in #36240Speech2TextFeatureExtractor API. by @KarelVesely84 in #34638pt_tf equivalence tests by @gante in #36253test_from_pretrained_low_cpu_mem_usage_equal less flaky by @gante in #36255GenerationTesterMixin inheritance is correct 🐛 🔫 by @gante in #36180main by @ydshieh in #36375is_causal fail with compile by @Cyrilvallez in #36374benchmark.yml by @ydshieh in #36402CandidateGenerator by @keyboardAnt in #35029contents: write by @ydshieh in #36445torch.distributed-compatible DynamicCache by @gante in #36373src/transformers/image_utils.py by @hmellor in #36435hub_retry by @ydshieh in #36449TRUST_REMOTE_CODE for RealmRetriever for security by @ydshieh in #36511input_ids passed to PrefixConstrainedLogitsProcessor is zero by @HiDolen in #36489DataCollatorForLanguageModeling by @capemox in #36457HybridCache] disable automatic compilation by @gante in #36620make fix-copies by @gante in #36664from_pretrained by @Cyrilvallez in #36033meta device by @gante in #36543gc.collect() if only 1 shard is used by @gante in #36721test_eager_matches_sdpa_inference by @gante in #36650generation_config, overwrite default values with the model's base generation_config by @gante in #36684TrainingArguments.torch_empty_cache_steps post_init check by @pkuderov in #36734test_eager_matches_sdpa_inference by @gante in #36740is_decoder usage in PretrainedConfig documentation by @d-kleine in #36724tj-actions/changed-files by @ydshieh in #36795dist": "loadfile" for pytest in CircleCI jobs by @ydshieh in #36811Trainer.collator.tokenizer in when Trainer.processing_class is None by @innerNULL in #36552GenerationMixin by @gante in #36605DataCollatorForLanguageModeling by @capemox in #36497.item in get_batch_samples by @regisss in #36861deformable_detr kernel from the Hub by @danieldk in #36853The following contributors have made significant changes to the library over the last release:
CandidateGenerator (#35029)deformable_detr kernel from the Hub (#36853)A new model is added to transformers: Mistral 3. It is added on top of the v4.49.0 release, and can be installed from the following tag: v4.49.0-Mistral-3.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.49.0-Mistral-3
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
The model is detailed in the following blog post.
The models are available on the Hub with the following tag: mistral3
Building upon Mistral Small 3 (2501), Mistral Small 3.1 (2503) adds state-of-the-art vision understanding and enhances long context capabilities up to 128k tokens without compromising text performance. With 24 billion parameters, this model achieves top-tier capabilities in both text and vision tasks.
It is ideal for:
This model was contributed by cyrilvallez and yonigozlan.
The original code can be found here and here.
Here is how you can use the image-text-to-text pipeline to perform inference with the Mistral3 models in just a few lines of code:
>>> from transformers import pipeline
>>> messages = [
... {
... "role": "user",
... "content": [
... {
... "type": "image",
... "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg",
... },
... {"type": "text", "text": "Describe this image."},
... ],
... },
... ]
>>> pipe = pipeline("image-text-to-text", model="mistralai/Mistral-Small-3.1-24B-Instruct-2503", torch_dtype=torch.bfloat16)
>>> outputs = pipe(text=messages, max_new_tokens=50, return_full_text=False)
>>> outputs[0]["generated_text"]
'The image depicts a vibrant and lush garden scene featuring a variety of wildflowers and plants. The central focus is on a large, pinkish-purple flower, likely a Greater Celandine (Chelidonium majus), with a'
This example demonstrates how to perform inference on a single image with the Mistral3 models using chat templates.
>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch
>>> torch_device = "cuda"
>>> model_checkpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
>>> messages = [
... {
... "role": "user",
... "content": [
... {"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
... {"type": "text", "text": "Describe this image"},
... ],
... }
... ]
>>> inputs = processor.apply_chat_template(messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
>>> generate_ids = model.generate(**inputs, max_new_tokens=20)
>>> decoded_output = processor.decode(generate_ids[0, inputs["input_ids"].shape[1] :], skip_special_tokens=True)
>>> decoded_output
"The image depicts two cats lying on a pink blanket. The larger cat, which appears to be an"...
This example shows how to generate text using the Mistral3 model without providing any image input.
>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch
>>> torch_device = "cuda"
>>> model_checkpoint = ".mistralai/Mistral-Small-3.1-24B-Instruct-2503"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
>>> SYSTEM_PROMPT = "You are a conversational agent that always answers straight to the point, always end your accurate response with an ASCII drawing of a cat."
>>> user_prompt = "Give me 5 non-formal ways to say 'See you later' in French."
>>> messages = [
... {"role": "system", "content": SYSTEM_PROMPT},
... {"role": "user", "content": user_prompt},
... ]
>>> text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
>>> inputs = processor(text=text, return_tensors="pt").to(0, dtype=torch.float16)
>>> generate_ids = model.generate(**inputs, max_new_tokens=50, do_sample=False)
>>> decoded_output = processor.batch_decode(generate_ids[:, inputs["input_ids"].shape[1] :], skip_special_tokens=True)[0]
>>> print(decoded_output)
"1. À plus tard!
2. Salut, à plus!
3. À toute!
4. À la prochaine!
5. Je me casse, à plus!
```
/\_/\
( o.o )
> ^ <
```"
Mistral3 models also support batched image and text inputs.
>>> from transformers import AutoProcessor, AutoModelForImageTextToText
>>> import torch
>>> torch_device = "cuda"
>>> model_checkpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> model = AutoModelForImageTextToText.from_pretrained(model_checkpoint, device_map=torch_device, torch_dtype=torch.bfloat16)
>>> messages = [
... [
... {
... "role": "user",
... "content": [
... {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
... {"type": "text", "text": "Write a haiku for this image"},
... ],
... },
... ],
... [
... {
... "role": "user",
... "content": [
... {"type": "image", "url": "https://www.ilankelman.org/stopsigns/australia.jpg"},
... {"type": "text", "text": "Describe this image"},
... ],
... },
... ],
... ]
>>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
>>> output = model.generate(**inputs, max_new_tokens=25)
>>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
>>> decoded_outputs
["Write a haiku for this imageCalm waters reflect\nWhispers of the forest's breath\nPeace on wooden path"
, "Describe this imageThe image depicts a vibrant street scene in what appears to be a Chinatown district. The focal point is a traditional Chinese"]
This implementation of the Mistral3 models supports batched text-images inputs with different number of images for each text.
This example also how to use BitsAndBytes to load the model in 4bit quantization.
>>> from transformers import AutoProcessor, AutoModelForImageTextToText, BitsAndBytesConfig
>>> import torch
>>> torch_device = "cuda"
>>> model_checkpoint = "mistralai/Mistral-Small-3.1-24B-Instruct-2503"
>>> processor = AutoProcessor.from_pretrained(model_checkpoint)
>>> quantization_config = BitsAndBytesConfig(load_in_4bit=True)
>>> model = AutoModelForImageTextToText.from_pretrained(
... model_checkpoint, quantization_config=quantization_config
... )
>>> messages = [
... [
... {
... "role": "user",
... "content": [
... {"type": "image", "url": "https://llava-vl.github.io/static/images/view.jpg"},
... {"type": "text", "text": "Write a haiku for this image"},
... ],
... },
... ],
... [
... {
... "role": "user",
... "content": [
... {"type": "image", "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"},
... {"type": "image", "url": "https://thumbs.dreamstime.com/b/golden-gate-bridge-san-francisco-purple-flowers-california-echium-candicans-36805947.jpg"},
... {"type": "text", "text": "These images depict two different landmarks. Can you identify them?"},
... ],
... },
... ],
>>> ]
>>> inputs = processor.apply_chat_template(messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt").to(model.device, dtype=torch.bfloat16)
>>> output = model.generate(**inputs, max_new_tokens=25)
>>> decoded_outputs = processor.batch_decode(output, skip_special_tokens=True)
>>> decoded_outputs
["Write a haiku for this imageSure, here is a haiku inspired by the image:\n\nCalm lake's wooden path\nSilent forest stands guard\n", "These images depict two different landmarks. Can you identify them? Certainly! The images depict two iconic landmarks:\n\n1. The first image shows the Statue of Liberty in New York City."]
A new model is added to transformers: Gemma 3. It is added on top of the v4.49.0 release, and can be installed from the following tag: v4.49.0-Gemma-3.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.49.0-Gemma-3
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
The model is detailed in the following blog post. The models and demos using the model are available in the following collection.
A Space to play around with the 12B-it flavor is available here.
The Gemma 3 model was proposed by Google. It is a vision-language model composed by a SigLIP vision encoder and a Gemma 2 language decoder linked by a multimodal linear projection.
It cuts an image into a fixed number of tokens same way as Siglip if the image does not exceed certain aspect ratio. For images that exceed the given aspect ratio, it crops the image into multiple smaller pacthes and concatenates them with the base image embedding.
One particularity is that the model uses bidirectional attention on all the image tokens. Also, the model interleaves sliding window local attention with full causal attention in the language backbone, where each sixth layer is a full causal attention layer.
Gemma3ForConditionalGeneration.Gemma3ForCausalLM for generation to avoid loading the vision tower."<start_of_image_>" token where the images should be inserted.apply_chat_template method to convert chat messages to text that can then be passed as text to the processor. You can also get a vectorized output from apply_chat_template. See the examples below for more details on how to use it.The model supports cropping images into smaller patches when the image aspect ratio exceeds a certain value. By default the images are not cropped and only the base image is forwarded to the model. Users can set do_pan_and_scan=True to obtain several crops per image along with the base image to improve the quality in DocVQA or similar tasks requiring higher resolution images.
Pan and scan is an inference time optimization to handle images with skewed aspect ratios. When enabled, it improves performance on tasks related to document understanding, infographics, OCR, etc.
from transformers import AutoProcessor
processor = AutoProcessor.from_pretrained("google/gemma-3-4b-it", padding_side="left")
url = "https://media.istockphoto.com/id/1192867753/photo/cow-in-berchida-beach-siniscola.jpg?s=612x612&w=0&k=20&c=v0hjjniwsMNfJSuKWZuIn8pssmD5h5bSN1peBd1CmH4="
messages = [
{
"role": "system",
"content": [
{"type": "text", "text": "You are a helpful assistant."}
]
},
{
"role": "user", "content": [
{"type": "image", "url": url},
{"type": "text", "text": "What is shown in this image?"},
]
},
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
return_tensors="pt",
add_generation_prompt=True,
do_pan_and_scan=True,
).to(model.device)
from transformers import AutoProcessor, Gemma3ForConditionalGeneration
model_id = "google/gemma-3-4b-it"
model = Gemma3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
processor = AutoProcessor.from_pretrained(model_id, padding_side="left")
url = "https://media.istockphoto.com/id/1192867753/photo/cow-in-berchida-beach-siniscola.jpg?s=612x612&w=0&k=20&c=v0hjjniwsMNfJSuKWZuIn8pssmD5h5bSN1peBd1CmH4="
messages = [
{
"role": "system",
"content": [
{"type": "text", "text": "You are a helpful assistant."}
]
},
{
"role": "user", "content": [
{"type": "image", "url": url},
{"type": "text", "text": "What is shown in this image?"},
]
},
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
return_tensors="pt",
add_generation_prompt=True,
).to(model.device)
output = model.generate(**inputs, max_new_tokens=50)
print(processor.decode(output[0], skip_special_tokens=True)[inputs.input_ids.shape[1]: ])
from transformers import AutoTokenizer, Gemma3ForCausalLM
model_id = "google/gemma-3-4b-it"
model = Gemma3ForConditionalGeneration.from_pretrained(model_id, device_map="auto")
processor = AutoProcessor.from_pretrained(model_id, padding_side="left")
url_cow = "https://media.istockphoto.com/id/1192867753/photo/cow-in-berchida-beach-siniscola.jpg?s=612x612&w=0&k=20&c=v0hjjniwsMNfJSuKWZuIn8pssmD5h5bSN1peBd1CmH4="
url_stop = "https://www.ilankelman.org/stopsigns/australia.jpg"
messages = [
{
"role": "system",
"content": [
{"type": "text", "text": "You are a helpful assistant."}
]
},
{
"role": "user", "content": [
{"type": "image", "url": url_cow},
{"type": "image", "url": url_stop},
{"type": "text", "text": "Are these two images identical?"},
]
},
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
return_tensors="pt",
add_generation_prompt=True,
).to(model.device)
output = model.generate(**inputs, max_new_tokens=50)
print(processor.decode(output[0], skip_special_tokens=True)[inputs.input_ids.shape[1]: ])
from transformers import AutoTokenizer, Gemma3ForCausalLM
model_id = "google/gemma-3-1b-it"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = Gemma3ForCausalLM.from_pretrained(model_id, device_map="auto")
input_ids = tokenizer("Write me a poem about Machine Learning.", return_tensors="pt").to(model.device)
outputs = model.generate(**input_ids, max_new_tokens=100)
text = tokenizer.batch_decode(outputs, skip_special_tokens=True)
print(text)
A new model is added to transformers: Aya Vision. It is added on top of the v4.49.0 release, and can be installed from the following tag: v4.49.0-AyaVision.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.49.0-AyaVision
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
The model is detailed in the following blog post.
The Aya Vision 8B and 32B models is a state-of-the-art multilingual multimodal models developed by Cohere For AI. They build on the Aya Expanse recipe to handle both visual and textual information without compromising on the strong multilingual textual performance of the original model.
Aya Vision 8B combines the Siglip2-so400-384-14 vision encoder with the Cohere CommandR-7B language model further post-trained with the Aya Expanse recipe, creating a powerful vision-language model capable of understanding images and generating text across 23 languages. Whereas, Aya Vision 32B uses Aya Expanse 32B as the language model.
Key features of Aya Vision include:
Here's an example usage of the Aya Vision model.
from transformers import AutoProcessor, AutoModelForImageTextToText
import torch
model_id = "CohereForAI/aya-vision-32b"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
model_id, device_map="auto", torch_dtype=torch.float16
)
# Format message with the aya-vision chat template
messages = [
{"role": "user",
"content": [
{"type": "image", "url": "https://pbs.twimg.com/media/Fx7YvfQWYAIp6rZ?format=jpg&name=medium"},
{"type": "text", "text": "चित्र में लिखा पाठ क्या कहता है?"},
]},
]
inputs = processor.apply_chat_template(
messages, padding=True, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt"
).to(model.device)
gen_tokens = model.generate(
**inputs,
max_new_tokens=300,
do_sample=True,
temperature=0.3,
)
print(processor.tokenizer.decode(gen_tokens[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
A new model is added to transformers: SigLIP-2.
It is added on top of the v4.49.0 release, and can be installed from the following tag: v4.49.0-SigLIP-2.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.49.0-SigLIP-2
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
The paper page for the model is available here. It is detailed in the following blog post.
The models and demos using the model are available in the following collection.
The SigLIP2 model was proposed in SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features by Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner and Xiaohua Zhai.
The model comes in two variants
transformers)The abstract from the paper is the following:
We introduce SigLIP 2, a family of new multilingual vision-language encoders that build on the success of the original SigLIP. In this second iteration, we extend the original image-text training objective with several prior, independently developed techniques into a unified recipe—this includes decoder-based pretraining, self-supervised losses (self-distillation, masked prediction) and online data curation. With these changes, SigLIP 2 models outperform their SigLIP counterparts at all model scales in core capabilities, including zero-shot classification (best SigLIP 2 ViT-g/16 achieves 85.0% ImageNet zero-shot accuracy), image-text retrieval, and transfer performance when extracting visual representations for Vision-Language Models (VLMs). Furthermore, the new training recipe leads to significant improvements on localization and dense prediction tasks. We also train variants which support multiple resolutions and preserve the input’s native aspect ratio. Finally, we train on a more diverse data-mixture that includes de-biasing techniques, leading to much better multilingual understanding and improved fair- ness. To provide users with the ability to trade-off inference cost with performance, we release model checkpoints at four sizes (ViT-B/86M, L/303M, So400m/400M, and g/1B).
torch.distributed utilities which may limit the scalability of batch size. However, DDP and FDSP works on single-node multi-gpu setup.GemmaTokenizerFast] make sure to pass padding="max_length" and max_length=64 as that's how the model was trained.max_num_patches parameter in the Processor. The default value is max_num_patches=256. Increasing max_num_patches to 1024 (4x) will approximately double processed image height and width, while preserving the aspect ratio.<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/model_doc/siglip2_metrics_table.png" alt="drawing" width="600"/>
This model was contributed by qubvel. The original code can be found here.
There are 2 main ways to use SigLIP2: either using the pipeline API, which abstracts away all the complexity for you, or by using the Siglip2Model class yourself.
Pipeline API
The pipeline allows to use the model in a few lines of code:
>>> from transformers import pipeline
>>> from PIL import Image
>>> import requests
>>> # load pipe
>>> image_classifier = pipeline(
... task="zero-shot-image-classification",
... model="google/siglip2-base-patch16-224",
... )
>>> # load image
>>> url = 'http://images.cocodataset.org/val2017/000000039769.jpg'
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> # inference
>>> candidate_labels = ["2 cats", "a plane", "a remote"]
>>> outputs = image_classifier(image, candidate_labels=candidate_labels)
>>> outputs = [{"score": round(output["score"], 4), "label": output["label"] } for output in outputs]
>>> print(outputs)
[{'score': 0.1499, 'label': '2 cats'}, {'score': 0.0008, 'label': 'a remote'}, {'score': 0.0, 'label': 'a plane'}]
Using the model yourself
If you want to do the pre- and postprocessing yourself, here's how to do that:
>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, AutoModel
>>> import torch
>>> model = AutoModel.from_pretrained("google/siglip2-base-patch16-224")
>>> processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-224")
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> candidate_labels = ["2 cats", "2 dogs"]
# follows the pipeline prompt template to get same results
>>> texts = [f"This is a photo of {label}." for label in candidate_labels]
# IMPORTANT: we pass `padding=max_length` and `max_length=64` since the model was trained with this
>>> inputs = processor(text=texts, images=image, padding="max_length", max_length=64, return_tensors="pt")
>>> with torch.no_grad():
... outputs = model(**inputs)
>>> logits_per_image = outputs.logits_per_image
>>> probs = torch.sigmoid(logits_per_image) # these are the probabilities
>>> print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")
15.0% that image 0 is '2 cats'
NaFlex combines ideas from FlexiViT, i.e. supporting multiple, predefined sequence lengths with a single ViT model, and NaViT, namely processing images at their native aspect ratio. This enables processing different types of images at appropriate resolution, e.g. using a larger resolution to process document images, while at the same time minimizing the impact of aspect ratio distortion on certain inference tasks, e.g. on OCR.
Given a patch size and target sequence length, NaFlex preprocesses the data by first resizing the input image such that the height and width after resizing are multiples of the patch size, while
1. keeping the aspect ratio distortion as small as possible
2. producing a sequence length of at most the desired target sequence length (`max_num_patches`)
The resulting distortion in width and height is at most (patch_size - 1) / width and
(patch_size - 1) / height, respectively, which tends to be small for common resolutions and aspect ratios.
After resizing, the image is split into a sequence of patches, and a mask with padding information is added.
>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, AutoModel
>>> import torch
>>> model = AutoModel.from_pretrained("google/siglip2-base-patch16-naflex")
>>> processor = AutoProcessor.from_pretrained("google/siglip2-base-patch16-naflex")
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> candidate_labels = ["2 cats", "2 dogs"]
# follows the pipeline prompt template to get same results
>>> texts = [f"This is a photo of {label}." for label in candidate_labels]
# default value for `max_num_patches` is 256, but you can increase resulted image resolution providing
# higher values e.g. `max_num_patches=512`
>>> inputs = processor(text=texts, images=image, max_num_patches=256, return_tensors="pt")
>>> with torch.no_grad():
... outputs = model(**inputs)
>>> logits_per_image = outputs.logits_per_image
>>> probs = torch.sigmoid(logits_per_image) # these are the probabilities
>>> print(f"{probs[0][0]:.1%} that image 0 is '{candidate_labels[0]}'")
21.1% that image 0 is '2 cats'
A new model is added to transformers: SmolVLM-2.
It is added on top of the v4.49.0 release, and can be installed from the following tag: v4.49.0-SmolVLM-2.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.49.0-SmolVLM-2
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
SmolVLM-2 is detailed in the following blog post.
The models and demos using the model are available in the following collection.
SmolVLM2 is an adaptation of the Idefics3 model with two main differences:
Input images are processed either by upsampling (if resizing is enabled) or at their original resolution. The resizing behavior depends on two parameters: do_resize and size.
Videos should not be upsampled.
If do_resize is set to True, the model resizes images so that the longest edge is 4*512 pixels by default.
The default resizing behavior can be customized by passing a dictionary to the size parameter. For example, {"longest_edge": 4 * 512} is the default, but you can change it to a different value if needed.
Here’s how to control resizing and set a custom size:
image_processor = SmolVLMImageProcessor(do_resize=True, size={"longest_edge": 2 * 512}, max_image_size=512)
Additionally, the max_image_size parameter, which controls the size of each square patch the image is decomposed into, is set to 512 by default but can be adjusted as needed. After resizing (if applicable), the image processor decomposes the images into square patches based on the max_image_size parameter.
This model was contributed by orrzohar.
The model can accept both images and videos as input, but you should use only one of the modalities at a time. Here's an example code for that.
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText
processor = AutoProcessor.from_pretrained("HuggingFaceTB/SmolVLM2-256M-Video-Instruct")
model = AutoModelForImageTextToText.from_pretrained(
"HuggingFaceTB/SmolVLM2-256M-Video-Instruct",
torch_dtype=torch.bfloat16,
device_map="cuda"
)
conversation = [
{
"role": "user",
"content":[
{"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
{"type": "text", "text": "Describe this image."}
]
}
]
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)
output_ids = model.generate(**inputs, max_new_tokens=128)
generated_texts = processor.batch_decode(output_ids, skip_special_tokens=True)
print(generated_texts)
# Video
conversation = [
{
"role": "user",
"content": [
{"type": "video", "path": "/path/to/video.mp4"},
{"type": "text", "text": "Describe this video in detail"}
]
},
]
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device, dtype=torch.bfloat16)
generated_ids = model.generate(**inputs, do_sample=False, max_new_tokens=100)
generated_texts = processor.batch_decode(generated_ids, skip_special_tokens=True)
print(generated_texts[0])
Helium-1 preview is a lightweight language model with 2B parameters, targeting edge and mobile devices. It supports the following languages: English, French, German, Italian, Portuguese, Spanish.
<img width="860" alt="image" src="https://github.com/user-attachments/assets/52e91b74-5572-46a6-93e5-058730411675" />The Qwen2.5-VL model is an update to Qwen2-VL from Qwen team, Alibaba Group.
The abstract from this update is the following:
Qwen2.5-VL marks a major step forward from Qwen2-VL, built upon the latest Qwen2.5 LLM. We’ve accelerated training and testing through the strategic implementation of window attention within the ViT. The ViT architecture itself has been refined with SwiGLU and RMSNorm, aligning it more closely with the LLM’s structure. A key innovation is the expansion of native dynamic resolution to encompass the temporal dimension, in addition to spatial aspects. Furthermore, we’ve upgraded MRoPE, incorporating absolute time alignment on the time axis to allow the model to effectively capture temporal dynamics, regardless of frame rate, leading to superior video understanding.
The SuperGlue model was proposed in SuperGlue: Learning Feature Matching with Graph Neural Networks by Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz and Andrew Rabinovich.
This model consists of matching two sets of interest points detected in an image. Paired with the SuperPoint model, it can be used to match two images and estimate the pose between them. This model is useful for tasks such as image matching, homography estimation, etc.
<img width="424" alt="image" src="https://github.com/user-attachments/assets/1d81983f-f9ce-4d82-adb7-e76098df543a" />The Granite Vision model is a variant of LLaVA-NeXT, leveraging a Granite language model alongside a SigLIP visual encoder. It utilizes multiple concatenated vision hidden states as its image features, similar to VipLlava. It also uses a larger set of image grid pinpoints than the original LlaVa-NeXT models to support additional aspect ratios.
Zamba2 is a large language model (LLM) trained by Zyphra, and made available under an Apache 2.0 license.
Zamba2-1.2B, Zamba2-2.7B and Zamba2-7B are hybrid models combining state-space models (Specifically Mamba) and transformer, and were trained using next-token prediction. Zamba2 uses shared transformer layers after every 6 mamba blocks. It uses the Mistral v0.1 tokenizer. We came to this architecture after a series of ablations at small scales. Zamba2-1.2B, Zamba2-2.7B and Zamba2-7B were pre-trained on 2T and 3T tokens, respectively.
GOT-OCR2 works on a wide range of tasks, including plain document OCR, scene text OCR, formatted document OCR, and even OCR for tables, charts, mathematical formulas, geometric shapes, molecular formulas and sheet music. While this implementation of the model will only output plain text, the outputs can be further processed to render the desired format, with packages like pdftex, mathpix, matplotlib, tikz, verovio or pyecharts. The model can also be used for interactive OCR, where the user can specify the region to be recognized by providing the coordinates or the color of the region’s bounding box.
DAB-DETR is an enhanced variant of Conditional DETR. It utilizes dynamically updated anchor boxes to provide both a reference query point (x, y) and a reference anchor size (w, h), improving cross-attention computation. This new approach achieves 45.7% AP when trained for 50 epochs with a single ResNet-50 model as the backbone.
DepthPro is a foundation model for zero-shot metric monocular depth estimation, designed to generate high-resolution depth maps with remarkable sharpness and fine-grained details. It employs a multi-scale Vision Transformer (ViT)-based architecture, where images are downsampled, divided into patches, and processed using a shared Dinov2 encoder. The extracted patch-level features are merged, upsampled, and refined using a DPT-like fusion stage, enabling precise depth estimation.
An improved Real-Time DEtection TRansformer (RT-DETR). RT-DETRv2 refines RT-DETR by introducing selective multi-scale feature extraction, a discrete sampling operator for broader deployment compatibility. These improvements yield a 0.3 to 1.4 increase in mAP metrics on the COCO dataset, all while maintaining the same parameter count and frames-per-second (FPS) performance.
Transformers' CLI welcomes a new command: chat. This command starts a conversation with the model of your choosing directly in your terminal.
This feature exists in TRL and has been migrated to transformers for easier usage.
An ongoing work is to standardize the image processors so that their API is equivalent. Additionally, the processors are given a fast variant so that they are never blockers in the image processing pipelines.
In this release, several processors have been standardized and have seen their fast version be contributed.
DPT image processors did not support segmentation_maps, instead only requiring images. This has been fixed.
This adds an argument to the preprocess method, therefore users using arguments as positional arguments with that method may see changed behavior. We recommend using keyword arguments for such methods so as to not be bothered by the addition of new features.
segmentation maps support for DPT image processor by @simonreise in #34345The problem_type in the config.json file was read incorrectly by the pipeline, which mapped single-label to multi-label losses, and vice-versa. This has been fixed.
The description of the pull request is the easiest way to understand the problem, why it exists, and how it is solved; please read the description below:
The ignore_index property of the llava configuration has been removed as it was not serving a purpose.
Quantization has received several improvements and fixes, including the contribution of FP8 quantization and the HIGGS quantization interface.
Additionally, we're replacing the AutoGPTQ implementaiton with GPTQModel from ModelCloud (see repository here)).
GPTQModel originated as major refractor of AutoGPTQ but is now a full-stand-in replacement with cleaner api, up-to-date model support, faster inference, higher quality quants.
max_length by @gante in #36120generate-related objects and methods scheduled for removal in v4.48 by @gante in #35677GenerationConfig(cache_implementation="static") by @gante in #35679SequenceBiasLogitsProcessor by @gante in #35699torch.compile(model.forward) as a fast test by @gante in #34544Pipelines have received several bug fixes and improvements which are detailed below.
test_custom_4d_attention_mask by @ydshieh in #35606EarlyStoppingCallback not require load_best_model_at_end by @muellerzr in #35101test_beam_search_low_memory by @ydshieh in #35611MobileNetV1ModelTest::test_batching_equivalence for now by @ydshieh in #35614Phi] bias should be True by @ArthurZucker in #35650Compile] Only test compiling model forward pass by @ArthurZucker in #35658zero_shot_image_classification documentation guide link in SigLIP by @aretrace in #35671Trainer cannot correctly call torch_jit_model_eval by @Wanguy in #35722pt_to_tf by @gante in #35672check_circleci_user job by @Sai-Suraj-27 in #32866MimiModel with DeepSpeed ZeRO-3 by @anferico in #34735PeftModel by @ambroser53 in #35680MimiModel with DeepSpeed ZeRO-3" by @eustlb in #35755self-comment-ci.yml by @ydshieh in #35548timm import behaviour by @rwightman in #35800test_batching_equivalence's flakiness by @ydshieh in #35729TimmWrapper by @ariG23498 in #35744timm tag to timm-wrapper models. by @pcuenca in #35794get_cached_models by @Wauplin in #35809docs/source/ar/tasks/masked_language_modeling.md into Arabic by @AhmedAlmaghz in #35198benchmark code by @gante in #35730self-comment-ci.yml by @ydshieh in #35816working-directory in self-comment-ci.yml by @ydshieh in #35833head_dim in config extracted from Gemma2 GGUF model by @Isotr0py in #35818tests] remove some flash attention class tests by @ArthurZucker in #35817num_logits_to_keep as Tensor + add flag by @Cyrilvallez in #35757test_pipelines_video_classification that was always failing by @CalOmnie in #35842Rocketknight1 to self-comment-ci.yml by @ydshieh in #35881_supports_static_cache = True for some model classes by @ydshieh in #34975test_generated_length_assisted_generation by @keyboardAnt in #34935unwrap_and_save_reload_schedule to use weights_only=False by @ydshieh in #35952squad_convert_example_to_features to work with numpy v2 by @ydshieh in #35955test_assisted_decoding_matches_greedy_search by @ydshieh in #35951transformers-pytorch-deepspeed-latest-gpu by @ydshieh in #35940Tester object has no attribute '_testMethodName' by @faaany in #35781TimmBackboneModelTest::test_batching_equivalence by @ydshieh in #35971benchmark.yml by @ydshieh in #35974generation / quantization) by @ydshieh in #35341self-comment-ci.yml by @ydshieh in #36030Qwen2VLImageProcessorFast into Qwen2VLProcessor by @yeliudev in #35987past_key_values by @yaswanth19 in #35890test_flash_attn_2_can_dispatch_composite_models by @ydshieh in #36050trainer.md by @faaany in #36066perf_infer_gpu_one.md by @faaany in #36087torch.export and fix some vision models by @qubvel in #35124output_dir Optional in TrainingArguments #27866 by @sambhavnoobcoder in #35735PretrainedConfig and PreTrainedModel by @hmellor in #36091test_initialization for VitPoseBackboneModelTest for now by @ydshieh in #36154get_default_model_revision by @MarcoGorelli in #35982DataCollatorForMultipleChoice from the docs to the package by @bauwenst in #34763check_repository_consistency run faster by MP by @ydshieh in #36175test-save-trainer by @zucchini-nlp in #36191The following contributors have made significant changes to the library over the last release:
docs/source/ar/tasks/masked_language_modeling.md into Arabic (#35198)head_dim in config extracted from Gemma2 GGUF model (#35818)DataCollatorForMultipleChoice from the docs to the package (#34763)This ends the python3.9 issues mostly!
For some very niche cases, the new rope embedding introduced device failures
num_items_in_batchFinally the fix to Gemma2 is propagated to paligemma2!
Sorry because the fixes for num_items_in_batches are not done yet 😓 To follow along see this PR, a new patch will be available soon!
Now, we mostly had BC issue with python version 3.9:
Then we had a small regression for DBRX saving:
Finally we have a fix for gemma and the hybrid attention architectures:
Miscellaneous:
Yet again we are dawned with a gradient accumulation fix! There is also a refactoring of the attention that let a small typo in, we made sure PHI is no longer broken!
Moonshine had a small issue when wrapping generate so we removed that!
🤗
The ModernBert model was proposed in Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference by Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Galalgher, Raja Bisas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Grifin Adams, Jeremy Howard and Iacopo Poli.
It is a refresh of the traditional encoder architecture, as used in previous models such as BERT and RoBERTa.
It builds on BERT and implements many modern architectural improvements which have been developed since its original release, such as:
The Aria model was proposed in Aria: An Open Multimodal Native Mixture-of-Experts Model by Li et al. from the Rhymes.AI team.
Aria is an open multimodal-native model with best-in-class performance across a wide range of multimodal, language, and coding tasks. It has a Mixture-of-Experts architecture, with respectively 3.9B and 3.5B activated parameters per visual token and text token.
We add a TimmWrapper set of classes such that timm models can be loaded in as transformer models into the library.
Here's a general usage example:
import torch
from urllib.request import urlopen
from PIL import Image
from transformers import AutoConfig, AutoModelForImageClassification, AutoImageProcessor
checkpoint = "timm/resnet50.a1_in1k"
img = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
image_processor = AutoImageProcessor.from_pretrained(checkpoint)
inputs = image_processor(img, return_tensors="pt")
model = AutoModelForImageClassification.from_pretrained(checkpoint)
with torch.no_grad():
logits = model(**inputs).logits
top5_probabilities, top5_class_indices = torch.topk(logits.softmax(dim=1) * 100, k=5)
Thanks to this, timm models now have access to pipelines, as well as Trainer, accelerate device maps, quantization, etc:
import torch
from urllib.request import urlopen
from PIL import Image
from transformers import pipeline
img = Image.open(urlopen(
'https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/beignets-task-guide.png'
))
pipe = pipeline("image-classification", model="timm/resnet18.a1_in1k")
print(pipe(img))
Pixtral modeling and checkpoint conversion code has been updated to support the new Pixtral-Large model.
The ColPali model was proposed in ColPali: Efficient Document Retrieval with Vision Language Models by Manuel Faysse*, Hugues Sibille*, Tony Wu*, Bilel Omrani, Gautier Viaud, Céline Hudelot, Pierre Colombo (* denotes equal contribution). Work lead by ILLUIN Technology.
In the proposed ColPali approach, the authors leverage VLMs to construct efficient multi-vector embeddings directly from document images (“screenshots”) for document retrieval. They train the model to maximize the similarity between these document embeddings and the corresponding query embeddings, using the late interaction method introduced in ColBERT.
Falcon3 represents a natural evolution from previous releases, emphasizing expanding the models’ science, math, and code capabilities. This iteration includes five base models: Falcon3-1B-Base, Falcon3-3B-Base, Falcon3-Mamba-7B-Base, Falcon3-7B-Base, and Falcon3-10B-Base. In developing these models, the authors incorporated several key innovations aimed at improving the models’ performances while reducing training costs:
One pre-training: They conducted a single large-scale pretraining run on the 7B model, using 2048 H100 GPU chips, leveraging 14 trillion tokens featuring web, code, STEM, and curated high-quality and multilingual data. Depth up-scaling for improved reasoning: Building on recent studies on the effects of model depth, they upscaled the 7B model to a 10B parameters model by duplicating the redundant layers and continuing pre-training with 2TT of high-quality data. This yielded Falcon3-10B-Base which achieves state-of-the-art zero-shot and few-shot performance for models under 13B parameters. Knowledge distillation for better tiny models: To provide compact and efficient alternatives, we developed Falcon3-1B-Base and Falcon3-3B-Base by leveraging pruning and knowledge distillation techniques, using less than 100GT of curated high-quality data, thereby redefining pre-training efficiency.
Bamba-9B is a decoder-only language model based on the Mamba-2 architecture and is designed to handle a wide range of text generation tasks. It is trained from scratch using a two-stage training approach. In the first stage, the model is trained on 2 trillion tokens from the Dolma v1.7 dataset. In the second stage, it undergoes additional training on 200 billion tokens, leveraging a carefully curated blend of high-quality data to further refine its performance and enhance output quality.
Checkout all Bamba-9B model checkpoints here.
ViTPose is a state-of-the-art vision transformer-based model for human pose estimation, introduced by Yufei Xu, Jing Zhang, Qiming Zhang, and Dacheng Tao in "ViTPose: Simple Vision Transformer Baselines for Human Pose Estimation”.
The model leverages the capabilities of vision transformers to accurately predict 2D human keypoints. Adopting a top-down approach, ViTPose estimates keypoints locations for each detected person, allowing it to be easily used with any object detection model.
The DINOv2 with Registers model was proposed in Vision Transformers Need Registers by Timothée Darcet, Maxime Oquab, Julien Mairal, Piotr Bojanowski.
The Vision Transformer (ViT) is a transformer encoder model (BERT-like) originally introduced to do supervised image classification on ImageNet.
Next, people figured out ways to make ViT work really well on self-supervised image feature extraction (i.e. learning meaningful features, also called embeddings) on images without requiring any labels. Some example papers here include DINOv2 and MAE.
The authors of DINOv2 noticed that ViTs have artifacts in attention maps. It’s due to the model using some image patches as “registers”. The authors propose a fix: just add some new tokens (called “register” tokens), which you only use during pre-training (and throw away afterwards). This results in:
The Emu3 model was proposed in Emu3: Next-Token Prediction is All You Need by Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, Yingli Zhao, Yulong Ao, Xuebin Min, Tao Li, Boya Wu, Bo Zhao, Bowen Zhang, Liangdong Wang, Guang Liu, Zheqi He, Xi Yang, Jingjing Liu, Yonghua Lin, Tiejun Huang, Zhongyuan Wang.
Emu3 sets a new standard in multimodal AI by using next-token prediction to handle images, text, and videos. It simplifies multimodal modeling by tokenizing all data into a unified format and training a single transformer. Visual data is tokenized using vector quantization methods based on VQ-VAE model. Discretized visual tokens are later fused with text token ids for image and text generation.
Emu3 outperforms leading models like SDXL and LLaVA-1.6 in both generation and perception tasks, without relying on diffusion or compositional methods..
A new Cohere update was added through a new "Cohere2" set of classes.
TextNet is a lightweight and efficient architecture designed specifically for text detection, offering superior performance compared to traditional models like MobileNetV3. With variants TextNet-T, TextNet-S, and TextNet-B (6.8M, 8.0M, and 8.9M parameters respectively), it achieves an excellent balance between accuracy and inference speed.
Differential Transformer combines the Llama architecture with Differential Transformer's Attention.
The conversion script needed a few update, while the modeling code was barely changed!
Moonshine is an autoregressive speech recognition encoder-decoder model that improves upon Whisper's architecture. Namely, it replaces absolute position embeddings with Rotary Position Embeddings (RoPE). This allows Moonshine to handle audio inputs of any length, unlike Whisper, which is restricted to fixed 30-second windows. It was introduced by Nat Jeffries, Evan King, Manjunath Kudlur, Guy Nicholson, James Wang, and Pete Warden in Moonshine: Speech Recognition for Live Transcription and Voice Commands .
From the VPTQ contributors:
VPTQ is a novel Post-Training Quantization method that leverages Vector Quantization to high accuracy on LLMs at an extremely low bit-width (<2-bit). VPTQ can compress 70B, even the 405B model, to 1-2 bits without retraining and maintain high accuracy.. More details here: https://github.com/microsoft/vptq
From the contributors:
HIGGS is a new 0-shot quantization algorithm that combines Hadamard preprocessing with MSE-Optimal quantization grids to achieve lower quantization error and SOTA performance. You can find more information in the paper.
Runtime support for HIGGS is implemented through FLUTE, and its library.
This PR adds support for HIGGS+FLUTE into transformers allowing for low-error 0-shot quantization and fast LLM inference.
We merged a cleanup for vision language models, to make sure it all models are standardized.
Many models in Transformers include scripts to convert the original model checkpoints into a Transformers-compatible format. These scripts can be found in the repo using the glob pattern models/**/convert_*.py. They were a recurring source of vulnerability reports and CVEs because many models were originally released using insecure formats like older PyTorch .bin weights or pickle files. The conversion scripts had to open these formats, and this meant that they were vulnerable to maliciously crafted inputs.
In practice, we do not see this as a serious vulnerability. The conversion scripts are never imported or called by the rest of the library; each script is standalone, and so the only way to exploit the vulnerability is to create a malicious checkpoint, induce a user to download it, and then also induce them to manually call a specific conversion script on it.
However, even if there is little practical risk of an exploit, we are aware that open vulnerability reports create a compliance problem for users, and so beginning with this release we will be excluding these conversion scripts from release branches and wheels. They will remain accessible to developers on the main branch.
A regular expression used within the Nougat code has been modified to ensure it does not hang. The method should output the same results but we cannot guarantee it; we recommend upgrading to the latest transformers if you use this model to ensure your code is performance-optimized.
This PR finalizes work that aimes to enable short-form (< 30 secs) and long-form generation using temperature fallback. It is a significant improvement to the whisper codebase, but it does result in the following breaking changes:
➡️ Previously:
• Short-form: Returned a ModelOutput or torch.LongTensor, including decoder input IDs and the EOS token ID.
• Long-form: Returned a Dict or torch.LongTensor, excluding decoder input IDs and the EOS token ID.
➡️ From now on:
Short-form and long-form generation are now treated identically, meaning output differentiation based on these modes is no longer applicable.
Decoder input IDs and EOS token IDs are never returned, except in two specific cases: when return_dict_in_generate=True and (return_timestamps=False or force_unique_generate_call=True).
In this case, the output will be a ModelOutput, which is the result of the underlying call to GenerationMixin’s generate. Indeed, return_timestamps=False ensures no seeking occurs; only a single call to generate is made. Therefore, this output includes both decoder input IDs and the EOS token ID.
In order to have a cleaner, isolated, future-proof code for the attention layers, they have been refactored so as to keep the model attention code within their files; but attention definitions relating to SDPA, Flash Attention, and other types of attention have been moved to a common file.
num_items_in_batch not being an integer by @xspirus in #35115docs/source/ar/community.md into Arabic by @AhmedAlmaghz in #33027AssistedCandidateGenerator for Improved Modularity and Reusability by @keyboardAnt and @jmamou in #35009Thread for SF conversion by @ydshieh in #35236rsfE with pytest by @ydshieh in #35119benchmark job in push-important-models.yml by @ydshieh in #35292benchmarks_entrypoint.py by @McPatate in #34495text by @probicheaux in #35201docs] Add link to ModernBERT Text Classification GLUE finetuning script by @tomaarsen in #35347Mamba2] Fix caching, slow path, and multi-gpu by @vasqu in #35154_make_causal_mask by @jiwoong-choi in #35291weights_only=True with torch.load for transfo_xl by @ydshieh in #35241test_generate_with_static_cache even less flaky by @ydshieh in #34995is_causal is passed explicitly by @Cyrilvallez in #35390PaliGemmaProcessor by @alvarobartt in #35278.github/workflows/self-comment-ci.yml for now by @ydshieh in #35366GPTQ, CompressedTensors] Fix unsafe imports and metada check by @vasqu in #34815ACCELERATE_MIN_VERSION on error by @KSafran in #35189model_accepts_loss_kwargs for timm model by @qubvel in #35257sdpa_kernel by @jla524 in #35410docs/source/ar/tasks/question_answering.md into Arabic by @AhmedAlmaghz in #35196docs/source/ar/tasks/summarization.md into Arabic by @AhmedAlmaghz in #35195sdpa_kernel by @jla524 in #35461The following contributors have made significant changes to the library over the last release:
Thread for SF conversion (#35236)rsfE with pytest (#35119)benchmark job in push-important-models.yml (#35292)weights_only=True with torch.load for transfo_xl (#35241)test_generate_with_static_cache even less flaky (#34995).github/workflows/self-comment-ci.yml for now (#35366)We waited a little bit to make sure it was stable, thanks @winglian for double checking and everyone for the fixes!
Fix GA loss bugs and add unit test (#35121) Contributed by @techkang and @ArthurZucker.
Fix num_items_in_batch not being an integer (#35115)) Contributed by @xspirus.
Fix FSDP no longer working (#35212) Contributed by @muellerzr.
Don't use no_sync when DeepSpeed doesn't support it for certain ZeRO configurations (#35212) Contributed by @winglian.
Only import torch.distributed if it is available (#35133) Contributed by @GaetanLepage.
[Whisper] Patch float type on MPS (#35295) Contributed by @eustlb. 🔜 we should probably have MPS CIs to avoid repeating this!
PaliGemma 2 and PaliGemma are lightweight open vision-language models (VLM) inspired by PaLI-3, and based on open components like the SigLIP vision model and the Gemma language model. PaliGemma takes both images and text as inputs and can answer questions about images with detail and context, meaning that PaliGemma can perform deeper analysis of images and provide useful insights, such as captioning for images and short videos, object detection, and reading text embedded within images.
PaliGemma 2 is available in 3B, 10B, and 28B parameter sizes, which are based on Gemma 2 2B, 9B, and 27B models, respectively. The original PaliGemma models are available in the 3B size. For more information on Gemma model variants, see the Gemma models list. PaliGemma model variants support different pixel resolutions for image inputs, including 224 x 224, 448 x 448, and 896 x 896 pixels.
<img width="743" alt="image" src="https://github.com/user-attachments/assets/55cda8a6-b463-4a58-b7d3-f7d50ee2fa11">The I-JEPA model was proposed in Image-based Joint-Embedding Predictive Architecture by Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, Nicolas Ballas. I-JEPA is a self-supervised learning method that predicts the representations of one part of an image based on other parts of the same image. This approach focuses on learning semantic features without relying on pre-defined invariances from hand-crafted data transformations, which can bias specific tasks, or on filling in pixel-level details, which often leads to less meaningful representations.
<img width="413" alt="image" src="https://github.com/user-attachments/assets/561ca9d7-0327-477a-96b8-61d2af0caf34">The OLMo2 model is the successor of the OLMo model, which was proposed in OLMo: Accelerating the Science of Language Models.
The architectural changes from the original OLMo model to this model are:
Commits:
We add support for Meta's Layer-Skip Llama 3.2 1B model.
The Llama3.2 1B model was continually pretrained with LayerSkip recipe, early exit loss and layer dropout, as presented in Layer Skip: Enabling Early Exit Inference and Self-Speculative Decoding and is capable of performing self-speculative decoding: decode with earlier layers and verify with remaining layers.
<img width="854" alt="image" src="https://github.com/user-attachments/assets/4a9e3596-e44e-419f-804d-9f4d03f8f680">This PR uses the torch.distributed.tensor.parallel subpackage to implement Tensor Parallel for Llama (as an example).
The motivation is multi-fold:
to make modeling code simple as single-worker case:
all manual TP implementations under if self.config.pretraining_tp > 1 can be removed.
to make tensor parallelism easily accessible by users:
added a model.tensor_parallel(device_mesh) method that allows users to turn a single-proc model into a parallel model. !- Please guide me to a right place to put this function/method if PreTrainedModel is not a preferred place. -!
This is the first PR of many to simplify and enable Tensor Parallel across models.
Python 3.8 reaches end of life, and, as such, we drop it from our CI.
Several improvements have been done to the GGUF support in transformers; notably by adding new architectures to the list of supported architectures.
use_parallel_residual and qkv_bias for StableLM GGUF config extraction by @Isotr0py in #34450We continue the work to improve the speed of fast processors as detailed in this roadmap.
We contribute a fast processor to RT-DETR.
A new pipeline has been added to transformers: image-text-to-text!
the pipeline support the following inputs:
We have had several issues with chat templates because they're stored as single lines in the JSON config files:
processor templates in chat_template.json and tokenizer templates in tokenizer_config.json causing confusionThe solution:
chat_template.jinja file in the repoProcessor classes, so processors should always be able to save their template as a raw Jinja file. In general, we'll be gently deprecating multiple templates in future.chat_template.jinja file is present, it overrides the JSON files. If a tokenizer is loaded with both Jinja and JSON chat templates and resaved, it should save only the Jinja file, and not have any chat_template entry in tokenizer_config.json.For now, we continue saving in the old format by default. I'll probably keep it this way for several versions before making the new format the default, to ensure that most users are able to load the new format before it becomes common. Until then, the new format should mostly be used for testing, to make sure it's ready for deployment when we do the switch.
This PR largely rework the logic we use in the modular converter. It is (hopefully) clearer and maintainable. Instead of going in all directions, adding stuff, then deleting it if not needed, we now do the following:
convert_tokens_to_ids by @winstxnhdw in #34030torch.fx issue related to the new loss_kwargs keyword argument by @michaelbenayoun in #34380test_eager_matches_sdpa_generate by @gante in #34386tensorflow_probability<0.22 in docker files by @ydshieh in #34381"best" for args.save_strategy. by @seanswyi in #31817model_doc/barthez.md to Korean by @Jwaminju in #33980docs/source/ar/fast_tokenizers.md into Arabic by @AhmedAlmaghz in #33034post_process_depth_estimation for GLPN by @alex-bene in #34413head_dim for mixtral model by @wavy-jung in #34281optimizer_cls_and_kwargs to Trainer.__init__ by @apoorvkh in #34358generate tests to the right mixin and delete redundant tests by @gante in #34464gc.collect and cuda.empty_cache by @ydshieh in #34514input_ids-inputs_embeds equivalence check by @gante in #34535test_eager_matches_sdpa_inference less flaky by @ydshieh in #34512docs/source/ar/multilingual.md into Arabic by @AhmedAlmaghz in #33048query_pre_attn_scalar different of num_heads in default gemma2 config by @molbap in #34540isin_mps_friendly can support 0D tensors by @gante in #34538@slow for test_eager_matches_sdpa_inference by @ydshieh in #34558convbert.md to Korean by @ahnjj in #34599timesformer.md to Korean by @mreraser in #33972docs/source/ar/trainer.md into Arabic by @AhmedAlmaghz in #33080Tool.from_space() by @aymeric-roucher in #34561docs/source/ar/torchscript.md into Arabic by @AhmedAlmaghz in #33079continue_final_message=True by @lewtun in #34253patch_size -> num_image_tokens in processing by @zucchini-nlp in #33424empty_cache device-agnostic by @faaany in #34774test_medium_seamless_m4t_pt in subprocess to avoid many failures by @ydshieh in #34812check_training_gradient_checkpointing by @ydshieh in #34806torch.export by @philkuz in #34103max_steps overriding num_train_epochs by @qgallouedec in #34810use_cache by @zucchini-nlp in #34274Deberta/Deberta-v2] Refactor code base to support compile, export, and fix LLM by @ArthurZucker in #22105peft] Given that self.active_adapter is deprecated, avoid using it by @tomaarsen in #34804test_auto_backbone_timm_model_from_pretrained by @ydshieh in #34877docs/source/ar/benchmarks.md into Arabic by @AhmedAlmaghz in #33023FlexAttention] Update gemma2 by @ArthurZucker in #34942get_max_length by @ydshieh in #34971Thread by @ydshieh in #34966release_memory() by @faaany in #34911VisitWebpageTool by @sergiopaniego in #34978save_pretrained for partially offloaded models by @kylesayrs in #34890utils/check_bad_commit.py (for auto ping in CI) by @ydshieh in #34943PixtralImageProcessorFast by @mgoin in #34836.from_pretrained type annotations by @qubvel in #34973FillMaskPipeline.__call__ signature and docstring by @alvarobartt in #35006cu_seqlens when tracing by @xenova in #35016test_eager_matches_sdpa_inference for XPU backend by @dvrogozh in #34889docs/source/ar/notebooks.md into Arabic by @AhmedAlmaghz in #33049BertGeneration by @ydshieh in #35043cuda by @faaany in #35047pad_token_tensor is None in warning by @tshu-w in #34005GPTNeoX] Flex Attention + Refactor by @vasqu in #34896tokenizers] bump to 0.21 by @ArthurZucker in #34972tie_word_embeddings handling for GGUF models by @Isotr0py in #35085trainer] fix the GA model_accepts_loss_kwargs by @ArthurZucker in #34915test_trainer.py) by @ydshieh in #35062The following contributors have made significant changes to the library over the last release:
docs/source/ar/fast_tokenizers.md into Arabic (#33034)docs/source/ar/multilingual.md into Arabic (#33048)docs/source/ar/trainer.md into Arabic (#33080)docs/source/ar/torchscript.md into Arabic (#33079)docs/source/ar/benchmarks.md into Arabic (#33023)PixtralImageProcessorFast (#34836)One small fix for FSDP + gradient accumulation loss issue!