Hugging Face/Transformers

Releases13Avg4/moVersionsv4.57.4 → v5.5.3

Patch release v4.46.2

Mostly had to finish the gradient accumulation ! Thanks to @techkang and @Ryukijano 🤗

VLMs: fix number of image tokens (#34332) by @zucchini-nlp
fix pixtral processor (#34486) by @@molbap
enable average tokens across devices (#34373) by @techkang and @muellerzr
Update trainer for easier handling of accumulate, compile fixes, and … by @muellerzr and @Ryukijano
MPS: isin_mps_friendly can support 0D tensors (#34538) by @gante

Patch release v4.46.1

Patch release v4.4.61

This is mostly for fx and onnx issues!

** Fix regression loading dtype #34409 by @SunMarc ** LLaVa: latency issues #34460 by @zucchini-nlp ** Fix pix2struct #34374 by @IlyasMoutawwakil ** Fix onnx non-exposable inplace aten op #34376 by @IlyasMoutawwakil ** Fix torch.fx issue related to the new loss_kwargs keyword argument #34380 by @michaelbenayoun

New model additions

Moshi

The Moshi model was proposed in Moshi: a speech-text foundation model for real-time dialogue by Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave and Neil Zeghidour.

Moshi is a speech-text foundation model that casts spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. Moshi also predicts time-aligned text tokens as a prefix to audio tokens. This “Inner Monologue” method significantly improves the linguistic quality of generated speech and provides streaming speech recognition and text-to-speech. As a result, Moshi is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice.

Moshi integration by @ylacombe in #33624

Zamba

Zamba-7B-v1 is a hybrid between state-space models (Specifically Mamba) and transformer, and was trained using next-token prediction. Zamba uses a shared transformer layer after every 6 mamba blocks. It uses the Mistral v0.1 tokenizer. We came to this architecture after a series of ablations at small scales. Zamba-7B-v1 was pre-trained on 1T tokens of text and code data.

Add Zamba by @pglorio in #30950

GLM

The GLM Model was proposed in ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools by GLM Team, THUDM & ZhipuAI.

The abstract from the paper starts with the following:

We introduce ChatGLM, an evolving family of large language models that we have been developing over time. This report primarily focuses on the GLM-4 language series, which includes GLM-4, GLM-4-Air, and GLM-4-9B.

add Glm by @Cyrilvallez in #33823

Idefics 3

The Idefics3 model was proposed in Building and better understanding vision-language models: insights and future directions by Hugo Laurençon, Andrés Marafioti, Victor Sanh, and Léo Tronchon.

Idefics3 is an adaptation of the Idefics2 model with three main differences:

It uses Llama3 for the text model.
It uses an updated processing logic for the images.
It removes the perceiver.

Add Idefics 3! by @andimarafioti in #32473

PhiMoE

The PhiMoE model was proposed in Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone by Microsoft.

This model is very similar to Mixtral with the main difference of Phi3LongRoPEScaledRotaryEmbedding, where they are used to extend the context of the rotary embeddings. The query, key and values are fused, and the MLP’s up and gate projection layers are also fused.

PhiMoE by @garg-amit in #33363

Watermarking

This release adds SynthID, a novel state-of-the-art watermarking technique by Google DeepMind. SynthID has a low generation-time computational cost and can be configured to be nearly imperceptible (at the cost of harder watermarking detection). The release also comes with the code to train and run the corresponding detector, which is a machine learning model itself.

from transformers import AutoModelForCausalLM, AutoTokenizer, SynthIDTextWatermarkingConfig

tokenizer = AutoTokenizer.from_pretrained('google/gemma-2-2b', padding_side="left")
model = AutoModelForCausalLM.from_pretrained('google/gemma-2-2b')

# SynthID Text configuration
watermarking_config = SynthIDTextWatermarkingConfig(
    keys=[654, 400, 836, 123, 340, 443, 597, 160, 57],
    ngram_len=5,
)

# Generation with watermarking
tokenized_prompts = tokenizer(["Once upon a time, "], return_tensors="pt", padding=True)
output_sequences = model.generate(
    **tokenized_prompts, watermarking_config=watermarking_config, do_sample=True, max_new_tokens=10
)
watermarked_text = tokenizer.batch_decode(output_sequences, skip_special_tokens=True)
print(watermarked_text)

Docs for applying SynthID watermarking: https://huggingface.co/docs/transformers/internal/generation_utils#transformers.SynthIDTextWatermarkLogitsProcessor Docs for detecting SynthID watermarking: https://huggingface.co/docs/transformers/internal/generation_utils#transformers.SynthIDTextWatermarkDetector

Add SynthID (watermerking by Google DeepMind) by @gante in #34350

Quantization

BitNet

BitNet is an architecture introduced by Microsoft Research that uses extreme quantization, representing each parameter with only three values: -1, 0, and 1. This results in a model that uses just 1.58 bits per parameter, significantly reducing computational and memory requirements. It replaces traditional Linear layers in Multi-Head Attention and Feed-Forward Networks with specialized layers called BitLinears that use ternary precision (or even binary, in the initial version)

FEAT : Adding BitNet quantization method to HFQuantizer by @MekkCyber in #33410

GGUF loading in transformers

More architectures are now supported in our GGUF loader; GGUF files saved with this architecture can now be loaded directly in transformers to be fine-tuned. We recommend using tooling from llama.cpp to requantize the models after further training has been done.

Add gguf support for bloom by @VladOS95-cyber in #33473
Add falcon gguf by @g-prz in #33437
Add gguf support for StableLM by @VladOS95-cyber in #33793
Add gguf support for gpt2 by @VladOS95-cyber in #34044
Add GGUF for starcoder2 by @VladOS95-cyber in #34094

Notable improvements and additions

Pipeline API synchronisation

We are pushing for a unified inference API across multiple libraries. As part of this, we are cleaning up the input and output signatures for our pipeline classes and deprecating some rarely-used arguments. This is still a work-in-progress, but when it's finished, transformers pipelines should exactly match workflows in deployment libraries like transformers.js or TGI, allowing you to seamlessly move from development to production.

Sync video classification pipeline with huggingface_hub spec by @Rocketknight1 in #34288
Image pipelines spec compliance by @Rocketknight1 in #33899
Make ASR pipeline compliant with Hub spec + add tests by @Rocketknight1 in #33769
Cleanup return_text and return_full_text options in TextGenerationPipeline by @Rocketknight1 in #33542
Make audio classification pipeline spec-compliant and add test by @Rocketknight1 in #33730
Sync QuestionAnsweringPipeline by @Rocketknight1 in #34039

Also, pipelines now fully support the Processor class, used by vision-language models. Expect full pipeline support for chatting with VLMs in the very near future!

Make pipeline able to load processor by @qubvel in #32514

Executorch compatibility

ExecuTorch is an end-to-end solution for enabling on-device inference capabilities across mobile and edge devices including wearables, embedded devices and microcontrollers. It is part of the PyTorch ecosystem and supports the deployment of PyTorch models with a focus on portability, productivity, and performance.

We are collaborating with the executorch team so that 🤗 Transformers models can be exported using torch.export. The goal of this integration is not only to enable export but also to ensure that the exported artifact can be further lowered and optimized to run efficiently in ExecuTorch, particularly for mobile and edge use cases.

Generate using exported model and enable gemma2-2b in ExecuTorch by @guangy10 in #33707
Qwen2.5 is ExecuTorch Compatible by @guangy10 in #34102
Olmo is ExecuTorch Compatible by @guangy10 in #34181
Llama3 and Llama2 are ExecuTorch compatible by @guangy10 in #34101

Gradient accumulation bugfix

Fix Gradient Accumulation issue by @ArthurZucker in #34191
Enable users to use their own loss functions + deal with prefetching for grad accum by @muellerzr in #34198
Enable Gradient Accumulation fix across all models + trainer fully in forward() by @muellerzr #34283

Bugfixes and improvements

adding positional encoder changes and tests by @manuelsh in #32600
Uniformize kwargs for chameleon processor by @leloykun in #32181
[MllamaProcessor] Update errors and API with multiple image by @ArthurZucker in #33715
fix: use correct var names for check_tokenizers script by @niqodea in #33702
Fix docs and docstrings Omdet-Turbo by @yonigozlan in #33726
Fix position embeddings singular/plural by @molbap in #33678
Generate: can_generate() recursive check by @gante in #33718
clean_up_tokenization_spaces=False if unset by @itazap in #31938
fix: add docstring for image_size in Convnextv2 config by @lucianosrp in #33734
Fix modular model converter unable to generate Processor classes by @tonywu71 in #33737
fix trainer tr_loss add error by @Wang-Xiaodong1899 in #33651
Update Albumentations Versions by @vasqu in #33704
Doc and config mismatch for DeBERTa by @fkrasnov2 in #33713
[clean_up_tokenization_spaces] Pl bart was failing, updating by @ArthurZucker in #33735
[MllamaImageProcessing] Update doc by @ArthurZucker in #33747
Make siglip examples clearer and error free by @jbn in #33667
Paligemma support for multi-image by @zucchini-nlp in #33447
remove warning v2 by @itazap in #33761
Model addition timeline by @LysandreJik in #33762
Fix typing in load_balancing_loss_func function of modeling_mixtral.py. by @PhilipMay in #33641
Enable non-safetensor ser/deser for TorchAoConfig quantized model 🔴 by @jerryzh168 in #33456
Fix typo in documentation by @qgallouedec in #33805
Hqq serialization by @mobicham in #33141
Add Slow CI reminder bot by @ydshieh in #33506
[modular] fixes! by @ArthurZucker in #33820
Fix ViT-MAE decoder interpolate by @xenova in #33330
Fixes for issue #33763 in idefics2 model by @aroun-coumar in #33766
Fix link in gguf.md by @pogpog in #33768
minor typo fix by @a-r-r-o-w in #33784
Fix Mamba slow path bug with dtype mismatch. by @Adibvafa in #32691
Fix passing str dtype to static cache by @guangy10 in #33741
fix check for hidden size in text model for deepspeed zero3 auto entries by @winglian in #33829
post reminder comment only once by @ydshieh in #33848
Generate: move llama prepare_inputs_for_generation to GenerationMixin by @gante in #33677
Refactor image features selection in LlaVa by @kenza-bouzid in #33696
fix: skip dropout in eval for flash_attn in various models by @fdschmidt93 in #33844
add attention weight up-cast to float32 in chameleon by @francescortu in #33822
Workaround for bark issue in pipelines by @Rocketknight1 in #33824
Fix device mismatch errors by @zucchini-nlp in #33851
This PR contains additional changes for #33143 by @aroun-coumar in #33581
Raise accelerate dependency error in case of defaulting low_cpu_mem_usage=True by @kylesayrs in #33830
Validate the eval dataset in advance. by @jackyjinjing in #33743
Add include_loss_for_metrics by @Manalelaidouni in #33088
Avoid using context that is not accessable from external contributors by @ydshieh in #33866
fix: repair depth estimation multiprocessing by @niqodea in #33759
Move weight initilization deformabledetr by @g-prz in #33339
[Fix] ViViT interpolate_pos_encoding by @RUFFY-369 in #33815
Repo consistency fix after #33339 by @amyeroberts in #33873
Add support for custom inputs and batched inputs in ProcessorTesterMixin by @yonigozlan in #33711
Fix: typo by @TrickEye in #33880
Uniformize model processors by @molbap in #31368
Don't run reminder bot for now by @ydshieh in #33883
populate quantization_config for kv-cache-scheme only configs by @horheynm in #33874
Allow for nightly packages of compressed_tensors by @kylesayrs in #33828
Fix kwargs passed by AutoQuantizationConfig.from_pretrained by @kylesayrs in #33798
Add sdpa for DistilBert by @OmarManzoor in #33724
Trainer - deprecate tokenizer for processing_class by @amyeroberts in #32385
[Quantization] Switch to optimum-quanto by @SunMarc in #31732
Optim deformable detr by @yonigozlan in #33600
Handle Trainer tokenizer kwarg deprecation with decorator by @qubvel in #33887
rename all test_processing_.py to test_processor_.py by @yonigozlan in #33878
uniformize processor Mllama by @yonigozlan in #33876
Fix dt proj bias reassigned by @HofitBata in #33314
Update an keyerror on _save_check_point prevent confusion of missing … by @fadingNA in #33832
VLM Generate: tag test_static_cache_matches_dynamic as flaky by @gante in #33630
Migrate the CI runners to the new clusters by @glegendre01 in #33849
Fix module initialization for root module under Zero3 by @Ben-Schneider-code in #33632
Add SplinterTokenizer unit test by @ariepratama in #32652
Generate tests: modality-agnostic input preparation by @gante in #33685
Fix: use unidic-lite instead of ipadic as the tokenizer dictionary for Japanese by @KanTakahiro in #33372
[Tests] Diverse Whisper fixes by @ylacombe in #33665
[PEFT] Support low_cpu_mem_usage option for PEFT loading adapters by @BenjaminBossan in #33725
add setter for trainer processor by @ArthurZucker in #33911
Add support for weights_only flag when loading state_dict by @jerryzh168 in #32481
Config: lower save_pretrained exception to warning by @gante in #33906
Uniformize kwargs for Idefics/2 processors by @yonigozlan in #32568
Remove logits.float() by @ringohoffman in #33902
Minor error condition bug fix by @htahboub in #33781
Fix distil whisper segment computation by @ylacombe in #33920
[Doc]: Broken link in Kubernetes doc by @saldanhad in #33879
[i18n-ru] Fixes typo in the README_ru.md by @Artanias in #33882
Ignore keys on validate_rope by @zucchini-nlp in #33753
[PR run-slow] by @ArthurZucker in #33939
Add a section on writing tool templates to the chat template docs by @Rocketknight1 in #33924
Enables CPU AWQ model with IPEX version. by @jiqing-feng in #33460
🔴 🚨 Resizing tokens embeddings: initialize from old embeddings' normal distribution. by @abuelnasr0 in #33325
Removed unnecessary transpose in Switch Transformer Routing by @karan-uppal3 in #33582
Fix attn mask ignore logic in training-time trace by @zhenglongjiepheonix in #32613
hot fix self.position_embeddings->self.position_embedding by @ArthurZucker in #33958
fix red check-copies by @ArthurZucker in #33964
Cache: revert DynamicCache init for BC by @gante in #33861
Paligemma: fix static cache test by @zucchini-nlp in #33941
Updating char_to_token documentation to note behaviour when trim_offsets is True by @Craigacp in #33919
add test for Jamba with new model jamba-tiny-dev by @yecohn in #33863
Bug fix gguf qwen2moe by @VladOS95-cyber in #33940
[TF] Fix Tensorflow XLA Generation on limited seq_len models by @vasqu in #33903
[WIP] Add Tokenizer for MyT5 Model by @tomlimi in #31286
Add position ids in forward pass to opt model by @avishaiElmakies in #33121
Flash-attn performance: remove cuda sync during inference by @Cyrilvallez in #33570
[Docs] Improve VLM docs by @NielsRogge in #33393
[Docs] Add Developer Guide: How to Hack Any Transformers Model by @MagnusS0 in #33979
[Red CIs] Fix hub failures by @ArthurZucker in #34001
Fix Tensor + Embedding error in some cases when using SiglipVisionModel by @kaitolucifer in #33994
properly fix and RUN_SLOW by @ArthurZucker in #33965
Enable customized optimizer for DeepSpeed by @dataKim1201 in #32049
[pytes collection] Fix flax test collection by @ArthurZucker in #34004
Fix undefined default_config in configuration_utils.py by @mgoin in #33934
🌐 [i18n-KO] Translated gguf.md to Korean by @yijun-lee in #33764
🌐 [i18n-KO] Translated swinv2.md to Korean by @mreraser in #33566
🌐 [i18n-KO] Translated audio_utils.md to Korean by @yijun-lee in #33802
🌐 [i18n-KO] Translated esm.md to Korean by @yijun-lee in #33796
🌐 [i18n-KO] Translated time_series_utils.md to Korean by @yijun-lee in #33806
🌐 [i18n-KO] Translated pipelines_utils.md to Korean by @yijun-lee in #33809
🌐 [i18n-KO] Translated trainer.md to Korean by @yijun-lee in #33797
🌐 [i18n-KO] Translated chameleon.md to Korean by @yijun-lee in #33799
🌐 [i18n-KO] Translated logging.md to Korean by @chhaewxn in #33543
🌐 [i18n-KO] Translated auto.md to Korean by @boyunJang in #33590
🌐 [i18n-KO] Translated swin2sr.md to Korean by @mreraser in #33795
🌐 [i18n-KO] Translated vit.md to Korean by @mreraser in #33884
🌐 [i18n-KO] Translated gemma.md to Korean by @yijun-lee in #33936
Cache: slight change in naming by @zucchini-nlp in #32421
Add support for all and potentilly deleting functions by @ArthurZucker in #33859
Processors: don't default padding side by @zucchini-nlp in #33942
Add auto model for image-text-to-text by @yonigozlan in #32472
BatchFeature.to() supports non-tensor keys by @Rocketknight1 in #33918
Improve modular converter by @Cyrilvallez in #33991
Fixup DeepSpeed things by @muellerzr in #34007
Fix typing issue by @SunMarc in #34012
fix awq tests due to ipex backend by @SunMarc in #34011
Remove decoder_config=None by @SunMarc in #34014
Fix trainer_seq2seq.py's __init__ type annotations by @benglewis in #34021
🌐 [i18n-KO] Translated feature_extractor.md to Korean by @yijun-lee in #33775
🌐 [i18n-KO] Translated bertweet.md to Korean by @ahnjj in #33891
🌐 [i18n-KO] Translated gpt_neox_japanese.md to Korean by @ahnjj in #33894
🌐 [i18n-KO] Translated rag.md to Korean by @chhaewxn in #33989
🌐 [i18n-KO] Translated main_classes/quantization.md to Korean by @fabxoe in #33959
🌐 [i18n-KO] Translated main_classes/configuration.md to Korean by @fabxoe in #33952
🌐 [i18n-KO] Translated model_doc/mamba.md to Korean by @fabxoe in #33626
🌐 [i18n-KO] Translated model_doc/autoformer.md to Korean by @fabxoe in #33574
🌐 [i18n-KO] Translated model_doc/patchtsmixer.md to Korean by @fabxoe in #33587
🌐 [i18n-KO] Translated model_doc/clip.md to Korean by @fabxoe in #33610
🌐 [i18n-KO] Translated model_doc/paligemma.md to Korean by @fabxoe in #33612
🌐 [i18n-KO] Translated model_doc/llama3.md to Korean by @fabxoe in #33635
🌐 [i18n-KO] Translated model_doc/mistral.md to Korean by @fabxoe in #33648
🌐 [i18n-KO] Translated model_doc/cohere.md to Korean by @fabxoe in #33885
🌐 [i18n-KO] Translated model_doc/dbrx.md to Korean by @fabxoe in #33951
🌐 [i18n-KO] Translated model_doc/deberta-v2.md to Korean by @fabxoe in #33968
🌐 [i18n-KO] Translated main_classes/onnx.md to Korean by @fabxoe in #33601
🌐 [i18n-KO] Translated tokenization_utils.md to Korean by @yijun-lee in #33813
🌐 [i18n-KO] Translated swin.md to Korean by @mreraser in #33510
🌐 [i18n-KO] Translated file_utils.md to Korean by @yijun-lee in #33803
🌐 [i18n-KO] Translated openai-gpt.md to Korean by @yijun-lee in #33801
🌐 [i18n-KO] Translated biogpt.md to Korean by @yijun-lee in #33773
🌐 [i18n-KO] Translated blip.md to Korean by @cjfghk5697 in #33515
🌐 [i18n-KO] Translated output.md to Korean by @4N3MONE in #33607
🌐 [i18n-KO] Translated image_processing_utils.md to Korean by @yijun-lee in #33804
🌐 [i18n-KO] Translated modular_transformers.md to Korean by @yijun-lee in #33772
[Patch helper] update to not have to checkout main by @ArthurZucker in #34006
Fix Failed tests with mobile bert resize tokens embedding by @abuelnasr0 in #33950
Generate: remove most decoder-only LLMs prepare_inputs_for_generation by @gante in #33870
Mllama: fix tests by @zucchini-nlp in #34000
Fix PIL dep for tests by @muellerzr in #34028
🌐 [i18n-KO] Translated model_doc/bart.md to Korean by @fabxoe in #33893
🌐 [i18n-KO] Translated model_doc/deberta.md to Korean by @fabxoe in #33967
🌐 [i18n-KO] Translated main_classes/keras_callbacks.md to Korean by @fabxoe in #33955
🌐 [i18n-KO] Translated model_doc/mamba2.md to Korean by @fabxoe in #33629
🌐 [i18n-KO] Translated main_classes/model.md to Korean by @fabxoe in #33606
🌐 [i18n-KO] Translated model_doc/trajectory_transformer.md to Korean by @fabxoe in #33597
🌐 [i18n-KO] Translated model_doc/time_series_transformer.md to Korean by @fabxoe in #33596
🌐 [i18n-KO] Translated model_doc/informer.md to Korean by @fabxoe in #33585
🌐 [i18n-KO] Translated model_doc/graphormer.md to Korean by @fabxoe in #33569
🌐 [i18n-KO] Translated modeling_utils.md to Korean by @yijun-lee in #33808
🌐 [i18n-KO] Translated main_classes/data_collator.md to Korean by @fabxoe in #33954
🌐 [i18n-KO] Translated model_doc/patchtst.md to Korean by @fabxoe in #33589
🌐 [i18n-KO] Translated text_generation.md to Korean by @yijun-lee in #33777
🌐 [i18n-KO] Translated main_classes/callback.md to Korean by @Jwaminju in #33572
🌐 [i18n-KO] Translated generation_utils.md to Korean by @yijun-lee in #33818
Add Translate docs into Arabic - section files CONCEPTUAL GUIDES by @AhmedAlmaghz in #33982
add sdpa to OPT by @avishaiElmakies in #33298
Phi3: fix attn for sliding window by @zucchini-nlp in #33586
HfArgumentParser: allow for hyhenated field names in long-options by @djmarti in #33990
Fix pipelines tests by @qubvel in #34049
Specifying torch dtype in Qwen2VLForConditionalGeneration by @htahboub in #33953
Universal Assisted Generation: Assisted generation with any assistant model (by Intel Labs) by @danielkorat in #33383
check if eigenvalues of covariance matrix are complex. by @abuelnasr0 in #34037
[Docs] Update compressed_tensors.md by @mgoin in #33961
Fix data_seed unused by @MekkCyber in #33731
[TESTS] ASR pipeline by @ylacombe in #33925
Update Blip2 is_pipeline_test_to_skip method signature by @qubvel in #34067
provide trust_remote_code for search feat extractor in model config by @eaidova in #34036
Small Fix to modular converter by @MekkCyber in #34051
Default synced_gpus to True when using FullyShardedDataParallel by @ringohoffman in #33483
Idefics: fix position ids by @zucchini-nlp in #33907
Update SSH workflow file by @ydshieh in #34084
Tests: upcast logits to float() by @gante in #34042
Fix flax failures by @LysandreJik in #33912
Fix DAC slow tests by @ylacombe in #34088
Fix failing conversion by @LysandreJik in #34010
Fix PushToHubMixin when pusing to a PR revision by @Wauplin in #34090
avoid many failures for ImageGPT by @ydshieh in #34071
Fix NaNs in cost_matrix for mask2former by @ducha-aiki in #34074
Fix flaky tests by @zucchini-nlp in #34069
Generate: move prepare_inputs_for_generation in encoder-decoder llms by @gante in #34048
Avoid many test failures for LlavaNextVideoForConditionalGeneration by @ydshieh in #34070
refactor: benchmarks by @McPatate in #33896
fix(ci): benchmarks dashboard was failing due to missing quotations by @McPatate in #34100
Generate: Fix modern llm generate calls with synced_gpus by @gante in #34095
Mistral-related models for QnA by @vasqu in #34045
Fix a typo by @PengWeixuan in #34148
Fixed error message in mllama by @dmgcsilva in #34106
Specify that users should be careful with their own files by @LysandreJik in #34153
Add documentation for docker by @ArthurZucker in #33156
Update README.md with Enterprise Hub by @gary149 in #34150
Idefics: enable generation tests by @zucchini-nlp in #34062
Add sdpa for Vivit by @RUFFY-369 in #33757
Fix FSDP resume Initialization issue by @Itssshikhar in #34032
Fix default behaviour in TextClassificationPipeline for regression problem type by @subhalingamd in #34066
Generate: move logits to same device as input_ids by @gante in #34076
Add support for inheritance from class with different suffix in modular by @yonigozlan in #34077
Fix optuna ddp hp search by @SunMarc in #34073
[feat] LlavaNext add feature size check to avoid CUDA Runtime Error by @laurentd-lunit in #33608
🌐 [i18n-KO] Translated vivit.md to Korean by @mreraser in #33935
🌐 [i18n-KO] Translated gemma2.md to Korean by @yijun-lee in #33937
🌐 [i18n-KO] Translated trainer_utils.md to Korean by @yijun-lee in #33817
🌐 [i18n-KO] Translated blip-2.md to Korean by @cjfghk5697 in #33516
IDEFICS: support inputs embeds by @zucchini-nlp in #34043
[fix] fix token healing tests and usage errors by @alpertunga-bile in #33931
Revert accelerate error caused by 46d09af by @steveepreston in #34197
Fix wrong name for llava onevision and qwen2_vl in tokenization auto by @yonigozlan in #34177
Avoid using torch's Tensor or PIL's Image in chat template utils if not available by @RezaRahemtola in #34165
Revert "Fix FSDP resume Initialization issue" by @SunMarc in #34193
Update trainer._get_eval_sampler() to support group_by_length arg by @larin92 in #33514
Fix warning message for fp32_cpu_offloading in bitsandbytes configs by @amosyou in #34079
Ping team members for new failed tests in daily CI by @ydshieh in #34171
fix(Wav2Vec2ForCTC): torch export by @chrsmcgrr in #34023
Fix for tokenizer.apply_chat_template with continue_final_message=True by @schoennenbeck in #34214
removes decord by @vrnvu in #33987
Fix bus error when using GPT2 on M1 macs by @chanind in #34031
Generate: visit non-llm prepare_inputs_for_generation by @gante in #34199
Support Llama 3.2 conversion (text models) by @pcuenca in #33778
Fix-red-ci by @ArthurZucker in #34230
BLIP: fix input expansion logic by @zucchini-nlp in #34225
Fix broken test decorator require_torch_up_to_2_accelerators by @byi8220 in #34201
Informative 2 by @LysandreJik in #34154
Fix UDOP dtype issue by @Rocketknight1 in #34180
Only cast logits to float when computing loss by @ringohoffman in #34147
Generation tests: don't rely on main input name by @zucchini-nlp in #34228
Change Paligemma import logging to work with modular by @yonigozlan in #34211
Add DetrImageProcessorFast by @yonigozlan in #34063
Add a doc section on writing generation prompts by @Rocketknight1 in #34248
Fix method name which changes in tutorial by @andimarafioti in #34252
Attn implementation for composite models by @zucchini-nlp in #32238
VLM: add more modularity by @zucchini-nlp in #34175
T5 compile compatibilty by @zucchini-nlp in #34089
[docs] Fix GenerationConfig params by @stevhliu in #34299
Fix Korean doc _toctree.yml by @regisss in #34293
Update PR templates by @SunMarc in #34065
[RT-DETR] Fix onnx inference bug for Optype (Where) by @YHallouard in #33877
Fix FA2 attention for models supporting sliding window by @Cyrilvallez in #34093
Fix: tensor of examples of the same length triggers invalid stacking by @pbelcak in #34166
Add post_process_depth_estimation to image processors and support ZoeDepth's inference intricacies by @alex-bene in #32550
Add option for running ffmpeg_microphone_live as a background process by @mikamerath in #32838
Feature: Add MLFLOW_MAX_LOG_PARAMS to MLflowCallback by @cecheta in #34279
Fix continue_final_message for image-text-to-text chat templates by @yonigozlan in #34236
fix error in _get_eval_sampler when group_by_length enabled by @akakakakakaa in #34237
[docs] fix typo by @faaany in #34235
🌐 [i18n-KO] Translated executorch.md to Korean by @ahnjj in #33888
🌐 [i18n-KO] Translated bert japanese.md to Korean by @ahnjj in #33890
🌐 [i18n-KO] Translated model_doc/bartpho.md to Korean by @Jwaminju in #33981
Example doc for token classification of Llama and Dependent/Copied Models by @h3110Fr13nd in #34139
[docs] Fix Korean toctree by @stevhliu in #34324
Added Deberta model type support by @FilipposVentirozos in #34308

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@manuelsh
- adding positional encoder changes and tests (#32600)
@ArthurZucker
- [MllamaProcessor] Update errors and API with multiple image (#33715)
- [clean_up_tokenization_spaces] Pl bart was failing, updating (#33735)
- [MllamaImageProcessing] Update doc (#33747)
- [modular] fixes! (#33820)
- add setter for trainer processor (#33911)
- [PR run-slow] (#33939)
- hot fix self.position_embeddings->self.position_embedding (#33958)
- fix red check-copies (#33964)
- [Red CIs] Fix hub failures (#34001)
- properly fix and RUN_SLOW (#33965)
- [pytes collection] Fix flax test collection (#34004)
- Add support for all and potentilly deleting functions (#33859)
- [Patch helper] update to not have to checkout main (#34006)
- Add documentation for docker (#33156)
- Fix Gradient Accumulation issue (#34191)
- Fix-red-ci (#34230)
@molbap
- Fix position embeddings singular/plural (#33678)
- Uniformize model processors (#31368)
@vasqu
- Update Albumentations Versions (#33704)
- [TF] Fix Tensorflow XLA Generation on limited seq_len models (#33903)
- Mistral-related models for QnA (#34045)
@VladOS95-cyber
- Add gguf support for bloom (#33473)
- Bug fix gguf qwen2moe (#33940)
- Add gguf support for StableLM (#33793)
- Add gguf support for gpt2 (#34044)
- Add GGUF for starcoder2 (#34094)
@ydshieh
- Add Slow CI reminder bot (#33506)
- post reminder comment only once (#33848)
- Avoid using context that is not accessable from external contributors (#33866)
- Don't run reminder bot for now (#33883)
- Update SSH workflow file (#34084)
- avoid many failures for ImageGPT (#34071)
- Avoid many test failures for LlavaNextVideoForConditionalGeneration (#34070)
- Ping team members for new failed tests in daily CI (#34171)
@amyeroberts
- Repo consistency fix after #33339 (#33873)
- Trainer - deprecate tokenizer for processing_class (#32385)
@ylacombe
- [Tests] Diverse Whisper fixes (#33665)
- Fix distil whisper segment computation (#33920)
- [TESTS] ASR pipeline (#33925)
- Fix DAC slow tests (#34088)
- Moshi integration (#33624)
@ringohoffman
- Remove logits.float() (#33902)
- Default synced_gpus to True when using FullyShardedDataParallel (#33483)
- Only cast logits to float when computing loss (#34147)
@garg-amit
- PhiMoE (#33363)
@pglorio
- Add Zamba (#30950)
@tomlimi
- [WIP] Add Tokenizer for MyT5 Model (#31286)
@yijun-lee
- 🌐 [i18n-KO] Translated gguf.md to Korean (#33764)
- 🌐 [i18n-KO] Translated audio_utils.md to Korean (#33802)
- 🌐 [i18n-KO] Translated esm.md to Korean (#33796)
- 🌐 [i18n-KO] Translated time_series_utils.md to Korean (#33806)
- 🌐 [i18n-KO] Translated pipelines_utils.md to Korean (#33809)
- 🌐 [i18n-KO] Translated trainer.md to Korean (#33797)
- 🌐 [i18n-KO] Translated chameleon.md to Korean (#33799)
- 🌐 [i18n-KO] Translated gemma.md to Korean (#33936)
- 🌐 [i18n-KO] Translated feature_extractor.md to Korean (#33775)
- 🌐 [i18n-KO] Translated tokenization_utils.md to Korean (#33813)
- 🌐 [i18n-KO] Translated file_utils.md to Korean (#33803)
- 🌐 [i18n-KO] Translated openai-gpt.md to Korean (#33801)
- 🌐 [i18n-KO] Translated biogpt.md to Korean (#33773)
- 🌐 [i18n-KO] Translated image_processing_utils.md to Korean (#33804)
- 🌐 [i18n-KO] Translated modular_transformers.md to Korean (#33772)
- 🌐 [i18n-KO] Translated modeling_utils.md to Korean (#33808)
- 🌐 [i18n-KO] Translated text_generation.md to Korean (#33777)
- 🌐 [i18n-KO] Translated generation_utils.md to Korean (#33818)
- 🌐 [i18n-KO] Translated gemma2.md to Korean (#33937)
- 🌐 [i18n-KO] Translated trainer_utils.md to Korean (#33817)
@fabxoe
- 🌐 [i18n-KO] Translated main_classes/quantization.md to Korean (#33959)
- 🌐 [i18n-KO] Translated main_classes/configuration.md to Korean (#33952)
- 🌐 [i18n-KO] Translated model_doc/mamba.md to Korean (#33626)
- 🌐 [i18n-KO] Translated model_doc/autoformer.md to Korean (#33574)
- 🌐 [i18n-KO] Translated model_doc/patchtsmixer.md to Korean (#33587)
- 🌐 [i18n-KO] Translated model_doc/clip.md to Korean (#33610)
- 🌐 [i18n-KO] Translated model_doc/paligemma.md to Korean (#33612)
- 🌐 [i18n-KO] Translated model_doc/llama3.md to Korean (#33635)
- 🌐 [i18n-KO] Translated model_doc/mistral.md to Korean (#33648)
- 🌐 [i18n-KO] Translated model_doc/cohere.md to Korean (#33885)
- 🌐 [i18n-KO] Translated model_doc/dbrx.md to Korean (#33951)
- 🌐 [i18n-KO] Translated model_doc/deberta-v2.md to Korean (#33968)
- 🌐 [i18n-KO] Translated main_classes/onnx.md to Korean (#33601)
- 🌐 [i18n-KO] Translated model_doc/bart.md to Korean (#33893)
- 🌐 [i18n-KO] Translated model_doc/deberta.md to Korean (#33967)
- 🌐 [i18n-KO] Translated main_classes/keras_callbacks.md to Korean (#33955)
- 🌐 [i18n-KO] Translated model_doc/mamba2.md to Korean (#33629)
- 🌐 [i18n-KO] Translated main_classes/model.md to Korean (#33606)
- 🌐 [i18n-KO] Translated model_doc/trajectory_transformer.md to Korean (#33597)
- 🌐 [i18n-KO] Translated model_doc/time_series_transformer.md to Korean (#33596)
- 🌐 [i18n-KO] Translated model_doc/informer.md to Korean (#33585)
- 🌐 [i18n-KO] Translated model_doc/graphormer.md to Korean (#33569)
- 🌐 [i18n-KO] Translated main_classes/data_collator.md to Korean (#33954)
- 🌐 [i18n-KO] Translated model_doc/patchtst.md to Korean (#33589)
@MekkCyber
- FEAT : Adding BitNet quantization method to HFQuantizer (#33410)
- Fix data_seed unused (#33731)
- Small Fix to modular converter (#34051)
@AhmedAlmaghz
- Add Translate docs into Arabic - section files CONCEPTUAL GUIDES (#33982)
@alex-bene
- Add post_process_depth_estimation to image processors and support ZoeDepth's inference intricacies (#32550)

Patch release v4.45.2

Mostly some warnings that were not properly removed ⚠️ :

Ignore keys on validate_rope #33753 by @zucchini-nlp
remove warning v2 #33761 by @itazap
Config: lower save_pretrained exception to warning #33906 by @gante

🔴 Had a small regression with dynamic Cache 🔴 *Cache: revert DynamicCache init for BC #33861 by @gante

A small fix for idefic 🐩 :

Fixes for issue #33763 in idefics2 model #33766 by @aroun-coumar

And a fix for Siglip 🤧 !

hot fix self.position_embeddings->self.position_embedding #33958 and properly fix and RUN_SLOW #33965 thanks to @mranzinger

Patch Release v4.45.1

Patches for v4.45.1

[MllamaProcessor] Update errors and API with multiple image (#33715) by @ArthurZucker
Generate: can_generate() recursive check (#33718) by @gante
clean_up_tokenization_spaces=False if unset (#31938) by @itazap

Llama 3.2, mllama, Qwen2-Audio, Qwen2-VL, OLMoE, Llava Onevision, Pixtral, FalconMamba, Modular Transformers

New model additions

mllama

The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.

Add MLLama #33703, by @qubvel, @zucchini-nlp, @ArthurZucker

Qwen2-VL

The Qwen2-VL is a major update from the previous Qwen-VL by the Qwen team.

An extract from the Qwen2-VL blogpost available here is as follows:

Qwen2-VL is the latest version of the vision language models based on Qwen2 in the Qwen model familities. Compared with Qwen-VL, Qwen2-VL has the capabilities of:

SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.
Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.
Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions.
Multilingual Support: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.

support qwen2-vl by @simonJJJ in #32318

Qwen2-Audio

The Qwen2-Audio is the new model series of large audio-language models from the Qwen team. Qwen2-Audio is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions.

They introduce two distinct audio interaction modes:

voice chat: users can freely engage in voice interactions with Qwen2-Audio without text input
audio analysis: users could provide audio and text instructions for analysis during the interaction

Add Qwen2-Audio by @faychu in #32137

OLMoE

OLMoE is a series of Open Language Models using sparse Mixture-of-Experts designed to enable the science of language models. The team releases all code, checkpoints, logs, and details involved in training these models.

Add OLMoE by @Muennighoff in #32406

Llava Onevision

LLaVA-Onevision is a Vision-Language Model that can generate text conditioned on one or several images/videos. The model consists of SigLIP vision encoder and a Qwen2 language backbone. The images are processed with anyres-9 technique where the image is split into 9 patches to better process high resolution images and capture as much details as possible. However, videos are pooled to a total sequence length of 196 tokens each frame for more memory efficient computation. LLaVA-Onevision is available in three sizes: 0.5B, 7B and 72B and achieves remarkable performance on benchmark evaluations.

Llava Onevision: add model by @zucchini-nlp in #32673

FalconMamba

The FalconMamba model was proposed by TII UAE (Technology Innovation Institute) in their release.

The model has been trained on approximtely 6T tokens consisting a mixture of many data sources such as RefineWeb, Cosmopedia and Math data.

The team releases an accompanying blog post.

Add new model by @younesbelkada in #32615

Granite Language Models

he Granite model was proposed in Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler by Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox and Rameswar Panda.

PowerLM-3B is a 3B state-of-the-art small language model trained with the Power learning rate scheduler. It is trained on a wide range of open-source and synthetic datasets with permissive licenses. PowerLM-3B has shown promising results compared to other models in the size categories across various benchmarks, including natural language multi-choices, code generation, and math reasoning.

Granite language models by @mayank31398 in #31502

Granite MOE

The GraniteMoe model was proposed in Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler by Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox and Rameswar Panda.

PowerMoE-3B is a 3B sparse Mixture-of-Experts (sMoE) language model trained with the Power learning rate scheduler. It sparsely activates 800M parameters for each token. It is trained on a mix of open-source and proprietary datasets. PowerMoE-3B has shown promising results compared to other dense models with 2x activate parameters across various benchmarks, including natural language multi-choices, code generation, and math reasoning.

Granitemoe by @mayank31398 in #33207

Descript-Audio-Codec

The Descript Audio Codec (DAC) model is a powerful tool for compressing audio data, making it highly efficient for storage and transmission. By compressing 44.1 KHz audio into tokens at just 8kbps bandwidth, the DAC model enables high-quality audio processing while significantly reducing the data footprint. This is particularly useful in scenarios where bandwidth is limited or storage space is at a premium, such as in streaming applications, remote conferencing, and archiving large audio datasets.

Add Descript-Audio-Codec model by @kamilakesbi in #31494

Pixtral

The Pixtral model was released by the Mistral AI team. Pixtral is a multimodal model, taking images and text as input, and producing text as output. This model follows the Llava family, meaning image embeddings are placed instead of the [IMG] token placeholders.

The model uses PixtralVisionModel for its vision encoder, and MistralForCausalLM for its language decoder. The main contribution is the 2d ROPE (rotary postiion embeddings) on the images, and support for arbitrary image sizes (the images are not padded together nor are they resized).

Add support for Pixtral by @ArthurZucker in #33449

Mimi

The Mimi model was proposed in Moshi: a speech-text foundation model for real-time dialogue by Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave and Neil Zeghidour. Mimi is a high-fidelity audio codec model developed by the Kyutai team, that combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps. In other words, it can be used to map audio waveforms into “audio tokens”, known as “codebooks”.

Codec integration by @ylacombe in #33565

OmDet-Turbo

The OmDet-Turbo model was proposed in Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head by Tiancheng Zhao, Peng Liu, Xuan He, Lu Zhang, Kyusong Lee. OmDet-Turbo incorporates components from RT-DETR and introduces a swift multimodal fusion module to achieve real-time open-vocabulary object detection capabilities while maintaining high accuracy. The base model achieves performance of up to 100.2 FPS and 53.4 AP on COCO zero-shot.

Add OmDet-Turbo by @yonigozlan in #31843

Quantization

GGUF

GGUF support continues to be enhanced in the library by offering a way to load GGUF models within transformers by unquantizing them, before re-quantizing them for re-use within the GGUF/GGML ecosystem.

Add Qwen2Moe GGUF loading support by @VladOS95-cyber in #33264
Fix incorrect vocab size retrieval in GGUF config by @Isotr0py in #32551
Add chat_template for tokenizer extracted from GGUF model by @Isotr0py in #32908
🚨 Support dequantization for most GGML types by @Isotr0py in #32625
Add support for GGUF Phi-3 by @a8nova in #31844

Torch AO

An ongoing effort is to add the ability to use torchao as a quantization backend. Future PRs will enable saving and fine-tuning with peft.

Add TorchAOHfQuantizer by @jerryzh168 in #32306

Liger Kernel

The Liger kernel is now supported in the Trainer class.

Integrate Liger (Linkedin GPU Efficient Runtime) Kernel to Trainer by @JasonZhu1313 in #32860

Modular Transformers

This PR introduces Modularity for transformers, which has always been prohibited when working with transformers (see blog post for the accompanying design philosophy).

The core idea behind this PR is to facilitate model addition by enabling Pythonic inheritance while keeping true to our single-file policy in which models/processors must be contained within a single file, enabling working around the object without going through 10 layers of abstractions.

It is heavily recommended to read the PR description in order to understand the depth of the change: https://github.com/huggingface/transformers/pull/33248

Modular transformers: modularity and inheritance for new model additions by @ArthurZucker in #33248

Agents

Agents continue being improved at each release; this time making it much simpler to leverage a local engine through a local Transformers Engine.

Multi agents with manager by @aymeric-roucher in #32687
Add new documentation page for advanced agent usage by @aymeric-roucher in #33265
Create local Transformers Engine by @aymeric-roucher in #33218
Agents use grammar by @aymeric-roucher in #31735

Dynamic cache for decoder-only models

This PR adds to all decoder-only models (except for XLNet) support for dynamic cache.

The documentation for the Dynamic cache can be found here, and documentation related to the KV cache in transformers in general can be found here.

Cache: new Cache format in decoder-only models by @zucchini-nlp in #31421

Chat templates updates

We've made several updates to our handling of chat models and chat templates. The most noticeable change is that assistant prefill is now supported. This means you can end a chat with an assistant message, and the model will continue that message instead of starting a new one, allowing you to guide the model's response:

pipe = pipeline("text-generation", model_checkpoint)

chat = [
    {"role": "user", "content": "Can you format the answer in JSON?"},
    {"role": "assistant", "content": '{"name": "'}
]

output = pipe(chat)   # The model will continue outputting JSON!

We've also enabled several new functionalities in Jinja that will allow more powerful templates in future, including Loop Controls and a strftime_now function that can get the current date and time, which is commonly used in system messages. For more details, see the updated chat template docs.

Enable some Jinja extensions and add datetime capabilities by @Rocketknight1 in #32684
Update Jinja docs with new functions and general cleanup by @Rocketknight1 in #33097
Add assistant prefill for chat templates and TextGenerationPipeline by @Rocketknight1 in #33198
Add a warning to the chat template docs about the tool_calls format by @Rocketknight1 in #33277
Add tip to clarify tool calling by @Rocketknight1 in #32883

Bugfixes and improvements

🌐 [i18n-KO] Translated mask_generation.md to Korean by @jeongiin in #32257
🌐 [i18n-KO] Translated idefics.md to Korean by @boyunJang in #32258
🌐 [i18n-KO] Translated image_to_image.md to Korean by @shinhyunji36 in #32327
Gemma2: add cache warning by @zucchini-nlp in #32279
enable xla fsdp by @hanwen-sun in #32048
Fix typo in tokenization_utils_base.py by @blubitz in #32484
fix broken link in docs by @jorahn in #32491
Docs: alert for the possibility of manipulating logits by @gante in #32467
🌐 [i18n-KO] Translated gptq.md to Korean by @1kmmk1 in #32293
🌐 [i18n-KO] Translated prompting.md to Korean by @chhaewxn in #32294
🌐 [i18n-KO] Translated quantization/quanto.md to Korean by @fabxoe in #32281
🌐 [i18n-KO] Translated image_feature_extraction.md to Korean by @mreraser in #32239
Fix references to model google mt5 small by @JuanFKurucz in #32497
Docs: Fixed WhisperModel.forward’s docstring link by @Sai-Suraj-27 in #32498
🌐 [i18n-KO] Translated chat_templating.md to Korean by @enchantee00 in #32362
Fix link to autoclass_tutorial.md in i18n.md by @JuanFKurucz in #32501
Fix typo: depracted -> deprecated by @tomaarsen in #32489
Fix issue #32518: Update llm_tutorial.md by @doomdagadiggiedahdah in #32523
Change Phi3 _supports_sdpa to True by @pocca2048 in #32457
Uniformize kwargs for processors - GroundingDINO by @SangbumChoi in #31964
Fix add-new-model-like by @molbap in #31773
filter flash_attn optional imports loading remote code by @eaidova in #30954
🌐 [i18n-KO] Translated ko-llm_tutorial_optimization.md to Korean by @010kim in #32372
🌐 [i18n-KO] Translated trainer.md to Korean by @cjfghk5697 in #32260
🌐 [i18n-KO] Translated eetq.md to Korean by @jun048098 in #32352
🌐 [i18n-KO] Translated fsdp.md to Korean by @win2dvp21 in #32261
🌐 [i18n-KO] Translated bitsandbytes.md to Korean by @SeungAhSon in #32408
Fix generate with inputs_embeds as input by @molbap in #32493
Fixed test test_static_cache_exportability with torch 2.4.0 by @guangy10 in #32516
Fix code example to load bigcode starcoder2 7b by @JuanFKurucz in #32474
[docs] Translation guide by @stevhliu in #32547
Gemma2: fix FA2 generation by @zucchini-nlp in #32553
Fix a bug in Qwen2Audio by @faychu in #32552
fix slow integration gemma2 test by @ArthurZucker in #32534
fix non contiguous tensor value error in save_pretrained by @congcongke in #32422
🌐 [i18n-KO] Translated agent.md to Korean by @Jwaminju in #32351
Fix: FA2 with packed training by @zucchini-nlp in #32487
Fix sliding window attention used in Gemma2FlashAttention2 by @brcps12 in #32522
fix: Fixed conditional check for encodec model names by @Sai-Suraj-27 in #32581
Fix .push_to_hub(..., create_pr=True, revision="my-branch") when creating PR on not-owned repo by @Wauplin in #32094
Cleanup tool calling documentation and rename doc by @Rocketknight1 in #32337
🌐 [i18n-KO] Translated deepspeed.md to Korean by @4N3MONE in #32431
🌐 [i18n-KO] Translated awq.mdto Korean by @ahnjj in #32324
fix: Fixed failing test_find_base_model_checkpoint by @Sai-Suraj-27 in #32638
"to be not" -> "not to be" by @qgallouedec in #32636
fix: Updated the is_torch_mps_available() function to include min_version argument by @Sai-Suraj-27 in #32545
Expand inputs in processors for VLMs by @zucchini-nlp in #30962
Automatically add transformers tag to the modelcard by @LysandreJik in #32623
Fix tests by @molbap in #32649
fix tensors on different devices in WhisperGenerationMixin by @faaany in #32316
Add support for GrokAdamW optimizer by @ehartford in #32521
Add Depth Anything V2 Metric models by @bt2513 in #32126
Fix: Fixed directory path for utils folder in test_tokenization_utils.py by @Sai-Suraj-27 in #32601
Modify ProcessorTesterMixin for better generalization by @yonigozlan in #32637
TF_Deberta supporting mixed precision by @pinesnow72 in #32618
Fix tests recurrent by @molbap in #32651
Support MUSA (Moore Threads GPU) backend in transformers by @fmo-mt in #31913
fix: Fixed failing tests in tests/utils/test_add_new_model_like.py by @Sai-Suraj-27 in #32678
Update translation docs review by @stevhliu in #32662
Fix JetMoeIntegrationTest by @ydshieh in #32332
Update the distributed CPU training on Kubernetes documentation by @dmsuehir in #32669
fix: Fixed unknown pytest config option doctest_glob by @Sai-Suraj-27 in #32475
Unpin deepspeed in Docker image/tests by @muellerzr in #32572
Updated workflows to the latest versions by @Sai-Suraj-27 in #32405
reopen: llava-next fails to consider padding_side during Training by @jp1924 in #32679
fix: Corrected falcon-mamba-7b model checkpoint name by @Sai-Suraj-27 in #32837
fix: update doc link for runhouse in README.md by @muddlebee in #32664
VLMs: small clean-up for cache class by @zucchini-nlp in #32417
add back the position ids by @ArthurZucker in #32554
Use head_dim if in config for RoPE by @suiyoubi in #32495
Generate: unify LogitsWarper and LogitsProcessor by @gante in #32626
[tests] make test_sdpa_equivalence device-agnostic by @faaany in #32520
Cache: use batch_size instead of max_batch_size by @gante in #32657
Fix AutoConfig and AutoModel support for Llava-Next-Video by @TKONIY in #32844
improve _get_is_as_tensor_fns by @zrr1999 in #32596
Revert PR 32299, flag users when Zero-3 was missed by @muellerzr in #32851
fix multi-gpu with static cache by @SunMarc in #32543
Reduce the error log when using core models that need their weights renamed, and provide a step forward by @muellerzr in #32656
Make beam_constraints.Constraint.advance() docstring more accurate by @alex-calderwood in #32674
generate: missing to in DoLa body, causing exceptions in multi-gpu generation by @gante in #32856
Add Flax Dinov2 by @MHRDYN7 in #31960
support torch-speech by @itazap in #32537
[tests] make test_sdpa_can_compile_dynamic device-agnostic by @faaany in #32519
Add repr for Conv1D by @AaronZLT in #32425
Support save/load ckpt for XLA FSDP by @yitongh in #32311
RT-DETR parameterized batchnorm freezing by @AlanBlanchet in #32631
Mamba / FalconMamba: Fix mamba left padding by @younesbelkada in #32677
Fix: Mamba2 generation mismatch between input_ids and inputs_embeds by @vasqu in #32694
Docs: Fixed whisper-large-v2 model link in docs by @Sai-Suraj-27 in #32871
Allow-head-dim by @ArthurZucker in #32857
🚨🚨🚨 Update min version of accelerate to 0.26.0 by @SunMarc in #32627
Fix repr for conv by @ArthurZucker in #32897
fix: jamba cache fails to use torch.nn.module by @xgal in #32894
Fix: Mamba2 norm_before_gate usage by @vasqu in #32686
Replace tensor.norm() with decomposed version for CLIP executorch export by @qubvel in #32887
link for optimizer names by @nbroad1881 in #32400
[i18n-ar] add README_ar.md to README.md by @AhmedAlmaghz in #32583
fix: [whisper] don't overwrite GenerationConfig's return_timestamps when return_timestamps is not passed to generate function by @hrl in #31296
Update docker image building by @ArthurZucker in #32918
Jamba: update integration tests by @gante in #32250
fix: Added missing huggingface_hub installation to workflows by @Sai-Suraj-27 in #32891
fix: no need to dtype A in jamba by @xgal in #32924
FEAT / Trainer: Add adamw 4bit optimizer by @SunMarc in #31865
CI: separate step to download nltk files by @gante in #32935
FIX / Hub: Also catch for exceptions.ConnectionError by @younesbelkada in #31469
Add SynCode to llm_tutorial by @shubhamugare in #32884
Fix benchmark script by @ydshieh in #32635
Improve greedy search memory usage by @regisss in #32895
fix: (issue #32689) AttributeError raised when using Trainer with eval_on_start=True in Jupyter Notebook. by @fshp971 in #32849
Gemma2: eager attention by default by @gante in #32865
[run_slow] idefics2 by @andimarafioti in #32840
Fix regression on Processor.save_pretrained caused by #31691 by @leloykun in #32921
🌐 [i18n-KO] Translated `knowledge_distillation_for_image_classification.md to Korean" by @JinukHong in #32334
Generate: Deprecate returning legacy cache by default; Handle use_cache=False by @gante in #32863
docs: fix outdated link to TF32 explanation by @anakin87 in #32947
Reducing memory usage: removing useless logits computation in generate() by @Cyrilvallez in #31292
Forbid PretrainedConfig from saving generate parameters; Update deprecations in generate-related code 🧹 by @gante in #32659
DeviceGuard added to use Deformable Attention more safely on multi-GPU by @DonggeunYu in #32910
added doctring to SchedulerType class by @Arunprakash-A in #32898
Updated the custom_models.md changed cross_entropy code by @S-M-J-I in #33118
CI: add torchvision to the consistency image by @gante in #32941
Test: add higher atol in test_forward_with_num_logits_to_keep by @gante in #33093
mps: add isin_mps_friendly, a wrapper function for torch.isin by @gante in #33099
Add changes for uroman package to handle non-Roman characters by @nandwalritik in #32404
fix: Fixed pydantic required version in dockerfiles to make it compatible with DeepSpeed by @Sai-Suraj-27 in #33105
quickfix documentation by @molbap in #32566
Fixup py 38 type hints for mps friendly by @muellerzr in #33128
fix: Fixed CodeGenTokenizationTest::test_truncation failing test by @Sai-Suraj-27 in #32850
fix: multilingual midel convert to tflite get wrong token by @Ayaa17 in #32079
disable scheduled daily CI temporarily by @ydshieh in #33136
CI: fix efficientnet pipeline timeout and prevent future similar issues due to large image size by @gante in #33123
Log additional test metrics with the CometCallback by @Lothiraldan in #33124
[docs] add quick usage snippet to Whisper. by @Vaibhavs10 in #31289
Update stateful_callbacks state before saving checkpoint by @pedrobrs in #32115
fix Idefics2VisionConfig type annotation by @chenzizhao in #33103
Add a fix for custom code tokenizers in pipelines by @Rocketknight1 in #32300
Llama: make slow tests green 🟢 by @gante in #33138
fix redundant checkpointing in example training scripts by @eminorhan in #33131
update torch req for 4-bit optimizer by @SunMarc in #33144
🌐 [i18n-KO] Translated conversations.md to Korean by @newfull5 in #32468
Very small change to one of the function parameters by @alisalamatian1 in #32548
🚨 Add Blip2ForImageTextRetrieval by @jpizarrom in #29261
fix model name and copyright by @mayank31398 in #33152
Fix: Jamba batched generation by @vasqu in #32914
[whisper] pass attention_mask to generate_with_fallback() by @benniekiss in #33145
[RoBERTa-based] Add support for sdpa by @hackyon in #30510
Fix import paths for test_module by @rasmi in #32888
Zero-shot pipelines: minor doc changes by @pcuenca in #33127
Customise the separator used for splicing in DataCollatorWithFlattening by @beep-bebop in #33114
Fix spell mistakes by @matsuo1234567 in #33149
update push CI workflow files for security by @ydshieh in #33142
added quick clarification by @DuyguA in #33166
pass module to Params4bit.from_prequantized to ensure quant_state by @winglian in #32524
Mamba2 conversion script for original models by @vasqu in #32580
Add a static cache that offloads to the CPU or other device by @gerbenvv in #32161
use a single for loop by @ArthurZucker in #33148
Pipeline: fix bad generation kwargs docs by @gante in #33205
Add missing quotes in modeling_llava_next_video.py by @juliendenize in #33214
Add warning for stop string edge case by @Rocketknight1 in #33169
Fix local repos with remote code not registering for pipelines by @Rocketknight1 in #33100
Refactor CI: more explicit by @ArthurZucker in #30674
🌐 [i18n-KO] Translated llm_optims.md to Korean by @yijun-lee in #32325
Fix red amin by @ArthurZucker in #33220
Test fetcher: missing return on filtered tests; don't write empty files by @gante in #33224
Generate: throw warning when return_dict_in_generate is False but should be True by @gante in #33146
Add video text to text docs by @merveenoyan in #33164
Add GraniteRMSNorm by @NielsRogge in #33177
Add duckduckgo search tool by @aymeric-roucher in #32882
Fix: Suppressed 'use_reentrant=False' warning by @ankush13r in #33208
docs: Replace package abbreviations with full name(bitsandbytes) in docstrings by @rapsealk in #33230
Generate: fix assistant in different device by @gante in #33257
remove to restriction for 4-bit model by @SunMarc in #33122
Fixed typo repeated word in DETR docs by @sergiopaniego in #33250
Fix: use torch.from_numpy() to create tensors for np.ndarrays by @shinyano in #33201
remove torch input dependant control flow by @ArthurZucker in #33245
Fix: num_logits_to_keep in composite models by @zucchini-nlp in #33168
Fix Bark saving by @ylacombe in #33266
Update chat template docs to remove Blenderbot by @Rocketknight1 in #33254
Add sdpa support for Albert by @OmarManzoor in #32092
Only disallow DeepSpeed Zero-3 for auto bs finder by @muellerzr in #31731
fix the parallel number of CI nodes when it is smaller than number of tests by @ArthurZucker in #33276
Repo checks: check documented methods exist by @gante in #32320
Fix: multigpu training by @zucchini-nlp in #33271
Cache docs: update by @zucchini-nlp in #32929
Config: unified logic to retrieve text config by @gante in #33219
[fix] LlavaNextProcessor '_get_unpadded_features' method by @laurentd-lunit in #33263
wait 15m before SSH into runner workflow stops by @ydshieh in #33300
Bugfix/alexsherstinsky/fix none check for attention factor in rope scaling 2024 08 28 0 by @alexsherstinsky in #33188
[InstructBLIP] qformer_tokenizer is required input by @amyeroberts in #33222
[BUG] fix upper nltk version by @ylacombe in #33301
Fix excessive CPU memory usage with FSDP and cpu_ram_efficient_loading by @matthewdouglas in #33154
Add validate images and text inputs order util for processors and test_processing_utils by @yonigozlan in #33285
Fix: Fix FalconMamba training issues due to incompatible kernels by @younesbelkada in #33195
Add paper link by @Muennighoff in #33305
🚨 Fix torch.jit.trace for interpolate_pos_encoding in all vision models by @xenova in #33226
Update SECURITY.md by @Michellehbn in #32680
simple align qwen2vl kv_seq_len calculation with qwen2 by @simonJJJ in #33161
Add a community notebook for fine-tuning with QLoRA, PEFT, and MLflow by @daniellok-db in #33319
Fix: StaticCache & inputs_embeds by @zucchini-nlp in #32932
Docs: add more cross-references to the KV cache docs by @gante in #33323
[whisper] alternative fix for long-form timestamps by @sanchit-gandhi in #32131
fix qwen2vl vision eager-attention by @simonJJJ in #33213
Load dynamic module (remote code) only once if code isn't change by @XuehaiPan in #33162
support loading model without config.json file by @itazap in #32356
Add validation for maximum sequence length in modeling_whisper.py by @AmirMohammadFakhimi in #33196
add self.head_dim for VisionAttention in Qwen2-VL by @GeLee-Q in #33211
support 3D attention mask in bert by @gathierry in #32105
Support reading tiktoken tokenizer.model file by @itazap in #31656
red-ci on main, fix copies by @ArthurZucker in #33356
RoPE: fix BC warning by @gante in #33331
Fix Prefill docs by @Rocketknight1 in #33352
Update author for QLorA/PEFT community notebook by @daniellok-db in #33338
add sdpa mbart by @nbroad1881 in #32033
Fix quantized cache tests by @zucchini-nlp in #33351
schedulefree optimizers by @winglian in #30079
Add visit webpage tool by @aymeric-roucher in #33353
Fixed Majority of the Typos in transformers[en] Documentation by @nnilayy in #33350
Compile compatibilty for decoder-only models by @zucchini-nlp in #32617
Adjust templates by @LysandreJik in #33384
Remove repeated prepare_images in processor tests by @amyeroberts in #33163
Fix import of FalconMambaForCausalLM by @younesbelkada in #33381
Import structure & first three model refactors by @LysandreJik in #31329
VLM: fixes after refactor by @zucchini-nlp in #32907
fixed Mask2Former image processor segmentation maps handling by @maciej-adamiak in #33364
Bug Fix: Update hub.py to fix NoneType error by @rishiraj in #33315
Update WhisperTokenizer Doc: Timestamps and Previous Tokens Behaviour by @bruno-hays in #33390
Make StaticCache configurable at model construct time by @guangy10 in #32830
use diff internal model in tests by @itazap in #33387
Fix FbgemmFp8Linear not preserving tensor shape by @vgel in #33239
Fix failing windows by @LysandreJik in #33436
Remove deprecated task in load_dataset by @albertvillanova in #33433
Dynamic number of speculative tokens in order to accelerate speculative decoding by @jmamou in #33258
Fix: Cast prefetch_bucket_size to integer for deepspeed >= 0.15 by @kiddj in #33402
[docs] add the missing huggingface hub username by @faaany in #33431
[docs] add the missing tokenizer when pushing models to huggingface hub by @faaany in #33428
Update stale.yml by @LysandreJik in #33434
Docs - update formatting of llama3 model card by @MichaelCurrin in #33438
Fix incomplete sentence in Zero-shot object detection documentation by @sergiopaniego in #33430
Fix flax whisper tokenizer bug by @hannan72 in #33151
Clean-up deprecated code by @zucchini-nlp in #33446
Fix default revision for pipelines by @ankane in #33395
Revive AMD scheduled CI by @ydshieh in #33448
Allow send SSH into runner info. to DM by @ydshieh in #33346
Correct Whisper's beam search scores computation by @ylacombe in #32336
Qwen2-VL: clean-up and add more tests by @zucchini-nlp in #33354
[whisper] Clarify error message when setting max_new_tokens by @benniekiss in #33324
[docs] refine the doc for train with a script by @faaany in #33423
Return image hidden states by @zucchini-nlp in #33426
add a callback hook right before the optimizer step by @winglian in #33444
Enable padding_side as call time kwargs by @zucchini-nlp in #33385
Mitigate a conflict when using sentencepiece by @tengomucho in #33327
[Phi-3] Bug on stale kv cache by @garg-amit in #33129
Fix the initialization of the cache when we have multi gpu by @SunMarc in #33303
Enable finetuning with torchao quantized model by @SunMarc in #33361
Corrected Agents and tools documentation links typos by @sergiopaniego in #33471
chore: fix typo in comment in tokenization_utils_base.py by @DavidLemayian in #33466
Cohere: update RoPE structure by @gante in #33408
Fix SSH workflow by @ydshieh in #33451
Add keypoint-detection task guide by @merveenoyan in #33274
Uniformize kwargs for LLaVa processor and update docs by @yonigozlan in #32858
Agents, supercharged - Multi-agents, External tools, and more docs typo fixed by @sergiopaniego in #33478
[i18n-ar] Add File : docs/source/ar/_toctree.yml by @AhmedAlmaghz in #32696
[Whisper test] Fix some failing tests by @ylacombe in #33450
Fix: Qwen2-VL training on video datasets by @hiyouga in #33307
Updated Trainer's liger-kernel integration to call correct patching API by @shimizust in #33502
Replace accelerator.use_fp16 in examples by @hlky in #33513
Fix parametrization-based weight norm by @ylacombe in #33275
Fix number of patch check for different vision feature select strategy by @insujang in #32494
chore: migrate coverage cfg to pyproject.toml by @SauravMaheshkar in #32650
idefics2 enable_input_require_grads not aligned with disable_input_re… by @sywangyi in #33194
Update chameleon.md — fix runtime type error by @maxwbuckley in #33494
Add explicit example for RAG chat templating by @A-Duss in #33503
CI Build image - move runners by @glegendre01 in #33530
fix to jamba config, asserting attention and expert offset by @ErezSC42 in #33316
Fix missing sequences_scores in the Whisper beam search output by @Nik-Kras in #32970
Uniformize kwargs for Pixtral processor by @yonigozlan in #33521
Add revision to trainer push_to_hub by @teamclouday in #33482
fix patch_attention_mask incorrect setting which leads to the differe… by @sywangyi in #33499
Support LLaVa-OV-Chat by @zucchini-nlp in #33532
Decorator for easier tool building by @aymeric-roucher in #33439
Fix for slow the bug tokenizer adding spaces to single id decodes by @DuyguA in #32564
Chat template: save and load correctly for processors by @zucchini-nlp in #33462
Fix missing head_dim in llama config from gguf model by @Isotr0py in #33526
[i18n-ur] Added README_ur.md file by @akkefa in #33461
fix the wandb logging issue by @ZIYU-DEEP in #33464
Fix tests in ASR pipeline by @ylacombe in #33545
Added support for bfloat16 to zero-shot classification pipeline by @umarbutler in #33554
Pipeline: no side-effects on model.config and model.generation_config 🔫 by @gante in #33480
Return attention mask in ASR pipeline to avoid warnings by @Rocketknight1 in #33509
enforce original size to be a list by @dom-dziela in #33564
Improve compiled RT-DETR inference speed by @yonigozlan in #33412
Fix bnb dequantization by @SunMarc in #33546
Load and save video-processor from separate folder by @zucchini-nlp in #33562
VLMs: enable generation tests by @zucchini-nlp in #33533
rag: fix CI by @gante in #33578
Cache: don't show warning in forward passes when past_key_values is None by @gante in #33541
fix tests with main revision and read token by @molbap in #33560
add uniform processors for altclip + chinese_clip by @molbap in #31198
Generate: check that attention_mask is 2D by @gante in #33575
change sequence_bias type of SequenceBiasLogitsProcessor to list, add… by @VladOS95-cyber in #33375
[Mamba2] Move dt calculations to kernel by @vasqu in #33520
Cache: don't throw warnings on gemma2 when instantiating a new cache by @gante in #33595
Uniformize kwargs for Paligemma processor and update docs by @yonigozlan in #33571
[tests] skip tests for xpu by @faaany in #33553
[tests] enable GemmaIntegrationTest on XPU by @faaany in #33555
Fix Llama 3 TikToken conversion by @pcuenca in #33538
Docs: add the ability to manually trigger jobs by @gante in #33598
Fix CircleCI nightly run by @ydshieh in #33558
Allow CI could be run on private forked repositories (e.g. new model additions) by @ydshieh in #33594
[tests] make more tests device-agnostic by @faaany in #33580
Update modeling_mamba2.py, fix pad size by @klae01 in #32599
Generate: remove flakyness in test_generate_from_inputs_embeds_decoder_only by @gante in #33602
Remove unnecessary CPM model tests by @amyeroberts in #33621
Add sdpa for BioGpt by @OmarManzoor in #33592
VLM generate: tests can't generate image/video tokens by @gante in #33623
Fix missing test in torch_job by @ydshieh in #33593
Add support for args to ProcessorMixin for backward compatibility by @yonigozlan in #33479
Fix contrastive search to correctly handle input with padding by @ducviet00 in #33507
Generate: assistant should sample when the main model samples by @gante in #33534
Fix some missing tests in circleci by @ydshieh in #33559
Update daily ci to use new cluster by @ydshieh in #33627
Fix qwen2vl float16 inference bug by @GeLee-Q in #33312
Fix typos by @litianjian in #33583
enable low-precision pipeline by @jiqing-feng in #31625
Pixtral update example checkpoint by @amyeroberts in #33633
Sdpa dino v2 by @avishaiElmakies in #33403
Clean up Unpack imports by @molbap in #33631
Fix DPT /Dinov2 sdpa regression on main by @molbap in #33660
handle dependency errors in check_imports by @molbap in #33622
add back self.max_position_embeddings = config.max_position_embeddings by @chengchengpei in #33550
Fix Llava conversion for LlavaQwen2ForCausalLM with Clip vision tower by @Isotr0py in #33613
Uniformize kwargs for Udop processor and update docs by @yonigozlan in #33628
Generation: deprecate PreTrainedModel inheriting from GenerationMixin by @gante in #33203
Enable BNB multi-backend support by @jiqing-feng in #31098
Fix error string after refactoring into get_chat_template by @tibor-reiss in #33652
uniformize git processor by @yonigozlan in #33668
Fix CIs post merging modular transformers by @ArthurZucker in #33681
Fixed docstring for cohere model regarding unavailability of prune_he… by @mnauf in #33253
Generation tests: update imagegpt input name, remove unused functions by @gante in #33663
Improve Error Messaging for Flash Attention 2 on CPU by @sizhky in #33655
Gemma2: fix config initialization (cache_implementation) by @gante in #33684
Fix ByteLevel alphabet missing when Sequence pretokenizer is used by @umarbutler in #33556
Uniformize kwargs for image-text-to-text processors by @yonigozlan in #32544
🚨🚨 Setting default behavior of assisted decoding by @jmamou in #33657
tests: fix pytorch tensor placement errors by @dvrogozh in #33485
bump tokenizers, fix added tokens fast by @ArthurZucker in #32535
[Pixtral] Improve docs, rename model by @NielsRogge in #33491

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@enchantee00
- 🌐 [i18n-KO] Translated chat_templating.md to Korean (#32362)
@faychu
- Add Qwen2-Audio (#32137)
- Fix a bug in Qwen2Audio (#32552)
@010kim
- 🌐 [i18n-KO] Translated ko-llm_tutorial_optimization.md to Korean (#32372)
@cjfghk5697
- 🌐 [i18n-KO] Translated trainer.md to Korean (#32260)
@younesbelkada
- Add new model (#32615)
- Mamba / FalconMamba: Fix mamba left padding (#32677)
- FIX / Hub: Also catch for exceptions.ConnectionError (#31469)
- Fix: Fix FalconMamba training issues due to incompatible kernels (#33195)
- Fix import of FalconMambaForCausalLM (#33381)
@4N3MONE
- 🌐 [i18n-KO] Translated deepspeed.md to Korean (#32431)
@jerryzh168
- Add TorchAOHfQuantizer (#32306)
@MHRDYN7
- Add Flax Dinov2 (#31960)
@kamilakesbi
- Add Descript-Audio-Codec model (#31494)
@Isotr0py
- Fix incorrect vocab size retrieval in GGUF config (#32551)
- Add chat_template for tokenizer extracted from GGUF model (#32908)
- 🚨 Support dequantization for most GGML types (#32625)
- Fix missing head_dim in llama config from gguf model (#33526)
- Fix Llava conversion for LlavaQwen2ForCausalLM with Clip vision tower (#33613)
@AhmedAlmaghz
- [i18n-ar] add README_ar.md to README.md (#32583)
- [i18n-ar] Add File : docs/source/ar/_toctree.yml (#32696)
@simonJJJ
- support qwen2-vl (#32318)
- simple align qwen2vl kv_seq_len calculation with qwen2 (#33161)
- fix qwen2vl vision eager-attention (#33213)
@jpizarrom
- 🚨 Add Blip2ForImageTextRetrieval (#29261)
@mayank31398
- Granite language models (#31502)
- fix model name and copyright (#33152)
- Granitemoe (#33207)
@hackyon
- [RoBERTa-based] Add support for sdpa (#30510)
@Muennighoff
- Add OLMoE (#32406)
- Add paper link (#33305)
@VladOS95-cyber
- Add Qwen2Moe GGUF loading support (#33264)
- change sequence_bias type of SequenceBiasLogitsProcessor to list, add… (#33375)
@jiqing-feng
- enable low-precision pipeline (#31625)
- Enable BNB multi-backend support (#31098)

Patch release v4.44.2, mostly 2 regressions that were not caught for Jamba and for processors!

Fix: Jamba cache fails to use torch.nn.module (#32894) Authored by @xgal
Fix: No need to dtype A in Jamba (#32924) @xgal
Fix: Regression on Processor.save_pretrained caused by #31691 (#32921) Authored by @leloykun

Patch release v4.44.1

Here are the different fixes, mostly Gemma2 context length, nits here and there, and generation issues

is_torchdynamo_compiling -- cast a wide exception net (#32476) by @gante
Revert "fixes to properly shard FSDP across cpu and meta for cpu_effcient_loading for prequantized 4bit (#32276)" (#32477) by @gante and @matthewdouglas
Gemma2: fix FA2 generation (#32553) by @zucchini-nlp
Fix: FA2 with packed training (#32487) by @zucchini-nlp
Fix sliding window attention used in Gemma2FlashAttention2 (#32522) by @brcps12
Automatically add transformers tag to the modelcard (#32623) by @LysandreJik
add back the position ids (#32554) by @ArthurZucker
Use head_dim if in config for RoPE (#32495) @suiyoubi @ArthurZucker
Revert PR 32299, flag users when Zero-3 was missed (#32851) by @muellerzr
fix multi-gpu with static cache (#32543) by @SunMarc
Reduce the error log when using core models that need their weights r… (#32656) by @muellerzr
Fix VLM generation issues (#32836) by @zucchini-nlp
Fix generate with inputs_embeds as input (#32493) (this PR has some cherry-pick)

Full Changelog: https://github.com/huggingface/transformers/compare/v4.44.0...v4.44.1

Release v4.44.0: End to end compile generation!!! Gemma2 (with assisted decoding), Codestral (Mistral for code), Nemotron, Efficient SFT training, CPU Offloaded KVCache, torch export for static cache

This release comes a bit early in our cycle because we wanted to ship important and requested models along with improved performances for everyone!

All of these are included with examples in the awesome https://github.com/huggingface/local-gemma repository! 🎈 We tried to share examples of what is now possible with all the shipped features! Kudos to @gante, @sanchit-gandhi and @xenova

💥 End-to-end generation compile

Generate: end-to-end compilation #30788 by @gante: model.generate now supports compiling! There are a few limitations, but here is a small snippet:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import copy

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B", torch_dtype=torch.bfloat16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B")

# compile generate
compiled_generate = torch.compile(model.generate, fullgraph=True, mode="reduce-overhead")

# compiled generate does NOT accept parameterization except a) model inputs b) a generation config
generation_config = copy.deepcopy(model.generation_config)
generation_config.pad_token_id = model.config.eos_token_id

model_inputs = tokenizer(["Write a poem about the market crashing in summer"], return_tensors="pt")
model_inputs = model_inputs.to(model.device)
output_compiled = compiled_generate(**model_inputs, generation_config=generation_config)
print(output_compiled)

⚡ 3 to 5x compile speedup (compilation time 👀 not runtime)

3-5x faster torch.compile forward compilation for autoregressive decoder models #32227* by @fxmarty . As documented on the PR, this makes the whole generation a lot faster when you re-use the cache! You can see this when you run model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)

🪶 Offloaded KV cache: offload the cache to CPU when you are GPU poooooor 🚀

Offloaded KV Cache #31325* by @n17s : you just have to set cache_implementation="offloaded" when calling from_pretrained or using this:

from transformers import GenerationConfig
gen_config = GenerationConfig(cache_implementation="offloaded", # other generation options such as num_beams=4,num_beam_groups=2,num_return_sequences=4,diversity_penalty=1.0,max_new_tokens=50,early_stopping=True)
outputs = model.generate(inputs["input_ids"],generation_config=gen_config)

📦 Torch export for static cache

pytorch team gave us a great gift: you can now use torch.export directly compatible with Executorch! Find examples here.

Make static cache compatible with torch.export #32168 by @guangy10

This also unlocks support for prompt reuse:

import os, torch, copy
from transformers import AutoModelForCausalLM, AutoTokenizer, DynamicCache
device = "cuda"
ckpt = "meta-llama/Meta-Llama-3.1-8B-Instruct"

INITIAL_PROMPT = "From now on, you are going to answer all my questions with historical details. Make sure to always add a bit of french here and there, for style."

model = AutoModelForCausalLM.from_pretrained(ckpt, torch_dtype=torch.float16)
model.to(device)
tokenizer = AutoTokenizer.from_pretrained(ckpt)

prompt_cache = DynamicCache()
inputs = tokenizer(INITIAL_PROMPT, return_tensors="pt").to("cuda")
prompt_cache = model(**inputs, past_key_values = prompt_cache).past_key_values

prompt = "Why are french people obsessed with french?"
new_inputs = tokenizer(INITIAL_PROMPT + prompt, return_tensors="pt").to("cuda")
past_key_values = copy.deepcopy(prompt_cache)
outputs = model.generate(**new_inputs, past_key_values=past_key_values,max_new_tokens=20) 
response = tokenizer.batch_decode(outputs)[0]
print(response)

prompt = "What is the best city to swim in?"
new_inputs = tokenizer(INITIAL_PROMPT + prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**new_inputs, past_key_values=copy.deepcopy(prompt_cache),max_new_tokens=20) 
response = tokenizer.batch_decode(outputs)[0]

Gemma2: assisted decoding

Gemma 2: support assisted generation #32357 by @gante

We now have a 2B Gemma 2 model -- a perfect sidekick for the 27B with assisted generation. We've enabled assisted generation in gemma 2, with a caveat: assisted generation currently requires the use of a windowless cache (as opposed to the default cache for gemma 2), so you might observe some output mismatch on long sequences. Read more about it here.

# transformers assisted generation reference: 
# https://huggingface.co/docs/transformers/main/en/llm_optims#speculative-decoding 
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# we DON’T recommend using the 9b model with the 2b model as its assistant
assistant_model_name = 'google/gemma-2-2b-it'
reference_model_name = 'google/gemma-2-27b-it'

tokenizer = AutoTokenizer.from_pretrained(reference_model_name)
model = AutoModelForCausalLM.from_pretrained(
   reference_model_name, device_map='auto', torch_dtype=torch.bfloat16
)
assistant_model = AutoModelForCausalLM.from_pretrained(
   assistant_model_name, device_map='auto', torch_dtype=torch.bfloat16
)

model_inputs = tokenizer("Einstein's theory of relativity states", return_tensors="pt").to(model.device)
generation_options = {
   "assistant_model": assistant_model,
   "do_sample": True,
   "temperature": 0.7,
   "max_new_tokens": 64,
}

outputs = model.generate(**model_inputs, **generation_options)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

Nemotron support

Nemotron-4-340B-Instruct is a large language model (LLM) that can be used as part of a synthetic data generation pipeline to create training data that helps researchers and developers build their own LLMs. It is a fine-tuned version of the Nemotron-4-340B-Base model, optimized for English-based single and multi-turn chat use-cases. It supports a context length of 4,096 tokens.

The conversion script should be able to cover Minitron and Nemotron, thanks and kudos to @suiyoubi. See:

Add Nemotron HF Support #31699

Codestral support

Codestral is trained on a diverse dataset of 80+ programming languages, including the most popular ones, such as Python, Java, C, C++, JavaScript, and Bash. It also performs well on more specific ones like Swift and Fortran. This broad language base ensures Codestral can assist developers in various coding environments and projects.

Codestral saves developers time and effort: it can complete coding functions, write tests, and complete any partial code using a fill-in-the-middle mechanism. Interacting with Codestral will help level up the developer’s coding game and reduce the risk of errors and bugs.

It's mamba2 architecture, was a bit of a pain to remove all einops but hope we made it better for everyone!

Add codestral mamba2 #32080 by @molbap and @vasqu

Breaking changes:

We removed the chat template in the code, they should all be on the hub!

🚨 No more default chat templates #31733 by @Rocketknight1

Long-form decoding for whisper, even faster:

Our great @sanchit-gandhi worked on porting the recent compile upgrades to long form decoding in

[whisper] compile compatibility with long-form decoding #31772

What's Changed

Enhancing SFT Training Efficiency Using Packing and FlashAttention2 with Position IDs by @RhuiDih in https://github.com/huggingface/transformers/pull/31629
Updated ruff to the latest version by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/31926
fix by @gante in https://github.com/huggingface/transformers/pull/32162
fix: Fixed an if condition that is always evaluating to true by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32160
[docs] change temperature to a positive value by @faaany in https://github.com/huggingface/transformers/pull/32077
adds: extra_repr() to MambaRMSNorm to include hidden size / size of weights in the layer by @rohitdwivedula in https://github.com/huggingface/transformers/pull/32171
fix: default value reflects the runtime environment variables rather than the ones present at import time. by @junrae6454 in https://github.com/huggingface/transformers/pull/32153
Update qwen2.md by @ArtificialZeng in https://github.com/huggingface/transformers/pull/32108
Remove conversational pipeline tests by @amyeroberts in https://github.com/huggingface/transformers/pull/32099
RoPE: relaxed rope validation by @gante in https://github.com/huggingface/transformers/pull/32182
let's not warn when someone is running a forward by @ArthurZucker in https://github.com/huggingface/transformers/pull/32176
Fix resize embedding with Deepspeed by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32192
Fix float8_e4m3fn in modeling_utils by @SunMarc in https://github.com/huggingface/transformers/pull/32193
Support dequantizing GGUF FP16 format by @PenutChen in https://github.com/huggingface/transformers/pull/31783
:rotating_light: No more default chat templates by @Rocketknight1 in https://github.com/huggingface/transformers/pull/31733
fix: Replaced deprecated unittest method with the correct one by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32198
[whisper] fix short-form output type by @sanchit-gandhi in https://github.com/huggingface/transformers/pull/32178
remove unnecessary guard code related with pytorch versions 1.4.2 ~ 1.7.0 by @statelesshz in https://github.com/huggingface/transformers/pull/32210
Update question_answering.py by @avlewis in https://github.com/huggingface/transformers/pull/32208
[BigBird Pegasus] set _supports_param_buffer_assignment to False by @kashif in https://github.com/huggingface/transformers/pull/32222
[warnings] fix E721 warnings by @kashif in https://github.com/huggingface/transformers/pull/32223
Follow up for #31973 by @ydshieh in https://github.com/huggingface/transformers/pull/32025
translate philosophy.md to chinese by @statelesshz in https://github.com/huggingface/transformers/pull/32177
Allow a specific microphone to be used by the ffmpeg audio pipeline utility functions. Default to using the currently active microphone on Mac by @jrhe in https://github.com/huggingface/transformers/pull/31846
Fix code snippet for Grounding DINO by @qubvel in https://github.com/huggingface/transformers/pull/32229
Generation: stop at eos for assisted decoding by @zucchini-nlp in https://github.com/huggingface/transformers/pull/31301
Llava: generate without images by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32183
Resize embeds with DeepSpeed by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32214
don't log base model architecture in wandb if log model is false by @joaonadkarni in https://github.com/huggingface/transformers/pull/32143
Refactor: Removed un-necessary object base class by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32230
Adds: extra_repr for RMSNorm layers in most models by @rohitdwivedula in https://github.com/huggingface/transformers/pull/32204
Add check for target_sizes is None in post_process_image_guided_detection for owlv2 by @catalys1 in https://github.com/huggingface/transformers/pull/31934
[tests] fix static cache implementation is not compatible with attn_implementation==flash_attention_2 by @faaany in https://github.com/huggingface/transformers/pull/32039
Flash-Attn: fix generation when no attention mask or no pading by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32241
More flexible trigger condition by @ydshieh in https://github.com/huggingface/transformers/pull/32251
Llama 3.1: replace for loop by tensor ops at inv_freq initialization by @gante in https://github.com/huggingface/transformers/pull/32244
🚨 Bloom support for cache class by @zucchini-nlp in https://github.com/huggingface/transformers/pull/31445
Upload new model failure report to Hub by @ydshieh in https://github.com/huggingface/transformers/pull/32264
Optimize t5 tokenize logic to avoid redundant calls by @leejet in https://github.com/huggingface/transformers/pull/32270
fix: Fixed wrong argument passed to convert_blip_checkpoint function call by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32262
Repo: remove exceptions in check_docstrings by @gante in https://github.com/huggingface/transformers/pull/32259
make p_mask a numpy array before passing to select_starts_ends by @faaany in https://github.com/huggingface/transformers/pull/32076
fix(docs): Fixed a link in docs by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32274
Generate: end-to-end compilation by @gante in https://github.com/huggingface/transformers/pull/30788
Whisper tokenizer word level timestamps by @kamilakesbi in https://github.com/huggingface/transformers/pull/32197
[pipeline] fix padding for 1-d tensors by @sanchit-gandhi in https://github.com/huggingface/transformers/pull/31776
Make static cache compatible with torch.export by @guangy10 in https://github.com/huggingface/transformers/pull/32168
Add stream messages from agent run for gradio chatbot by @aymeric-roucher in https://github.com/huggingface/transformers/pull/32142
use torch 2.4 in 2 CI jobs by @ydshieh in https://github.com/huggingface/transformers/pull/32302
Docs: fix GaLore optimizer code example by @gil2rok in https://github.com/huggingface/transformers/pull/32249
Fix GGUF dequantize for gguf==0.9.1 by @Isotr0py in https://github.com/huggingface/transformers/pull/32298
Cast epochs_trained to int when resuming training by @teddy-f-47 in https://github.com/huggingface/transformers/pull/32286
feat(ci): set fetch-depth: 0 in trufflehog checkout step by @McPatate in https://github.com/huggingface/transformers/pull/31663
Fix M4T for ASR pipeline by @ylacombe in https://github.com/huggingface/transformers/pull/32296
Docs: formatting nits by @gante in https://github.com/huggingface/transformers/pull/32247
Alternative agent plan by @plaggy in https://github.com/huggingface/transformers/pull/32295
fix: Added missing raise keyword for few exceptions by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32333
fixes to properly shard FSDP across cpu and meta for cpu_efficient_loading for prequantized 4bit by @winglian in https://github.com/huggingface/transformers/pull/32276
fixes #32329 : The Torch code is correct - to get an average of 10% o… by @fkrasnov2 in https://github.com/huggingface/transformers/pull/32335
Repo checks: skip docstring checks if not in the diff by @gante in https://github.com/huggingface/transformers/pull/32328
Fix slow GemmaTokenizer and improve SPM slow -> fast conversion process by @xenova in https://github.com/huggingface/transformers/pull/32191
LLaVA-NeXT: fix anyres shapes by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32314
Gemma2 and flash-attention by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32188
Llama 3.1: Fix incorrect inv_freq assignment by @gante in https://github.com/huggingface/transformers/pull/32330
[Idefics2] - Fix FA2 call for Perceiver layer by @amyeroberts in https://github.com/huggingface/transformers/pull/32275
Gemma 2: support assisted generation by @gante in https://github.com/huggingface/transformers/pull/32357
Fix error when streaming to gradio with non-string tool arguments by @aymeric-roucher in https://github.com/huggingface/transformers/pull/32360
3-5x faster torch.compile forward compilation for autoregressive decoder models by @fxmarty in https://github.com/huggingface/transformers/pull/32227
fix: Fixed staticmethods with self as first argument by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32361
fix: warmup_steps check for training_args by @Ricardo-L-C in https://github.com/huggingface/transformers/pull/32236
LLaVa: add cache class attribute by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32278
[enc-dec cache] fix bug in indexing by @sanchit-gandhi in https://github.com/huggingface/transformers/pull/32370
[whisper] compile compatibility with long-form decoding by @sanchit-gandhi in https://github.com/huggingface/transformers/pull/31772
Remove size check between attn_weights and kv_seq_len for phi3 by @helunwencser in https://github.com/huggingface/transformers/pull/32339
add missing attribute _supports_param_buffer_assignment for gpt-j. by @nv-guomingz in https://github.com/huggingface/transformers/pull/32359
Check device map for saving tokenizer config on TPU (fix for issue #31971) by @ayukh in https://github.com/huggingface/transformers/pull/32043
update clean_up_tokenization_spaces warning by @itazap in https://github.com/huggingface/transformers/pull/32371
Empty list in defaults for LLaMA special tokens during weights conversion by @ViktorooReps in https://github.com/huggingface/transformers/pull/32342
Fix conflicting key in init kwargs in PreTrainedTokenizerBase by @OmarManzoor in https://github.com/huggingface/transformers/pull/31233
Offloaded KV Cache by @n17s in https://github.com/huggingface/transformers/pull/31325
Docker: add speech dep to the consistency docker image by @gante in https://github.com/huggingface/transformers/pull/32374
Fixed Hybrid Cache Shape Initialization. by @OsamaS99 in https://github.com/huggingface/transformers/pull/32163
Yell at the user if zero-3 init wasn't performed, but expected to have been done by @muellerzr in https://github.com/huggingface/transformers/pull/32299
Update docs by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32368
RoPE: Add numerical tests ✨ by @gante in https://github.com/huggingface/transformers/pull/32380
[generate] only require an attention mask for mps with torch<2.4 by @sanchit-gandhi in https://github.com/huggingface/transformers/pull/32367
fix: (issue #32124) Exception raised when running transformers/examples/flax/language-modeling/t5_tokenizer_model.py. by @fshp971 in https://github.com/huggingface/transformers/pull/32157
MixtralFlashAttention2: put "plus 1" inside parentheses when calculating rotary_seq_len, allowing None position_ids input. by @Luke20000429 in https://github.com/huggingface/transformers/pull/31500
Bump keras from 2.8.0 to 2.13.1 in /examples/research_projects/decision_transformer by @dependabot in https://github.com/huggingface/transformers/pull/32393
fix: SeamlessM4TFeatureExtractor stride remainder by @TechInterMezzo in https://github.com/huggingface/transformers/pull/32088
Phi3 tests: fix typing for Python 3.8 by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32388
#32184 save total_vocab_size by @itazap in https://github.com/huggingface/transformers/pull/32240
add values for neftune by @nbroad1881 in https://github.com/huggingface/transformers/pull/32399
Fix documentation references to google/bit-50 model by @JuanFKurucz in https://github.com/huggingface/transformers/pull/32407
Persist embedding type of BART and mBART models after resize by @AbdiHaryadi in https://github.com/huggingface/transformers/pull/32242
fix: Updated test_embeded_special_tokens for luke and mluke models by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32413
Respect the config's attn_implementation if set by @amyeroberts in https://github.com/huggingface/transformers/pull/32383
Fix documentation links and code reference to model llava-next by @JuanFKurucz in https://github.com/huggingface/transformers/pull/32434
Cache: create docs by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32150
Llava: fix checkpoint_doc by @RUFFY-369 in https://github.com/huggingface/transformers/pull/32458
add the missing flash attention test marker by @faaany in https://github.com/huggingface/transformers/pull/32419
Update kwargs validation for preprocess with decorator by @qubvel in https://github.com/huggingface/transformers/pull/32024
Fix get large model config for Switch Transformer encoder only tester by @JuanFKurucz in https://github.com/huggingface/transformers/pull/32438
Dependencies: fix typo by @gante in https://github.com/huggingface/transformers/pull/32389
Add Nemotron HF Support by @suiyoubi in https://github.com/huggingface/transformers/pull/31699
Generate: fix end to end compilation by @gante in https://github.com/huggingface/transformers/pull/32465
Add codestral mamba2 by @molbap in https://github.com/huggingface/transformers/pull/32080

New Contributors

@RhuiDih made their first contribution in https://github.com/huggingface/transformers/pull/31629
@rohitdwivedula made their first contribution in https://github.com/huggingface/transformers/pull/32171
@ArtificialZeng made their first contribution in https://github.com/huggingface/transformers/pull/32108
@avlewis made their first contribution in https://github.com/huggingface/transformers/pull/32208
@jrhe made their first contribution in https://github.com/huggingface/transformers/pull/31846
@joaonadkarni made their first contribution in https://github.com/huggingface/transformers/pull/32143
@catalys1 made their first contribution in https://github.com/huggingface/transformers/pull/31934
@leejet made their first contribution in https://github.com/huggingface/transformers/pull/32270
@guangy10 made their first contribution in https://github.com/huggingface/transformers/pull/32168
@gil2rok made their first contribution in https://github.com/huggingface/transformers/pull/32249
@teddy-f-47 made their first contribution in https://github.com/huggingface/transformers/pull/32286
@plaggy made their first contribution in https://github.com/huggingface/transformers/pull/32295
@fkrasnov2 made their first contribution in https://github.com/huggingface/transformers/pull/32335
@helunwencser made their first contribution in https://github.com/huggingface/transformers/pull/32339
@nv-guomingz made their first contribution in https://github.com/huggingface/transformers/pull/32359
@ayukh made their first contribution in https://github.com/huggingface/transformers/pull/32043
@n17s made their first contribution in https://github.com/huggingface/transformers/pull/31325
@OsamaS99 made their first contribution in https://github.com/huggingface/transformers/pull/32163
@fshp971 made their first contribution in https://github.com/huggingface/transformers/pull/32157
@Luke20000429 made their first contribution in https://github.com/huggingface/transformers/pull/31500
@TechInterMezzo made their first contribution in https://github.com/huggingface/transformers/pull/32088
@AbdiHaryadi made their first contribution in https://github.com/huggingface/transformers/pull/32242
@RUFFY-369 made their first contribution in https://github.com/huggingface/transformers/pull/32458
@suiyoubi made their first contribution in https://github.com/huggingface/transformers/pull/31699

Full Changelog: https://github.com/huggingface/transformers/compare/v4.43.4...v4.44.0

v4.43.4 Patch Release

Patch Release v4.43.4

There was a mick mack, now deepseep issues are properly pushed with:

Resize embeds with DeepSpeed https://github.com/huggingface/transformers/pull/32214

🤗 Enjoy holidays

v4.43.3 Patch deepspeed

Patch release v4.43.3: We still saw some bugs so @zucchini-nlp added: ~~- Resize embeds with DeepSpeed #32214~~

don't log base model architecture in wandb if log model is false #32143

Other fixes:

[whisper] fix short-form output type #32178, by @sanchit-gandhi which fixes the short audio temperature fallback!
[BigBird Pegasus] set _supports_param_buffer_assignment to False #32222 by @kashif, mostly related to the new super fast init, some models have to get this set to False. If you see a weird behavior look for that 😉

v4.43.2: Patch release

Fix float8_e4m3fn in modeling_utils (#32193)
Fix resize embedding with Deepspeed (#32192)
let's not warn when someone is running a forward (#32176)
RoPE: relaxed rope validation (#32182)

v4.43.1: Patch release

v4.43.0: Llama 3.1, Chameleon, ZoeDepth, Hiera

Llama

The Llama 3.1 models are released by Meta and come in three flavours: 8B, 70B, and 405B.

To get an overview of Llama 3.1, please visit the Hugging Face announcement blog post.

We release a repository of llama recipes to showcase usage for inference, total and partial fine-tuning of the different variants.

Chameleon

The Chameleon model was proposed in Chameleon: Mixed-Modal Early-Fusion Foundation Models by META AI Chameleon Team. Chameleon is a Vision-Language Model that use vector quantization to tokenize images which enables the model to generate multimodal output. The model takes images and texts as input, including an interleaved format, and generates textual response.

Chameleon: add model by @zucchini-nlp in #31534

ZoeDepth

The ZoeDepth model was proposed in ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth by Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, Matthias Müller. ZoeDepth extends the DPT framework for metric (also called absolute) depth estimation. ZoeDepth is pre-trained on 12 datasets using relative depth and fine-tuned on two domains (NYU and KITTI) using metric depth. A lightweight head is used with a novel bin adjustment design called metric bins module for each domain. During inference, each input image is automatically routed to the appropriate head using a latent classifier.

Add ZoeDepth by @NielsRogge in #30136

Hiera

Hiera was proposed in Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles by Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer

The paper introduces “Hiera,” a hierarchical Vision Transformer that simplifies the architecture of modern hierarchical vision transformers by removing unnecessary components without compromising on accuracy or efficiency. Unlike traditional transformers that add complex vision-specific components to improve supervised classification performance, Hiera demonstrates that such additions, often termed “bells-and-whistles,” are not essential for high accuracy. By leveraging a strong visual pretext task (MAE) for pretraining, Hiera retains simplicity and achieves superior accuracy and speed both in inference and training across various image and video recognition tasks. The approach suggests that spatial biases required for vision tasks can be effectively learned through proper pretraining, eliminating the need for added architectural complexity.

Adding hiera by @Namangarg110 in #30356

Agents

Our ReactAgent has a specific way to return its final output: it calls the tool final_answer, added to the user-defined toolbox upon agent initialization, with the answer as the tool argument. We found that even for a one-shot agent like CodeAgent, using a specific final_answer tools helps the llm_engine find what to return: so we generalized the final_answer tool for all agents.

Adds final answer tool for all agents by @aymeric-roucher in #31703

Now if your code-based agent (like ReactCodeAgent) defines a function at step 1, it will remember the function definition indefinitely. This means your agent can create its own tools for later re-use!

Code agent: allow function persistence between steps by @aymeric-roucher in #31769

This is a transformative PR: it allows the agent to regularly run a specific step for planning its actions in advance. This gets activated if you set an int for planning_interval upon agent initialization. At step 0, a first plan will be done. At later steps (like steps 3, 6, 9 if you set planning_interval=3 ), this plan will be updated by the agent depending on the history of previous steps. More detail soon!

Agents planning by @aymeric-roucher in #31702

Notable changes to the codebase

A significant RoPE refactor was done to make it model agnostic and more easily adaptable to any architecture. It is only applied to Llama for now but will be applied to all models using RoPE over the coming days.

Llama: RoPE refactor by @gante in #32135

Breaking changes

TextGenerationPipeline and tokenizer kwargs

🚨🚨 This PR changes the code to rely on the tokenizer's defaults when these flags are unset. This means some models using TextGenerationPipeline previously did not add a <bos> by default, which (negatively) impacted their performance. In practice, this is a breaking change.

Example of a script changed as a result of this PR:

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2-9b-it", torch_dtype=torch.bfloat16, device_map="auto")
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("Foo bar"))

🚨🚨 TextGenerationPipeline: rely on the tokenizer default kwargs by @gante in #31747

Bugfixes and improvements

Fix post gemma merge by @ArthurZucker in #31660
Fix float out of range in owlvit and owlv2 when using FP16 or lower precision by @aliencaocao in #31657
[docs] Llama3 by @stevhliu in #31662
[HybridCache] Fix get_seq_length method by @sanchit-gandhi in #31661
don't zero out the attention_mask when using sliding window with flash attention by @winglian in #31670
Fix Gemma2 4d attention mask by @hiyouga in #31674
Fix return_dict in encodec by @jla524 in #31646
add gather_use_object arguments by @SangbumChoi in #31514
Gemma capping is a must for big models by @ArthurZucker in #31698
Add French version of run scripts tutorial by @jadechoghari in #31483
dependencies: keras-nlp<0.14 pin by @gante in #31684
remove incorrect urls pointing to the llava repository by @BiliBraker in #31107
Move some test files (tets/test_xxx_utils.py) to tests/utils by @ydshieh in #31730
Fix mistral ONNX export by @fxmarty in #31696
[whisper] static kv cache by @sanchit-gandhi in #31166
Make tool JSON schemas consistent by @Rocketknight1 in #31756
Fix documentation for Gemma2. by @jbornschein in #31682
fix assisted decoding by @jiqing-feng in #31401
Requires for torch.tensor before casting by @echarlaix in #31755
handle (processor_class, None) returned by ModelPatterns by @molbap in #31753
Gemma 2: Update slow tests by @gante in #31759
Add ignore_errors=True to trainer.py rmtree in _inner_training_loop by @njbrake in #31668
[fix bug] logits's shape different from label's shape in preprocess_logits_for_metrics by @wiserxin in #31447
Fix RT-DETR cache for generate_anchors by @qubvel in #31671
Fix RT-DETR weights initialization by @qubvel in #31724
pytest_num_workers=4 for some CircleCI jobs by @ydshieh in #31764
Fix Gemma2 types by @hiyouga in #31779
Add torch_empty_cache_steps to TrainingArguments by @aliencaocao in #31546
Fix ClapProcessor to merge feature_extractor output into the returned BatchEncoding by @mxkopy in #31767
Fix serialization for offloaded model by @SunMarc in #31727
Make tensor device correct when ACCELERATE_TORCH_DEVICE is defined by @kiszk in #31751
Exclude torch.compile time from metrics computation by @zxd1997066 in #31443
Update CometCallback to allow reusing of the running experiment by @Lothiraldan in #31366
Fix gemma tests by @ydshieh in #31794
Add training support for SigLIP by @aliencaocao in #31495
Repeating an important warning in the chat template docs by @Rocketknight1 in #31796
Allow FP16 or other precision inference for Pipelines by @aliencaocao in #31342
Fix galore lr display with schedulers by @vasqu in #31710
Fix Wav2Vec2 Fairseq conversion (weight norm state dict keys) by @gau-nernst in #31714
Depth Anything: update conversion script for V2 by @pcuenca in #31522
Fix Seq2SeqTrainer crash when BatchEncoding data is None by @iohub in #31418
Bump certifi from 2023.7.22 to 2024.7.4 in /examples/research_projects/decision_transformer by @dependabot[bot] in #31813
Add FA2 and sdpa support for SigLIP by @qubvel in #31499
Bump transformers from 4.26.1 to 4.38.0 in /examples/tensorflow/language-modeling-tpu by @dependabot[bot] in #31837
Bump certifi from 2023.7.22 to 2024.7.4 in /examples/research_projects/lxmert by @dependabot[bot] in #31838
Fix typos by @omahs in #31819
transformers.fx.symbolic_trace supports inputs_embeds by @fxmarty in #31574
Avoid failure TFBlipModelTest::test_pipeline_image_to_text by @ydshieh in #31827
Fix incorrect accelerator device handling for MPS in TrainingArguments by @andstor in #31812
Mamba & RecurrentGemma: enable strict signature by @gante in #31549
Deprecate vocab_size in other two VLMs by @zucchini-nlp in #31681
FX symbolic_trace: do not test decoder_inputs_embeds by @fxmarty in #31840
[Grounding DINO] Add processor to auto mapping by @NielsRogge in #31845
chore: remove duplicate words by @hattizai in #31853
save_pretrained: use tqdm when saving checkpoint shards from offloaded params by @kallewoof in #31856
Test loading generation config with safetensor weights by @gante in #31550
docs: typo in tf qa example by @chen-keinan in #31864
Generate: Add new decoding strategy "DoLa" in .generate() by @voidism in #29619
Fix _init_weights for ResNetPreTrainedModel by @ydshieh in #31851
Update depth estimation task guide by @merveenoyan in #31860
Bump zipp from 3.7.0 to 3.19.1 in /examples/research_projects/decision_transformer by @dependabot[bot] in #31871
Add return type annotation to PreTrainedModel.from_pretrained by @mauvilsa in #31869
Revert "Fix _init_weights for ResNetPreTrainedModel" by @ydshieh in #31868
Bump certifi from 2023.7.22 to 2024.7.4 in /examples/research_projects/visual_bert by @dependabot[bot] in #31872
add warning when using gradient_checkpointing with FSDP full shard by @yundai424 in #31578
Add conversion for interleave llava by @zucchini-nlp in #31858
remove duplicate words in msg by @yukionfire in #31876
Fix file type checks in data splits for contrastive training example script by @npyoung in #31720
Fix failed tests in #31851 by @ydshieh in #31879
fix: Removed duplicate field definitions in some classes by @Sai-Suraj-27 in #31888
Push sharded checkpoint to hub when push_to_hub=True in TrainingArguments by @SunMarc in #31808
[RT-DETR] Add resources by @NielsRogge in #31815
Modify warnings in a with block to avoid flaky tests by @ydshieh in #31893
Add a condition for nested_detach by @haikuoxin in #31855
InstructBlipVideo: Update docstring by @zucchini-nlp in #31886
Fixes to alternating SWA layers in Gemma2 by @turboderp in #31775
Processor accepts any kwargs by @zucchini-nlp in #31889
[ConvertSlow] make sure the order is preserved for addedtokens by @ArthurZucker in #31902
[Gemma2] Support FA2 softcapping by @ArthurZucker in #31887
Fix missing methods for Fuyu by @Isotr0py in #31880
fix: Fixed the 1st argument name in classmethods by @Sai-Suraj-27 in #31907
add gather_use_object arguments II by @SangbumChoi in #31799
Add warning message for beta and gamma parameters by @OmarManzoor in #31654
Fix fx tests with inputs_embeds by @fxmarty in #31862
Refactor flash attention implementation in transformers by @ArthurZucker in #31446
Generate: fix SlidingWindowCache.reset() by @gante in #31917
🚨 fix(SigLip): remove spurious exclusion of first vision output token by @transmissions11 in #30952
Allow Trainer.get_optimizer_cls_and_kwargs to be overridden by @apoorvkh in #31875
[Bug Fix] fix qa pipeline tensor to numpy by @jiqing-feng in #31585
Docker: TF pin on the consistency job by @gante in #31928
fix prompt strip to support tensors and np arrays by @AvivSham in #27818
Fix GenerationMixin.generate compatibility with pytorch profiler by @fxmarty in #31935
Generate: remove deprecated code due to Cache and cache_position being default by @gante in #31898
Generate: v4.42 deprecations 🧹🧹 by @gante in #31956
Whisper: move to tensor cpu before converting to np array at decode time by @gante in #31954
fix: Removed a wrong key-word argument in sigmoid_focal_loss() function call by @Sai-Suraj-27 in #31951
Generate: handle logits_warper update in models with custom generate fn by @gante in #31957
fix: Fixed the arguments in create_repo() function call by @Sai-Suraj-27 in #31947
Notify new docker images built for circleci by @ydshieh in #31701
Avoid race condition by @ydshieh in #31973
Masking: remove flakiness from test by @gante in #31939
Generate: doc nits by @gante in #31982
Fix the incorrect permutation of gguf by @PenutChen in #31788
Cambricon MLUs support SDPA and flash_attn by @huismiling in #31102
Speedup model init on CPU (by 10x+ for llama-3-8B as one example) by @muellerzr in #31771
[tests] fix deepspeed zero3 config for test_stage3_nvme_offload by @faaany in #31881
Fix bad test about slower init by @muellerzr in #32002
Tests: remove cuda versions when the result is the same 🧹🧹 by @gante in #31955
Bug report update by @gante in #31983
add flash-attn deterministic option to flash-attn>=2.4.1 by @junrae6454 in #31961
fix: Fixed incorrect dictionary assignment in src/transformers/__init__.py by @Sai-Suraj-27 in #31993
Bug report update -- round 2 by @gante in #32006
Fix gather when collecting 'num_input_tokens_seen' by @CodeCreator in #31974
Fix if else and actually enable superfast init by @muellerzr in #32007
SpeechEncoderDecoder doesn't support param buffer assignments by @muellerzr in #32009
Fix tests skip by @qubvel in #32012
Fixed log messages that are resulting in TypeError due to too many arguments by @Sai-Suraj-27 in #32017
Fix typo in classification function selection logic to improve code consistency by @moses in #32031
doc: fix broken BEiT and DiNAT model links on Backbone page by @dvrogozh in #32029
Pass missing arguments to SeamlessM4Tv2ConformerEncoderLayer.forward() when gradient checkpointing is enabled by @anferico in #31945
Add language to word timestamps for Whisper by @robinderat in #31572
Add sdpa and FA2 for CLIP by @qubvel in #31940
unpin numpy<2.0 by @ydshieh in #32018
Chameleon: minor fixes after shipping by @zucchini-nlp in #32037
Bump scikit-learn from 1.0.2 to 1.5.0 in /examples/research_projects/decision_transformer by @dependabot[bot] in #31458
Bump scikit-learn from 1.1.2 to 1.5.0 in /examples/research_projects/codeparrot/examples by @dependabot[bot] in #32052
[mistral] Support passing head_dim through config (and do not require head_dim * num_heads == hidden_size) by @xenova in #32050
Add torch.compile Support For Mamba by @zhenglongjiepheonix in #31247
fix: Removed duplicate entries in a dictionary by @Sai-Suraj-27 in #32041
docs: Fixed 2 links in the docs along with some minor fixes by @Sai-Suraj-27 in #32058
Llava: add default chat templates by @zucchini-nlp in #31691
[Chameleon, Hiera] Improve docs by @NielsRogge in #32038
Incorrect Whisper long-form decoding timestamps by @kamilakesbi in #32003
[mistral] Fix FA2 attention reshape for Mistral Nemo by @xenova in #32065
VideoLLaVa: fix chat format in docs by @zucchini-nlp in #32083
Fix progress callback deepcopy by @fozziethebeat in #32070
Fixes to chameleon docs by @merveenoyan in #32078
Add image-text-to-text task guide by @merveenoyan in #31777
Support generating with fallback for short form audio in Whisper by @kamilakesbi in #30984
Disable quick init for deepspeed by @muellerzr in #32066
Chameleon: not supported with fast load by @zucchini-nlp in #32091
Fix tests after huggingface_hub 0.24 by @Wauplin in #32054
Fix shard order by @b-chu in #32023
Generate: store special token tensors under a unique variable name by @gante in #31980
fix: Replaced deprecated mktemp() function by @Sai-Suraj-27 in #32123
Mention model_info.id instead of model_info.modelId by @Wauplin in #32106
[generate] fix eos/pad id check on mps devices by @sanchit-gandhi in #31695
Fix failing test with race condition by @Rocketknight1 in #32140
Update ko/_toctree.yml and remove custom_tools.md to reflect latest changes by @jungnerd in #31969
fix: Fixed raising TypeError instead of ValueError for invalid type by @Sai-Suraj-27 in #32111
[RoBERTa] Minor clarifications to model doc by @bt2513 in #31949
Return assistant generated tokens mask in apply_chat_template by @yonigottesman in #30650
Don't default to other weights file when use_safetensors=True by @amyeroberts in #31874
set warning level to info for special tokens have been added by @ArthurZucker in #32138
Add new quant method by @SunMarc in #32047
Add llama3-llava-next-8b to llava_next conversion script by @jamt9000 in #31395
LLaVaNeXT: pad on right if training by @zucchini-nlp in #32134
Remove trust_remote_code when loading Libri Dummy by @sanchit-gandhi in #31748
[modelling] remove un-necessary transpose for fa2 attention by @sanchit-gandhi in #31749
Fix mask creations of GPTNeoX and GPT2 by @vasqu in #31944
Add method to retrieve used chat template by @KonradSzafer in #32032
Add YaRN and Dynamic-YaRN RoPE Scaling Methods by @mig-mfreitas in #30910
Disable quick init for TapasPreTrainedModel by @daniellok-db in #32149
Modify resize_token_embeddings to ensure output type is same as input by @bayllama in #31979
gguf conversion add_prefix_space=None for llama3 by @itazap in #31937
Fix flash attention speed issue by @Cyrilvallez in #32028
Fix video batching to videollava by @merveenoyan in #32139
Added mamba.py backend by @alxndrTL in #30139
Rename Phi-3 rope scaling type by @garg-amit in #31436
Revert "Incorrect Whisper long-form decoding timestamps " by @sanchit-gandhi in #32148
Fix typing to be compatible with later py versions by @amyeroberts in #32155
feat(cache): StaticCache uses index_copy_ to avoid useless copy by @tengomucho in #31857
Added additional kwarg for successful running of optuna hyperparameter search by @DeF0017 in #31924
Enhancing SFT Training Efficiency Using Packing and FlashAttention2 with Position IDs by @RhuiDih in #31629

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@aliencaocao
- Fix float out of range in owlvit and owlv2 when using FP16 or lower precision (#31657)
- Add torch_empty_cache_steps to TrainingArguments (#31546)
- Add training support for SigLIP (#31495)
- Allow FP16 or other precision inference for Pipelines (#31342)
@voidism
- Generate: Add new decoding strategy "DoLa" in .generate() (#29619)
@Namangarg110
- Adding hiera (#30356)

Patch release v4.42.4

Mostly gemma2 support FA2 softcapping!

but also fix the sliding window for long context and other typos.

[Gemma2] Support FA2 softcapping (#31887) by @ArthurZucker
[ConvertSlow] make sure the order is preserved for addedtokens (#31902) by @ArthurZucker
Fixes to alternating SWA layers in Gemma2 (#31775) by @turboderp
Requires for torch.tensor before casting (#31755) by @echarlaix

Was off last week could not get this out, thanks all for your patience 🥳

Patch release v4.42.3

Make sure we have attention softcapping for "eager" GEMMA2 model

After experimenting, we noticed that for the 27b model mostly, softcapping is a must. So adding it back (it should have been there, but an error on my side made it disappear) sorry all! 😭

Gemma capping is a must for big models (#31698)

Patch release v4.42.2

Patch release

Thanks to our 2 contributors for their prompt fixing mostly applies for training and FA2!

Fix Gemma2 4d attention mask (#31674) by @hiyouga
don't zero out the attention_mask when using sliding window with flash attention (#31670) by @winglian

v4.42.1: Patch release

Patch release for commit:

[HybridCache] Fix get_seq_length method (#31661)

v4.42.0: Gemma 2, RTDETR, InstructBLIP, LLAVa Next, New Model Adder

New model additions

Gemma-2

The Gemma2 model was proposed in Gemma2: Open Models Based on Gemini Technology and Research by Gemma2 Team, Google. Gemma2 models are trained on 6T tokens, and released with 2 versions, 2b and 7b.

The abstract from the paper is the following:

This work introduces Gemma2, a new family of open language models demonstrating strong performance across academic benchmarks for language understanding, reasoning, and safety. We release two sizes of models (2 billion and 7 billion parameters), and provide both pretrained and fine-tuned checkpoints. Gemma2 outperforms similarly sized open models on 11 out of 18 text-based tasks, and we present comprehensive evaluations of safety and responsibility aspects of the models, alongside a detailed description of our model development. We believe the responsible release of LLMs is critical for improving the safety of frontier models, and for enabling the next wave of LLM innovations

Add gemma 2 by @ArthurZucker in #31659

RTDETR

The RT-DETR model was proposed in DETRs Beat YOLOs on Real-time Object Detection by Wenyu Lv, Yian Zhao, Shangliang Xu, Jinman Wei, Guanzhong Wang, Cheng Cui, Yuning Du, Qingqing Dang, Yi Liu.

RT-DETR is an object detection model that stands for “Real-Time DEtection Transformer.” This model is designed to perform object detection tasks with a focus on achieving real-time performance while maintaining high accuracy. Leveraging the transformer architecture, which has gained significant popularity in various fields of deep learning, RT-DETR processes images to identify and locate multiple objects within them.

New model support RTDETR by @SangbumChoi in #29077

InstructBlip

The InstructBLIP model was proposed in InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning by Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. InstructBLIP leverages the BLIP-2 architecture for visual instruction tuning.

InstructBLIP uses the same architecture as BLIP-2 with a tiny but important difference: it also feeds the text prompt (instruction) to the Q-Former.

Add video modality for InstrucBLIP by @zucchini-nlp in #30182

LlaVa NeXT Video

The LLaVa-NeXT-Video model was proposed in LLaVA-NeXT: A Strong Zero-shot Video Understanding Model by Yuanhan Zhang, Bo Li, Haotian Liu, Yong Jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, Chunyuan Li. LLaVa-NeXT-Video improves upon LLaVa-NeXT by fine-tuning on a mix if video and image dataset thus increasing the model’s performance on videos.

LLaVA-NeXT surprisingly has strong performance in understanding video content in zero-shot fashion with the AnyRes technique that it uses. The AnyRes technique naturally represents a high-resolution image into multiple images. This technique is naturally generalizable to represent videos because videos can be considered as a set of frames (similar to a set of images in LLaVa-NeXT). The current version of LLaVA-NeXT makes use of AnyRes and trains with supervised fine-tuning (SFT) on top of LLaVA-Next on video data to achieves better video understanding capabilities.The model is a current SOTA among open-source models on VideoMME bench.

Add LLaVa NeXT Video by @zucchini-nlp in #31252

New model adder

A very significant change makes its way within the transformers codebase, introducing a new way to add models to transformers. We recommend reading the description of the PR below, but here is the gist of it:

The diff_converter tool is here to replace our old Copied from statements, while keeping our core transformers philosophy:

single model single file

explicit code

standardization of modeling code

readable and educative code

simple code

least amount of modularity

This additionally unlocks the ability to very quickly see the differences between new architectures that get developed. While many architectures are similar, the "single model, single file" policy can obfuscate the changes. With this diff converter, we want to make the changes between architectures very explicit.

Diff converter v2 by @ArthurZucker in #30868

Tool-use and RAG model support

We've made major updates to our support for tool-use and RAG models. We can now automatically generate JSON schema descriptions for Python functions which are suitable for passing to tool models, and we've defined a standard API for tool models which should allow the same tool inputs to be used with many different models. Models will need updates to their chat templates to support the new API, and we're targeting the Nous-Hermes, Command-R and Mistral/Mixtral model families for support in the very near future. Please see the updated chat template docs for more information.

If you are the owner of a model that supports tool use, but you're not sure how to update its chat template to support the new API, feel free to reach out to us for assistance with the update, for example on the Hugging Face Discord server. Ping Matt and yell key phrases like "chat templates" and "Jinja" and your issue will probably get resolved.

Chat Template support for function calling and RAG by @Rocketknight1 in #30621

GGUF support

We further the support of GGUF files to offer fine-tuning within the python/HF ecosystem, before converting them back to the GGUF/GGML/llama.cpp libraries.

Add Qwen2 GGUF loading support by @Isotr0py in #31175
GGUF: Fix llama 3 GGUF by @younesbelkada in #31358
Fix llama gguf converter by @SunMarc in #31575

Trainer improvements

A new optimizer is added in the Trainer.

FEAT / Trainer: LOMO optimizer support by @younesbelkada in #30178

Quantization improvements

Several improvements are done related to quantization: a new cache (the quantized KV cache) is added, offering the ability to convert the cache of generative models, further reducing the memory requirements.

Additionally, the documentation related to quantization is entirely redone with the aim of helping users choose which is the best quantization method.

Quantized KV Cache by @zucchini-nlp in #30483
Docs / Quantization: refactor quantization documentation by @younesbelkada in #30942

Examples

New instance segmentation examples are added by @qubvel

Instance segmentation examples by @qubvel in #31084

Notable improvements

As a notable improvement to the HF vision models that leverage backbones, we enable leveraging HF pretrained model weights as backbones, with the following API:

from transformers import MaskFormerConfig, MaskFormerForInstanceSegmentation

config = MaskFormerConfig(backbone="microsoft/resnet-50", use_pretrained_backbone=True)
model = MaskFormerForInstanceSegmentation(config)

Enable HF pretrained backbones by @amyeroberts in #31145

Additionally, we thank @Cyrilvallez for diving into our generate method and greatly reducing the memory requirements.

Reduce by 2 the memory requirement in generate() 🔥🔥🔥 by @Cyrilvallez in #30536

Breaking changes

Remove ConversationalPipeline and Conversation object

Both the ConversationalPipeline and the Conversation object have been deprecated for a while, and are due for removal in 4.42, which is the upcoming version.

The TextGenerationPipeline is recommended for this use-case, and now accepts inputs in the form of the OpenAI API.

🚨 Remove ConversationalPipeline and Conversation object by @Rocketknight1 in #31165

Remove an accidental duplicate softmax application in FLAVA's attention

Removes duplicate softmax application in FLAVA attention. Likely to have a small change on the outputs but flagging with 🚨 as it will change a bit.

🚨 FLAVA: Remove double softmax by @amyeroberts in #31322

Idefics2's `ignore_index` attribute of the loss is updated to `-100`

🚨 [Idefics2] Update ignore index by @NielsRogge in #30898

out_indices from `timm` being updated

Recent updates to timm changed the type of the attribute model.feature_info.out_indices. Previously, out_indices would reflect the input type of out_indices on the create_model call i.e. either tuple or list. Now, this value is always a tuple.

As list are more useful and consistent for us -- we cannot save tuples in configs, they must be converted to lists first -- we instead choose to cast out_indices to always be a list.

This has the possibility of being a slight breaking change if users are creating models and relying on out_indices on being a tuple. As this property only happens when a new model is created, and not if it's saved and reloaded (because of the config), then I think this has a low chance of having much of an impact.

🚨 out_indices always a list by @amyeroberts in #30941

datasets referenced in the quantization config get updated to remove references to datasets with restrictive licenses.

🚨 Remove dataset with restrictive license by @echarlaix in #31452

Bugfixes and improvements

Add fixed resize and pad strategy for object detection by @qubvel in #30742
Enable dynamic resolution input for Swin Transformer and variants by @the-neural-networker in #30656
Add TokenClassification for Mistral, Mixtral and Qwen2 by @josephenguehard in #29878
FIX / Quantization: Fix Dockerfile build by @younesbelkada in #30890
Add support for torch.compile dynamic shapes by @warner-benjamin in #30560
LLaVa-Next: Update docs with batched inference by @zucchini-nlp in #30857
DeformableDETR two stage support bfloat16 by @DonggeunYu in #30907
add return_token_timestamps to WhisperProcessor by @kamilakesbi in #30812
Fix num_hidden_layers in initialization of new model in Mamba by @SrGonao in #30403
separate kwargs in processor (similar to #30193) by @Eric2i in #30905
fix for custom pipeline configuration by @not-lain in #29004
Add AutoFeatureExtractor support to Wav2Vec2ProcessorWithLM by @ylacombe in #28706
Fix a shape annotation and typos in mamba slow forward by @vasqu in #30691
tokenizer_class = "AutoTokenizer" Llava Family by @ArthurZucker in #30912
Introduce configured_state arg for accelerator_config by @muellerzr in #29781
Add torch.compile for Mistral by @zhenglongjiepheonix in #30642
[docs] Spanish translation of model_memory_anatomy.md by @aaronjimv in #30885
FIX / TST: Fix expected results on Mistral slow test (A10) by @younesbelkada in #30909
PaliGemma - fix processor with no input text by @hiyouga in #30916
CI: AMD MI300 tests fix by @mht-sharma in #30797
Enforce saving at end of training if saving option chosen by @muellerzr in #30160
fix: center_crop occasionally outputs off-by-one dimension matrix by @mattlbeck in #30934
[Benchmark] Reuse optimum-benchmark by @ydshieh in #30615
TST / Workflows: Get slack notifications for docker image build by @younesbelkada in #30891
Fix swin embeddings interpolation by @amyeroberts in #30936
Fix inhomogeneous shape error in example by @Zantares in #30434
update ruff version by @ArthurZucker in #30932
Update build ci image [push-ci-image] by @ArthurZucker in #30933)
Update video-llava docs by @zucchini-nlp in #30935
Fix low cpu mem usage tests by @SunMarc in #30808
[doc] Add references to the fine-tuning blog and distil-whisper to Whisper. by @Vaibhavs10 in #30938
Avoid extra chunk in speech recognition by @jonatanklosko in #29539
[whisper] only trigger forced ids warning once by @sanchit-gandhi in #30966
Paligemma - fix slow tests, add bf16 and f16 slow tests by @molbap in #30851
Finally fix the missing new model failure CI report by @ydshieh in #30968
legacy to init the slow tokenizer when converting from slow was wrong by @ArthurZucker in #30972
Generation: get special tokens from model config by @zucchini-nlp in #30899
[Whisper] Strip prompt before finding common subsequence by @sanchit-gandhi in #27836
Fix link in Pipeline documentation by @junhl in #30948
[Mistral and friends] Update MLP by @NielsRogge in #31057
Paligemma causal attention mask by @molbap in #30967
Update object detection with latest resize and pad strategies by @qubvel in #30955
Using assistant in AutomaticSpeechRecognitionPipeline with different encoder size by @kamilakesbi in #30637
Push ci image by @ArthurZucker in #30982
test_custom_4d_attention_mask skip with sliding window attn by @poedator in #30833
Finish adding support for torch.compile dynamic shapes by @warner-benjamin in #30919
FIX / Docs: Minor changes in quantization docs by @younesbelkada in #30985
Fix accelerate failing tests by @SunMarc in #30836
[tests] add torch.use_deterministic_algorithms for XPU by @faaany in #30774
Add a check that warmup_setps is either 0 or >= 1 by @ymoslem in #30764
Update 4 MptIntegrationTests expected outputs by @ydshieh in #30989
[Port] TensorFlow implementation of Mistral by @ariG23498 in #29708
Remove deprecated properties in tokenization_nllb.py and tokenization_nllb_fast.py by @ymoslem in #29834
Bugfix: WandbCallback uploads initial model checkpoint by @mgerstgrasser in #30897
add prefix space ignored in llama #29625 by @itazap in #30964
Fix training speed regression introduced by "optimize VRAM for calculating pos_bias in LayoutLM v2, v3 by @kkoehncke in #26139)"
Do not trigger autoconversion if local_files_only by @Wauplin in #31004
pin uv==0.1.45 by @ydshieh in #31006
Perceiver interpolate position embedding by @g1y5x3 in #30979
[tests] make test_model_parallelism device-agnostic by @faaany in #30844
FIX / TST: Fix expected results on Mistral AWQ test by @SunMarc in #30971
allow multi-gpu by @ydshieh in #31011
Fix resume_download future warning by @Wauplin in #31007
Quantization / TST: Fix remaining quantization tests by @younesbelkada in #31000
save the list of new model failures by @ydshieh in #31013
added interpolation for vitmae model in pytorch as well as tf. by @bhuvanmdev in #30732
Add split special tokens by @itazap in #30772
Paligemma- fix devices and dtype assignments by @molbap in #31008
Redirect transformers_agents doc to agents by @aymeric-roucher in #31054
unpin uv by @ydshieh in #31055
Follow up: Fix link in dbrx.md by @eitanturok in #30514
Update feature request label in template by @amyeroberts in #30940
Fix quanto tests by @SunMarc in #31062
Fix pad_to_max_length Whisper by @ylacombe in #30787
skip test_model_parallelism for 2 model test classes by @ydshieh in #31067
use @main by @ydshieh in #31065
Remove ninja from docker image build by @ydshieh in #31080
fix "piano" typo by @clinty in #31027
Update quicktour.md to fix broken link to Glossary by @apalkk in #31072
Remove redundant backend checks in training_args.py by @kevint324 in #30999
fix from_pretrained in offline mode when model is preloaded in cache by @oOraph in #31010
Remove float64 cast for OwlVit and OwlV2 to support MPS device by @qubvel in #31071
Fix OWLv2 post_process_object_detection for multiple images by @qubvel in #31082
Fix typo in trainer.py by @taslimisina in #31048
[SuperPoint, PaliGemma] Update docs by @NielsRogge in #31025
Fix failing tokenizer tests by @LysandreJik in #31083
Watermark: fix tests by @zucchini-nlp in #30961
Docs / PEFT: Add PEFT API documentation by @younesbelkada in #31078
Render chat template tojson filter as unicode by @CISC in #31041
FIX: Add accelerate as a hard requirement by @younesbelkada in #31090
FIX / OPT: Fix OPT multi-GPU training for OPTForQuestionAnswering by @younesbelkada in #31092
skip test_multi_gpu_data_parallel_forward for vit and deit by @ydshieh in #31086
Fix PretrainedConfig docstring with deprecated resume_download by @albertvillanova in #31014
Fix DeepSpeed compatibility with weight_norm by @jonnyli1125 in #30881)
TST: Fix instruct-blip tests by @younesbelkada in #31088
Docs / Quantization: Redirect deleted page by @younesbelkada in #31063
Deprecate low use models by @amyeroberts in #30781
Quantized KV cache: update quanto by @zucchini-nlp in #31052
FEAT: Add mistral v3 conversion script by @younesbelkada in #30981
Use HF_HUB_OFFLINE + fix has_file in offline mode by @Wauplin in #31016
Improve transformers-cli env reporting by @statelesshz in #31003
Fix env.py in cases where torch is not present by @Rocketknight1 in #31113
Fix faulty rstrip in module loading by @Rocketknight1 in #31108
Rm maintainer + migrate by @muellerzr in #31089
Fix nightly circleci by @ydshieh in #31114
FIX / Docs: Fix GPTQ expected number of bits by @younesbelkada in #31111
Add VLM generation default contributor by @gante in #31115
Add on_optimizer_step to callback options by @dhruvbpai in #31095
Cleanup docker build by @ydshieh in #31119
FIX / Quantization: Add extra validation for bnb config by @younesbelkada in #31135
fix get_scheduler when name is warmup_stable_decay by @zspo in #31128
Docs / Quantization: Replace all occurences of load_in_8bit with bnb config by @younesbelkada in #31136
Workflow: Remove IS_GITHUB_CI by @younesbelkada in #31147
helper by @ArthurZucker in #31152
pytest -rsfE by @ydshieh in #31140
Fix quantized cache output by @SunMarc in #31143
Update sam.md by @asifajrof in #31130
Quantization: Enhance bnb error message by @younesbelkada in #31160
[trainer] add sanity evaluation option by @SunMarc in #31146
Add streaming, various fixes by @aymeric-roucher in #30838
Added description of quantization_config by @vamsivallepu in #31133
Fix typo: use_safetenstors to use_safetensors by @CharlesCNorton in #31184
Remove copied froms for deprecated models by @amyeroberts in #31153
Token healing by @ahmed-moubtahij in #30081
[GemmaModel] fix small typo by @ArthurZucker in #31202
Fix Cannot convert [array()] to EagerTensor of dtype int64 by @pavi-ninjaac in #31109
Ignore non-causal mask in more cases with SDPA by @fxmarty in #30138
SlidingWindowCache: reduce differences to other Cache classes by @gante in #30970
Fix test_compile_static_cache by @ydshieh in #30991
fix the get_size_with_aspect_ratio in max_size situation by @SangbumChoi in #30902
Fix typo in utils by @Bojun-Feng in #31169
Rename sanity_evaluation to eval_on_start by @Qubitium in #31192
Wrong translation FR : Contents = Contenu by @jadechoghari in #31186
Cohere: Fix copied from by @younesbelkada in #31213
Set greater_is_better to False if metric_for_best_model ends with "loss" by @miivanov90 in #31142
Fix GPU OOM for mistral.py::Mask4DTestHard by @ydshieh in #31212
[docs] Spanish translation of tokenizer_summary.md by @aaronjimv in #31154
Pass device in Logits Processor's init by @zucchini-nlp in #29804
Fix sentence fragment within test comments by @DomHudson in #31218
fix(PatchTST): Wrong dropout used for PretainHead by @maxstrobel in #31117
Video-LLaVa: handle any number of frames by @zucchini-nlp in #31221
Add dynamic resolution input/interpolate position embedding to deit by @p-kris10 in #31131
fix bf16 issue in text classification pipeline by @chujiezheng in #30996
Fix pipeline tests - torch imports by @amyeroberts in #31227
Add new line switch before logging ***** Running {description} ***** by @jacklanda in #31225
add no split modules for xlmrobertaxl by @ManuelFay in #31223
Fix MistralIntegrationTest by @ydshieh in #31231
Blip: Deprecate BlipModel by @younesbelkada in #31235
Move out common backbone config param validation by @amyeroberts in #31144
Upload (daily) CI results to Hub by @ydshieh in #31168
Specify dtype=torch.bool to avoid xla error by @ysulsky in #31191
Fixing name 'torch' is not defined in bitsandbytes integration by @jamesbraza in #31243
Benchmark GitHub Actions workflow by @ydshieh in #31163
Early labels validation by @amyeroberts in #31240
doc: add info about wav2vec2 bert in older wav2vec2 models. by @Vaibhavs10 in #31120
enable deterministic mode for npu by @statelesshz in #31253
Add missing Flaubert tokenizer tests by @bastrob in #30492
Fix circular reference issue in CLIPTokenizerFast by @dhaivat1729 in #31075
Add condition to benchmark job in push-important-models.yml by @ydshieh in #31259
Skip failing JetMOE generation tests by @amyeroberts in #31266
no need for explicit EXTRA_TOKENS in processing_paligemma.py by @grahamannett in #31022
[SwitchTransformer] Significant performance improvement on MoE blocks by @ranggihwang in #31173
fix loading special_tokens_map_file by @ZhiyuanChen in #31012
Make mamba use cache by @zucchini-nlp in #31116
Generation: fix handling of special tokens by @zucchini-nlp in #31254
Switch from cached_download to hf_hub_download in remaining occurrences by @Wauplin in #31284
fix: str should be used not int when setting env variables by @statelesshz in #31272
Fix _save_tpu: use _maybe_convert_to_cpu instead of to cpu. by @baoleai in #31264
fix accelerate tests for roberta xl by @SunMarc in #31288
Enable dynamic resolution input for Beit by @OmarManzoor in #31053
Mark MobileNetV1ModelTest::test_batching_equivalence as flaky by @amyeroberts in #31258
Pipeline VQA: Add support for list of images and questions as pipeline input by @BlacCod in #31217
Fix SwinLayer / DonutSwinLayer / ClapAudioLayer attention mask device by @gorodnitskiy in #31295
Update text-to-speech.md by @jaguaryang in #31269
Fixed Wav2Vec2ProcessorWithLM decoding error by @karicotiza in #31188
Fix jetmoe model by @Cyrilvallez in #31279
Extend save_pretrained to offloaded models by @blbadger in #27412
Implement JSON dump conversion for torch_dtype in TrainingArguments by @junrae6454 in #31224
interpolation added for TVP. by @bhuvanmdev in #30863
Rename test_model_common_attributes -> test_model_get_set_embeddings by @amyeroberts in #31321
Use unused prepare_img() function in dinov2 conversion script by @IbrahimAmin1 in #31335
docs: fix style by @imba-tjd in #31340
Fix paligemma inverted mask by @molbap in #31207
docs/zh: fix style by @imba-tjd in #31334
Decorators for deprecation and named arguments validation by @qubvel in #30799
Improve error msg when using bitsandbytes by @SunMarc in #31350
Fix Cohere CI by @ydshieh in #31263
Fix gradio tool demos by @aymeric-roucher in #31230
Fast image processor by @amyeroberts in #28847
Add french translation of AutoBackbone by @jadechoghari in #31300
Add support to declare imports for code agent by @JasonZhu1313 in #31355
Fix idefics cache by @zucchini-nlp in #31377
[Bug Fix] Renamed loss to losses to suppress UnboundLocalError by @her0e1c1 in #31365
docs: fix broken link by @imba-tjd in #31370
backbone_utils - fix relative import by @amyeroberts in #31382
README underline between badges fix by @novialriptide in #31376
Update comment in modeling_utils.py by @inf3rnus in #31299
Use huggingface_hub helper function to split state dict by @SunMarc in #31091
Change JSON serialization to custom json.dumps by @junrae6454 in #31100
feat(ci): add trufflehog secrets detection by @McPatate in #31344
[QoL fix] [Image processing] Add warning on assumption of channel dim and avoid infering when inputs are PIL.Image by @aliencaocao in #31364
Make chat templates part of ProcessorMixin by @Rocketknight1 in #30744
add initial design for uniform processors + align model by @molbap in #31197
Add missing French translation of tutoriel_pipeline.md by @jadechoghari in #31396
Temporarily pin datasets upper version to fix CI by @albertvillanova in #31407
Support Clip QKV for MPT by @akakakakakaa in #31307
Pin datasets<2.20.0 for examples by @amyeroberts in #31417
Fix MusicGen SDPA by @ylacombe in #31208
Set seed for M4T retain grad test by @ylacombe in #31419
Fix SpeechT5 decoder_attention_mask shape by @ylacombe in #28071
Change potential inputs_embeds padding logger.warning to logger.warning_once by @naimenz in #31411
Remove duplicate image processor in auto map by @amyeroberts in #31383
Install the tensorflow example requirements in docker by @amyeroberts in #31428
Remove empty create_and_test_config_common_properties tests by @amyeroberts in #31359
xpu: support xpu backend from stock pytorch (>=2.4) by @dvrogozh in #31238
Musicgen special tokens in tensors by @zucchini-nlp in #31420
Fix Bark logits processors device misplacement by @ylacombe in #31416
Rename misnamed image processor test files by @amyeroberts in #31430
Generate: fix tokenizer being popped twice by @gante in #31427
[tests] make TestDeepSpeedModelZoo device-agnostic by @faaany in #31402
Support multiple validation datasets when dataloader_persistent_workers=True by @bastienlc in #30627
Pass datasets trust_remote_code by @albertvillanova in #31406
simple fix by @tokenizer-decode in #31456
Fix typing errors in Qwen2ForTokenClassification by @kevinhu in #31440
Agents: Improve python interpreter by @aymeric-roucher in #31409
Donut: fix generate call from local path by @gante in #31470
Make "tool_use" the default chat template key when tools are passed by @Rocketknight1 in #31429
Fix single letter stop strings by @Rocketknight1 in #31448
Update chat template docs and bump Jinja version by @Rocketknight1 in #31455
Improve PreTrainedTokenizerFast loading time when there are many added tokens by @ydshieh in #31404
Fix documentation typos by @qgallouedec in #31476
Give more useful metric_for_best_model errors by @tomaarsen in #31450
Update perf_train_gpu_many.md by @remyleone in #31451
[GPT2] Add SDPA support by @vasqu in #31172
Fix autocast incompatibility in RecurrentGemma by @xplip in #30832
Use self.config_tester.run_common_tests() by @amyeroberts in #31431
[tests] rename test_config_object to test_ds_config_object by @faaany in #31403
Docs / AQLM: Clarify torch.compile support for AQLM by @younesbelkada in #31473
Mamba: add generative tests by @gante in #31478
Update object_detection.md by @jajupmochi in #31488
Add docs on zeroshot image classification prompt templates by @aliencaocao in #31343
auto-detect device when no device is passed to pipeline by @faaany in #31398
Fix typo: pas_token_id by @ftnext in #30894
Fix wandb integration with SetFit model by @timothepearce in #30021
Consider inheritance in type checking for tensors by @daemyung in #31378
Add valid columns check in _remove_unused_columns method by @arthasking123 in #31466
Fix a teeny-tiny typo in tokenization_utils_base.py's docstring by @sadra-barikbin in #31510
Fix mismatched ` in doc & other common typos by @jhwei in #31516
RWKV: enable generation tests by @gante in #31490
unskip 2 tests in cohere by @ydshieh in #31517
Revive Nightly/Past CI by @ydshieh in #31159
Deprecate legacy cache + use cache position by @zucchini-nlp in #31491
SPLIT PR: add user defined symbols and control symbols by @itazap in #31305
Removed torch.cuda.empty_cache from train loop. by @FoamoftheSea in #31530
Update mask_generation.md by @nicholicaron in #31543
Correct @is_flaky test decoration by @qubvel in #31480
Add implementation of spectrogram_batch by @ravenouse in #27159
chore: fix typos by @xiaoxianBoy in #31559
Update git templates by @ArthurZucker in #31539
Fix the error caused by incorrect use of logger in pipeline by @lanyun1103 in #31565
Fix bug about add_special_tokens and so on by @hiroshi-matsuda-rit in #31496
Add Jinja as a requirement with the right version cutoff by @Rocketknight1 in #31536
Fix doc typo in TrainingArguments by @qgallouedec in #31503
Fix is_torch_xpu_available for torch < 2.3 by @amyeroberts in #31573
Added version constraint on numpy for version <2.0 by @Resteklicken in #31569
Siglip: add _no_split_module by @zucchini-nlp in #31566
fix output data type of image classification by @jiqing-feng in #31444
add preprocessing_num_workers to run_classification.py by @jiahuanluo in #31586
Improve error message for mismatched copies in code blocks by @molbap in #31535
Add ViTImageProcessorFast to tests by @amyeroberts in #31424
docs: move translations to i18n by @SauravMaheshkar in #31584
Removed unnecessary self.projection call in VivitTubeletEmbeddings by @v-iashin in #31632
[GPT-NeoX] Add SDPA support by @vasqu in #31031
Update RT-DETR code snippet by @qubvel in #31631
Llama et al. / FSDP : Fix breaking change in 4.40 for FSDP by @younesbelkada in #31161
Fix RT-DETR inference with float16 and bfloat16 by @qubvel in #31639
Fix paligemma detection inference by @molbap in #31587
Generate: fix assisted generation with past_key_values passed as kwargs by @gante in #31644
Fix dtype casting in swinv2 and swinv2sr to allow non-FP32 inference by @aliencaocao in #31589
Skip tests properly by @amyeroberts in #31308
Generation: past kv can be None by @zucchini-nlp in #31051
Fix ONNX exports for Optimum compatible models by @merveenoyan in #31311

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@josephenguehard
- Add TokenClassification for Mistral, Mixtral and Qwen2 (#29878)
@vasqu
- Fix a shape annotation and typos in mamba slow forward (#30691)
- [GPT2] Add SDPA support (#31172)
- [GPT-NeoX] Add SDPA support (#31031)
@ariG23498
- [Port] TensorFlow implementation of Mistral (#29708)
@bhuvanmdev
- added interpolation for vitmae model in pytorch as well as tf. (#30732)
- interpolation added for TVP. (#30863)
@SangbumChoi
- fix the get_size_with_aspect_ratio in max_size situation (#30902)
- New model support RTDETR (#29077)
@Cyrilvallez
- Reduce by 2 the memory requirement in generate() 🔥🔥🔥 (#30536)
- Fix jetmoe model (#31279)
@ravenouse
- Add implementation of spectrogram_batch (#27159)

Mostly fixing some stuff related to trust_remote_code=True and from_pretrained

The local_file_only was having a hard time when a .safetensors file did not exist. This is not expected and instead of trying to convert, we should just fallback to loading the .bin files.

Do not trigger autoconversion if local_files_only #31004 from @Wauplin fixes this!
Paligemma: Fix devices and dtype assignments (#31008) by @molbap
Redirect transformers_agents doc to agents (#31054) @aymeric-roucher
Fix from_pretrained in offline mode when model is preloaded in cache (#31010) by @oOraph
Fix faulty rstrip in module loading (#31108) @Rocketknight1

Patch Release v4.45.1