releases.shpreview
Hugging Face/Transformers

Transformers

$npx -y @buildinternet/releases show transformers
Mon
Wed
Fri
AprMayJunJulAugSepOctNovDecJanFebMarApr
Less
More
Releases13Avg4/moVersionsv4.57.4 → v5.5.3
Nov 5, 2024
Patch release v4.46.2

Mostly had to finish the gradient accumulation ! Thanks to @techkang and @Ryukijano 🤗

  • VLMs: fix number of image tokens (#34332) by @zucchini-nlp
  • fix pixtral processor (#34486) by @@molbap
  • enable average tokens across devices (#34373) by @techkang and @muellerzr
  • Update trainer for easier handling of accumulate, compile fixes, and … by @muellerzr and @Ryukijano
  • MPS: isin_mps_friendly can support 0D tensors (#34538) by @gante
Oct 29, 2024
Patch release v4.46.1

Patch release v4.4.61

This is mostly for fx and onnx issues!

** Fix regression loading dtype #34409 by @SunMarc ** LLaVa: latency issues #34460 by @zucchini-nlp ** Fix pix2struct #34374 by @IlyasMoutawwakil ** Fix onnx non-exposable inplace aten op #34376 by @IlyasMoutawwakil ** Fix torch.fx issue related to the new loss_kwargs keyword argument #34380 by @michaelbenayoun

Oct 24, 2024
Release v4.46.0

New model additions

Moshi

The Moshi model was proposed in Moshi: a speech-text foundation model for real-time dialogue by Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave and Neil Zeghidour.

Moshi is a speech-text foundation model that casts spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. Moshi also predicts time-aligned text tokens as a prefix to audio tokens. This “Inner Monologue” method significantly improves the linguistic quality of generated speech and provides streaming speech recognition and text-to-speech. As a result, Moshi is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice.

  • Moshi integration by @ylacombe in #33624

Zamba

Zamba-7B-v1 is a hybrid between state-space models (Specifically Mamba) and transformer, and was trained using next-token prediction. Zamba uses a shared transformer layer after every 6 mamba blocks. It uses the Mistral v0.1 tokenizer. We came to this architecture after a series of ablations at small scales. Zamba-7B-v1 was pre-trained on 1T tokens of text and code data.

<img width="400" alt="zamba" src="https://github.com/user-attachments/assets/a86428b8-4d24-4e5a-bf78-222312693bb2">
  • Add Zamba by @pglorio in #30950

GLM

The GLM Model was proposed in ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools by GLM Team, THUDM & ZhipuAI.

The abstract from the paper starts with the following:

We introduce ChatGLM, an evolving family of large language models that we have been developing over time. This report primarily focuses on the GLM-4 language series, which includes GLM-4, GLM-4-Air, and GLM-4-9B.

  • add Glm by @Cyrilvallez in #33823

Idefics 3

The Idefics3 model was proposed in Building and better understanding vision-language models: insights and future directions by Hugo Laurençon, Andrés Marafioti, Victor Sanh, and Léo Tronchon.

Idefics3 is an adaptation of the Idefics2 model with three main differences:

  • It uses Llama3 for the text model.
  • It uses an updated processing logic for the images.
  • It removes the perceiver.

  • Add Idefics 3! by @andimarafioti in #32473

PhiMoE

The PhiMoE model was proposed in Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone by Microsoft.

This model is very similar to Mixtral with the main difference of Phi3LongRoPEScaledRotaryEmbedding, where they are used to extend the context of the rotary embeddings. The query, key and values are fused, and the MLP’s up and gate projection layers are also fused.

  • PhiMoE by @garg-amit in #33363

Watermarking

This release adds SynthID, a novel state-of-the-art watermarking technique by Google DeepMind. SynthID has a low generation-time computational cost and can be configured to be nearly imperceptible (at the cost of harder watermarking detection). The release also comes with the code to train and run the corresponding detector, which is a machine learning model itself.

from transformers import AutoModelForCausalLM, AutoTokenizer, SynthIDTextWatermarkingConfig

tokenizer = AutoTokenizer.from_pretrained('google/gemma-2-2b', padding_side="left")
model = AutoModelForCausalLM.from_pretrained('google/gemma-2-2b')

# SynthID Text configuration
watermarking_config = SynthIDTextWatermarkingConfig(
    keys=[654, 400, 836, 123, 340, 443, 597, 160, 57],
    ngram_len=5,
)

# Generation with watermarking
tokenized_prompts = tokenizer(["Once upon a time, "], return_tensors="pt", padding=True)
output_sequences = model.generate(
    **tokenized_prompts, watermarking_config=watermarking_config, do_sample=True, max_new_tokens=10
)
watermarked_text = tokenizer.batch_decode(output_sequences, skip_special_tokens=True)
print(watermarked_text)

Docs for applying SynthID watermarking: https://huggingface.co/docs/transformers/internal/generation_utils#transformers.SynthIDTextWatermarkLogitsProcessor Docs for detecting SynthID watermarking: https://huggingface.co/docs/transformers/internal/generation_utils#transformers.SynthIDTextWatermarkDetector

<img width="750" alt="how-synthid-works-high-level" src="https://github.com/user-attachments/assets/c5702b21-e7e6-490d-8fe6-b73783e78e6b">
  • Add SynthID (watermerking by Google DeepMind) by @gante in #34350

Quantization

BitNet

BitNet is an architecture introduced by Microsoft Research that uses extreme quantization, representing each parameter with only three values: -1, 0, and 1. This results in a model that uses just 1.58 bits per parameter, significantly reducing computational and memory requirements. It replaces traditional Linear layers in Multi-Head Attention and Feed-Forward Networks with specialized layers called BitLinears that use ternary precision (or even binary, in the initial version)

  • FEAT : Adding BitNet quantization method to HFQuantizer by @MekkCyber in #33410

GGUF loading in transformers

More architectures are now supported in our GGUF loader; GGUF files saved with this architecture can now be loaded directly in transformers to be fine-tuned. We recommend using tooling from llama.cpp to requantize the models after further training has been done.

  • Add gguf support for bloom by @VladOS95-cyber in #33473
  • Add falcon gguf by @g-prz in #33437
  • Add gguf support for StableLM by @VladOS95-cyber in #33793
  • Add gguf support for gpt2 by @VladOS95-cyber in #34044
  • Add GGUF for starcoder2 by @VladOS95-cyber in #34094

Notable improvements and additions

Pipeline API synchronisation

We are pushing for a unified inference API across multiple libraries. As part of this, we are cleaning up the input and output signatures for our pipeline classes and deprecating some rarely-used arguments. This is still a work-in-progress, but when it's finished, transformers pipelines should exactly match workflows in deployment libraries like transformers.js or TGI, allowing you to seamlessly move from development to production.

  • Sync video classification pipeline with huggingface_hub spec by @Rocketknight1 in #34288
  • Image pipelines spec compliance by @Rocketknight1 in #33899
  • Make ASR pipeline compliant with Hub spec + add tests by @Rocketknight1 in #33769
  • Cleanup return_text and return_full_text options in TextGenerationPipeline by @Rocketknight1 in #33542
  • Make audio classification pipeline spec-compliant and add test by @Rocketknight1 in #33730
  • Sync QuestionAnsweringPipeline by @Rocketknight1 in #34039

Also, pipelines now fully support the Processor class, used by vision-language models. Expect full pipeline support for chatting with VLMs in the very near future!

  • Make pipeline able to load processor by @qubvel in #32514

Executorch compatibility

ExecuTorch is an end-to-end solution for enabling on-device inference capabilities across mobile and edge devices including wearables, embedded devices and microcontrollers. It is part of the PyTorch ecosystem and supports the deployment of PyTorch models with a focus on portability, productivity, and performance.

We are collaborating with the executorch team so that 🤗 Transformers models can be exported using torch.export. The goal of this integration is not only to enable export but also to ensure that the exported artifact can be further lowered and optimized to run efficiently in ExecuTorch, particularly for mobile and edge use cases.

<img width="750" alt="how-executorch-works-high-level" src="https://github.com/user-attachments/assets/e353f9c9-b3e8-4172-86e0-c9b0b1bdd17a">
  • Generate using exported model and enable gemma2-2b in ExecuTorch by @guangy10 in #33707
  • Qwen2.5 is ExecuTorch Compatible by @guangy10 in #34102
  • Olmo is ExecuTorch Compatible by @guangy10 in #34181
  • Llama3 and Llama2 are ExecuTorch compatible by @guangy10 in #34101

Gradient accumulation bugfix

  • Fix Gradient Accumulation issue by @ArthurZucker in #34191
  • Enable users to use their own loss functions + deal with prefetching for grad accum by @muellerzr in #34198
  • Enable Gradient Accumulation fix across all models + trainer fully in forward() by @muellerzr #34283

Bugfixes and improvements

  • adding positional encoder changes and tests by @manuelsh in #32600
  • Uniformize kwargs for chameleon processor by @leloykun in #32181
  • [MllamaProcessor] Update errors and API with multiple image by @ArthurZucker in #33715
  • fix: use correct var names for check_tokenizers script by @niqodea in #33702
  • Fix docs and docstrings Omdet-Turbo by @yonigozlan in #33726
  • Fix position embeddings singular/plural by @molbap in #33678
  • Generate: can_generate() recursive check by @gante in #33718
  • clean_up_tokenization_spaces=False if unset by @itazap in #31938
  • fix: add docstring for image_size in Convnextv2 config by @lucianosrp in #33734
  • Fix modular model converter unable to generate Processor classes by @tonywu71 in #33737
  • fix trainer tr_loss add error by @Wang-Xiaodong1899 in #33651
  • Update Albumentations Versions by @vasqu in #33704
  • Doc and config mismatch for DeBERTa by @fkrasnov2 in #33713
  • [clean_up_tokenization_spaces] Pl bart was failing, updating by @ArthurZucker in #33735
  • [MllamaImageProcessing] Update doc by @ArthurZucker in #33747
  • Make siglip examples clearer and error free by @jbn in #33667
  • Paligemma support for multi-image by @zucchini-nlp in #33447
  • remove warning v2 by @itazap in #33761
  • Model addition timeline by @LysandreJik in #33762
  • Fix typing in load_balancing_loss_func function of modeling_mixtral.py. by @PhilipMay in #33641
  • Enable non-safetensor ser/deser for TorchAoConfig quantized model 🔴 by @jerryzh168 in #33456
  • Fix typo in documentation by @qgallouedec in #33805
  • Hqq serialization by @mobicham in #33141
  • Add Slow CI reminder bot by @ydshieh in #33506
  • [modular] fixes! by @ArthurZucker in #33820
  • Fix ViT-MAE decoder interpolate by @xenova in #33330
  • Fixes for issue #33763 in idefics2 model by @aroun-coumar in #33766
  • Fix link in gguf.md by @pogpog in #33768
  • minor typo fix by @a-r-r-o-w in #33784
  • Fix Mamba slow path bug with dtype mismatch. by @Adibvafa in #32691
  • Fix passing str dtype to static cache by @guangy10 in #33741
  • fix check for hidden size in text model for deepspeed zero3 auto entries by @winglian in #33829
  • post reminder comment only once by @ydshieh in #33848
  • Generate: move llama prepare_inputs_for_generation to GenerationMixin by @gante in #33677
  • Refactor image features selection in LlaVa by @kenza-bouzid in #33696
  • fix: skip dropout in eval for flash_attn in various models by @fdschmidt93 in #33844
  • add attention weight up-cast to float32 in chameleon by @francescortu in #33822
  • Workaround for bark issue in pipelines by @Rocketknight1 in #33824
  • Fix device mismatch errors by @zucchini-nlp in #33851
  • This PR contains additional changes for #33143 by @aroun-coumar in #33581
  • Raise accelerate dependency error in case of defaulting low_cpu_mem_usage=True by @kylesayrs in #33830
  • Validate the eval dataset in advance. by @jackyjinjing in #33743
  • Add include_loss_for_metrics by @Manalelaidouni in #33088
  • Avoid using context that is not accessable from external contributors by @ydshieh in #33866
  • fix: repair depth estimation multiprocessing by @niqodea in #33759
  • Move weight initilization deformabledetr by @g-prz in #33339
  • [Fix] ViViT interpolate_pos_encoding by @RUFFY-369 in #33815
  • Repo consistency fix after #33339 by @amyeroberts in #33873
  • Add support for custom inputs and batched inputs in ProcessorTesterMixin by @yonigozlan in #33711
  • Fix: typo by @TrickEye in #33880
  • Uniformize model processors by @molbap in #31368
  • Don't run reminder bot for now by @ydshieh in #33883
  • populate quantization_config for kv-cache-scheme only configs by @horheynm in #33874
  • Allow for nightly packages of compressed_tensors by @kylesayrs in #33828
  • Fix kwargs passed by AutoQuantizationConfig.from_pretrained by @kylesayrs in #33798
  • Add sdpa for DistilBert by @OmarManzoor in #33724
  • Trainer - deprecate tokenizer for processing_class by @amyeroberts in #32385
  • [Quantization] Switch to optimum-quanto by @SunMarc in #31732
  • Optim deformable detr by @yonigozlan in #33600
  • Handle Trainer tokenizer kwarg deprecation with decorator by @qubvel in #33887
  • rename all test_processing_.py to test_processor_.py by @yonigozlan in #33878
  • uniformize processor Mllama by @yonigozlan in #33876
  • Fix dt proj bias reassigned by @HofitBata in #33314
  • Update an keyerror on _save_check_point prevent confusion of missing … by @fadingNA in #33832
  • VLM Generate: tag test_static_cache_matches_dynamic as flaky by @gante in #33630
  • Migrate the CI runners to the new clusters by @glegendre01 in #33849
  • Fix module initialization for root module under Zero3 by @Ben-Schneider-code in #33632
  • Add SplinterTokenizer unit test by @ariepratama in #32652
  • Generate tests: modality-agnostic input preparation by @gante in #33685
  • Fix: use unidic-lite instead of ipadic as the tokenizer dictionary for Japanese by @KanTakahiro in #33372
  • [Tests] Diverse Whisper fixes by @ylacombe in #33665
  • [PEFT] Support low_cpu_mem_usage option for PEFT loading adapters by @BenjaminBossan in #33725
  • add setter for trainer processor by @ArthurZucker in #33911
  • Add support for weights_only flag when loading state_dict by @jerryzh168 in #32481
  • Config: lower save_pretrained exception to warning by @gante in #33906
  • Uniformize kwargs for Idefics/2 processors by @yonigozlan in #32568
  • Remove logits.float() by @ringohoffman in #33902
  • Minor error condition bug fix by @htahboub in #33781
  • Fix distil whisper segment computation by @ylacombe in #33920
  • [Doc]: Broken link in Kubernetes doc by @saldanhad in #33879
  • [i18n-ru] Fixes typo in the README_ru.md by @Artanias in #33882
  • Ignore keys on validate_rope by @zucchini-nlp in #33753
  • [PR run-slow] by @ArthurZucker in #33939
  • Add a section on writing tool templates to the chat template docs by @Rocketknight1 in #33924
  • Enables CPU AWQ model with IPEX version. by @jiqing-feng in #33460
  • 🔴 🚨 Resizing tokens embeddings: initialize from old embeddings' normal distribution. by @abuelnasr0 in #33325
  • Removed unnecessary transpose in Switch Transformer Routing by @karan-uppal3 in #33582
  • Fix attn mask ignore logic in training-time trace by @zhenglongjiepheonix in #32613
  • hot fix self.position_embeddings->self.position_embedding by @ArthurZucker in #33958
  • fix red check-copies by @ArthurZucker in #33964
  • Cache: revert DynamicCache init for BC by @gante in #33861
  • Paligemma: fix static cache test by @zucchini-nlp in #33941
  • Updating char_to_token documentation to note behaviour when trim_offsets is True by @Craigacp in #33919
  • add test for Jamba with new model jamba-tiny-dev by @yecohn in #33863
  • Bug fix gguf qwen2moe by @VladOS95-cyber in #33940
  • [TF] Fix Tensorflow XLA Generation on limited seq_len models by @vasqu in #33903
  • [WIP] Add Tokenizer for MyT5 Model by @tomlimi in #31286
  • Add position ids in forward pass to opt model by @avishaiElmakies in #33121
  • Flash-attn performance: remove cuda sync during inference by @Cyrilvallez in #33570
  • [Docs] Improve VLM docs by @NielsRogge in #33393
  • [Docs] Add Developer Guide: How to Hack Any Transformers Model by @MagnusS0 in #33979
  • [Red CIs] Fix hub failures by @ArthurZucker in #34001
  • Fix Tensor + Embedding error in some cases when using SiglipVisionModel by @kaitolucifer in #33994
  • properly fix and RUN_SLOW by @ArthurZucker in #33965
  • Enable customized optimizer for DeepSpeed by @dataKim1201 in #32049
  • [pytes collection] Fix flax test collection by @ArthurZucker in #34004
  • Fix undefined default_config in configuration_utils.py by @mgoin in #33934
  • 🌐 [i18n-KO] Translated gguf.md to Korean by @yijun-lee in #33764
  • 🌐 [i18n-KO] Translated swinv2.md to Korean by @mreraser in #33566
  • 🌐 [i18n-KO] Translated audio_utils.md to Korean by @yijun-lee in #33802
  • 🌐 [i18n-KO] Translated esm.md to Korean by @yijun-lee in #33796
  • 🌐 [i18n-KO] Translated time_series_utils.md to Korean by @yijun-lee in #33806
  • 🌐 [i18n-KO] Translated pipelines_utils.md to Korean by @yijun-lee in #33809
  • 🌐 [i18n-KO] Translated trainer.md to Korean by @yijun-lee in #33797
  • 🌐 [i18n-KO] Translated chameleon.md to Korean by @yijun-lee in #33799
  • 🌐 [i18n-KO] Translated logging.md to Korean by @chhaewxn in #33543
  • 🌐 [i18n-KO] Translated auto.md to Korean by @boyunJang in #33590
  • 🌐 [i18n-KO] Translated swin2sr.md to Korean by @mreraser in #33795
  • 🌐 [i18n-KO] Translated vit.md to Korean by @mreraser in #33884
  • 🌐 [i18n-KO] Translated gemma.md to Korean by @yijun-lee in #33936
  • Cache: slight change in naming by @zucchini-nlp in #32421
  • Add support for all and potentilly deleting functions by @ArthurZucker in #33859
  • Processors: don't default padding side by @zucchini-nlp in #33942
  • Add auto model for image-text-to-text by @yonigozlan in #32472
  • BatchFeature.to() supports non-tensor keys by @Rocketknight1 in #33918
  • Improve modular converter by @Cyrilvallez in #33991
  • Fixup DeepSpeed things by @muellerzr in #34007
  • Fix typing issue by @SunMarc in #34012
  • fix awq tests due to ipex backend by @SunMarc in #34011
  • Remove decoder_config=None by @SunMarc in #34014
  • Fix trainer_seq2seq.py's __init__ type annotations by @benglewis in #34021
  • 🌐 [i18n-KO] Translated feature_extractor.md to Korean by @yijun-lee in #33775
  • 🌐 [i18n-KO] Translated bertweet.md to Korean by @ahnjj in #33891
  • 🌐 [i18n-KO] Translated gpt_neox_japanese.md to Korean by @ahnjj in #33894
  • 🌐 [i18n-KO] Translated rag.md to Korean by @chhaewxn in #33989
  • 🌐 [i18n-KO] Translated main_classes/quantization.md to Korean by @fabxoe in #33959
  • 🌐 [i18n-KO] Translated main_classes/configuration.md to Korean by @fabxoe in #33952
  • 🌐 [i18n-KO] Translated model_doc/mamba.md to Korean by @fabxoe in #33626
  • 🌐 [i18n-KO] Translated model_doc/autoformer.md to Korean by @fabxoe in #33574
  • 🌐 [i18n-KO] Translated model_doc/patchtsmixer.md to Korean by @fabxoe in #33587
  • 🌐 [i18n-KO] Translated model_doc/clip.md to Korean by @fabxoe in #33610
  • 🌐 [i18n-KO] Translated model_doc/paligemma.md to Korean by @fabxoe in #33612
  • 🌐 [i18n-KO] Translated model_doc/llama3.md to Korean by @fabxoe in #33635
  • 🌐 [i18n-KO] Translated model_doc/mistral.md to Korean by @fabxoe in #33648
  • 🌐 [i18n-KO] Translated model_doc/cohere.md to Korean by @fabxoe in #33885
  • 🌐 [i18n-KO] Translated model_doc/dbrx.md to Korean by @fabxoe in #33951
  • 🌐 [i18n-KO] Translated model_doc/deberta-v2.md to Korean by @fabxoe in #33968
  • 🌐 [i18n-KO] Translated main_classes/onnx.md to Korean by @fabxoe in #33601
  • 🌐 [i18n-KO] Translated tokenization_utils.md to Korean by @yijun-lee in #33813
  • 🌐 [i18n-KO] Translated swin.md to Korean by @mreraser in #33510
  • 🌐 [i18n-KO] Translated file_utils.md to Korean by @yijun-lee in #33803
  • 🌐 [i18n-KO] Translated openai-gpt.md to Korean by @yijun-lee in #33801
  • 🌐 [i18n-KO] Translated biogpt.md to Korean by @yijun-lee in #33773
  • 🌐 [i18n-KO] Translated blip.md to Korean by @cjfghk5697 in #33515
  • 🌐 [i18n-KO] Translated output.md to Korean by @4N3MONE in #33607
  • 🌐 [i18n-KO] Translated image_processing_utils.md to Korean by @yijun-lee in #33804
  • 🌐 [i18n-KO] Translated modular_transformers.md to Korean by @yijun-lee in #33772
  • [Patch helper] update to not have to checkout main by @ArthurZucker in #34006
  • Fix Failed tests with mobile bert resize tokens embedding by @abuelnasr0 in #33950
  • Generate: remove most decoder-only LLMs prepare_inputs_for_generation by @gante in #33870
  • Mllama: fix tests by @zucchini-nlp in #34000
  • Fix PIL dep for tests by @muellerzr in #34028
  • 🌐 [i18n-KO] Translated model_doc/bart.md to Korean by @fabxoe in #33893
  • 🌐 [i18n-KO] Translated model_doc/deberta.md to Korean by @fabxoe in #33967
  • 🌐 [i18n-KO] Translated main_classes/keras_callbacks.md to Korean by @fabxoe in #33955
  • 🌐 [i18n-KO] Translated model_doc/mamba2.md to Korean by @fabxoe in #33629
  • 🌐 [i18n-KO] Translated main_classes/model.md to Korean by @fabxoe in #33606
  • 🌐 [i18n-KO] Translated model_doc/trajectory_transformer.md to Korean by @fabxoe in #33597
  • 🌐 [i18n-KO] Translated model_doc/time_series_transformer.md to Korean by @fabxoe in #33596
  • 🌐 [i18n-KO] Translated model_doc/informer.md to Korean by @fabxoe in #33585
  • 🌐 [i18n-KO] Translated model_doc/graphormer.md to Korean by @fabxoe in #33569
  • 🌐 [i18n-KO] Translated modeling_utils.md to Korean by @yijun-lee in #33808
  • 🌐 [i18n-KO] Translated main_classes/data_collator.md to Korean by @fabxoe in #33954
  • 🌐 [i18n-KO] Translated model_doc/patchtst.md to Korean by @fabxoe in #33589
  • 🌐 [i18n-KO] Translated text_generation.md to Korean by @yijun-lee in #33777
  • 🌐 [i18n-KO] Translated main_classes/callback.md to Korean by @Jwaminju in #33572
  • 🌐 [i18n-KO] Translated generation_utils.md to Korean by @yijun-lee in #33818
  • Add Translate docs into Arabic - section files CONCEPTUAL GUIDES by @AhmedAlmaghz in #33982
  • add sdpa to OPT by @avishaiElmakies in #33298
  • Phi3: fix attn for sliding window by @zucchini-nlp in #33586
  • HfArgumentParser: allow for hyhenated field names in long-options by @djmarti in #33990
  • Fix pipelines tests by @qubvel in #34049
  • Specifying torch dtype in Qwen2VLForConditionalGeneration by @htahboub in #33953
  • Universal Assisted Generation: Assisted generation with any assistant model (by Intel Labs) by @danielkorat in #33383
  • check if eigenvalues of covariance matrix are complex. by @abuelnasr0 in #34037
  • [Docs] Update compressed_tensors.md by @mgoin in #33961
  • Fix data_seed unused by @MekkCyber in #33731
  • [TESTS] ASR pipeline by @ylacombe in #33925
  • Update Blip2 is_pipeline_test_to_skip method signature by @qubvel in #34067
  • provide trust_remote_code for search feat extractor in model config by @eaidova in #34036
  • Small Fix to modular converter by @MekkCyber in #34051
  • Default synced_gpus to True when using FullyShardedDataParallel by @ringohoffman in #33483
  • Idefics: fix position ids by @zucchini-nlp in #33907
  • Update SSH workflow file by @ydshieh in #34084
  • Tests: upcast logits to float() by @gante in #34042
  • Fix flax failures by @LysandreJik in #33912
  • Fix DAC slow tests by @ylacombe in #34088
  • Fix failing conversion by @LysandreJik in #34010
  • Fix PushToHubMixin when pusing to a PR revision by @Wauplin in #34090
  • avoid many failures for ImageGPT by @ydshieh in #34071
  • Fix NaNs in cost_matrix for mask2former by @ducha-aiki in #34074
  • Fix flaky tests by @zucchini-nlp in #34069
  • Generate: move prepare_inputs_for_generation in encoder-decoder llms by @gante in #34048
  • Avoid many test failures for LlavaNextVideoForConditionalGeneration by @ydshieh in #34070
  • refactor: benchmarks by @McPatate in #33896
  • fix(ci): benchmarks dashboard was failing due to missing quotations by @McPatate in #34100
  • Generate: Fix modern llm generate calls with synced_gpus by @gante in #34095
  • Mistral-related models for QnA by @vasqu in #34045
  • Fix a typo by @PengWeixuan in #34148
  • Fixed error message in mllama by @dmgcsilva in #34106
  • Specify that users should be careful with their own files by @LysandreJik in #34153
  • Add documentation for docker by @ArthurZucker in #33156
  • Update README.md with Enterprise Hub by @gary149 in #34150
  • Idefics: enable generation tests by @zucchini-nlp in #34062
  • Add sdpa for Vivit by @RUFFY-369 in #33757
  • Fix FSDP resume Initialization issue by @Itssshikhar in #34032
  • Fix default behaviour in TextClassificationPipeline for regression problem type by @subhalingamd in #34066
  • Generate: move logits to same device as input_ids by @gante in #34076
  • Add support for inheritance from class with different suffix in modular by @yonigozlan in #34077
  • Fix optuna ddp hp search by @SunMarc in #34073
  • [feat] LlavaNext add feature size check to avoid CUDA Runtime Error by @laurentd-lunit in #33608
  • 🌐 [i18n-KO] Translated vivit.md to Korean by @mreraser in #33935
  • 🌐 [i18n-KO] Translated gemma2.md to Korean by @yijun-lee in #33937
  • 🌐 [i18n-KO] Translated trainer_utils.md to Korean by @yijun-lee in #33817
  • 🌐 [i18n-KO] Translated blip-2.md to Korean by @cjfghk5697 in #33516
  • IDEFICS: support inputs embeds by @zucchini-nlp in #34043
  • [fix] fix token healing tests and usage errors by @alpertunga-bile in #33931
  • Revert accelerate error caused by 46d09af by @steveepreston in #34197
  • Fix wrong name for llava onevision and qwen2_vl in tokenization auto by @yonigozlan in #34177
  • Avoid using torch's Tensor or PIL's Image in chat template utils if not available by @RezaRahemtola in #34165
  • Revert "Fix FSDP resume Initialization issue" by @SunMarc in #34193
  • Update trainer._get_eval_sampler() to support group_by_length arg by @larin92 in #33514
  • Fix warning message for fp32_cpu_offloading in bitsandbytes configs by @amosyou in #34079
  • Ping team members for new failed tests in daily CI by @ydshieh in #34171
  • fix(Wav2Vec2ForCTC): torch export by @chrsmcgrr in #34023
  • Fix for tokenizer.apply_chat_template with continue_final_message=True by @schoennenbeck in #34214
  • removes decord by @vrnvu in #33987
  • Fix bus error when using GPT2 on M1 macs by @chanind in #34031
  • Generate: visit non-llm prepare_inputs_for_generation by @gante in #34199
  • Support Llama 3.2 conversion (text models) by @pcuenca in #33778
  • Fix-red-ci by @ArthurZucker in #34230
  • BLIP: fix input expansion logic by @zucchini-nlp in #34225
  • Fix broken test decorator require_torch_up_to_2_accelerators by @byi8220 in #34201
  • Informative 2 by @LysandreJik in #34154
  • Fix UDOP dtype issue by @Rocketknight1 in #34180
  • Only cast logits to float when computing loss by @ringohoffman in #34147
  • Generation tests: don't rely on main input name by @zucchini-nlp in #34228
  • Change Paligemma import logging to work with modular by @yonigozlan in #34211
  • Add DetrImageProcessorFast by @yonigozlan in #34063
  • Add a doc section on writing generation prompts by @Rocketknight1 in #34248
  • Fix method name which changes in tutorial by @andimarafioti in #34252
  • Attn implementation for composite models by @zucchini-nlp in #32238
  • VLM: add more modularity by @zucchini-nlp in #34175
  • T5 compile compatibilty by @zucchini-nlp in #34089
  • [docs] Fix GenerationConfig params by @stevhliu in #34299
  • Fix Korean doc _toctree.yml by @regisss in #34293
  • Update PR templates by @SunMarc in #34065
  • [RT-DETR] Fix onnx inference bug for Optype (Where) by @YHallouard in #33877
  • Fix FA2 attention for models supporting sliding window by @Cyrilvallez in #34093
  • Fix: tensor of examples of the same length triggers invalid stacking by @pbelcak in #34166
  • Add post_process_depth_estimation to image processors and support ZoeDepth's inference intricacies by @alex-bene in #32550
  • Add option for running ffmpeg_microphone_live as a background process by @mikamerath in #32838
  • Feature: Add MLFLOW_MAX_LOG_PARAMS to MLflowCallback by @cecheta in #34279
  • Fix continue_final_message for image-text-to-text chat templates by @yonigozlan in #34236
  • fix error in _get_eval_sampler when group_by_length enabled by @akakakakakaa in #34237
  • [docs] fix typo by @faaany in #34235
  • 🌐 [i18n-KO] Translated executorch.md to Korean by @ahnjj in #33888
  • 🌐 [i18n-KO] Translated bert japanese.md to Korean by @ahnjj in #33890
  • 🌐 [i18n-KO] Translated model_doc/bartpho.md to Korean by @Jwaminju in #33981
  • Example doc for token classification of Llama and Dependent/Copied Models by @h3110Fr13nd in #34139
  • [docs] Fix Korean toctree by @stevhliu in #34324
  • Added Deberta model type support by @FilipposVentirozos in #34308

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @manuelsh
    • adding positional encoder changes and tests (#32600)
  • @ArthurZucker
    • [MllamaProcessor] Update errors and API with multiple image (#33715)
    • [clean_up_tokenization_spaces] Pl bart was failing, updating (#33735)
    • [MllamaImageProcessing] Update doc (#33747)
    • [modular] fixes! (#33820)
    • add setter for trainer processor (#33911)
    • [PR run-slow] (#33939)
    • hot fix self.position_embeddings->self.position_embedding (#33958)
    • fix red check-copies (#33964)
    • [Red CIs] Fix hub failures (#34001)
    • properly fix and RUN_SLOW (#33965)
    • [pytes collection] Fix flax test collection (#34004)
    • Add support for all and potentilly deleting functions (#33859)
    • [Patch helper] update to not have to checkout main (#34006)
    • Add documentation for docker (#33156)
    • Fix Gradient Accumulation issue (#34191)
    • Fix-red-ci (#34230)
  • @molbap
    • Fix position embeddings singular/plural (#33678)
    • Uniformize model processors (#31368)
  • @vasqu
    • Update Albumentations Versions (#33704)
    • [TF] Fix Tensorflow XLA Generation on limited seq_len models (#33903)
    • Mistral-related models for QnA (#34045)
  • @VladOS95-cyber
    • Add gguf support for bloom (#33473)
    • Bug fix gguf qwen2moe (#33940)
    • Add gguf support for StableLM (#33793)
    • Add gguf support for gpt2 (#34044)
    • Add GGUF for starcoder2 (#34094)
  • @ydshieh
    • Add Slow CI reminder bot (#33506)
    • post reminder comment only once (#33848)
    • Avoid using context that is not accessable from external contributors (#33866)
    • Don't run reminder bot for now (#33883)
    • Update SSH workflow file (#34084)
    • avoid many failures for ImageGPT (#34071)
    • Avoid many test failures for LlavaNextVideoForConditionalGeneration (#34070)
    • Ping team members for new failed tests in daily CI (#34171)
  • @amyeroberts
    • Repo consistency fix after #33339 (#33873)
    • Trainer - deprecate tokenizer for processing_class (#32385)
  • @ylacombe
    • [Tests] Diverse Whisper fixes (#33665)
    • Fix distil whisper segment computation (#33920)
    • [TESTS] ASR pipeline (#33925)
    • Fix DAC slow tests (#34088)
    • Moshi integration (#33624)
  • @ringohoffman
    • Remove logits.float() (#33902)
    • Default synced_gpus to True when using FullyShardedDataParallel (#33483)
    • Only cast logits to float when computing loss (#34147)
  • @garg-amit
    • PhiMoE (#33363)
  • @pglorio
    • Add Zamba (#30950)
  • @tomlimi
    • [WIP] Add Tokenizer for MyT5 Model (#31286)
  • @yijun-lee
    • 🌐 [i18n-KO] Translated gguf.md to Korean (#33764)
    • 🌐 [i18n-KO] Translated audio_utils.md to Korean (#33802)
    • 🌐 [i18n-KO] Translated esm.md to Korean (#33796)
    • 🌐 [i18n-KO] Translated time_series_utils.md to Korean (#33806)
    • 🌐 [i18n-KO] Translated pipelines_utils.md to Korean (#33809)
    • 🌐 [i18n-KO] Translated trainer.md to Korean (#33797)
    • 🌐 [i18n-KO] Translated chameleon.md to Korean (#33799)
    • 🌐 [i18n-KO] Translated gemma.md to Korean (#33936)
    • 🌐 [i18n-KO] Translated feature_extractor.md to Korean (#33775)
    • 🌐 [i18n-KO] Translated tokenization_utils.md to Korean (#33813)
    • 🌐 [i18n-KO] Translated file_utils.md to Korean (#33803)
    • 🌐 [i18n-KO] Translated openai-gpt.md to Korean (#33801)
    • 🌐 [i18n-KO] Translated biogpt.md to Korean (#33773)
    • 🌐 [i18n-KO] Translated image_processing_utils.md to Korean (#33804)
    • 🌐 [i18n-KO] Translated modular_transformers.md to Korean (#33772)
    • 🌐 [i18n-KO] Translated modeling_utils.md to Korean (#33808)
    • 🌐 [i18n-KO] Translated text_generation.md to Korean (#33777)
    • 🌐 [i18n-KO] Translated generation_utils.md to Korean (#33818)
    • 🌐 [i18n-KO] Translated gemma2.md to Korean (#33937)
    • 🌐 [i18n-KO] Translated trainer_utils.md to Korean (#33817)
  • @fabxoe
    • 🌐 [i18n-KO] Translated main_classes/quantization.md to Korean (#33959)
    • 🌐 [i18n-KO] Translated main_classes/configuration.md to Korean (#33952)
    • 🌐 [i18n-KO] Translated model_doc/mamba.md to Korean (#33626)
    • 🌐 [i18n-KO] Translated model_doc/autoformer.md to Korean (#33574)
    • 🌐 [i18n-KO] Translated model_doc/patchtsmixer.md to Korean (#33587)
    • 🌐 [i18n-KO] Translated model_doc/clip.md to Korean (#33610)
    • 🌐 [i18n-KO] Translated model_doc/paligemma.md to Korean (#33612)
    • 🌐 [i18n-KO] Translated model_doc/llama3.md to Korean (#33635)
    • 🌐 [i18n-KO] Translated model_doc/mistral.md to Korean (#33648)
    • 🌐 [i18n-KO] Translated model_doc/cohere.md to Korean (#33885)
    • 🌐 [i18n-KO] Translated model_doc/dbrx.md to Korean (#33951)
    • 🌐 [i18n-KO] Translated model_doc/deberta-v2.md to Korean (#33968)
    • 🌐 [i18n-KO] Translated main_classes/onnx.md to Korean (#33601)
    • 🌐 [i18n-KO] Translated model_doc/bart.md to Korean (#33893)
    • 🌐 [i18n-KO] Translated model_doc/deberta.md to Korean (#33967)
    • 🌐 [i18n-KO] Translated main_classes/keras_callbacks.md to Korean (#33955)
    • 🌐 [i18n-KO] Translated model_doc/mamba2.md to Korean (#33629)
    • 🌐 [i18n-KO] Translated main_classes/model.md to Korean (#33606)
    • 🌐 [i18n-KO] Translated model_doc/trajectory_transformer.md to Korean (#33597)
    • 🌐 [i18n-KO] Translated model_doc/time_series_transformer.md to Korean (#33596)
    • 🌐 [i18n-KO] Translated model_doc/informer.md to Korean (#33585)
    • 🌐 [i18n-KO] Translated model_doc/graphormer.md to Korean (#33569)
    • 🌐 [i18n-KO] Translated main_classes/data_collator.md to Korean (#33954)
    • 🌐 [i18n-KO] Translated model_doc/patchtst.md to Korean (#33589)
  • @MekkCyber
    • FEAT : Adding BitNet quantization method to HFQuantizer (#33410)
    • Fix data_seed unused (#33731)
    • Small Fix to modular converter (#34051)
  • @AhmedAlmaghz
    • Add Translate docs into Arabic - section files CONCEPTUAL GUIDES (#33982)
  • @alex-bene
    • Add post_process_depth_estimation to image processors and support ZoeDepth's inference intricacies (#32550)
Oct 7, 2024
Release v4.45.2

Patch release v4.45.2

Mostly some warnings that were not properly removed ⚠️ :

  • Ignore keys on validate_rope #33753 by @zucchini-nlp
  • remove warning v2 #33761 by @itazap
  • Config: lower save_pretrained exception to warning #33906 by @gante

🔴 Had a small regression with dynamic Cache 🔴 *Cache: revert DynamicCache init for BC #33861 by @gante

A small fix for idefic 🐩 :

  • Fixes for issue #33763 in idefics2 model #33766 by @aroun-coumar

And a fix for Siglip 🤧 !

  • hot fix self.position_embeddings->self.position_embedding #33958 and properly fix and RUN_SLOW #33965 thanks to @mranzinger
Sep 26, 2024
Patch Release v4.45.1

Patches for v4.45.1

  • [MllamaProcessor] Update errors and API with multiple image (#33715) by @ArthurZucker
  • Generate: can_generate() recursive check (#33718) by @gante
  • clean_up_tokenization_spaces=False if unset (#31938) by @itazap
Sep 25, 2024
Llama 3.2, mllama, Qwen2-Audio, Qwen2-VL, OLMoE, Llava Onevision, Pixtral, FalconMamba, Modular Transformers

New model additions

mllama

The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.

  • Add MLLama #33703, by @qubvel, @zucchini-nlp, @ArthurZucker

Qwen2-VL

The Qwen2-VL is a major update from the previous Qwen-VL by the Qwen team.

An extract from the Qwen2-VL blogpost available here is as follows:

Qwen2-VL is the latest version of the vision language models based on Qwen2 in the Qwen model familities. Compared with Qwen-VL, Qwen2-VL has the capabilities of:

  • SoTA understanding of images of various resolution & ratio: Qwen2-VL achieves state-of-the-art performance on visual understanding benchmarks, including MathVista, DocVQA, RealWorldQA, MTVQA, etc.
  • Understanding videos of 20min+: Qwen2-VL can understand videos over 20 minutes for high-quality video-based question answering, dialog, content creation, etc.
  • Agent that can operate your mobiles, robots, etc.: with the abilities of complex reasoning and decision making, Qwen2-VL can be integrated with devices like mobile phones, robots, etc., for automatic operation based on visual environment and text instructions.
  • Multilingual Support: to serve global users, besides English and Chinese, Qwen2-VL now supports the understanding of texts in different languages inside images, including most European languages, Japanese, Korean, Arabic, Vietnamese, etc.

  • support qwen2-vl by @simonJJJ in #32318

Qwen2-Audio

The Qwen2-Audio is the new model series of large audio-language models from the Qwen team. Qwen2-Audio is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions.

They introduce two distinct audio interaction modes:

  • voice chat: users can freely engage in voice interactions with Qwen2-Audio without text input
  • audio analysis: users could provide audio and text instructions for analysis during the interaction

  • Add Qwen2-Audio by @faychu in #32137

OLMoE

OLMoE is a series of Open Language Models using sparse Mixture-of-Experts designed to enable the science of language models. The team releases all code, checkpoints, logs, and details involved in training these models.

  • Add OLMoE by @Muennighoff in #32406

Llava Onevision

LLaVA-Onevision is a Vision-Language Model that can generate text conditioned on one or several images/videos. The model consists of SigLIP vision encoder and a Qwen2 language backbone. The images are processed with anyres-9 technique where the image is split into 9 patches to better process high resolution images and capture as much details as possible. However, videos are pooled to a total sequence length of 196 tokens each frame for more memory efficient computation. LLaVA-Onevision is available in three sizes: 0.5B, 7B and 72B and achieves remarkable performance on benchmark evaluations.

  • Llava Onevision: add model by @zucchini-nlp in #32673

FalconMamba

The FalconMamba model was proposed by TII UAE (Technology Innovation Institute) in their release.

The model has been trained on approximtely 6T tokens consisting a mixture of many data sources such as RefineWeb, Cosmopedia and Math data.

The team releases an accompanying blog post.

  • Add new model by @younesbelkada in #32615

Granite Language Models

he Granite model was proposed in Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler by Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox and Rameswar Panda.

PowerLM-3B is a 3B state-of-the-art small language model trained with the Power learning rate scheduler. It is trained on a wide range of open-source and synthetic datasets with permissive licenses. PowerLM-3B has shown promising results compared to other models in the size categories across various benchmarks, including natural language multi-choices, code generation, and math reasoning.

  • Granite language models by @mayank31398 in #31502

Granite MOE

The GraniteMoe model was proposed in Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler by Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox and Rameswar Panda.

PowerMoE-3B is a 3B sparse Mixture-of-Experts (sMoE) language model trained with the Power learning rate scheduler. It sparsely activates 800M parameters for each token. It is trained on a mix of open-source and proprietary datasets. PowerMoE-3B has shown promising results compared to other dense models with 2x activate parameters across various benchmarks, including natural language multi-choices, code generation, and math reasoning.

  • Granitemoe by @mayank31398 in #33207

Descript-Audio-Codec

The Descript Audio Codec (DAC) model is a powerful tool for compressing audio data, making it highly efficient for storage and transmission. By compressing 44.1 KHz audio into tokens at just 8kbps bandwidth, the DAC model enables high-quality audio processing while significantly reducing the data footprint. This is particularly useful in scenarios where bandwidth is limited or storage space is at a premium, such as in streaming applications, remote conferencing, and archiving large audio datasets.

  • Add Descript-Audio-Codec model by @kamilakesbi in #31494

Pixtral

The Pixtral model was released by the Mistral AI team. Pixtral is a multimodal model, taking images and text as input, and producing text as output. This model follows the Llava family, meaning image embeddings are placed instead of the [IMG] token placeholders.

The model uses PixtralVisionModel for its vision encoder, and MistralForCausalLM for its language decoder. The main contribution is the 2d ROPE (rotary postiion embeddings) on the images, and support for arbitrary image sizes (the images are not padded together nor are they resized).

  • Add support for Pixtral by @ArthurZucker in #33449

Mimi

The Mimi model was proposed in Moshi: a speech-text foundation model for real-time dialogue by Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave and Neil Zeghidour. Mimi is a high-fidelity audio codec model developed by the Kyutai team, that combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps. In other words, it can be used to map audio waveforms into “audio tokens”, known as “codebooks”.

  • Codec integration by @ylacombe in #33565

OmDet-Turbo

The OmDet-Turbo model was proposed in Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head by Tiancheng Zhao, Peng Liu, Xuan He, Lu Zhang, Kyusong Lee. OmDet-Turbo incorporates components from RT-DETR and introduces a swift multimodal fusion module to achieve real-time open-vocabulary object detection capabilities while maintaining high accuracy. The base model achieves performance of up to 100.2 FPS and 53.4 AP on COCO zero-shot.

  • Add OmDet-Turbo by @yonigozlan in #31843

Quantization

GGUF

GGUF support continues to be enhanced in the library by offering a way to load GGUF models within transformers by unquantizing them, before re-quantizing them for re-use within the GGUF/GGML ecosystem.

  • Add Qwen2Moe GGUF loading support by @VladOS95-cyber in #33264
  • Fix incorrect vocab size retrieval in GGUF config by @Isotr0py in #32551
  • Add chat_template for tokenizer extracted from GGUF model by @Isotr0py in #32908
  • 🚨 Support dequantization for most GGML types by @Isotr0py in #32625
  • Add support for GGUF Phi-3 by @a8nova in #31844

Torch AO

An ongoing effort is to add the ability to use torchao as a quantization backend. Future PRs will enable saving and fine-tuning with peft.

  • Add TorchAOHfQuantizer by @jerryzh168 in #32306

Liger Kernel

The Liger kernel is now supported in the Trainer class.

  • Integrate Liger (Linkedin GPU Efficient Runtime) Kernel to Trainer by @JasonZhu1313 in #32860

Modular Transformers

This PR introduces Modularity for transformers, which has always been prohibited when working with transformers (see blog post for the accompanying design philosophy).

The core idea behind this PR is to facilitate model addition by enabling Pythonic inheritance while keeping true to our single-file policy in which models/processors must be contained within a single file, enabling working around the object without going through 10 layers of abstractions.

It is heavily recommended to read the PR description in order to understand the depth of the change: https://github.com/huggingface/transformers/pull/33248

  • Modular transformers: modularity and inheritance for new model additions by @ArthurZucker in #33248

Agents

Agents continue being improved at each release; this time making it much simpler to leverage a local engine through a local Transformers Engine.

  • Multi agents with manager by @aymeric-roucher in #32687
  • Add new documentation page for advanced agent usage by @aymeric-roucher in #33265
  • Create local Transformers Engine by @aymeric-roucher in #33218
  • Agents use grammar by @aymeric-roucher in #31735

Dynamic cache for decoder-only models

This PR adds to all decoder-only models (except for XLNet) support for dynamic cache.

The documentation for the Dynamic cache can be found here, and documentation related to the KV cache in transformers in general can be found here.

  • Cache: new Cache format in decoder-only models by @zucchini-nlp in #31421

Chat templates updates

We've made several updates to our handling of chat models and chat templates. The most noticeable change is that assistant prefill is now supported. This means you can end a chat with an assistant message, and the model will continue that message instead of starting a new one, allowing you to guide the model's response:

pipe = pipeline("text-generation", model_checkpoint)

chat = [
    {"role": "user", "content": "Can you format the answer in JSON?"},
    {"role": "assistant", "content": '{"name": "'}
]

output = pipe(chat)   # The model will continue outputting JSON!

We've also enabled several new functionalities in Jinja that will allow more powerful templates in future, including Loop Controls and a strftime_now function that can get the current date and time, which is commonly used in system messages. For more details, see the updated chat template docs.

  • Enable some Jinja extensions and add datetime capabilities by @Rocketknight1 in #32684
  • Update Jinja docs with new functions and general cleanup by @Rocketknight1 in #33097
  • Add assistant prefill for chat templates and TextGenerationPipeline by @Rocketknight1 in #33198
  • Add a warning to the chat template docs about the tool_calls format by @Rocketknight1 in #33277
  • Add tip to clarify tool calling by @Rocketknight1 in #32883

Bugfixes and improvements

  • 🌐 [i18n-KO] Translated mask_generation.md to Korean by @jeongiin in #32257
  • 🌐 [i18n-KO] Translated idefics.md to Korean by @boyunJang in #32258
  • 🌐 [i18n-KO] Translated image_to_image.md to Korean by @shinhyunji36 in #32327
  • Gemma2: add cache warning by @zucchini-nlp in #32279
  • enable xla fsdp by @hanwen-sun in #32048
  • Fix typo in tokenization_utils_base.py by @blubitz in #32484
  • fix broken link in docs by @jorahn in #32491
  • Docs: alert for the possibility of manipulating logits by @gante in #32467
  • 🌐 [i18n-KO] Translated gptq.md to Korean by @1kmmk1 in #32293
  • 🌐 [i18n-KO] Translated prompting.md to Korean by @chhaewxn in #32294
  • 🌐 [i18n-KO] Translated quantization/quanto.md to Korean by @fabxoe in #32281
  • 🌐 [i18n-KO] Translated image_feature_extraction.md to Korean by @mreraser in #32239
  • Fix references to model google mt5 small by @JuanFKurucz in #32497
  • Docs: Fixed WhisperModel.forward’s docstring link by @Sai-Suraj-27 in #32498
  • 🌐 [i18n-KO] Translated chat_templating.md to Korean by @enchantee00 in #32362
  • Fix link to autoclass_tutorial.md in i18n.md by @JuanFKurucz in #32501
  • Fix typo: depracted -> deprecated by @tomaarsen in #32489
  • Fix issue #32518: Update llm_tutorial.md by @doomdagadiggiedahdah in #32523
  • Change Phi3 _supports_sdpa to True by @pocca2048 in #32457
  • Uniformize kwargs for processors - GroundingDINO by @SangbumChoi in #31964
  • Fix add-new-model-like by @molbap in #31773
  • filter flash_attn optional imports loading remote code by @eaidova in #30954
  • 🌐 [i18n-KO] Translated ko-llm_tutorial_optimization.md to Korean by @010kim in #32372
  • 🌐 [i18n-KO] Translated trainer.md to Korean by @cjfghk5697 in #32260
  • 🌐 [i18n-KO] Translated eetq.md to Korean by @jun048098 in #32352
  • 🌐 [i18n-KO] Translated fsdp.md to Korean by @win2dvp21 in #32261
  • 🌐 [i18n-KO] Translated bitsandbytes.md to Korean by @SeungAhSon in #32408
  • Fix generate with inputs_embeds as input by @molbap in #32493
  • Fixed test test_static_cache_exportability with torch 2.4.0 by @guangy10 in #32516
  • Fix code example to load bigcode starcoder2 7b by @JuanFKurucz in #32474
  • [docs] Translation guide by @stevhliu in #32547
  • Gemma2: fix FA2 generation by @zucchini-nlp in #32553
  • Fix a bug in Qwen2Audio by @faychu in #32552
  • fix slow integration gemma2 test by @ArthurZucker in #32534
  • fix non contiguous tensor value error in save_pretrained by @congcongke in #32422
  • 🌐 [i18n-KO] Translated agent.md to Korean by @Jwaminju in #32351
  • Fix: FA2 with packed training by @zucchini-nlp in #32487
  • Fix sliding window attention used in Gemma2FlashAttention2 by @brcps12 in #32522
  • fix: Fixed conditional check for encodec model names by @Sai-Suraj-27 in #32581
  • Fix .push_to_hub(..., create_pr=True, revision="my-branch") when creating PR on not-owned repo by @Wauplin in #32094
  • Cleanup tool calling documentation and rename doc by @Rocketknight1 in #32337
  • 🌐 [i18n-KO] Translated deepspeed.md to Korean by @4N3MONE in #32431
  • 🌐 [i18n-KO] Translated awq.mdto Korean by @ahnjj in #32324
  • fix: Fixed failing test_find_base_model_checkpoint by @Sai-Suraj-27 in #32638
  • "to be not" -> "not to be" by @qgallouedec in #32636
  • fix: Updated the is_torch_mps_available() function to include min_version argument by @Sai-Suraj-27 in #32545
  • Expand inputs in processors for VLMs by @zucchini-nlp in #30962
  • Automatically add transformers tag to the modelcard by @LysandreJik in #32623
  • Fix tests by @molbap in #32649
  • fix tensors on different devices in WhisperGenerationMixin by @faaany in #32316
  • Add support for GrokAdamW optimizer by @ehartford in #32521
  • Add Depth Anything V2 Metric models by @bt2513 in #32126
  • Fix: Fixed directory path for utils folder in test_tokenization_utils.py by @Sai-Suraj-27 in #32601
  • Modify ProcessorTesterMixin for better generalization by @yonigozlan in #32637
  • TF_Deberta supporting mixed precision by @pinesnow72 in #32618
  • Fix tests recurrent by @molbap in #32651
  • Support MUSA (Moore Threads GPU) backend in transformers by @fmo-mt in #31913
  • fix: Fixed failing tests in tests/utils/test_add_new_model_like.py by @Sai-Suraj-27 in #32678
  • Update translation docs review by @stevhliu in #32662
  • Fix JetMoeIntegrationTest by @ydshieh in #32332
  • Update the distributed CPU training on Kubernetes documentation by @dmsuehir in #32669
  • fix: Fixed unknown pytest config option doctest_glob by @Sai-Suraj-27 in #32475
  • Unpin deepspeed in Docker image/tests by @muellerzr in #32572
  • Updated workflows to the latest versions by @Sai-Suraj-27 in #32405
  • reopen: llava-next fails to consider padding_side during Training by @jp1924 in #32679
  • fix: Corrected falcon-mamba-7b model checkpoint name by @Sai-Suraj-27 in #32837
  • fix: update doc link for runhouse in README.md by @muddlebee in #32664
  • VLMs: small clean-up for cache class by @zucchini-nlp in #32417
  • add back the position ids by @ArthurZucker in #32554
  • Use head_dim if in config for RoPE by @suiyoubi in #32495
  • Generate: unify LogitsWarper and LogitsProcessor by @gante in #32626
  • [tests] make test_sdpa_equivalence device-agnostic by @faaany in #32520
  • Cache: use batch_size instead of max_batch_size by @gante in #32657
  • Fix AutoConfig and AutoModel support for Llava-Next-Video by @TKONIY in #32844
  • improve _get_is_as_tensor_fns by @zrr1999 in #32596
  • Revert PR 32299, flag users when Zero-3 was missed by @muellerzr in #32851
  • fix multi-gpu with static cache by @SunMarc in #32543
  • Reduce the error log when using core models that need their weights renamed, and provide a step forward by @muellerzr in #32656
  • Make beam_constraints.Constraint.advance() docstring more accurate by @alex-calderwood in #32674
  • generate: missing to in DoLa body, causing exceptions in multi-gpu generation by @gante in #32856
  • Add Flax Dinov2 by @MHRDYN7 in #31960
  • support torch-speech by @itazap in #32537
  • [tests] make test_sdpa_can_compile_dynamic device-agnostic by @faaany in #32519
  • Add repr for Conv1D by @AaronZLT in #32425
  • Support save/load ckpt for XLA FSDP by @yitongh in #32311
  • RT-DETR parameterized batchnorm freezing by @AlanBlanchet in #32631
  • Mamba / FalconMamba: Fix mamba left padding by @younesbelkada in #32677
  • Fix: Mamba2 generation mismatch between input_ids and inputs_embeds by @vasqu in #32694
  • Docs: Fixed whisper-large-v2 model link in docs by @Sai-Suraj-27 in #32871
  • Allow-head-dim by @ArthurZucker in #32857
  • 🚨🚨🚨 Update min version of accelerate to 0.26.0 by @SunMarc in #32627
  • Fix repr for conv by @ArthurZucker in #32897
  • fix: jamba cache fails to use torch.nn.module by @xgal in #32894
  • Fix: Mamba2 norm_before_gate usage by @vasqu in #32686
  • Replace tensor.norm() with decomposed version for CLIP executorch export by @qubvel in #32887
  • link for optimizer names by @nbroad1881 in #32400
  • [i18n-ar] add README_ar.md to README.md by @AhmedAlmaghz in #32583
  • fix: [whisper] don't overwrite GenerationConfig's return_timestamps when return_timestamps is not passed to generate function by @hrl in #31296
  • Update docker image building by @ArthurZucker in #32918
  • Jamba: update integration tests by @gante in #32250
  • fix: Added missing huggingface_hub installation to workflows by @Sai-Suraj-27 in #32891
  • fix: no need to dtype A in jamba by @xgal in #32924
  • FEAT / Trainer: Add adamw 4bit optimizer by @SunMarc in #31865
  • CI: separate step to download nltk files by @gante in #32935
  • FIX / Hub: Also catch for exceptions.ConnectionError by @younesbelkada in #31469
  • Add SynCode to llm_tutorial by @shubhamugare in #32884
  • Fix benchmark script by @ydshieh in #32635
  • Improve greedy search memory usage by @regisss in #32895
  • fix: (issue #32689) AttributeError raised when using Trainer with eval_on_start=True in Jupyter Notebook. by @fshp971 in #32849
  • Gemma2: eager attention by default by @gante in #32865
  • [run_slow] idefics2 by @andimarafioti in #32840
  • Fix regression on Processor.save_pretrained caused by #31691 by @leloykun in #32921
  • 🌐 [i18n-KO] Translated `knowledge_distillation_for_image_classification.md to Korean" by @JinukHong in #32334
  • Generate: Deprecate returning legacy cache by default; Handle use_cache=False by @gante in #32863
  • docs: fix outdated link to TF32 explanation by @anakin87 in #32947
  • Reducing memory usage: removing useless logits computation in generate() by @Cyrilvallez in #31292
  • Forbid PretrainedConfig from saving generate parameters; Update deprecations in generate-related code 🧹 by @gante in #32659
  • DeviceGuard added to use Deformable Attention more safely on multi-GPU by @DonggeunYu in #32910
  • added doctring to SchedulerType class by @Arunprakash-A in #32898
  • Updated the custom_models.md changed cross_entropy code by @S-M-J-I in #33118
  • CI: add torchvision to the consistency image by @gante in #32941
  • Test: add higher atol in test_forward_with_num_logits_to_keep by @gante in #33093
  • mps: add isin_mps_friendly, a wrapper function for torch.isin by @gante in #33099
  • Add changes for uroman package to handle non-Roman characters by @nandwalritik in #32404
  • fix: Fixed pydantic required version in dockerfiles to make it compatible with DeepSpeed by @Sai-Suraj-27 in #33105
  • quickfix documentation by @molbap in #32566
  • Fixup py 38 type hints for mps friendly by @muellerzr in #33128
  • fix: Fixed CodeGenTokenizationTest::test_truncation failing test by @Sai-Suraj-27 in #32850
  • fix: multilingual midel convert to tflite get wrong token by @Ayaa17 in #32079
  • disable scheduled daily CI temporarily by @ydshieh in #33136
  • CI: fix efficientnet pipeline timeout and prevent future similar issues due to large image size by @gante in #33123
  • Log additional test metrics with the CometCallback by @Lothiraldan in #33124
  • [docs] add quick usage snippet to Whisper. by @Vaibhavs10 in #31289
  • Update stateful_callbacks state before saving checkpoint by @pedrobrs in #32115
  • fix Idefics2VisionConfig type annotation by @chenzizhao in #33103
  • Add a fix for custom code tokenizers in pipelines by @Rocketknight1 in #32300
  • Llama: make slow tests green 🟢 by @gante in #33138
  • fix redundant checkpointing in example training scripts by @eminorhan in #33131
  • update torch req for 4-bit optimizer by @SunMarc in #33144
  • 🌐 [i18n-KO] Translated conversations.md to Korean by @newfull5 in #32468
  • Very small change to one of the function parameters by @alisalamatian1 in #32548
  • 🚨 Add Blip2ForImageTextRetrieval by @jpizarrom in #29261
  • fix model name and copyright by @mayank31398 in #33152
  • Fix: Jamba batched generation by @vasqu in #32914
  • [whisper] pass attention_mask to generate_with_fallback() by @benniekiss in #33145
  • [RoBERTa-based] Add support for sdpa by @hackyon in #30510
  • Fix import paths for test_module by @rasmi in #32888
  • Zero-shot pipelines: minor doc changes by @pcuenca in #33127
  • Customise the separator used for splicing in DataCollatorWithFlattening by @beep-bebop in #33114
  • Fix spell mistakes by @matsuo1234567 in #33149
  • update push CI workflow files for security by @ydshieh in #33142
  • added quick clarification by @DuyguA in #33166
  • pass module to Params4bit.from_prequantized to ensure quant_state by @winglian in #32524
  • Mamba2 conversion script for original models by @vasqu in #32580
  • Add a static cache that offloads to the CPU or other device by @gerbenvv in #32161
  • use a single for loop by @ArthurZucker in #33148
  • Pipeline: fix bad generation kwargs docs by @gante in #33205
  • Add missing quotes in modeling_llava_next_video.py by @juliendenize in #33214
  • Add warning for stop string edge case by @Rocketknight1 in #33169
  • Fix local repos with remote code not registering for pipelines by @Rocketknight1 in #33100
  • Refactor CI: more explicit by @ArthurZucker in #30674
  • 🌐 [i18n-KO] Translated llm_optims.md to Korean by @yijun-lee in #32325
  • Fix red amin by @ArthurZucker in #33220
  • Test fetcher: missing return on filtered tests; don't write empty files by @gante in #33224
  • Generate: throw warning when return_dict_in_generate is False but should be True by @gante in #33146
  • Add video text to text docs by @merveenoyan in #33164
  • Add GraniteRMSNorm by @NielsRogge in #33177
  • Add duckduckgo search tool by @aymeric-roucher in #32882
  • Fix: Suppressed 'use_reentrant=False' warning by @ankush13r in #33208
  • docs: Replace package abbreviations with full name(bitsandbytes) in docstrings by @rapsealk in #33230
  • Generate: fix assistant in different device by @gante in #33257
  • remove to restriction for 4-bit model by @SunMarc in #33122
  • Fixed typo repeated word in DETR docs by @sergiopaniego in #33250
  • Fix: use torch.from_numpy() to create tensors for np.ndarrays by @shinyano in #33201
  • remove torch input dependant control flow by @ArthurZucker in #33245
  • Fix: num_logits_to_keep in composite models by @zucchini-nlp in #33168
  • Fix Bark saving by @ylacombe in #33266
  • Update chat template docs to remove Blenderbot by @Rocketknight1 in #33254
  • Add sdpa support for Albert by @OmarManzoor in #32092
  • Only disallow DeepSpeed Zero-3 for auto bs finder by @muellerzr in #31731
  • fix the parallel number of CI nodes when it is smaller than number of tests by @ArthurZucker in #33276
  • Repo checks: check documented methods exist by @gante in #32320
  • Fix: multigpu training by @zucchini-nlp in #33271
  • Cache docs: update by @zucchini-nlp in #32929
  • Config: unified logic to retrieve text config by @gante in #33219
  • [fix] LlavaNextProcessor '_get_unpadded_features' method by @laurentd-lunit in #33263
  • wait 15m before SSH into runner workflow stops by @ydshieh in #33300
  • Bugfix/alexsherstinsky/fix none check for attention factor in rope scaling 2024 08 28 0 by @alexsherstinsky in #33188
  • [InstructBLIP] qformer_tokenizer is required input by @amyeroberts in #33222
  • [BUG] fix upper nltk version by @ylacombe in #33301
  • Fix excessive CPU memory usage with FSDP and cpu_ram_efficient_loading by @matthewdouglas in #33154
  • Add validate images and text inputs order util for processors and test_processing_utils by @yonigozlan in #33285
  • Fix: Fix FalconMamba training issues due to incompatible kernels by @younesbelkada in #33195
  • Add paper link by @Muennighoff in #33305
  • 🚨 Fix torch.jit.trace for interpolate_pos_encoding in all vision models by @xenova in #33226
  • Update SECURITY.md by @Michellehbn in #32680
  • simple align qwen2vl kv_seq_len calculation with qwen2 by @simonJJJ in #33161
  • Add a community notebook for fine-tuning with QLoRA, PEFT, and MLflow by @daniellok-db in #33319
  • Fix: StaticCache & inputs_embeds by @zucchini-nlp in #32932
  • Docs: add more cross-references to the KV cache docs by @gante in #33323
  • [whisper] alternative fix for long-form timestamps by @sanchit-gandhi in #32131
  • fix qwen2vl vision eager-attention by @simonJJJ in #33213
  • Load dynamic module (remote code) only once if code isn't change by @XuehaiPan in #33162
  • support loading model without config.json file by @itazap in #32356
  • Add validation for maximum sequence length in modeling_whisper.py by @AmirMohammadFakhimi in #33196
  • add self.head_dim for VisionAttention in Qwen2-VL by @GeLee-Q in #33211
  • support 3D attention mask in bert by @gathierry in #32105
  • Support reading tiktoken tokenizer.model file by @itazap in #31656
  • red-ci on main, fix copies by @ArthurZucker in #33356
  • RoPE: fix BC warning by @gante in #33331
  • Fix Prefill docs by @Rocketknight1 in #33352
  • Update author for QLorA/PEFT community notebook by @daniellok-db in #33338
  • add sdpa mbart by @nbroad1881 in #32033
  • Fix quantized cache tests by @zucchini-nlp in #33351
  • schedulefree optimizers by @winglian in #30079
  • Add visit webpage tool by @aymeric-roucher in #33353
  • Fixed Majority of the Typos in transformers[en] Documentation by @nnilayy in #33350
  • Compile compatibilty for decoder-only models by @zucchini-nlp in #32617
  • Adjust templates by @LysandreJik in #33384
  • Remove repeated prepare_images in processor tests by @amyeroberts in #33163
  • Fix import of FalconMambaForCausalLM by @younesbelkada in #33381
  • Import structure & first three model refactors by @LysandreJik in #31329
  • VLM: fixes after refactor by @zucchini-nlp in #32907
  • fixed Mask2Former image processor segmentation maps handling by @maciej-adamiak in #33364
  • Bug Fix: Update hub.py to fix NoneType error by @rishiraj in #33315
  • Update WhisperTokenizer Doc: Timestamps and Previous Tokens Behaviour by @bruno-hays in #33390
  • Make StaticCache configurable at model construct time by @guangy10 in #32830
  • use diff internal model in tests by @itazap in #33387
  • Fix FbgemmFp8Linear not preserving tensor shape by @vgel in #33239
  • Fix failing windows by @LysandreJik in #33436
  • Remove deprecated task in load_dataset by @albertvillanova in #33433
  • Dynamic number of speculative tokens in order to accelerate speculative decoding by @jmamou in #33258
  • Fix: Cast prefetch_bucket_size to integer for deepspeed >= 0.15 by @kiddj in #33402
  • [docs] add the missing huggingface hub username by @faaany in #33431
  • [docs] add the missing tokenizer when pushing models to huggingface hub by @faaany in #33428
  • Update stale.yml by @LysandreJik in #33434
  • Docs - update formatting of llama3 model card by @MichaelCurrin in #33438
  • Fix incomplete sentence in Zero-shot object detection documentation by @sergiopaniego in #33430
  • Fix flax whisper tokenizer bug by @hannan72 in #33151
  • Clean-up deprecated code by @zucchini-nlp in #33446
  • Fix default revision for pipelines by @ankane in #33395
  • Revive AMD scheduled CI by @ydshieh in #33448
  • Allow send SSH into runner info. to DM by @ydshieh in #33346
  • Correct Whisper's beam search scores computation by @ylacombe in #32336
  • Qwen2-VL: clean-up and add more tests by @zucchini-nlp in #33354
  • [whisper] Clarify error message when setting max_new_tokens by @benniekiss in #33324
  • [docs] refine the doc for train with a script by @faaany in #33423
  • Return image hidden states by @zucchini-nlp in #33426
  • add a callback hook right before the optimizer step by @winglian in #33444
  • Enable padding_side as call time kwargs by @zucchini-nlp in #33385
  • Mitigate a conflict when using sentencepiece by @tengomucho in #33327
  • [Phi-3] Bug on stale kv cache by @garg-amit in #33129
  • Fix the initialization of the cache when we have multi gpu by @SunMarc in #33303
  • Enable finetuning with torchao quantized model by @SunMarc in #33361
  • Corrected Agents and tools documentation links typos by @sergiopaniego in #33471
  • chore: fix typo in comment in tokenization_utils_base.py by @DavidLemayian in #33466
  • Cohere: update RoPE structure by @gante in #33408
  • Fix SSH workflow by @ydshieh in #33451
  • Add keypoint-detection task guide by @merveenoyan in #33274
  • Uniformize kwargs for LLaVa processor and update docs by @yonigozlan in #32858
  • Agents, supercharged - Multi-agents, External tools, and more docs typo fixed by @sergiopaniego in #33478
  • [i18n-ar] Add File : docs/source/ar/_toctree.yml by @AhmedAlmaghz in #32696
  • [Whisper test] Fix some failing tests by @ylacombe in #33450
  • Fix: Qwen2-VL training on video datasets by @hiyouga in #33307
  • Updated Trainer's liger-kernel integration to call correct patching API by @shimizust in #33502
  • Replace accelerator.use_fp16 in examples by @hlky in #33513
  • Fix parametrization-based weight norm by @ylacombe in #33275
  • Fix number of patch check for different vision feature select strategy by @insujang in #32494
  • chore: migrate coverage cfg to pyproject.toml by @SauravMaheshkar in #32650
  • idefics2 enable_input_require_grads not aligned with disable_input_re… by @sywangyi in #33194
  • Update chameleon.md — fix runtime type error by @maxwbuckley in #33494
  • Add explicit example for RAG chat templating by @A-Duss in #33503
  • CI Build image - move runners by @glegendre01 in #33530
  • fix to jamba config, asserting attention and expert offset by @ErezSC42 in #33316
  • Fix missing sequences_scores in the Whisper beam search output by @Nik-Kras in #32970
  • Uniformize kwargs for Pixtral processor by @yonigozlan in #33521
  • Add revision to trainer push_to_hub by @teamclouday in #33482
  • fix patch_attention_mask incorrect setting which leads to the differe… by @sywangyi in #33499
  • Support LLaVa-OV-Chat by @zucchini-nlp in #33532
  • Decorator for easier tool building by @aymeric-roucher in #33439
  • Fix for slow the bug tokenizer adding spaces to single id decodes by @DuyguA in #32564
  • Chat template: save and load correctly for processors by @zucchini-nlp in #33462
  • Fix missing head_dim in llama config from gguf model by @Isotr0py in #33526
  • [i18n-ur] Added README_ur.md file by @akkefa in #33461
  • fix the wandb logging issue by @ZIYU-DEEP in #33464
  • Fix tests in ASR pipeline by @ylacombe in #33545
  • Added support for bfloat16 to zero-shot classification pipeline by @umarbutler in #33554
  • Pipeline: no side-effects on model.config and model.generation_config 🔫 by @gante in #33480
  • Return attention mask in ASR pipeline to avoid warnings by @Rocketknight1 in #33509
  • enforce original size to be a list by @dom-dziela in #33564
  • Improve compiled RT-DETR inference speed by @yonigozlan in #33412
  • Fix bnb dequantization by @SunMarc in #33546
  • Load and save video-processor from separate folder by @zucchini-nlp in #33562
  • VLMs: enable generation tests by @zucchini-nlp in #33533
  • rag: fix CI by @gante in #33578
  • Cache: don't show warning in forward passes when past_key_values is None by @gante in #33541
  • fix tests with main revision and read token by @molbap in #33560
  • add uniform processors for altclip + chinese_clip by @molbap in #31198
  • Generate: check that attention_mask is 2D by @gante in #33575
  • change sequence_bias type of SequenceBiasLogitsProcessor to list, add… by @VladOS95-cyber in #33375
  • [Mamba2] Move dt calculations to kernel by @vasqu in #33520
  • Cache: don't throw warnings on gemma2 when instantiating a new cache by @gante in #33595
  • Uniformize kwargs for Paligemma processor and update docs by @yonigozlan in #33571
  • [tests] skip tests for xpu by @faaany in #33553
  • [tests] enable GemmaIntegrationTest on XPU by @faaany in #33555
  • Fix Llama 3 TikToken conversion by @pcuenca in #33538
  • Docs: add the ability to manually trigger jobs by @gante in #33598
  • Fix CircleCI nightly run by @ydshieh in #33558
  • Allow CI could be run on private forked repositories (e.g. new model additions) by @ydshieh in #33594
  • [tests] make more tests device-agnostic by @faaany in #33580
  • Update modeling_mamba2.py, fix pad size by @klae01 in #32599
  • Generate: remove flakyness in test_generate_from_inputs_embeds_decoder_only by @gante in #33602
  • Remove unnecessary CPM model tests by @amyeroberts in #33621
  • Add sdpa for BioGpt by @OmarManzoor in #33592
  • VLM generate: tests can't generate image/video tokens by @gante in #33623
  • Fix missing test in torch_job by @ydshieh in #33593
  • Add support for args to ProcessorMixin for backward compatibility by @yonigozlan in #33479
  • Fix contrastive search to correctly handle input with padding by @ducviet00 in #33507
  • Generate: assistant should sample when the main model samples by @gante in #33534
  • Fix some missing tests in circleci by @ydshieh in #33559
  • Update daily ci to use new cluster by @ydshieh in #33627
  • Fix qwen2vl float16 inference bug by @GeLee-Q in #33312
  • Fix typos by @litianjian in #33583
  • enable low-precision pipeline by @jiqing-feng in #31625
  • Pixtral update example checkpoint by @amyeroberts in #33633
  • Sdpa dino v2 by @avishaiElmakies in #33403
  • Clean up Unpack imports by @molbap in #33631
  • Fix DPT /Dinov2 sdpa regression on main by @molbap in #33660
  • handle dependency errors in check_imports by @molbap in #33622
  • add back self.max_position_embeddings = config.max_position_embeddings by @chengchengpei in #33550
  • Fix Llava conversion for LlavaQwen2ForCausalLM with Clip vision tower by @Isotr0py in #33613
  • Uniformize kwargs for Udop processor and update docs by @yonigozlan in #33628
  • Generation: deprecate PreTrainedModel inheriting from GenerationMixin by @gante in #33203
  • Enable BNB multi-backend support by @jiqing-feng in #31098
  • Fix error string after refactoring into get_chat_template by @tibor-reiss in #33652
  • uniformize git processor by @yonigozlan in #33668
  • Fix CIs post merging modular transformers by @ArthurZucker in #33681
  • Fixed docstring for cohere model regarding unavailability of prune_he… by @mnauf in #33253
  • Generation tests: update imagegpt input name, remove unused functions by @gante in #33663
  • Improve Error Messaging for Flash Attention 2 on CPU by @sizhky in #33655
  • Gemma2: fix config initialization (cache_implementation) by @gante in #33684
  • Fix ByteLevel alphabet missing when Sequence pretokenizer is used by @umarbutler in #33556
  • Uniformize kwargs for image-text-to-text processors by @yonigozlan in #32544
  • 🚨🚨 Setting default behavior of assisted decoding by @jmamou in #33657
  • tests: fix pytorch tensor placement errors by @dvrogozh in #33485
  • bump tokenizers, fix added tokens fast by @ArthurZucker in #32535
  • [Pixtral] Improve docs, rename model by @NielsRogge in #33491

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @enchantee00
    • 🌐 [i18n-KO] Translated chat_templating.md to Korean (#32362)
  • @faychu
    • Add Qwen2-Audio (#32137)
    • Fix a bug in Qwen2Audio (#32552)
  • @010kim
    • 🌐 [i18n-KO] Translated ko-llm_tutorial_optimization.md to Korean (#32372)
  • @cjfghk5697
    • 🌐 [i18n-KO] Translated trainer.md to Korean (#32260)
  • @younesbelkada
    • Add new model (#32615)
    • Mamba / FalconMamba: Fix mamba left padding (#32677)
    • FIX / Hub: Also catch for exceptions.ConnectionError (#31469)
    • Fix: Fix FalconMamba training issues due to incompatible kernels (#33195)
    • Fix import of FalconMambaForCausalLM (#33381)
  • @4N3MONE
    • 🌐 [i18n-KO] Translated deepspeed.md to Korean (#32431)
  • @jerryzh168
    • Add TorchAOHfQuantizer (#32306)
  • @MHRDYN7
    • Add Flax Dinov2 (#31960)
  • @kamilakesbi
    • Add Descript-Audio-Codec model (#31494)
  • @Isotr0py
    • Fix incorrect vocab size retrieval in GGUF config (#32551)
    • Add chat_template for tokenizer extracted from GGUF model (#32908)
    • 🚨 Support dequantization for most GGML types (#32625)
    • Fix missing head_dim in llama config from gguf model (#33526)
    • Fix Llava conversion for LlavaQwen2ForCausalLM with Clip vision tower (#33613)
  • @AhmedAlmaghz
    • [i18n-ar] add README_ar.md to README.md (#32583)
    • [i18n-ar] Add File : docs/source/ar/_toctree.yml (#32696)
  • @simonJJJ
    • support qwen2-vl (#32318)
    • simple align qwen2vl kv_seq_len calculation with qwen2 (#33161)
    • fix qwen2vl vision eager-attention (#33213)
  • @jpizarrom
    • 🚨 Add Blip2ForImageTextRetrieval (#29261)
  • @mayank31398
    • Granite language models (#31502)
    • fix model name and copyright (#33152)
    • Granitemoe (#33207)
  • @hackyon
    • [RoBERTa-based] Add support for sdpa (#30510)
  • @Muennighoff
    • Add OLMoE (#32406)
    • Add paper link (#33305)
  • @VladOS95-cyber
    • Add Qwen2Moe GGUF loading support (#33264)
    • change sequence_bias type of SequenceBiasLogitsProcessor to list, add… (#33375)
  • @jiqing-feng
    • enable low-precision pipeline (#31625)
    • Enable BNB multi-backend support (#31098)
Aug 22, 2024
Release v4.44.2

Patch release v4.44.2, mostly 2 regressions that were not caught for Jamba and for processors!

  • Fix: Jamba cache fails to use torch.nn.module (#32894) Authored by @xgal
  • Fix: No need to dtype A in Jamba (#32924) @xgal
  • Fix: Regression on Processor.save_pretrained caused by #31691 (#32921) Authored by @leloykun
Aug 20, 2024
Patch release v4.44.1

Here are the different fixes, mostly Gemma2 context length, nits here and there, and generation issues

  • is_torchdynamo_compiling -- cast a wide exception net (#32476) by @gante
  • Revert "fixes to properly shard FSDP across cpu and meta for cpu_effcient_loading for prequantized 4bit (#32276)" (#32477) by @gante and @matthewdouglas
  • Gemma2: fix FA2 generation (#32553) by @zucchini-nlp
  • Fix: FA2 with packed training (#32487) by @zucchini-nlp
  • Fix sliding window attention used in Gemma2FlashAttention2 (#32522) by @brcps12
  • Automatically add transformers tag to the modelcard (#32623) by @LysandreJik
  • add back the position ids (#32554) by @ArthurZucker
  • Use head_dim if in config for RoPE (#32495) @suiyoubi @ArthurZucker
  • Revert PR 32299, flag users when Zero-3 was missed (#32851) by @muellerzr
  • fix multi-gpu with static cache (#32543) by @SunMarc
  • Reduce the error log when using core models that need their weights r… (#32656) by @muellerzr
  • Fix VLM generation issues (#32836) by @zucchini-nlp
  • Fix generate with inputs_embeds as input (#32493) (this PR has some cherry-pick)

Full Changelog: https://github.com/huggingface/transformers/compare/v4.44.0...v4.44.1

Aug 6, 2024
Release v4.44.0

Release v4.44.0: End to end compile generation!!! Gemma2 (with assisted decoding), Codestral (Mistral for code), Nemotron, Efficient SFT training, CPU Offloaded KVCache, torch export for static cache

This release comes a bit early in our cycle because we wanted to ship important and requested models along with improved performances for everyone!

All of these are included with examples in the awesome https://github.com/huggingface/local-gemma repository! 🎈 We tried to share examples of what is now possible with all the shipped features! Kudos to @gante, @sanchit-gandhi and @xenova

💥 End-to-end generation compile

Generate: end-to-end compilation #30788 by @gante: model.generate now supports compiling! There are a few limitations, but here is a small snippet:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import copy

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B", torch_dtype=torch.bfloat16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B")

# compile generate
compiled_generate = torch.compile(model.generate, fullgraph=True, mode="reduce-overhead")

# compiled generate does NOT accept parameterization except a) model inputs b) a generation config
generation_config = copy.deepcopy(model.generation_config)
generation_config.pad_token_id = model.config.eos_token_id

model_inputs = tokenizer(["Write a poem about the market crashing in summer"], return_tensors="pt")
model_inputs = model_inputs.to(model.device)
output_compiled = compiled_generate(**model_inputs, generation_config=generation_config)
print(output_compiled)

⚡ 3 to 5x compile speedup (compilation time 👀 not runtime)

  • 3-5x faster torch.compile forward compilation for autoregressive decoder models #32227* by @fxmarty . As documented on the PR, this makes the whole generation a lot faster when you re-use the cache! You can see this when you run model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)

🪶 Offloaded KV cache: offload the cache to CPU when you are GPU poooooor 🚀

  • Offloaded KV Cache #31325* by @n17s : you just have to set cache_implementation="offloaded" when calling from_pretrained or using this:
from transformers import GenerationConfig
gen_config = GenerationConfig(cache_implementation="offloaded", # other generation options such as num_beams=4,num_beam_groups=2,num_return_sequences=4,diversity_penalty=1.0,max_new_tokens=50,early_stopping=True)
outputs = model.generate(inputs["input_ids"],generation_config=gen_config)

📦 Torch export for static cache

pytorch team gave us a great gift: you can now use torch.export directly compatible with Executorch! Find examples here.

  • Make static cache compatible with torch.export #32168 by @guangy10

This also unlocks support for prompt reuse:

import os, torch, copy
from transformers import AutoModelForCausalLM, AutoTokenizer, DynamicCache
device = "cuda"
ckpt = "meta-llama/Meta-Llama-3.1-8B-Instruct"

INITIAL_PROMPT = "From now on, you are going to answer all my questions with historical details. Make sure to always add a bit of french here and there, for style."

model = AutoModelForCausalLM.from_pretrained(ckpt, torch_dtype=torch.float16)
model.to(device)
tokenizer = AutoTokenizer.from_pretrained(ckpt)

prompt_cache = DynamicCache()
inputs = tokenizer(INITIAL_PROMPT, return_tensors="pt").to("cuda")
prompt_cache = model(**inputs, past_key_values = prompt_cache).past_key_values

prompt = "Why are french people obsessed with french?"
new_inputs = tokenizer(INITIAL_PROMPT + prompt, return_tensors="pt").to("cuda")
past_key_values = copy.deepcopy(prompt_cache)
outputs = model.generate(**new_inputs, past_key_values=past_key_values,max_new_tokens=20) 
response = tokenizer.batch_decode(outputs)[0]
print(response)

prompt = "What is the best city to swim in?"
new_inputs = tokenizer(INITIAL_PROMPT + prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**new_inputs, past_key_values=copy.deepcopy(prompt_cache),max_new_tokens=20) 
response = tokenizer.batch_decode(outputs)[0]

Gemma2: assisted decoding

Gemma 2: support assisted generation #32357 by @gante

We now have a 2B Gemma 2 model -- a perfect sidekick for the 27B with assisted generation. We've enabled assisted generation in gemma 2, with a caveat: assisted generation currently requires the use of a windowless cache (as opposed to the default cache for gemma 2), so you might observe some output mismatch on long sequences. Read more about it here.

# transformers assisted generation reference: 
# https://huggingface.co/docs/transformers/main/en/llm_optims#speculative-decoding 
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# we DON’T recommend using the 9b model with the 2b model as its assistant
assistant_model_name = 'google/gemma-2-2b-it'
reference_model_name = 'google/gemma-2-27b-it'

tokenizer = AutoTokenizer.from_pretrained(reference_model_name)
model = AutoModelForCausalLM.from_pretrained(
   reference_model_name, device_map='auto', torch_dtype=torch.bfloat16
)
assistant_model = AutoModelForCausalLM.from_pretrained(
   assistant_model_name, device_map='auto', torch_dtype=torch.bfloat16
)

model_inputs = tokenizer("Einstein's theory of relativity states", return_tensors="pt").to(model.device)
generation_options = {
   "assistant_model": assistant_model,
   "do_sample": True,
   "temperature": 0.7,
   "max_new_tokens": 64,
}

outputs = model.generate(**model_inputs, **generation_options)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

Nemotron support

Nemotron-4-340B-Instruct is a large language model (LLM) that can be used as part of a synthetic data generation pipeline to create training data that helps researchers and developers build their own LLMs. It is a fine-tuned version of the Nemotron-4-340B-Base model, optimized for English-based single and multi-turn chat use-cases. It supports a context length of 4,096 tokens.

The conversion script should be able to cover Minitron and Nemotron, thanks and kudos to @suiyoubi. See:

  • Add Nemotron HF Support #31699

Codestral support

Codestral is trained on a diverse dataset of 80+ programming languages, including the most popular ones, such as Python, Java, C, C++, JavaScript, and Bash. It also performs well on more specific ones like Swift and Fortran. This broad language base ensures Codestral can assist developers in various coding environments and projects.

Codestral saves developers time and effort: it can complete coding functions, write tests, and complete any partial code using a fill-in-the-middle mechanism. Interacting with Codestral will help level up the developer’s coding game and reduce the risk of errors and bugs.

It's mamba2 architecture, was a bit of a pain to remove all einops but hope we made it better for everyone!

  • Add codestral mamba2 #32080 by @molbap and @vasqu

Breaking changes:

We removed the chat template in the code, they should all be on the hub!

  • 🚨 No more default chat templates #31733 by @Rocketknight1

Long-form decoding for whisper, even faster:

Our great @sanchit-gandhi worked on porting the recent compile upgrades to long form decoding in

  • [whisper] compile compatibility with long-form decoding #31772

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/transformers/compare/v4.43.4...v4.44.0

Aug 5, 2024
v4.43.4 Patch Release

Patch Release v4.43.4

There was a mick mack, now deepseep issues are properly pushed with:

🤗 Enjoy holidays

Jul 26, 2024
v4.43.3 Patch deepspeed

Patch release v4.43.3: We still saw some bugs so @zucchini-nlp added: - Resize embeds with DeepSpeed #32214

  • don't log base model architecture in wandb if log model is false #32143

Other fixes:

  • [whisper] fix short-form output type #32178, by @sanchit-gandhi which fixes the short audio temperature fallback!
  • [BigBird Pegasus] set _supports_param_buffer_assignment to False #32222 by @kashif, mostly related to the new super fast init, some models have to get this set to False. If you see a weird behavior look for that 😉
Jul 24, 2024
v4.43.2: Patch release
  • Fix float8_e4m3fn in modeling_utils (#32193)
  • Fix resize embedding with Deepspeed (#32192)
  • let's not warn when someone is running a forward (#32176)
  • RoPE: relaxed rope validation (#32182)
Jul 23, 2024
v4.43.1: Patch release
  • fix (#32162)
v4.43.0: Llama 3.1, Chameleon, ZoeDepth, Hiera

Llama

The Llama 3.1 models are released by Meta and come in three flavours: 8B, 70B, and 405B.

To get an overview of Llama 3.1, please visit the Hugging Face announcement blog post.

We release a repository of llama recipes to showcase usage for inference, total and partial fine-tuning of the different variants.

Chameleon

The Chameleon model was proposed in Chameleon: Mixed-Modal Early-Fusion Foundation Models by META AI Chameleon Team. Chameleon is a Vision-Language Model that use vector quantization to tokenize images which enables the model to generate multimodal output. The model takes images and texts as input, including an interleaved format, and generates textual response.

  • Chameleon: add model by @zucchini-nlp in #31534

ZoeDepth

The ZoeDepth model was proposed in ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth by Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, Matthias Müller. ZoeDepth extends the DPT framework for metric (also called absolute) depth estimation. ZoeDepth is pre-trained on 12 datasets using relative depth and fine-tuned on two domains (NYU and KITTI) using metric depth. A lightweight head is used with a novel bin adjustment design called metric bins module for each domain. During inference, each input image is automatically routed to the appropriate head using a latent classifier.

  • Add ZoeDepth by @NielsRogge in #30136

Hiera

Hiera was proposed in Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles by Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer

The paper introduces “Hiera,” a hierarchical Vision Transformer that simplifies the architecture of modern hierarchical vision transformers by removing unnecessary components without compromising on accuracy or efficiency. Unlike traditional transformers that add complex vision-specific components to improve supervised classification performance, Hiera demonstrates that such additions, often termed “bells-and-whistles,” are not essential for high accuracy. By leveraging a strong visual pretext task (MAE) for pretraining, Hiera retains simplicity and achieves superior accuracy and speed both in inference and training across various image and video recognition tasks. The approach suggests that spatial biases required for vision tasks can be effectively learned through proper pretraining, eliminating the need for added architectural complexity.

  • Adding hiera by @Namangarg110 in #30356

Agents

Our ReactAgent has a specific way to return its final output: it calls the tool final_answer, added to the user-defined toolbox upon agent initialization, with the answer as the tool argument. We found that even for a one-shot agent like CodeAgent, using a specific final_answer tools helps the llm_engine find what to return: so we generalized the final_answer tool for all agents.

  • Adds final answer tool for all agents by @aymeric-roucher in #31703

Now if your code-based agent (like ReactCodeAgent) defines a function at step 1, it will remember the function definition indefinitely. This means your agent can create its own tools for later re-use!

  • Code agent: allow function persistence between steps by @aymeric-roucher in #31769

This is a transformative PR: it allows the agent to regularly run a specific step for planning its actions in advance. This gets activated if you set an int for planning_interval upon agent initialization. At step 0, a first plan will be done. At later steps (like steps 3, 6, 9 if you set planning_interval=3 ), this plan will be updated by the agent depending on the history of previous steps. More detail soon!

  • Agents planning by @aymeric-roucher in #31702

Notable changes to the codebase

A significant RoPE refactor was done to make it model agnostic and more easily adaptable to any architecture. It is only applied to Llama for now but will be applied to all models using RoPE over the coming days.

  • Llama: RoPE refactor by @gante in #32135

Breaking changes

TextGenerationPipeline and tokenizer kwargs

🚨🚨 This PR changes the code to rely on the tokenizer's defaults when these flags are unset. This means some models using TextGenerationPipeline previously did not add a <bos> by default, which (negatively) impacted their performance. In practice, this is a breaking change.

Example of a script changed as a result of this PR:

from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2-9b-it", torch_dtype=torch.bfloat16, device_map="auto")
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("Foo bar"))
  • 🚨🚨 TextGenerationPipeline: rely on the tokenizer default kwargs by @gante in #31747

Bugfixes and improvements

  • Fix post gemma merge by @ArthurZucker in #31660
  • Fix float out of range in owlvit and owlv2 when using FP16 or lower precision by @aliencaocao in #31657
  • [docs] Llama3 by @stevhliu in #31662
  • [HybridCache] Fix get_seq_length method by @sanchit-gandhi in #31661
  • don't zero out the attention_mask when using sliding window with flash attention by @winglian in #31670
  • Fix Gemma2 4d attention mask by @hiyouga in #31674
  • Fix return_dict in encodec by @jla524 in #31646
  • add gather_use_object arguments by @SangbumChoi in #31514
  • Gemma capping is a must for big models by @ArthurZucker in #31698
  • Add French version of run scripts tutorial by @jadechoghari in #31483
  • dependencies: keras-nlp<0.14 pin by @gante in #31684
  • remove incorrect urls pointing to the llava repository by @BiliBraker in #31107
  • Move some test files (tets/test_xxx_utils.py) to tests/utils by @ydshieh in #31730
  • Fix mistral ONNX export by @fxmarty in #31696
  • [whisper] static kv cache by @sanchit-gandhi in #31166
  • Make tool JSON schemas consistent by @Rocketknight1 in #31756
  • Fix documentation for Gemma2. by @jbornschein in #31682
  • fix assisted decoding by @jiqing-feng in #31401
  • Requires for torch.tensor before casting by @echarlaix in #31755
  • handle (processor_class, None) returned by ModelPatterns by @molbap in #31753
  • Gemma 2: Update slow tests by @gante in #31759
  • Add ignore_errors=True to trainer.py rmtree in _inner_training_loop by @njbrake in #31668
  • [fix bug] logits's shape different from label's shape in preprocess_logits_for_metrics by @wiserxin in #31447
  • Fix RT-DETR cache for generate_anchors by @qubvel in #31671
  • Fix RT-DETR weights initialization by @qubvel in #31724
  • pytest_num_workers=4 for some CircleCI jobs by @ydshieh in #31764
  • Fix Gemma2 types by @hiyouga in #31779
  • Add torch_empty_cache_steps to TrainingArguments by @aliencaocao in #31546
  • Fix ClapProcessor to merge feature_extractor output into the returned BatchEncoding by @mxkopy in #31767
  • Fix serialization for offloaded model by @SunMarc in #31727
  • Make tensor device correct when ACCELERATE_TORCH_DEVICE is defined by @kiszk in #31751
  • Exclude torch.compile time from metrics computation by @zxd1997066 in #31443
  • Update CometCallback to allow reusing of the running experiment by @Lothiraldan in #31366
  • Fix gemma tests by @ydshieh in #31794
  • Add training support for SigLIP by @aliencaocao in #31495
  • Repeating an important warning in the chat template docs by @Rocketknight1 in #31796
  • Allow FP16 or other precision inference for Pipelines by @aliencaocao in #31342
  • Fix galore lr display with schedulers by @vasqu in #31710
  • Fix Wav2Vec2 Fairseq conversion (weight norm state dict keys) by @gau-nernst in #31714
  • Depth Anything: update conversion script for V2 by @pcuenca in #31522
  • Fix Seq2SeqTrainer crash when BatchEncoding data is None by @iohub in #31418
  • Bump certifi from 2023.7.22 to 2024.7.4 in /examples/research_projects/decision_transformer by @dependabot[bot] in #31813
  • Add FA2 and sdpa support for SigLIP by @qubvel in #31499
  • Bump transformers from 4.26.1 to 4.38.0 in /examples/tensorflow/language-modeling-tpu by @dependabot[bot] in #31837
  • Bump certifi from 2023.7.22 to 2024.7.4 in /examples/research_projects/lxmert by @dependabot[bot] in #31838
  • Fix typos by @omahs in #31819
  • transformers.fx.symbolic_trace supports inputs_embeds by @fxmarty in #31574
  • Avoid failure TFBlipModelTest::test_pipeline_image_to_text by @ydshieh in #31827
  • Fix incorrect accelerator device handling for MPS in TrainingArguments by @andstor in #31812
  • Mamba & RecurrentGemma: enable strict signature by @gante in #31549
  • Deprecate vocab_size in other two VLMs by @zucchini-nlp in #31681
  • FX symbolic_trace: do not test decoder_inputs_embeds by @fxmarty in #31840
  • [Grounding DINO] Add processor to auto mapping by @NielsRogge in #31845
  • chore: remove duplicate words by @hattizai in #31853
  • save_pretrained: use tqdm when saving checkpoint shards from offloaded params by @kallewoof in #31856
  • Test loading generation config with safetensor weights by @gante in #31550
  • docs: typo in tf qa example by @chen-keinan in #31864
  • Generate: Add new decoding strategy "DoLa" in .generate() by @voidism in #29619
  • Fix _init_weights for ResNetPreTrainedModel by @ydshieh in #31851
  • Update depth estimation task guide by @merveenoyan in #31860
  • Bump zipp from 3.7.0 to 3.19.1 in /examples/research_projects/decision_transformer by @dependabot[bot] in #31871
  • Add return type annotation to PreTrainedModel.from_pretrained by @mauvilsa in #31869
  • Revert "Fix _init_weights for ResNetPreTrainedModel" by @ydshieh in #31868
  • Bump certifi from 2023.7.22 to 2024.7.4 in /examples/research_projects/visual_bert by @dependabot[bot] in #31872
  • add warning when using gradient_checkpointing with FSDP full shard by @yundai424 in #31578
  • Add conversion for interleave llava by @zucchini-nlp in #31858
  • remove duplicate words in msg by @yukionfire in #31876
  • Fix file type checks in data splits for contrastive training example script by @npyoung in #31720
  • Fix failed tests in #31851 by @ydshieh in #31879
  • fix: Removed duplicate field definitions in some classes by @Sai-Suraj-27 in #31888
  • Push sharded checkpoint to hub when push_to_hub=True in TrainingArguments by @SunMarc in #31808
  • [RT-DETR] Add resources by @NielsRogge in #31815
  • Modify warnings in a with block to avoid flaky tests by @ydshieh in #31893
  • Add a condition for nested_detach by @haikuoxin in #31855
  • InstructBlipVideo: Update docstring by @zucchini-nlp in #31886
  • Fixes to alternating SWA layers in Gemma2 by @turboderp in #31775
  • Processor accepts any kwargs by @zucchini-nlp in #31889
  • [ConvertSlow] make sure the order is preserved for addedtokens by @ArthurZucker in #31902
  • [Gemma2] Support FA2 softcapping by @ArthurZucker in #31887
  • Fix missing methods for Fuyu by @Isotr0py in #31880
  • fix: Fixed the 1st argument name in classmethods by @Sai-Suraj-27 in #31907
  • add gather_use_object arguments II by @SangbumChoi in #31799
  • Add warning message for beta and gamma parameters by @OmarManzoor in #31654
  • Fix fx tests with inputs_embeds by @fxmarty in #31862
  • Refactor flash attention implementation in transformers by @ArthurZucker in #31446
  • Generate: fix SlidingWindowCache.reset() by @gante in #31917
  • 🚨 fix(SigLip): remove spurious exclusion of first vision output token by @transmissions11 in #30952
  • Allow Trainer.get_optimizer_cls_and_kwargs to be overridden by @apoorvkh in #31875
  • [Bug Fix] fix qa pipeline tensor to numpy by @jiqing-feng in #31585
  • Docker: TF pin on the consistency job by @gante in #31928
  • fix prompt strip to support tensors and np arrays by @AvivSham in #27818
  • Fix GenerationMixin.generate compatibility with pytorch profiler by @fxmarty in #31935
  • Generate: remove deprecated code due to Cache and cache_position being default by @gante in #31898
  • Generate: v4.42 deprecations 🧹🧹 by @gante in #31956
  • Whisper: move to tensor cpu before converting to np array at decode time by @gante in #31954
  • fix: Removed a wrong key-word argument in sigmoid_focal_loss() function call by @Sai-Suraj-27 in #31951
  • Generate: handle logits_warper update in models with custom generate fn by @gante in #31957
  • fix: Fixed the arguments in create_repo() function call by @Sai-Suraj-27 in #31947
  • Notify new docker images built for circleci by @ydshieh in #31701
  • Avoid race condition by @ydshieh in #31973
  • Masking: remove flakiness from test by @gante in #31939
  • Generate: doc nits by @gante in #31982
  • Fix the incorrect permutation of gguf by @PenutChen in #31788
  • Cambricon MLUs support SDPA and flash_attn by @huismiling in #31102
  • Speedup model init on CPU (by 10x+ for llama-3-8B as one example) by @muellerzr in #31771
  • [tests] fix deepspeed zero3 config for test_stage3_nvme_offload by @faaany in #31881
  • Fix bad test about slower init by @muellerzr in #32002
  • Tests: remove cuda versions when the result is the same 🧹🧹 by @gante in #31955
  • Bug report update by @gante in #31983
  • add flash-attn deterministic option to flash-attn>=2.4.1 by @junrae6454 in #31961
  • fix: Fixed incorrect dictionary assignment in src/transformers/__init__.py by @Sai-Suraj-27 in #31993
  • Bug report update -- round 2 by @gante in #32006
  • Fix gather when collecting 'num_input_tokens_seen' by @CodeCreator in #31974
  • Fix if else and actually enable superfast init by @muellerzr in #32007
  • SpeechEncoderDecoder doesn't support param buffer assignments by @muellerzr in #32009
  • Fix tests skip by @qubvel in #32012
  • Fixed log messages that are resulting in TypeError due to too many arguments by @Sai-Suraj-27 in #32017
  • Fix typo in classification function selection logic to improve code consistency by @moses in #32031
  • doc: fix broken BEiT and DiNAT model links on Backbone page by @dvrogozh in #32029
  • Pass missing arguments to SeamlessM4Tv2ConformerEncoderLayer.forward() when gradient checkpointing is enabled by @anferico in #31945
  • Add language to word timestamps for Whisper by @robinderat in #31572
  • Add sdpa and FA2 for CLIP by @qubvel in #31940
  • unpin numpy<2.0 by @ydshieh in #32018
  • Chameleon: minor fixes after shipping by @zucchini-nlp in #32037
  • Bump scikit-learn from 1.0.2 to 1.5.0 in /examples/research_projects/decision_transformer by @dependabot[bot] in #31458
  • Bump scikit-learn from 1.1.2 to 1.5.0 in /examples/research_projects/codeparrot/examples by @dependabot[bot] in #32052
  • [mistral] Support passing head_dim through config (and do not require head_dim * num_heads == hidden_size) by @xenova in #32050
  • Add torch.compile Support For Mamba by @zhenglongjiepheonix in #31247
  • fix: Removed duplicate entries in a dictionary by @Sai-Suraj-27 in #32041
  • docs: Fixed 2 links in the docs along with some minor fixes by @Sai-Suraj-27 in #32058
  • Llava: add default chat templates by @zucchini-nlp in #31691
  • [Chameleon, Hiera] Improve docs by @NielsRogge in #32038
  • Incorrect Whisper long-form decoding timestamps by @kamilakesbi in #32003
  • [mistral] Fix FA2 attention reshape for Mistral Nemo by @xenova in #32065
  • VideoLLaVa: fix chat format in docs by @zucchini-nlp in #32083
  • Fix progress callback deepcopy by @fozziethebeat in #32070
  • Fixes to chameleon docs by @merveenoyan in #32078
  • Add image-text-to-text task guide by @merveenoyan in #31777
  • Support generating with fallback for short form audio in Whisper by @kamilakesbi in #30984
  • Disable quick init for deepspeed by @muellerzr in #32066
  • Chameleon: not supported with fast load by @zucchini-nlp in #32091
  • Fix tests after huggingface_hub 0.24 by @Wauplin in #32054
  • Fix shard order by @b-chu in #32023
  • Generate: store special token tensors under a unique variable name by @gante in #31980
  • fix: Replaced deprecated mktemp() function by @Sai-Suraj-27 in #32123
  • Mention model_info.id instead of model_info.modelId by @Wauplin in #32106
  • [generate] fix eos/pad id check on mps devices by @sanchit-gandhi in #31695
  • Fix failing test with race condition by @Rocketknight1 in #32140
  • Update ko/_toctree.yml and remove custom_tools.md to reflect latest changes by @jungnerd in #31969
  • fix: Fixed raising TypeError instead of ValueError for invalid type by @Sai-Suraj-27 in #32111
  • [RoBERTa] Minor clarifications to model doc by @bt2513 in #31949
  • Return assistant generated tokens mask in apply_chat_template by @yonigottesman in #30650
  • Don't default to other weights file when use_safetensors=True by @amyeroberts in #31874
  • set warning level to info for special tokens have been added by @ArthurZucker in #32138
  • Add new quant method by @SunMarc in #32047
  • Add llama3-llava-next-8b to llava_next conversion script by @jamt9000 in #31395
  • LLaVaNeXT: pad on right if training by @zucchini-nlp in #32134
  • Remove trust_remote_code when loading Libri Dummy by @sanchit-gandhi in #31748
  • [modelling] remove un-necessary transpose for fa2 attention by @sanchit-gandhi in #31749
  • Fix mask creations of GPTNeoX and GPT2 by @vasqu in #31944
  • Add method to retrieve used chat template by @KonradSzafer in #32032
  • Add YaRN and Dynamic-YaRN RoPE Scaling Methods by @mig-mfreitas in #30910
  • Disable quick init for TapasPreTrainedModel by @daniellok-db in #32149
  • Modify resize_token_embeddings to ensure output type is same as input by @bayllama in #31979
  • gguf conversion add_prefix_space=None for llama3 by @itazap in #31937
  • Fix flash attention speed issue by @Cyrilvallez in #32028
  • Fix video batching to videollava by @merveenoyan in #32139
  • Added mamba.py backend by @alxndrTL in #30139
  • Rename Phi-3 rope scaling type by @garg-amit in #31436
  • Revert "Incorrect Whisper long-form decoding timestamps " by @sanchit-gandhi in #32148
  • Fix typing to be compatible with later py versions by @amyeroberts in #32155
  • feat(cache): StaticCache uses index_copy_ to avoid useless copy by @tengomucho in #31857
  • Added additional kwarg for successful running of optuna hyperparameter search by @DeF0017 in #31924
  • Enhancing SFT Training Efficiency Using Packing and FlashAttention2 with Position IDs by @RhuiDih in #31629

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @aliencaocao
    • Fix float out of range in owlvit and owlv2 when using FP16 or lower precision (#31657)
    • Add torch_empty_cache_steps to TrainingArguments (#31546)
    • Add training support for SigLIP (#31495)
    • Allow FP16 or other precision inference for Pipelines (#31342)
  • @voidism
    • Generate: Add new decoding strategy "DoLa" in .generate() (#29619)
  • @Namangarg110
    • Adding hiera (#30356)
Jul 11, 2024
Patch release v4.42.4

Mostly gemma2 support FA2 softcapping!

but also fix the sliding window for long context and other typos.

  • [Gemma2] Support FA2 softcapping (#31887) by @ArthurZucker
  • [ConvertSlow] make sure the order is preserved for addedtokens (#31902) by @ArthurZucker
  • Fixes to alternating SWA layers in Gemma2 (#31775) by @turboderp
  • Requires for torch.tensor before casting (#31755) by @echarlaix

Was off last week could not get this out, thanks all for your patience 🥳

Jun 28, 2024
Patch release v4.42.3

Make sure we have attention softcapping for "eager" GEMMA2 model

After experimenting, we noticed that for the 27b model mostly, softcapping is a must. So adding it back (it should have been there, but an error on my side made it disappear) sorry all! 😭

  • Gemma capping is a must for big models (#31698)
Patch release v4.42.2

Patch release

Thanks to our 2 contributors for their prompt fixing mostly applies for training and FA2!

  • Fix Gemma2 4d attention mask (#31674) by @hiyouga
  • don't zero out the attention_mask when using sliding window with flash attention (#31670) by @winglian
Jun 27, 2024
v4.42.1: Patch release

Patch release for commit:

  • [HybridCache] Fix get_seq_length method (#31661)
v4.42.0: Gemma 2, RTDETR, InstructBLIP, LLAVa Next, New Model Adder

New model additions

Gemma-2

The Gemma2 model was proposed in Gemma2: Open Models Based on Gemini Technology and Research by Gemma2 Team, Google. Gemma2 models are trained on 6T tokens, and released with 2 versions, 2b and 7b.

The abstract from the paper is the following:

This work introduces Gemma2, a new family of open language models demonstrating strong performance across academic benchmarks for language understanding, reasoning, and safety. We release two sizes of models (2 billion and 7 billion parameters), and provide both pretrained and fine-tuned checkpoints. Gemma2 outperforms similarly sized open models on 11 out of 18 text-based tasks, and we present comprehensive evaluations of safety and responsibility aspects of the models, alongside a detailed description of our model development. We believe the responsible release of LLMs is critical for improving the safety of frontier models, and for enabling the next wave of LLM innovations

  • Add gemma 2 by @ArthurZucker in #31659

RTDETR

The RT-DETR model was proposed in DETRs Beat YOLOs on Real-time Object Detection by Wenyu Lv, Yian Zhao, Shangliang Xu, Jinman Wei, Guanzhong Wang, Cheng Cui, Yuning Du, Qingqing Dang, Yi Liu.

RT-DETR is an object detection model that stands for “Real-Time DEtection Transformer.” This model is designed to perform object detection tasks with a focus on achieving real-time performance while maintaining high accuracy. Leveraging the transformer architecture, which has gained significant popularity in various fields of deep learning, RT-DETR processes images to identify and locate multiple objects within them.

  • New model support RTDETR by @SangbumChoi in #29077

InstructBlip

The InstructBLIP model was proposed in InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning by Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. InstructBLIP leverages the BLIP-2 architecture for visual instruction tuning.

InstructBLIP uses the same architecture as BLIP-2 with a tiny but important difference: it also feeds the text prompt (instruction) to the Q-Former.

  • Add video modality for InstrucBLIP by @zucchini-nlp in #30182

LlaVa NeXT Video

The LLaVa-NeXT-Video model was proposed in LLaVA-NeXT: A Strong Zero-shot Video Understanding Model by Yuanhan Zhang, Bo Li, Haotian Liu, Yong Jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, Chunyuan Li. LLaVa-NeXT-Video improves upon LLaVa-NeXT by fine-tuning on a mix if video and image dataset thus increasing the model’s performance on videos.

LLaVA-NeXT surprisingly has strong performance in understanding video content in zero-shot fashion with the AnyRes technique that it uses. The AnyRes technique naturally represents a high-resolution image into multiple images. This technique is naturally generalizable to represent videos because videos can be considered as a set of frames (similar to a set of images in LLaVa-NeXT). The current version of LLaVA-NeXT makes use of AnyRes and trains with supervised fine-tuning (SFT) on top of LLaVA-Next on video data to achieves better video understanding capabilities.The model is a current SOTA among open-source models on VideoMME bench.

  • Add LLaVa NeXT Video by @zucchini-nlp in #31252

New model adder

A very significant change makes its way within the transformers codebase, introducing a new way to add models to transformers. We recommend reading the description of the PR below, but here is the gist of it:

The diff_converter tool is here to replace our old Copied from statements, while keeping our core transformers philosophy:

  • single model single file
  • explicit code
  • standardization of modeling code
  • readable and educative code
  • simple code
  • least amount of modularity

This additionally unlocks the ability to very quickly see the differences between new architectures that get developed. While many architectures are similar, the "single model, single file" policy can obfuscate the changes. With this diff converter, we want to make the changes between architectures very explicit.

  • Diff converter v2 by @ArthurZucker in #30868

Tool-use and RAG model support

We've made major updates to our support for tool-use and RAG models. We can now automatically generate JSON schema descriptions for Python functions which are suitable for passing to tool models, and we've defined a standard API for tool models which should allow the same tool inputs to be used with many different models. Models will need updates to their chat templates to support the new API, and we're targeting the Nous-Hermes, Command-R and Mistral/Mixtral model families for support in the very near future. Please see the updated chat template docs for more information.

If you are the owner of a model that supports tool use, but you're not sure how to update its chat template to support the new API, feel free to reach out to us for assistance with the update, for example on the Hugging Face Discord server. Ping Matt and yell key phrases like "chat templates" and "Jinja" and your issue will probably get resolved.

  • Chat Template support for function calling and RAG by @Rocketknight1 in #30621

GGUF support

We further the support of GGUF files to offer fine-tuning within the python/HF ecosystem, before converting them back to the GGUF/GGML/llama.cpp libraries.

  • Add Qwen2 GGUF loading support by @Isotr0py in #31175
  • GGUF: Fix llama 3 GGUF by @younesbelkada in #31358
  • Fix llama gguf converter by @SunMarc in #31575

Trainer improvements

A new optimizer is added in the Trainer.

  • FEAT / Trainer: LOMO optimizer support by @younesbelkada in #30178

Quantization improvements

Several improvements are done related to quantization: a new cache (the quantized KV cache) is added, offering the ability to convert the cache of generative models, further reducing the memory requirements.

Additionally, the documentation related to quantization is entirely redone with the aim of helping users choose which is the best quantization method.

  • Quantized KV Cache by @zucchini-nlp in #30483
  • Docs / Quantization: refactor quantization documentation by @younesbelkada in #30942

Examples

New instance segmentation examples are added by @qubvel

  • Instance segmentation examples by @qubvel in #31084

Notable improvements

As a notable improvement to the HF vision models that leverage backbones, we enable leveraging HF pretrained model weights as backbones, with the following API:

from transformers import MaskFormerConfig, MaskFormerForInstanceSegmentation

config = MaskFormerConfig(backbone="microsoft/resnet-50", use_pretrained_backbone=True)
model = MaskFormerForInstanceSegmentation(config)
  • Enable HF pretrained backbones by @amyeroberts in #31145

Additionally, we thank @Cyrilvallez for diving into our generate method and greatly reducing the memory requirements.

  • Reduce by 2 the memory requirement in generate() 🔥🔥🔥 by @Cyrilvallez in #30536

Breaking changes

Remove ConversationalPipeline and Conversation object

Both the ConversationalPipeline and the Conversation object have been deprecated for a while, and are due for removal in 4.42, which is the upcoming version.

The TextGenerationPipeline is recommended for this use-case, and now accepts inputs in the form of the OpenAI API.

  • 🚨 Remove ConversationalPipeline and Conversation object by @Rocketknight1 in #31165

Remove an accidental duplicate softmax application in FLAVA's attention

Removes duplicate softmax application in FLAVA attention. Likely to have a small change on the outputs but flagging with 🚨 as it will change a bit.

  • 🚨 FLAVA: Remove double softmax by @amyeroberts in #31322

Idefics2's ignore_index attribute of the loss is updated to -100

  • 🚨 [Idefics2] Update ignore index by @NielsRogge in #30898

out_indices from timm being updated

Recent updates to timm changed the type of the attribute model.feature_info.out_indices. Previously, out_indices would reflect the input type of out_indices on the create_model call i.e. either tuple or list. Now, this value is always a tuple.

As list are more useful and consistent for us -- we cannot save tuples in configs, they must be converted to lists first -- we instead choose to cast out_indices to always be a list.

This has the possibility of being a slight breaking change if users are creating models and relying on out_indices on being a tuple. As this property only happens when a new model is created, and not if it's saved and reloaded (because of the config), then I think this has a low chance of having much of an impact.

  • 🚨 out_indices always a list by @amyeroberts in #30941

datasets referenced in the quantization config get updated to remove references to datasets with restrictive licenses.

  • 🚨 Remove dataset with restrictive license by @echarlaix in #31452

Bugfixes and improvements

  • Add fixed resize and pad strategy for object detection by @qubvel in #30742
  • Enable dynamic resolution input for Swin Transformer and variants by @the-neural-networker in #30656
  • Add TokenClassification for Mistral, Mixtral and Qwen2 by @josephenguehard in #29878
  • FIX / Quantization: Fix Dockerfile build by @younesbelkada in #30890
  • Add support for torch.compile dynamic shapes by @warner-benjamin in #30560
  • LLaVa-Next: Update docs with batched inference by @zucchini-nlp in #30857
  • DeformableDETR two stage support bfloat16 by @DonggeunYu in #30907
  • add return_token_timestamps to WhisperProcessor by @kamilakesbi in #30812
  • Fix num_hidden_layers in initialization of new model in Mamba by @SrGonao in #30403
  • separate kwargs in processor (similar to #30193) by @Eric2i in #30905
  • fix for custom pipeline configuration by @not-lain in #29004
  • Add AutoFeatureExtractor support to Wav2Vec2ProcessorWithLM by @ylacombe in #28706
  • Fix a shape annotation and typos in mamba slow forward by @vasqu in #30691
  • tokenizer_class = "AutoTokenizer" Llava Family by @ArthurZucker in #30912
  • Introduce configured_state arg for accelerator_config by @muellerzr in #29781
  • Add torch.compile for Mistral by @zhenglongjiepheonix in #30642
  • [docs] Spanish translation of model_memory_anatomy.md by @aaronjimv in #30885
  • FIX / TST: Fix expected results on Mistral slow test (A10) by @younesbelkada in #30909
  • PaliGemma - fix processor with no input text by @hiyouga in #30916
  • CI: AMD MI300 tests fix by @mht-sharma in #30797
  • Enforce saving at end of training if saving option chosen by @muellerzr in #30160
  • fix: center_crop occasionally outputs off-by-one dimension matrix by @mattlbeck in #30934
  • [Benchmark] Reuse optimum-benchmark by @ydshieh in #30615
  • TST / Workflows: Get slack notifications for docker image build by @younesbelkada in #30891
  • Fix swin embeddings interpolation by @amyeroberts in #30936
  • Fix inhomogeneous shape error in example by @Zantares in #30434
  • update ruff version by @ArthurZucker in #30932
  • Update build ci image [push-ci-image] by @ArthurZucker in #30933)
  • Update video-llava docs by @zucchini-nlp in #30935
  • Fix low cpu mem usage tests by @SunMarc in #30808
  • [doc] Add references to the fine-tuning blog and distil-whisper to Whisper. by @Vaibhavs10 in #30938
  • Avoid extra chunk in speech recognition by @jonatanklosko in #29539
  • [whisper] only trigger forced ids warning once by @sanchit-gandhi in #30966
  • Paligemma - fix slow tests, add bf16 and f16 slow tests by @molbap in #30851
  • Finally fix the missing new model failure CI report by @ydshieh in #30968
  • legacy to init the slow tokenizer when converting from slow was wrong by @ArthurZucker in #30972
  • Generation: get special tokens from model config by @zucchini-nlp in #30899
  • [Whisper] Strip prompt before finding common subsequence by @sanchit-gandhi in #27836
  • Fix link in Pipeline documentation by @junhl in #30948
  • [Mistral and friends] Update MLP by @NielsRogge in #31057
  • Paligemma causal attention mask by @molbap in #30967
  • Update object detection with latest resize and pad strategies by @qubvel in #30955
  • Using assistant in AutomaticSpeechRecognitionPipeline with different encoder size by @kamilakesbi in #30637
  • Push ci image by @ArthurZucker in #30982
  • test_custom_4d_attention_mask skip with sliding window attn by @poedator in #30833
  • Finish adding support for torch.compile dynamic shapes by @warner-benjamin in #30919
  • FIX / Docs: Minor changes in quantization docs by @younesbelkada in #30985
  • Fix accelerate failing tests by @SunMarc in #30836
  • [tests] add torch.use_deterministic_algorithms for XPU by @faaany in #30774
  • Add a check that warmup_setps is either 0 or >= 1 by @ymoslem in #30764
  • Update 4 MptIntegrationTests expected outputs by @ydshieh in #30989
  • [Port] TensorFlow implementation of Mistral by @ariG23498 in #29708
  • Remove deprecated properties in tokenization_nllb.py and tokenization_nllb_fast.py by @ymoslem in #29834
  • Bugfix: WandbCallback uploads initial model checkpoint by @mgerstgrasser in #30897
  • add prefix space ignored in llama #29625 by @itazap in #30964
  • Fix training speed regression introduced by "optimize VRAM for calculating pos_bias in LayoutLM v2, v3 by @kkoehncke in #26139)"
  • Do not trigger autoconversion if local_files_only by @Wauplin in #31004
  • pin uv==0.1.45 by @ydshieh in #31006
  • Perceiver interpolate position embedding by @g1y5x3 in #30979
  • [tests] make test_model_parallelism device-agnostic by @faaany in #30844
  • FIX / TST: Fix expected results on Mistral AWQ test by @SunMarc in #30971
  • allow multi-gpu by @ydshieh in #31011
  • Fix resume_download future warning by @Wauplin in #31007
  • Quantization / TST: Fix remaining quantization tests by @younesbelkada in #31000
  • save the list of new model failures by @ydshieh in #31013
  • added interpolation for vitmae model in pytorch as well as tf. by @bhuvanmdev in #30732
  • Add split special tokens by @itazap in #30772
  • Paligemma- fix devices and dtype assignments by @molbap in #31008
  • Redirect transformers_agents doc to agents by @aymeric-roucher in #31054
  • unpin uv by @ydshieh in #31055
  • Follow up: Fix link in dbrx.md by @eitanturok in #30514
  • Update feature request label in template by @amyeroberts in #30940
  • Fix quanto tests by @SunMarc in #31062
  • Fix pad_to_max_length Whisper by @ylacombe in #30787
  • skip test_model_parallelism for 2 model test classes by @ydshieh in #31067
  • use @main by @ydshieh in #31065
  • Remove ninja from docker image build by @ydshieh in #31080
  • fix "piano" typo by @clinty in #31027
  • Update quicktour.md to fix broken link to Glossary by @apalkk in #31072
  • Remove redundant backend checks in training_args.py by @kevint324 in #30999
  • fix from_pretrained in offline mode when model is preloaded in cache by @oOraph in #31010
  • Remove float64 cast for OwlVit and OwlV2 to support MPS device by @qubvel in #31071
  • Fix OWLv2 post_process_object_detection for multiple images by @qubvel in #31082
  • Fix typo in trainer.py by @taslimisina in #31048
  • [SuperPoint, PaliGemma] Update docs by @NielsRogge in #31025
  • Fix failing tokenizer tests by @LysandreJik in #31083
  • Watermark: fix tests by @zucchini-nlp in #30961
  • Docs / PEFT: Add PEFT API documentation by @younesbelkada in #31078
  • Render chat template tojson filter as unicode by @CISC in #31041
  • FIX: Add accelerate as a hard requirement by @younesbelkada in #31090
  • FIX / OPT: Fix OPT multi-GPU training for OPTForQuestionAnswering by @younesbelkada in #31092
  • skip test_multi_gpu_data_parallel_forward for vit and deit by @ydshieh in #31086
  • Fix PretrainedConfig docstring with deprecated resume_download by @albertvillanova in #31014
  • Fix DeepSpeed compatibility with weight_norm by @jonnyli1125 in #30881)
  • TST: Fix instruct-blip tests by @younesbelkada in #31088
  • Docs / Quantization: Redirect deleted page by @younesbelkada in #31063
  • Deprecate low use models by @amyeroberts in #30781
  • Quantized KV cache: update quanto by @zucchini-nlp in #31052
  • FEAT: Add mistral v3 conversion script by @younesbelkada in #30981
  • Use HF_HUB_OFFLINE + fix has_file in offline mode by @Wauplin in #31016
  • Improve transformers-cli env reporting by @statelesshz in #31003
  • Fix env.py in cases where torch is not present by @Rocketknight1 in #31113
  • Fix faulty rstrip in module loading by @Rocketknight1 in #31108
  • Rm maintainer + migrate by @muellerzr in #31089
  • Fix nightly circleci by @ydshieh in #31114
  • FIX / Docs: Fix GPTQ expected number of bits by @younesbelkada in #31111
  • Add VLM generation default contributor by @gante in #31115
  • Add on_optimizer_step to callback options by @dhruvbpai in #31095
  • Cleanup docker build by @ydshieh in #31119
  • FIX / Quantization: Add extra validation for bnb config by @younesbelkada in #31135
  • fix get_scheduler when name is warmup_stable_decay by @zspo in #31128
  • Docs / Quantization: Replace all occurences of load_in_8bit with bnb config by @younesbelkada in #31136
  • Workflow: Remove IS_GITHUB_CI by @younesbelkada in #31147
  • helper by @ArthurZucker in #31152
  • pytest -rsfE by @ydshieh in #31140
  • Fix quantized cache output by @SunMarc in #31143
  • Update sam.md by @asifajrof in #31130
  • Quantization: Enhance bnb error message by @younesbelkada in #31160
  • [trainer] add sanity evaluation option by @SunMarc in #31146
  • Add streaming, various fixes by @aymeric-roucher in #30838
  • Added description of quantization_config by @vamsivallepu in #31133
  • Fix typo: use_safetenstors to use_safetensors by @CharlesCNorton in #31184
  • Remove copied froms for deprecated models by @amyeroberts in #31153
  • Token healing by @ahmed-moubtahij in #30081
  • [GemmaModel] fix small typo by @ArthurZucker in #31202
  • Fix Cannot convert [array()] to EagerTensor of dtype int64 by @pavi-ninjaac in #31109
  • Ignore non-causal mask in more cases with SDPA by @fxmarty in #30138
  • SlidingWindowCache: reduce differences to other Cache classes by @gante in #30970
  • Fix test_compile_static_cache by @ydshieh in #30991
  • fix the get_size_with_aspect_ratio in max_size situation by @SangbumChoi in #30902
  • Fix typo in utils by @Bojun-Feng in #31169
  • Rename sanity_evaluation to eval_on_start by @Qubitium in #31192
  • Wrong translation FR : Contents = Contenu by @jadechoghari in #31186
  • Cohere: Fix copied from by @younesbelkada in #31213
  • Set greater_is_better to False if metric_for_best_model ends with "loss" by @miivanov90 in #31142
  • Fix GPU OOM for mistral.py::Mask4DTestHard by @ydshieh in #31212
  • [docs] Spanish translation of tokenizer_summary.md by @aaronjimv in #31154
  • Pass device in Logits Processor's init by @zucchini-nlp in #29804
  • Fix sentence fragment within test comments by @DomHudson in #31218
  • fix(PatchTST): Wrong dropout used for PretainHead by @maxstrobel in #31117
  • Video-LLaVa: handle any number of frames by @zucchini-nlp in #31221
  • Add dynamic resolution input/interpolate position embedding to deit by @p-kris10 in #31131
  • fix bf16 issue in text classification pipeline by @chujiezheng in #30996
  • Fix pipeline tests - torch imports by @amyeroberts in #31227
  • Add new line switch before logging ***** Running {description} ***** by @jacklanda in #31225
  • add no split modules for xlmrobertaxl by @ManuelFay in #31223
  • Fix MistralIntegrationTest by @ydshieh in #31231
  • Blip: Deprecate BlipModel by @younesbelkada in #31235
  • Move out common backbone config param validation by @amyeroberts in #31144
  • Upload (daily) CI results to Hub by @ydshieh in #31168
  • Specify dtype=torch.bool to avoid xla error by @ysulsky in #31191
  • Fixing name 'torch' is not defined in bitsandbytes integration by @jamesbraza in #31243
  • Benchmark GitHub Actions workflow by @ydshieh in #31163
  • Early labels validation by @amyeroberts in #31240
  • doc: add info about wav2vec2 bert in older wav2vec2 models. by @Vaibhavs10 in #31120
  • enable deterministic mode for npu by @statelesshz in #31253
  • Add missing Flaubert tokenizer tests by @bastrob in #30492
  • Fix circular reference issue in CLIPTokenizerFast by @dhaivat1729 in #31075
  • Add condition to benchmark job in push-important-models.yml by @ydshieh in #31259
  • Skip failing JetMOE generation tests by @amyeroberts in #31266
  • no need for explicit EXTRA_TOKENS in processing_paligemma.py by @grahamannett in #31022
  • [SwitchTransformer] Significant performance improvement on MoE blocks by @ranggihwang in #31173
  • fix loading special_tokens_map_file by @ZhiyuanChen in #31012
  • Make mamba use cache by @zucchini-nlp in #31116
  • Generation: fix handling of special tokens by @zucchini-nlp in #31254
  • Switch from cached_download to hf_hub_download in remaining occurrences by @Wauplin in #31284
  • fix: str should be used not int when setting env variables by @statelesshz in #31272
  • Fix _save_tpu: use _maybe_convert_to_cpu instead of to cpu. by @baoleai in #31264
  • fix accelerate tests for roberta xl by @SunMarc in #31288
  • Enable dynamic resolution input for Beit by @OmarManzoor in #31053
  • Mark MobileNetV1ModelTest::test_batching_equivalence as flaky by @amyeroberts in #31258
  • Pipeline VQA: Add support for list of images and questions as pipeline input by @BlacCod in #31217
  • Fix SwinLayer / DonutSwinLayer / ClapAudioLayer attention mask device by @gorodnitskiy in #31295
  • Update text-to-speech.md by @jaguaryang in #31269
  • Fixed Wav2Vec2ProcessorWithLM decoding error by @karicotiza in #31188
  • Fix jetmoe model by @Cyrilvallez in #31279
  • Extend save_pretrained to offloaded models by @blbadger in #27412
  • Implement JSON dump conversion for torch_dtype in TrainingArguments by @junrae6454 in #31224
  • interpolation added for TVP. by @bhuvanmdev in #30863
  • Rename test_model_common_attributes -> test_model_get_set_embeddings by @amyeroberts in #31321
  • Use unused prepare_img() function in dinov2 conversion script by @IbrahimAmin1 in #31335
  • docs: fix style by @imba-tjd in #31340
  • Fix paligemma inverted mask by @molbap in #31207
  • docs/zh: fix style by @imba-tjd in #31334
  • Decorators for deprecation and named arguments validation by @qubvel in #30799
  • Improve error msg when using bitsandbytes by @SunMarc in #31350
  • Fix Cohere CI by @ydshieh in #31263
  • Fix gradio tool demos by @aymeric-roucher in #31230
  • Fast image processor by @amyeroberts in #28847
  • Add french translation of AutoBackbone by @jadechoghari in #31300
  • Add support to declare imports for code agent by @JasonZhu1313 in #31355
  • Fix idefics cache by @zucchini-nlp in #31377
  • [Bug Fix] Renamed loss to losses to suppress UnboundLocalError by @her0e1c1 in #31365
  • docs: fix broken link by @imba-tjd in #31370
  • backbone_utils - fix relative import by @amyeroberts in #31382
  • README underline between badges fix by @novialriptide in #31376
  • Update comment in modeling_utils.py by @inf3rnus in #31299
  • Use huggingface_hub helper function to split state dict by @SunMarc in #31091
  • Change JSON serialization to custom json.dumps by @junrae6454 in #31100
  • feat(ci): add trufflehog secrets detection by @McPatate in #31344
  • [QoL fix] [Image processing] Add warning on assumption of channel dim and avoid infering when inputs are PIL.Image by @aliencaocao in #31364
  • Make chat templates part of ProcessorMixin by @Rocketknight1 in #30744
  • add initial design for uniform processors + align model by @molbap in #31197
  • Add missing French translation of tutoriel_pipeline.md by @jadechoghari in #31396
  • Temporarily pin datasets upper version to fix CI by @albertvillanova in #31407
  • Support Clip QKV for MPT by @akakakakakaa in #31307
  • Pin datasets<2.20.0 for examples by @amyeroberts in #31417
  • Fix MusicGen SDPA by @ylacombe in #31208
  • Set seed for M4T retain grad test by @ylacombe in #31419
  • Fix SpeechT5 decoder_attention_mask shape by @ylacombe in #28071
  • Change potential inputs_embeds padding logger.warning to logger.warning_once by @naimenz in #31411
  • Remove duplicate image processor in auto map by @amyeroberts in #31383
  • Install the tensorflow example requirements in docker by @amyeroberts in #31428
  • Remove empty create_and_test_config_common_properties tests by @amyeroberts in #31359
  • xpu: support xpu backend from stock pytorch (>=2.4) by @dvrogozh in #31238
  • Musicgen special tokens in tensors by @zucchini-nlp in #31420
  • Fix Bark logits processors device misplacement by @ylacombe in #31416
  • Rename misnamed image processor test files by @amyeroberts in #31430
  • Generate: fix tokenizer being popped twice by @gante in #31427
  • [tests] make TestDeepSpeedModelZoo device-agnostic by @faaany in #31402
  • Support multiple validation datasets when dataloader_persistent_workers=True by @bastienlc in #30627
  • Pass datasets trust_remote_code by @albertvillanova in #31406
  • simple fix by @tokenizer-decode in #31456
  • Fix typing errors in Qwen2ForTokenClassification by @kevinhu in #31440
  • Agents: Improve python interpreter by @aymeric-roucher in #31409
  • Donut: fix generate call from local path by @gante in #31470
  • Make "tool_use" the default chat template key when tools are passed by @Rocketknight1 in #31429
  • Fix single letter stop strings by @Rocketknight1 in #31448
  • Update chat template docs and bump Jinja version by @Rocketknight1 in #31455
  • Improve PreTrainedTokenizerFast loading time when there are many added tokens by @ydshieh in #31404
  • Fix documentation typos by @qgallouedec in #31476
  • Give more useful metric_for_best_model errors by @tomaarsen in #31450
  • Update perf_train_gpu_many.md by @remyleone in #31451
  • [GPT2] Add SDPA support by @vasqu in #31172
  • Fix autocast incompatibility in RecurrentGemma by @xplip in #30832
  • Use self.config_tester.run_common_tests() by @amyeroberts in #31431
  • [tests] rename test_config_object to test_ds_config_object by @faaany in #31403
  • Docs / AQLM: Clarify torch.compile support for AQLM by @younesbelkada in #31473
  • Mamba: add generative tests by @gante in #31478
  • Update object_detection.md by @jajupmochi in #31488
  • Add docs on zeroshot image classification prompt templates by @aliencaocao in #31343
  • auto-detect device when no device is passed to pipeline by @faaany in #31398
  • Fix typo: pas_token_id by @ftnext in #30894
  • Fix wandb integration with SetFit model by @timothepearce in #30021
  • Consider inheritance in type checking for tensors by @daemyung in #31378
  • Add valid columns check in _remove_unused_columns method by @arthasking123 in #31466
  • Fix a teeny-tiny typo in tokenization_utils_base.py's docstring by @sadra-barikbin in #31510
  • Fix mismatched ` in doc & other common typos by @jhwei in #31516
  • RWKV: enable generation tests by @gante in #31490
  • unskip 2 tests in cohere by @ydshieh in #31517
  • Revive Nightly/Past CI by @ydshieh in #31159
  • Deprecate legacy cache + use cache position by @zucchini-nlp in #31491
  • SPLIT PR: add user defined symbols and control symbols by @itazap in #31305
  • Removed torch.cuda.empty_cache from train loop. by @FoamoftheSea in #31530
  • Update mask_generation.md by @nicholicaron in #31543
  • Correct @is_flaky test decoration by @qubvel in #31480
  • Add implementation of spectrogram_batch by @ravenouse in #27159
  • chore: fix typos by @xiaoxianBoy in #31559
  • Update git templates by @ArthurZucker in #31539
  • Fix the error caused by incorrect use of logger in pipeline by @lanyun1103 in #31565
  • Fix bug about add_special_tokens and so on by @hiroshi-matsuda-rit in #31496
  • Add Jinja as a requirement with the right version cutoff by @Rocketknight1 in #31536
  • Fix doc typo in TrainingArguments by @qgallouedec in #31503
  • Fix is_torch_xpu_available for torch < 2.3 by @amyeroberts in #31573
  • Added version constraint on numpy for version <2.0 by @Resteklicken in #31569
  • Siglip: add _no_split_module by @zucchini-nlp in #31566
  • fix output data type of image classification by @jiqing-feng in #31444
  • add preprocessing_num_workers to run_classification.py by @jiahuanluo in #31586
  • Improve error message for mismatched copies in code blocks by @molbap in #31535
  • Add ViTImageProcessorFast to tests by @amyeroberts in #31424
  • docs: move translations to i18n by @SauravMaheshkar in #31584
  • Removed unnecessary self.projection call in VivitTubeletEmbeddings by @v-iashin in #31632
  • [GPT-NeoX] Add SDPA support by @vasqu in #31031
  • Update RT-DETR code snippet by @qubvel in #31631
  • Llama et al. / FSDP : Fix breaking change in 4.40 for FSDP by @younesbelkada in #31161
  • Fix RT-DETR inference with float16 and bfloat16 by @qubvel in #31639
  • Fix paligemma detection inference by @molbap in #31587
  • Generate: fix assisted generation with past_key_values passed as kwargs by @gante in #31644
  • Fix dtype casting in swinv2 and swinv2sr to allow non-FP32 inference by @aliencaocao in #31589
  • Skip tests properly by @amyeroberts in #31308
  • Generation: past kv can be None by @zucchini-nlp in #31051
  • Fix ONNX exports for Optimum compatible models by @merveenoyan in #31311

Significant community contributions

The following contributors have made significant changes to the library over the last release:

  • @josephenguehard
    • Add TokenClassification for Mistral, Mixtral and Qwen2 (#29878)
  • @vasqu
    • Fix a shape annotation and typos in mamba slow forward (#30691)
    • [GPT2] Add SDPA support (#31172)
    • [GPT-NeoX] Add SDPA support (#31031)
  • @ariG23498
    • [Port] TensorFlow implementation of Mistral (#29708)
  • @bhuvanmdev
    • added interpolation for vitmae model in pytorch as well as tf. (#30732)
    • interpolation added for TVP. (#30863)
  • @SangbumChoi
    • fix the get_size_with_aspect_ratio in max_size situation (#30902)
    • New model support RTDETR (#29077)
  • @Cyrilvallez
    • Reduce by 2 the memory requirement in generate() 🔥🔥🔥 (#30536)
    • Fix jetmoe model (#31279)
  • @ravenouse
    • Add implementation of spectrogram_batch (#27159)
May 30, 2024
Release v4.41.2

Mostly fixing some stuff related to trust_remote_code=True and from_pretrained

The local_file_only was having a hard time when a .safetensors file did not exist. This is not expected and instead of trying to convert, we should just fallback to loading the .bin files.

  • Do not trigger autoconversion if local_files_only #31004 from @Wauplin fixes this!
  • Paligemma: Fix devices and dtype assignments (#31008) by @molbap
  • Redirect transformers_agents doc to agents (#31054) @aymeric-roucher
  • Fix from_pretrained in offline mode when model is preloaded in cache (#31010) by @oOraph
  • Fix faulty rstrip in module loading (#31108) @Rocketknight1
Latest
v5.5.4
Tracking Since
Apr 23, 2024
Last checked Apr 19, 2026