Mostly had to finish the gradient accumulation ! Thanks to @techkang and @Ryukijano 🤗
This is mostly for fx and onnx issues!
** Fix regression loading dtype #34409 by @SunMarc
** LLaVa: latency issues #34460 by @zucchini-nlp
** Fix pix2struct #34374 by @IlyasMoutawwakil
** Fix onnx non-exposable inplace aten op #34376 by @IlyasMoutawwakil
** Fix torch.fx issue related to the new loss_kwargs keyword argument #34380 by @michaelbenayoun
The Moshi model was proposed in Moshi: a speech-text foundation model for real-time dialogue by Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave and Neil Zeghidour.
Moshi is a speech-text foundation model that casts spoken dialogue as speech-to-speech generation. Starting from a text language model backbone, Moshi generates speech as tokens from the residual quantizer of a neural audio codec, while modeling separately its own speech and that of the user into parallel streams. This allows for the removal of explicit speaker turns, and the modeling of arbitrary conversational dynamics. Moshi also predicts time-aligned text tokens as a prefix to audio tokens. This “Inner Monologue” method significantly improves the linguistic quality of generated speech and provides streaming speech recognition and text-to-speech. As a result, Moshi is the first real-time full-duplex spoken large language model, with a theoretical latency of 160ms, 200ms in practice.
Zamba-7B-v1 is a hybrid between state-space models (Specifically Mamba) and transformer, and was trained using next-token prediction. Zamba uses a shared transformer layer after every 6 mamba blocks. It uses the Mistral v0.1 tokenizer. We came to this architecture after a series of ablations at small scales. Zamba-7B-v1 was pre-trained on 1T tokens of text and code data.
<img width="400" alt="zamba" src="https://github.com/user-attachments/assets/a86428b8-4d24-4e5a-bf78-222312693bb2">The GLM Model was proposed in ChatGLM: A Family of Large Language Models from GLM-130B to GLM-4 All Tools by GLM Team, THUDM & ZhipuAI.
The abstract from the paper starts with the following:
We introduce ChatGLM, an evolving family of large language models that we have been developing over time. This report primarily focuses on the GLM-4 language series, which includes GLM-4, GLM-4-Air, and GLM-4-9B.
The Idefics3 model was proposed in Building and better understanding vision-language models: insights and future directions by Hugo Laurençon, Andrés Marafioti, Victor Sanh, and Léo Tronchon.
Idefics3 is an adaptation of the Idefics2 model with three main differences:
The PhiMoE model was proposed in Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone by Microsoft.
This model is very similar to Mixtral with the main difference of Phi3LongRoPEScaledRotaryEmbedding, where they are used to extend the context of the rotary embeddings. The query, key and values are fused, and the MLP’s up and gate projection layers are also fused.
This release adds SynthID, a novel state-of-the-art watermarking technique by Google DeepMind. SynthID has a low generation-time computational cost and can be configured to be nearly imperceptible (at the cost of harder watermarking detection). The release also comes with the code to train and run the corresponding detector, which is a machine learning model itself.
from transformers import AutoModelForCausalLM, AutoTokenizer, SynthIDTextWatermarkingConfig
tokenizer = AutoTokenizer.from_pretrained('google/gemma-2-2b', padding_side="left")
model = AutoModelForCausalLM.from_pretrained('google/gemma-2-2b')
# SynthID Text configuration
watermarking_config = SynthIDTextWatermarkingConfig(
keys=[654, 400, 836, 123, 340, 443, 597, 160, 57],
ngram_len=5,
)
# Generation with watermarking
tokenized_prompts = tokenizer(["Once upon a time, "], return_tensors="pt", padding=True)
output_sequences = model.generate(
**tokenized_prompts, watermarking_config=watermarking_config, do_sample=True, max_new_tokens=10
)
watermarked_text = tokenizer.batch_decode(output_sequences, skip_special_tokens=True)
print(watermarked_text)
Docs for applying SynthID watermarking: https://huggingface.co/docs/transformers/internal/generation_utils#transformers.SynthIDTextWatermarkLogitsProcessor Docs for detecting SynthID watermarking: https://huggingface.co/docs/transformers/internal/generation_utils#transformers.SynthIDTextWatermarkDetector
<img width="750" alt="how-synthid-works-high-level" src="https://github.com/user-attachments/assets/c5702b21-e7e6-490d-8fe6-b73783e78e6b">BitNet is an architecture introduced by Microsoft Research that uses extreme quantization, representing each parameter with only three values: -1, 0, and 1. This results in a model that uses just 1.58 bits per parameter, significantly reducing computational and memory requirements. It replaces traditional Linear layers in Multi-Head Attention and Feed-Forward Networks with specialized layers called BitLinears that use ternary precision (or even binary, in the initial version)
More architectures are now supported in our GGUF loader; GGUF files saved with this architecture can now be loaded directly in transformers to be fine-tuned. We recommend using tooling from llama.cpp to requantize the models after further training has been done.
We are pushing for a unified inference API across multiple libraries. As part of this, we are cleaning up the input and output signatures for our pipeline classes and deprecating some rarely-used arguments. This is still a work-in-progress, but when it's finished, transformers pipelines should exactly match workflows in deployment libraries like transformers.js or TGI, allowing you to seamlessly move from development to production.
Also, pipelines now fully support the Processor class, used by vision-language models. Expect full pipeline support for chatting with VLMs in the very near future!
pipeline able to load processor by @qubvel in #32514ExecuTorch is an end-to-end solution for enabling on-device inference capabilities across mobile and edge devices including wearables, embedded devices and microcontrollers. It is part of the PyTorch ecosystem and supports the deployment of PyTorch models with a focus on portability, productivity, and performance.
We are collaborating with the executorch team so that 🤗 Transformers models can be exported using torch.export. The goal of this integration is not only to enable export but also to ensure that the exported artifact can be further lowered and optimized to run efficiently in ExecuTorch, particularly for mobile and edge use cases.
MllamaProcessor] Update errors and API with multiple image by @ArthurZucker in #33715can_generate() recursive check by @gante in #33718image_size in Convnextv2 config by @lucianosrp in #33734clean_up_tokenization_spaces] Pl bart was failing, updating by @ArthurZucker in #33735MllamaImageProcessing] Update doc by @ArthurZucker in #33747load_balancing_loss_func function of modeling_mixtral.py. by @PhilipMay in #33641modular] fixes! by @ArthurZucker in #33820prepare_inputs_for_generation to GenerationMixin by @gante in #33677accelerate dependency error in case of defaulting low_cpu_mem_usage=True by @kylesayrs in #33830compressed_tensors by @kylesayrs in #33828tokenizer kwarg deprecation with decorator by @qubvel in #33887test_static_cache_matches_dynamic as flaky by @gante in #33630SplinterTokenizer unit test by @ariepratama in #32652weights_only flag when loading state_dict by @jerryzh168 in #32481save_pretrained exception to warning by @gante in #33906logits.float() by @ringohoffman in #33902validate_rope by @zucchini-nlp in #33753PR run-slow] by @ArthurZucker in #33939self.position_embeddings->self.position_embedding by @ArthurZucker in #33958char_to_token documentation to note behaviour when trim_offsets is True by @Craigacp in #33919TF] Fix Tensorflow XLA Generation on limited seq_len models by @vasqu in #33903Red CIs] Fix hub failures by @ArthurZucker in #34001pytes collection] Fix flax test collection by @ArthurZucker in #34004gguf.md to Korean by @yijun-lee in #33764swinv2.md to Korean by @mreraser in #33566audio_utils.md to Korean by @yijun-lee in #33802esm.md to Korean by @yijun-lee in #33796time_series_utils.md to Korean by @yijun-lee in #33806pipelines_utils.md to Korean by @yijun-lee in #33809trainer.md to Korean by @yijun-lee in #33797chameleon.md to Korean by @yijun-lee in #33799logging.md to Korean by @chhaewxn in #33543auto.md to Korean by @boyunJang in #33590swin2sr.md to Korean by @mreraser in #33795vit.md to Korean by @mreraser in #33884gemma.md to Korean by @yijun-lee in #33936decoder_config=None by @SunMarc in #34014trainer_seq2seq.py's __init__ type annotations by @benglewis in #34021feature_extractor.md to Korean by @yijun-lee in #33775bertweet.md to Korean by @ahnjj in #33891gpt_neox_japanese.md to Korean by @ahnjj in #33894rag.md to Korean by @chhaewxn in #33989main_classes/quantization.md to Korean by @fabxoe in #33959main_classes/configuration.md to Korean by @fabxoe in #33952model_doc/mamba.md to Korean by @fabxoe in #33626model_doc/autoformer.md to Korean by @fabxoe in #33574model_doc/patchtsmixer.md to Korean by @fabxoe in #33587model_doc/clip.md to Korean by @fabxoe in #33610model_doc/paligemma.md to Korean by @fabxoe in #33612model_doc/llama3.md to Korean by @fabxoe in #33635model_doc/mistral.md to Korean by @fabxoe in #33648model_doc/cohere.md to Korean by @fabxoe in #33885model_doc/dbrx.md to Korean by @fabxoe in #33951model_doc/deberta-v2.md to Korean by @fabxoe in #33968main_classes/onnx.md to Korean by @fabxoe in #33601tokenization_utils.md to Korean by @yijun-lee in #33813swin.md to Korean by @mreraser in #33510file_utils.md to Korean by @yijun-lee in #33803openai-gpt.md to Korean by @yijun-lee in #33801biogpt.md to Korean by @yijun-lee in #33773blip.md to Korean by @cjfghk5697 in #33515image_processing_utils.md to Korean by @yijun-lee in #33804modular_transformers.md to Korean by @yijun-lee in #33772Patch helper] update to not have to checkout main by @ArthurZucker in #34006prepare_inputs_for_generation by @gante in #33870model_doc/bart.md to Korean by @fabxoe in #33893model_doc/deberta.md to Korean by @fabxoe in #33967main_classes/keras_callbacks.md to Korean by @fabxoe in #33955model_doc/mamba2.md to Korean by @fabxoe in #33629main_classes/model.md to Korean by @fabxoe in #33606model_doc/trajectory_transformer.md to Korean by @fabxoe in #33597model_doc/time_series_transformer.md to Korean by @fabxoe in #33596model_doc/informer.md to Korean by @fabxoe in #33585model_doc/graphormer.md to Korean by @fabxoe in #33569modeling_utils.md to Korean by @yijun-lee in #33808main_classes/data_collator.md to Korean by @fabxoe in #33954model_doc/patchtst.md to Korean by @fabxoe in #33589text_generation.md to Korean by @yijun-lee in #33777main_classes/callback.md to Korean by @Jwaminju in #33572generation_utils.md to Korean by @yijun-lee in #33818is_pipeline_test_to_skip method signature by @qubvel in #34067synced_gpus to True when using FullyShardedDataParallel by @ringohoffman in #33483logits to float() by @gante in #34042prepare_inputs_for_generation in encoder-decoder llms by @gante in #34048LlavaNextVideoForConditionalGeneration by @ydshieh in #34070generate calls with synced_gpus by @gante in #34095logits to same device as input_ids by @gante in #34076vivit.md to Korean by @mreraser in #33935gemma2.md to Korean by @yijun-lee in #33937trainer_utils.md to Korean by @yijun-lee in #33817blip-2.md to Korean by @cjfghk5697 in #33516accelerate error caused by 46d09af by @steveepreston in #34197trainer._get_eval_sampler() to support group_by_length arg by @larin92 in #33514prepare_inputs_for_generation by @gante in #34199require_torch_up_to_2_accelerators by @byi8220 in #34201MLFLOW_MAX_LOG_PARAMS to MLflowCallback by @cecheta in #34279executorch.md to Korean by @ahnjj in #33888bert japanese.md to Korean by @ahnjj in #33890model_doc/bartpho.md to Korean by @Jwaminju in #33981The following contributors have made significant changes to the library over the last release:
MllamaProcessor] Update errors and API with multiple image (#33715)clean_up_tokenization_spaces] Pl bart was failing, updating (#33735)MllamaImageProcessing] Update doc (#33747)modular] fixes! (#33820)PR run-slow] (#33939)self.position_embeddings->self.position_embedding (#33958)Red CIs] Fix hub failures (#34001)pytes collection] Fix flax test collection (#34004)Patch helper] update to not have to checkout main (#34006)TF] Fix Tensorflow XLA Generation on limited seq_len models (#33903)LlavaNextVideoForConditionalGeneration (#34070)logits.float() (#33902)synced_gpus to True when using FullyShardedDataParallel (#33483)gguf.md to Korean (#33764)audio_utils.md to Korean (#33802)esm.md to Korean (#33796)time_series_utils.md to Korean (#33806)pipelines_utils.md to Korean (#33809)trainer.md to Korean (#33797)chameleon.md to Korean (#33799)gemma.md to Korean (#33936)feature_extractor.md to Korean (#33775)tokenization_utils.md to Korean (#33813)file_utils.md to Korean (#33803)openai-gpt.md to Korean (#33801)biogpt.md to Korean (#33773)image_processing_utils.md to Korean (#33804)modular_transformers.md to Korean (#33772)modeling_utils.md to Korean (#33808)text_generation.md to Korean (#33777)generation_utils.md to Korean (#33818)gemma2.md to Korean (#33937)trainer_utils.md to Korean (#33817)main_classes/quantization.md to Korean (#33959)main_classes/configuration.md to Korean (#33952)model_doc/mamba.md to Korean (#33626)model_doc/autoformer.md to Korean (#33574)model_doc/patchtsmixer.md to Korean (#33587)model_doc/clip.md to Korean (#33610)model_doc/paligemma.md to Korean (#33612)model_doc/llama3.md to Korean (#33635)model_doc/mistral.md to Korean (#33648)model_doc/cohere.md to Korean (#33885)model_doc/dbrx.md to Korean (#33951)model_doc/deberta-v2.md to Korean (#33968)main_classes/onnx.md to Korean (#33601)model_doc/bart.md to Korean (#33893)model_doc/deberta.md to Korean (#33967)main_classes/keras_callbacks.md to Korean (#33955)model_doc/mamba2.md to Korean (#33629)main_classes/model.md to Korean (#33606)model_doc/trajectory_transformer.md to Korean (#33597)model_doc/time_series_transformer.md to Korean (#33596)model_doc/informer.md to Korean (#33585)model_doc/graphormer.md to Korean (#33569)main_classes/data_collator.md to Korean (#33954)model_doc/patchtst.md to Korean (#33589)Mostly some warnings that were not properly removed ⚠️ :
🔴 Had a small regression with dynamic Cache 🔴 *Cache: revert DynamicCache init for BC #33861 by @gante
A small fix for idefic 🐩 :
And a fix for Siglip 🤧 !
The Llama 3.2-Vision collection of multimodal large language models (LLMs) is a collection of pretrained and instruction-tuned image reasoning generative models in 11B and 90B sizes (text + images in / text out). The Llama 3.2-Vision instruction-tuned models are optimized for visual recognition, image reasoning, captioning, and answering general questions about an image. The models outperform many of the available open source and closed multimodal models on common industry benchmarks.
The Qwen2-VL is a major update from the previous Qwen-VL by the Qwen team.
An extract from the Qwen2-VL blogpost available here is as follows:
Qwen2-VL is the latest version of the vision language models based on Qwen2 in the Qwen model familities. Compared with Qwen-VL, Qwen2-VL has the capabilities of:
The Qwen2-Audio is the new model series of large audio-language models from the Qwen team. Qwen2-Audio is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions.
They introduce two distinct audio interaction modes:
OLMoE is a series of Open Language Models using sparse Mixture-of-Experts designed to enable the science of language models. The team releases all code, checkpoints, logs, and details involved in training these models.
LLaVA-Onevision is a Vision-Language Model that can generate text conditioned on one or several images/videos. The model consists of SigLIP vision encoder and a Qwen2 language backbone. The images are processed with anyres-9 technique where the image is split into 9 patches to better process high resolution images and capture as much details as possible. However, videos are pooled to a total sequence length of 196 tokens each frame for more memory efficient computation. LLaVA-Onevision is available in three sizes: 0.5B, 7B and 72B and achieves remarkable performance on benchmark evaluations.
The FalconMamba model was proposed by TII UAE (Technology Innovation Institute) in their release.
The model has been trained on approximtely 6T tokens consisting a mixture of many data sources such as RefineWeb, Cosmopedia and Math data.
The team releases an accompanying blog post.
he Granite model was proposed in Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler by Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox and Rameswar Panda.
PowerLM-3B is a 3B state-of-the-art small language model trained with the Power learning rate scheduler. It is trained on a wide range of open-source and synthetic datasets with permissive licenses. PowerLM-3B has shown promising results compared to other models in the size categories across various benchmarks, including natural language multi-choices, code generation, and math reasoning.
The GraniteMoe model was proposed in Power Scheduler: A Batch Size and Token Number Agnostic Learning Rate Scheduler by Yikang Shen, Matthew Stallone, Mayank Mishra, Gaoyuan Zhang, Shawn Tan, Aditya Prasad, Adriana Meza Soria, David D. Cox and Rameswar Panda.
PowerMoE-3B is a 3B sparse Mixture-of-Experts (sMoE) language model trained with the Power learning rate scheduler. It sparsely activates 800M parameters for each token. It is trained on a mix of open-source and proprietary datasets. PowerMoE-3B has shown promising results compared to other dense models with 2x activate parameters across various benchmarks, including natural language multi-choices, code generation, and math reasoning.
The Descript Audio Codec (DAC) model is a powerful tool for compressing audio data, making it highly efficient for storage and transmission. By compressing 44.1 KHz audio into tokens at just 8kbps bandwidth, the DAC model enables high-quality audio processing while significantly reducing the data footprint. This is particularly useful in scenarios where bandwidth is limited or storage space is at a premium, such as in streaming applications, remote conferencing, and archiving large audio datasets.
The Pixtral model was released by the Mistral AI team. Pixtral is a multimodal model, taking images and text as input, and producing text as output. This model follows the Llava family, meaning image embeddings are placed instead of the [IMG] token placeholders.
The model uses PixtralVisionModel for its vision encoder, and MistralForCausalLM for its language decoder. The main contribution is the 2d ROPE (rotary postiion embeddings) on the images, and support for arbitrary image sizes (the images are not padded together nor are they resized).
The Mimi model was proposed in Moshi: a speech-text foundation model for real-time dialogue by Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave and Neil Zeghidour. Mimi is a high-fidelity audio codec model developed by the Kyutai team, that combines semantic and acoustic information into audio tokens running at 12Hz and a bitrate of 1.1kbps. In other words, it can be used to map audio waveforms into “audio tokens”, known as “codebooks”.
The OmDet-Turbo model was proposed in Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head by Tiancheng Zhao, Peng Liu, Xuan He, Lu Zhang, Kyusong Lee. OmDet-Turbo incorporates components from RT-DETR and introduces a swift multimodal fusion module to achieve real-time open-vocabulary object detection capabilities while maintaining high accuracy. The base model achieves performance of up to 100.2 FPS and 53.4 AP on COCO zero-shot.
GGUF support continues to be enhanced in the library by offering a way to load GGUF models within transformers by unquantizing them, before re-quantizing them for re-use within the GGUF/GGML ecosystem.
An ongoing effort is to add the ability to use torchao as a quantization backend. Future PRs will enable saving and fine-tuning with peft.
The Liger kernel is now supported in the Trainer class.
This PR introduces Modularity for transformers, which has always been prohibited when working with transformers (see blog post for the accompanying design philosophy).
The core idea behind this PR is to facilitate model addition by enabling Pythonic inheritance while keeping true to our single-file policy in which models/processors must be contained within a single file, enabling working around the object without going through 10 layers of abstractions.
It is heavily recommended to read the PR description in order to understand the depth of the change: https://github.com/huggingface/transformers/pull/33248
transformers: modularity and inheritance for new model additions by @ArthurZucker in #33248Agents continue being improved at each release; this time making it much simpler to leverage a local engine through a local Transformers Engine.
This PR adds to all decoder-only models (except for XLNet) support for dynamic cache.
The documentation for the Dynamic cache can be found here, and documentation related to the KV cache in transformers in general can be found here.
We've made several updates to our handling of chat models and chat templates. The most noticeable change is that assistant prefill is now supported. This means you can end a chat with an assistant message, and the model will continue that message instead of starting a new one, allowing you to guide the model's response:
pipe = pipeline("text-generation", model_checkpoint)
chat = [
{"role": "user", "content": "Can you format the answer in JSON?"},
{"role": "assistant", "content": '{"name": "'}
]
output = pipe(chat) # The model will continue outputting JSON!
We've also enabled several new functionalities in Jinja that will allow more powerful templates in future, including Loop Controls and a strftime_now function that can get the current date and time, which is commonly used in system messages. For more details, see the updated chat template docs.
mask_generation.md to Korean by @jeongiin in #32257idefics.md to Korean by @boyunJang in #32258image_to_image.md to Korean by @shinhyunji36 in #32327gptq.md to Korean by @1kmmk1 in #32293prompting.md to Korean by @chhaewxn in #32294quantization/quanto.md to Korean by @fabxoe in #32281image_feature_extraction.md to Korean by @mreraser in #32239chat_templating.md to Korean by @enchantee00 in #32362_supports_sdpa to True by @pocca2048 in #32457ko-llm_tutorial_optimization.md to Korean by @010kim in #32372trainer.md to Korean by @cjfghk5697 in #32260eetq.md to Korean by @jun048098 in #32352fsdp.md to Korean by @win2dvp21 in #32261bitsandbytes.md to Korean by @SeungAhSon in #32408inputs_embeds as input by @molbap in #32493test_static_cache_exportability with torch 2.4.0 by @guangy10 in #32516agent.md to Korean by @Jwaminju in #32351encodec model names by @Sai-Suraj-27 in #32581.push_to_hub(..., create_pr=True, revision="my-branch") when creating PR on not-owned repo by @Wauplin in #32094deepspeed.md to Korean by @4N3MONE in #32431awq.mdto Korean by @ahnjj in #32324test_find_base_model_checkpoint by @Sai-Suraj-27 in #32638is_torch_mps_available() function to include min_version argument by @Sai-Suraj-27 in #32545transformers tag to the modelcard by @LysandreJik in #32623WhisperGenerationMixin by @faaany in #32316test_tokenization_utils.py by @Sai-Suraj-27 in #32601tests/utils/test_add_new_model_like.py by @Sai-Suraj-27 in #32678JetMoeIntegrationTest by @ydshieh in #32332doctest_glob by @Sai-Suraj-27 in #32475 falcon-mamba-7b model checkpoint name by @Sai-Suraj-27 in #32837LogitsWarper and LogitsProcessor by @gante in #32626batch_size instead of max_batch_size by @gante in #32657to in DoLa body, causing exceptions in multi-gpu generation by @gante in #32856test_sdpa_can_compile_dynamic device-agnostic by @faaany in #32519whisper-large-v2 model link in docs by @Sai-Suraj-27 in #32871norm_before_gate usage by @vasqu in #32686tensor.norm() with decomposed version for CLIP executorch export by @qubvel in #32887return_timestamps when return_timestamps is not passed to generate function by @hrl in #31296huggingface_hub installation to workflows by @Sai-Suraj-27 in #32891exceptions.ConnectionError by @younesbelkada in #31469AttributeError raised when using Trainer with eval_on_start=True in Jupyter Notebook. by @fshp971 in #32849Processor.save_pretrained caused by #31691 by @leloykun in #32921use_cache=False by @gante in #32863PretrainedConfig from saving generate parameters; Update deprecations in generate-related code 🧹 by @gante in #32659atol in test_forward_with_num_logits_to_keep by @gante in #33093isin_mps_friendly, a wrapper function for torch.isin by @gante in #33099pydantic required version in dockerfiles to make it compatible with DeepSpeed by @Sai-Suraj-27 in #33105efficientnet pipeline timeout and prevent future similar issues due to large image size by @gante in #33123conversations.md to Korean by @newfull5 in #32468llm_optims.md to Korean by @yijun-lee in #32325return_dict_in_generate is False but should be True by @gante in #33146bitsandbytes) in docstrings by @rapsealk in #33230torch.from_numpy() to create tensors for np.ndarrays by @shinyano in #33201num_logits_to_keep in composite models by @zucchini-nlp in #33168FalconMamba training issues due to incompatible kernels by @younesbelkada in #33195torch.jit.trace for interpolate_pos_encoding in all vision models by @xenova in #33226inputs_embeds by @zucchini-nlp in #32932transformers[en] Documentation by @nnilayy in #33350FalconMambaForCausalLM by @younesbelkada in #33381FbgemmFp8Linear not preserving tensor shape by @vgel in #33239Zero-shot object detection documentation by @sergiopaniego in #33430SSH into runner info. to DM by @ydshieh in #33346train with a script by @faaany in #33423padding_side as call time kwargs by @zucchini-nlp in #33385Agents and tools documentation links typos by @sergiopaniego in #33471Agents, supercharged - Multi-agents, External tools, and more docs typo fixed by @sergiopaniego in #33478docs/source/ar/_toctree.yml by @AhmedAlmaghz in #32696accelerator.use_fp16 in examples by @hlky in #33513sequences_scores in the Whisper beam search output by @Nik-Kras in #32970model.config and model.generation_config 🔫 by @gante in #33480past_key_values is None by @gante in #33541attention_mask is 2D by @gante in #33575Mamba2] Move dt calculations to kernel by @vasqu in #33520gemma2 when instantiating a new cache by @gante in #33595test_generate_from_inputs_embeds_decoder_only by @gante in #33602torch_job by @ydshieh in #33593PreTrainedModel inheriting from GenerationMixin by @gante in #33203cache_implementation) by @gante in #33684The following contributors have made significant changes to the library over the last release:
chat_templating.md to Korean (#32362)ko-llm_tutorial_optimization.md to Korean (#32372)trainer.md to Korean (#32260)exceptions.ConnectionError (#31469)FalconMamba training issues due to incompatible kernels (#33195)FalconMambaForCausalLM (#33381)deepspeed.md to Korean (#32431)docs/source/ar/_toctree.yml (#32696)Patch release v4.44.2, mostly 2 regressions that were not caught for Jamba and for processors!
Full Changelog: https://github.com/huggingface/transformers/compare/v4.44.0...v4.44.1
This release comes a bit early in our cycle because we wanted to ship important and requested models along with improved performances for everyone!
All of these are included with examples in the awesome https://github.com/huggingface/local-gemma repository! 🎈 We tried to share examples of what is now possible with all the shipped features! Kudos to @gante, @sanchit-gandhi and @xenova
Generate: end-to-end compilation #30788 by @gante: model.generate now supports compiling! There are a few limitations, but here is a small snippet:
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import copy
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Meta-Llama-3.1-8B", torch_dtype=torch.bfloat16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B")
# compile generate
compiled_generate = torch.compile(model.generate, fullgraph=True, mode="reduce-overhead")
# compiled generate does NOT accept parameterization except a) model inputs b) a generation config
generation_config = copy.deepcopy(model.generation_config)
generation_config.pad_token_id = model.config.eos_token_id
model_inputs = tokenizer(["Write a poem about the market crashing in summer"], return_tensors="pt")
model_inputs = model_inputs.to(model.device)
output_compiled = compiled_generate(**model_inputs, generation_config=generation_config)
print(output_compiled)
model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)cache_implementation="offloaded" when calling from_pretrained or using this:from transformers import GenerationConfig
gen_config = GenerationConfig(cache_implementation="offloaded", # other generation options such as num_beams=4,num_beam_groups=2,num_return_sequences=4,diversity_penalty=1.0,max_new_tokens=50,early_stopping=True)
outputs = model.generate(inputs["input_ids"],generation_config=gen_config)
pytorch team gave us a great gift: you can now use torch.export directly compatible with Executorch! Find examples here.
This also unlocks support for prompt reuse:
import os, torch, copy
from transformers import AutoModelForCausalLM, AutoTokenizer, DynamicCache
device = "cuda"
ckpt = "meta-llama/Meta-Llama-3.1-8B-Instruct"
INITIAL_PROMPT = "From now on, you are going to answer all my questions with historical details. Make sure to always add a bit of french here and there, for style."
model = AutoModelForCausalLM.from_pretrained(ckpt, torch_dtype=torch.float16)
model.to(device)
tokenizer = AutoTokenizer.from_pretrained(ckpt)
prompt_cache = DynamicCache()
inputs = tokenizer(INITIAL_PROMPT, return_tensors="pt").to("cuda")
prompt_cache = model(**inputs, past_key_values = prompt_cache).past_key_values
prompt = "Why are french people obsessed with french?"
new_inputs = tokenizer(INITIAL_PROMPT + prompt, return_tensors="pt").to("cuda")
past_key_values = copy.deepcopy(prompt_cache)
outputs = model.generate(**new_inputs, past_key_values=past_key_values,max_new_tokens=20)
response = tokenizer.batch_decode(outputs)[0]
print(response)
prompt = "What is the best city to swim in?"
new_inputs = tokenizer(INITIAL_PROMPT + prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**new_inputs, past_key_values=copy.deepcopy(prompt_cache),max_new_tokens=20)
response = tokenizer.batch_decode(outputs)[0]
Gemma 2: support assisted generation #32357 by @gante
We now have a 2B Gemma 2 model -- a perfect sidekick for the 27B with assisted generation. We've enabled assisted generation in gemma 2, with a caveat: assisted generation currently requires the use of a windowless cache (as opposed to the default cache for gemma 2), so you might observe some output mismatch on long sequences. Read more about it here.
# transformers assisted generation reference:
# https://huggingface.co/docs/transformers/main/en/llm_optims#speculative-decoding
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
# we DON’T recommend using the 9b model with the 2b model as its assistant
assistant_model_name = 'google/gemma-2-2b-it'
reference_model_name = 'google/gemma-2-27b-it'
tokenizer = AutoTokenizer.from_pretrained(reference_model_name)
model = AutoModelForCausalLM.from_pretrained(
reference_model_name, device_map='auto', torch_dtype=torch.bfloat16
)
assistant_model = AutoModelForCausalLM.from_pretrained(
assistant_model_name, device_map='auto', torch_dtype=torch.bfloat16
)
model_inputs = tokenizer("Einstein's theory of relativity states", return_tensors="pt").to(model.device)
generation_options = {
"assistant_model": assistant_model,
"do_sample": True,
"temperature": 0.7,
"max_new_tokens": 64,
}
outputs = model.generate(**model_inputs, **generation_options)
tokenizer.batch_decode(outputs, skip_special_tokens=True)
Nemotron-4-340B-Instruct is a large language model (LLM) that can be used as part of a synthetic data generation pipeline to create training data that helps researchers and developers build their own LLMs. It is a fine-tuned version of the Nemotron-4-340B-Base model, optimized for English-based single and multi-turn chat use-cases. It supports a context length of 4,096 tokens.
The conversion script should be able to cover Minitron and Nemotron, thanks and kudos to @suiyoubi. See:
Codestral is trained on a diverse dataset of 80+ programming languages, including the most popular ones, such as Python, Java, C, C++, JavaScript, and Bash. It also performs well on more specific ones like Swift and Fortran. This broad language base ensures Codestral can assist developers in various coding environments and projects.
Codestral saves developers time and effort: it can complete coding functions, write tests, and complete any partial code using a fill-in-the-middle mechanism. Interacting with Codestral will help level up the developer’s coding game and reduce the risk of errors and bugs.
It's mamba2 architecture, was a bit of a pain to remove all einops but hope we made it better for everyone!
We removed the chat template in the code, they should all be on the hub!
Our great @sanchit-gandhi worked on porting the recent compile upgrades to long form decoding in
ruff to the latest version by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/31926unittest method with the correct one by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32198eos for assisted decoding by @zucchini-nlp in https://github.com/huggingface/transformers/pull/31301object base class by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32230target_sizes is None in post_process_image_guided_detection for owlv2 by @catalys1 in https://github.com/huggingface/transformers/pull/31934static cache implementation is not compatible with attn_implementation==flash_attention_2 by @faaany in https://github.com/huggingface/transformers/pull/32039convert_blip_checkpoint function call by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32262check_docstrings by @gante in https://github.com/huggingface/transformers/pull/32259p_mask a numpy array before passing to select_starts_ends by @faaany in https://github.com/huggingface/transformers/pull/32076gguf==0.9.1 by @Isotr0py in https://github.com/huggingface/transformers/pull/32298fetch-depth: 0 in trufflehog checkout step by @McPatate in https://github.com/huggingface/transformers/pull/31663inv_freq assignment by @gante in https://github.com/huggingface/transformers/pull/323303-5x faster torch.compile forward compilation for autoregressive decoder models by @fxmarty in https://github.com/huggingface/transformers/pull/32227
staticmethods with self as first argument by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32361speech dep to the consistency docker image by @gante in https://github.com/huggingface/transformers/pull/32374transformers/examples/flax/language-modeling/t5_tokenizer_model.py. by @fshp971 in https://github.com/huggingface/transformers/pull/32157test_embeded_special_tokens for luke and mluke models by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32413preprocess with decorator by @qubvel in https://github.com/huggingface/transformers/pull/32024Full Changelog: https://github.com/huggingface/transformers/compare/v4.43.4...v4.44.0
There was a mick mack, now deepseep issues are properly pushed with:
🤗 Enjoy holidays
Patch release v4.43.3:
We still saw some bugs so @zucchini-nlp added:
- Resize embeds with DeepSpeed #32214
Other fixes:
The Llama 3.1 models are released by Meta and come in three flavours: 8B, 70B, and 405B.
To get an overview of Llama 3.1, please visit the Hugging Face announcement blog post.
We release a repository of llama recipes to showcase usage for inference, total and partial fine-tuning of the different variants.
The Chameleon model was proposed in Chameleon: Mixed-Modal Early-Fusion Foundation Models by META AI Chameleon Team. Chameleon is a Vision-Language Model that use vector quantization to tokenize images which enables the model to generate multimodal output. The model takes images and texts as input, including an interleaved format, and generates textual response.
The ZoeDepth model was proposed in ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth by Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, Matthias Müller. ZoeDepth extends the DPT framework for metric (also called absolute) depth estimation. ZoeDepth is pre-trained on 12 datasets using relative depth and fine-tuned on two domains (NYU and KITTI) using metric depth. A lightweight head is used with a novel bin adjustment design called metric bins module for each domain. During inference, each input image is automatically routed to the appropriate head using a latent classifier.
Hiera was proposed in Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles by Chaitanya Ryali, Yuan-Ting Hu, Daniel Bolya, Chen Wei, Haoqi Fan, Po-Yao Huang, Vaibhav Aggarwal, Arkabandhu Chowdhury, Omid Poursaeed, Judy Hoffman, Jitendra Malik, Yanghao Li, Christoph Feichtenhofer
The paper introduces “Hiera,” a hierarchical Vision Transformer that simplifies the architecture of modern hierarchical vision transformers by removing unnecessary components without compromising on accuracy or efficiency. Unlike traditional transformers that add complex vision-specific components to improve supervised classification performance, Hiera demonstrates that such additions, often termed “bells-and-whistles,” are not essential for high accuracy. By leveraging a strong visual pretext task (MAE) for pretraining, Hiera retains simplicity and achieves superior accuracy and speed both in inference and training across various image and video recognition tasks. The approach suggests that spatial biases required for vision tasks can be effectively learned through proper pretraining, eliminating the need for added architectural complexity.
Our ReactAgent has a specific way to return its final output: it calls the tool final_answer, added to the user-defined toolbox upon agent initialization, with the answer as the tool argument. We found that even for a one-shot agent like CodeAgent, using a specific final_answer tools helps the llm_engine find what to return: so we generalized the final_answer tool for all agents.
Now if your code-based agent (like ReactCodeAgent) defines a function at step 1, it will remember the function definition indefinitely. This means your agent can create its own tools for later re-use!
This is a transformative PR: it allows the agent to regularly run a specific step for planning its actions in advance. This gets activated if you set an int for planning_interval upon agent initialization. At step 0, a first plan will be done. At later steps (like steps 3, 6, 9 if you set planning_interval=3 ), this plan will be updated by the agent depending on the history of previous steps. More detail soon!
A significant RoPE refactor was done to make it model agnostic and more easily adaptable to any architecture. It is only applied to Llama for now but will be applied to all models using RoPE over the coming days.
🚨🚨 This PR changes the code to rely on the tokenizer's defaults when these flags are unset. This means some models using TextGenerationPipeline previously did not add a <bos> by default, which (negatively) impacted their performance. In practice, this is a breaking change.
Example of a script changed as a result of this PR:
from transformers import AutoModelForCausalLM, AutoTokenizer, pipeline
import torch
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b-it")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2-9b-it", torch_dtype=torch.bfloat16, device_map="auto")
pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)
print(pipe("Foo bar"))
get_seq_length method by @sanchit-gandhi in #31661keras-nlp<0.14 pin by @gante in #31684tets/test_xxx_utils.py) to tests/utils by @ydshieh in #31730pytest_num_workers=4 for some CircleCI jobs by @ydshieh in #31764sdpa support for SigLIP by @qubvel in #31499TFBlipModelTest::test_pipeline_image_to_text by @ydshieh in #31827TrainingArguments by @andstor in #31812vocab_size in other two VLMs by @zucchini-nlp in #31681.generate() by @voidism in #29619_init_weights for ResNetPreTrainedModel by @ydshieh in #31851_init_weights for ResNetPreTrainedModel" by @ydshieh in #31868duplicate field definitions in some classes by @Sai-Suraj-27 in #31888push_to_hub=True in TrainingArguments by @SunMarc in #31808warnings in a with block to avoid flaky tests by @ydshieh in #31893ConvertSlow] make sure the order is preserved for addedtokens by @ArthurZucker in #31902Gemma2] Support FA2 softcapping by @ArthurZucker in #318871st argument name in classmethods by @Sai-Suraj-27 in #31907SlidingWindowCache.reset() by @gante in #31917Trainer.get_optimizer_cls_and_kwargs to be overridden by @apoorvkh in #31875GenerationMixin.generate compatibility with pytorch profiler by @fxmarty in #31935Cache and cache_position being default by @gante in #31898sigmoid_focal_loss() function call by @Sai-Suraj-27 in #31951logits_warper update in models with custom generate fn by @gante in #31957create_repo() function call by @Sai-Suraj-27 in #31947test_stage3_nvme_offload by @faaany in #31881src/transformers/__init__.py by @Sai-Suraj-27 in #31993log messages that are resulting in TypeError due to too many arguments by @Sai-Suraj-27 in #32017SeamlessM4Tv2ConformerEncoderLayer.forward() when gradient checkpointing is enabled by @anferico in #31945sdpa and FA2 for CLIP by @qubvel in #31940numpy<2.0 by @ydshieh in #32018head_dim through config (and do not require head_dim * num_heads == hidden_size) by @xenova in #32050duplicate entries in a dictionary by @Sai-Suraj-27 in #32041huggingface_hub 0.24 by @Wauplin in #32054mktemp() function by @Sai-Suraj-27 in #32123ko/_toctree.yml and remove custom_tools.md to reflect latest changes by @jungnerd in #31969TypeError instead of ValueError for invalid type by @Sai-Suraj-27 in #32111trust_remote_code when loading Libri Dummy by @sanchit-gandhi in #31748GPTNeoX and GPT2 by @vasqu in #31944The following contributors have made significant changes to the library over the last release:
.generate() (#29619)but also fix the sliding window for long context and other typos.
Was off last week could not get this out, thanks all for your patience 🥳
After experimenting, we noticed that for the 27b model mostly, softcapping is a must. So adding it back (it should have been there, but an error on my side made it disappear) sorry all! 😭
Thanks to our 2 contributors for their prompt fixing mostly applies for training and FA2!
Patch release for commit:
The Gemma2 model was proposed in Gemma2: Open Models Based on Gemini Technology and Research by Gemma2 Team, Google. Gemma2 models are trained on 6T tokens, and released with 2 versions, 2b and 7b.
The abstract from the paper is the following:
This work introduces Gemma2, a new family of open language models demonstrating strong performance across academic benchmarks for language understanding, reasoning, and safety. We release two sizes of models (2 billion and 7 billion parameters), and provide both pretrained and fine-tuned checkpoints. Gemma2 outperforms similarly sized open models on 11 out of 18 text-based tasks, and we present comprehensive evaluations of safety and responsibility aspects of the models, alongside a detailed description of our model development. We believe the responsible release of LLMs is critical for improving the safety of frontier models, and for enabling the next wave of LLM innovations
The RT-DETR model was proposed in DETRs Beat YOLOs on Real-time Object Detection by Wenyu Lv, Yian Zhao, Shangliang Xu, Jinman Wei, Guanzhong Wang, Cheng Cui, Yuning Du, Qingqing Dang, Yi Liu.
RT-DETR is an object detection model that stands for “Real-Time DEtection Transformer.” This model is designed to perform object detection tasks with a focus on achieving real-time performance while maintaining high accuracy. Leveraging the transformer architecture, which has gained significant popularity in various fields of deep learning, RT-DETR processes images to identify and locate multiple objects within them.
The InstructBLIP model was proposed in InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning by Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, Steven Hoi. InstructBLIP leverages the BLIP-2 architecture for visual instruction tuning.
InstructBLIP uses the same architecture as BLIP-2 with a tiny but important difference: it also feeds the text prompt (instruction) to the Q-Former.
The LLaVa-NeXT-Video model was proposed in LLaVA-NeXT: A Strong Zero-shot Video Understanding Model by Yuanhan Zhang, Bo Li, Haotian Liu, Yong Jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, Chunyuan Li. LLaVa-NeXT-Video improves upon LLaVa-NeXT by fine-tuning on a mix if video and image dataset thus increasing the model’s performance on videos.
LLaVA-NeXT surprisingly has strong performance in understanding video content in zero-shot fashion with the AnyRes technique that it uses. The AnyRes technique naturally represents a high-resolution image into multiple images. This technique is naturally generalizable to represent videos because videos can be considered as a set of frames (similar to a set of images in LLaVa-NeXT). The current version of LLaVA-NeXT makes use of AnyRes and trains with supervised fine-tuning (SFT) on top of LLaVA-Next on video data to achieves better video understanding capabilities.The model is a current SOTA among open-source models on VideoMME bench.
A very significant change makes its way within the transformers codebase, introducing a new way to add models to transformers. We recommend reading the description of the PR below, but here is the gist of it:
The diff_converter tool is here to replace our old Copied from statements, while keeping our core transformers philosophy:
- single model single file
- explicit code
- standardization of modeling code
- readable and educative code
- simple code
- least amount of modularity
This additionally unlocks the ability to very quickly see the differences between new architectures that get developed. While many architectures are similar, the "single model, single file" policy can obfuscate the changes. With this diff converter, we want to make the changes between architectures very explicit.
We've made major updates to our support for tool-use and RAG models. We can now automatically generate JSON schema descriptions for Python functions which are suitable for passing to tool models, and we've defined a standard API for tool models which should allow the same tool inputs to be used with many different models. Models will need updates to their chat templates to support the new API, and we're targeting the Nous-Hermes, Command-R and Mistral/Mixtral model families for support in the very near future. Please see the updated chat template docs for more information.
If you are the owner of a model that supports tool use, but you're not sure how to update its chat template to support the new API, feel free to reach out to us for assistance with the update, for example on the Hugging Face Discord server. Ping Matt and yell key phrases like "chat templates" and "Jinja" and your issue will probably get resolved.
We further the support of GGUF files to offer fine-tuning within the python/HF ecosystem, before converting them back to the GGUF/GGML/llama.cpp libraries.
A new optimizer is added in the Trainer.
Several improvements are done related to quantization: a new cache (the quantized KV cache) is added, offering the ability to convert the cache of generative models, further reducing the memory requirements.
Additionally, the documentation related to quantization is entirely redone with the aim of helping users choose which is the best quantization method.
New instance segmentation examples are added by @qubvel
As a notable improvement to the HF vision models that leverage backbones, we enable leveraging HF pretrained model weights as backbones, with the following API:
from transformers import MaskFormerConfig, MaskFormerForInstanceSegmentation
config = MaskFormerConfig(backbone="microsoft/resnet-50", use_pretrained_backbone=True)
model = MaskFormerForInstanceSegmentation(config)
Additionally, we thank @Cyrilvallez for diving into our generate method and greatly reducing the memory requirements.
generate() 🔥🔥🔥 by @Cyrilvallez in #30536Both the ConversationalPipeline and the Conversation object have been deprecated for a while, and are due for removal in 4.42, which is the upcoming version.
The TextGenerationPipeline is recommended for this use-case, and now accepts inputs in the form of the OpenAI API.
Removes duplicate softmax application in FLAVA attention. Likely to have a small change on the outputs but flagging with 🚨 as it will change a bit.
ignore_index attribute of the loss is updated to -100timm being updatedRecent updates to timm changed the type of the attribute model.feature_info.out_indices. Previously, out_indices would reflect the input type of out_indices on the create_model call i.e. either tuple or list. Now, this value is always a tuple.
As list are more useful and consistent for us -- we cannot save tuples in configs, they must be converted to lists first -- we instead choose to cast out_indices to always be a list.
This has the possibility of being a slight breaking change if users are creating models and relying on out_indices on being a tuple. As this property only happens when a new model is created, and not if it's saved and reloaded (because of the config), then I think this has a low chance of having much of an impact.
mamba slow forward by @vasqu in #30691tokenizer_class = "AutoTokenizer" Llava Family by @ArthurZucker in #30912optimum-benchmark by @ydshieh in #30615torch.use_deterministic_algorithms for XPU by @faaany in #30774MptIntegrationTests expected outputs by @ydshieh in #30989uv==0.1.45 by @ydshieh in #31006test_model_parallelism device-agnostic by @faaany in #30844test_model_parallelism for 2 model test classes by @ydshieh in #31067@main by @ydshieh in #31065ninja from docker image build by @ydshieh in #31080accelerate as a hard requirement by @younesbelkada in #31090OPTForQuestionAnswering by @younesbelkada in #31092test_multi_gpu_data_parallel_forward for vit and deit by @ydshieh in #31086HF_HUB_OFFLINE + fix has_file in offline mode by @Wauplin in #31016transformers-cli env reporting by @statelesshz in #31003load_in_8bit with bnb config by @younesbelkada in #31136IS_GITHUB_CI by @younesbelkada in #31147GemmaModel] fix small typo by @ArthurZucker in #31202test_compile_static_cache by @ydshieh in #30991mistral.py::Mask4DTestHard by @ydshieh in #31212MistralIntegrationTest by @ydshieh in #31231BlipModel by @younesbelkada in #31235name 'torch' is not defined in bitsandbytes integration by @jamesbraza in #31243benchmark job in push-important-models.yml by @ydshieh in #31259SwitchTransformer] Significant performance improvement on MoE blocks by @ranggihwang in #31173cached_download to hf_hub_download in remaining occurrences by @Wauplin in #31284str should be used not int when setting env variables by @statelesshz in #31272decoder_attention_mask shape by @ylacombe in #28071inputs_embeds padding logger.warning to logger.warning_once by @naimenz in #31411tokenizer being popped twice by @gante in #31427TestDeepSpeedModelZoo device-agnostic by @faaany in #31402dataloader_persistent_workers=True by @bastienlc in #30627Qwen2ForTokenClassification by @kevinhu in #31440generate call from local path by @gante in #31470PreTrainedTokenizerFast loading time when there are many added tokens by @ydshieh in #31404metric_for_best_model errors by @tomaarsen in #31450GPT2] Add SDPA support by @vasqu in #31172test_config_object to test_ds_config_object by @faaany in #31403torch.compile support for AQLM by @younesbelkada in #31473wandb integration with SetFit model by @timothepearce in #30021tokenization_utils_base.py's docstring by @sadra-barikbin in #31510spectrogram_batch by @ravenouse in #27159TrainingArguments by @qgallouedec in #31503_no_split_module by @zucchini-nlp in #31566i18n by @SauravMaheshkar in #31584self.projection call in VivitTubeletEmbeddings by @v-iashin in #31632GPT-NeoX] Add SDPA support by @vasqu in #31031past_key_values passed as kwargs by @gante in #31644The following contributors have made significant changes to the library over the last release:
mamba slow forward (#30691)GPT2] Add SDPA support (#31172)GPT-NeoX] Add SDPA support (#31031)generate() 🔥🔥🔥 (#30536)spectrogram_batch (#27159)Mostly fixing some stuff related to trust_remote_code=True and from_pretrained
The local_file_only was having a hard time when a .safetensors file did not exist. This is not expected and instead of trying to convert, we should just fallback to loading the .bin files.