Release v4.44.0: End to end compile generation!!! Gemma2 (with assisted decoding), Codestral (Mistral for code), Nemotron, Efficient SFT training, CPU Offloaded KVCache, torch export for static cache

This release comes a bit early in our cycle because we wanted to ship important and requested models along with improved performances for everyone!

All of these are included with examples in the awesome https://github.com/huggingface/local-gemma repository! 🎈 We tried to share examples of what is now possible with all the shipped features! Kudos to @gante, @sanchit-gandhi and @xenova

💥 End-to-end generation compile

Generate: end-to-end compilation #30788 by @gante: model.generate now supports compiling! There are a few limitations, but here is a small snippet:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import copy

model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Meta-Llama-3.1-8B", torch_dtype=torch.bfloat16, device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3.1-8B")

# compile generate
compiled_generate = torch.compile(model.generate, fullgraph=True, mode="reduce-overhead")

# compiled generate does NOT accept parameterization except a) model inputs b) a generation config
generation_config = copy.deepcopy(model.generation_config)
generation_config.pad_token_id = model.config.eos_token_id

model_inputs = tokenizer(["Write a poem about the market crashing in summer"], return_tensors="pt")
model_inputs = model_inputs.to(model.device)
output_compiled = compiled_generate(**model_inputs, generation_config=generation_config)
print(output_compiled)

⚡ 3 to 5x compile speedup (compilation time 👀 not runtime)

3-5x faster torch.compile forward compilation for autoregressive decoder models #32227* by @fxmarty . As documented on the PR, this makes the whole generation a lot faster when you re-use the cache! You can see this when you run model.forward = torch.compile(model.forward, mode="reduce-overhead", fullgraph=True)

🪶 Offloaded KV cache: offload the cache to CPU when you are GPU poooooor 🚀

Offloaded KV Cache #31325* by @n17s : you just have to set cache_implementation="offloaded" when calling from_pretrained or using this:

from transformers import GenerationConfig
gen_config = GenerationConfig(cache_implementation="offloaded", # other generation options such as num_beams=4,num_beam_groups=2,num_return_sequences=4,diversity_penalty=1.0,max_new_tokens=50,early_stopping=True)
outputs = model.generate(inputs["input_ids"],generation_config=gen_config)

📦 Torch export for static cache

pytorch team gave us a great gift: you can now use torch.export directly compatible with Executorch! Find examples here.

Make static cache compatible with torch.export #32168 by @guangy10

This also unlocks support for prompt reuse:

import os, torch, copy
from transformers import AutoModelForCausalLM, AutoTokenizer, DynamicCache
device = "cuda"
ckpt = "meta-llama/Meta-Llama-3.1-8B-Instruct"

INITIAL_PROMPT = "From now on, you are going to answer all my questions with historical details. Make sure to always add a bit of french here and there, for style."

model = AutoModelForCausalLM.from_pretrained(ckpt, torch_dtype=torch.float16)
model.to(device)
tokenizer = AutoTokenizer.from_pretrained(ckpt)

prompt_cache = DynamicCache()
inputs = tokenizer(INITIAL_PROMPT, return_tensors="pt").to("cuda")
prompt_cache = model(**inputs, past_key_values = prompt_cache).past_key_values

prompt = "Why are french people obsessed with french?"
new_inputs = tokenizer(INITIAL_PROMPT + prompt, return_tensors="pt").to("cuda")
past_key_values = copy.deepcopy(prompt_cache)
outputs = model.generate(**new_inputs, past_key_values=past_key_values,max_new_tokens=20) 
response = tokenizer.batch_decode(outputs)[0]
print(response)

prompt = "What is the best city to swim in?"
new_inputs = tokenizer(INITIAL_PROMPT + prompt, return_tensors="pt").to("cuda")
outputs = model.generate(**new_inputs, past_key_values=copy.deepcopy(prompt_cache),max_new_tokens=20) 
response = tokenizer.batch_decode(outputs)[0]

Gemma2: assisted decoding

Gemma 2: support assisted generation #32357 by @gante

We now have a 2B Gemma 2 model -- a perfect sidekick for the 27B with assisted generation. We've enabled assisted generation in gemma 2, with a caveat: assisted generation currently requires the use of a windowless cache (as opposed to the default cache for gemma 2), so you might observe some output mismatch on long sequences. Read more about it here.

# transformers assisted generation reference: 
# https://huggingface.co/docs/transformers/main/en/llm_optims#speculative-decoding 
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

# we DON’T recommend using the 9b model with the 2b model as its assistant
assistant_model_name = 'google/gemma-2-2b-it'
reference_model_name = 'google/gemma-2-27b-it'

tokenizer = AutoTokenizer.from_pretrained(reference_model_name)
model = AutoModelForCausalLM.from_pretrained(
   reference_model_name, device_map='auto', torch_dtype=torch.bfloat16
)
assistant_model = AutoModelForCausalLM.from_pretrained(
   assistant_model_name, device_map='auto', torch_dtype=torch.bfloat16
)

model_inputs = tokenizer("Einstein's theory of relativity states", return_tensors="pt").to(model.device)
generation_options = {
   "assistant_model": assistant_model,
   "do_sample": True,
   "temperature": 0.7,
   "max_new_tokens": 64,
}

outputs = model.generate(**model_inputs, **generation_options)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

Nemotron support

Nemotron-4-340B-Instruct is a large language model (LLM) that can be used as part of a synthetic data generation pipeline to create training data that helps researchers and developers build their own LLMs. It is a fine-tuned version of the Nemotron-4-340B-Base model, optimized for English-based single and multi-turn chat use-cases. It supports a context length of 4,096 tokens.

The conversion script should be able to cover Minitron and Nemotron, thanks and kudos to @suiyoubi. See:

Add Nemotron HF Support #31699

Codestral support

Codestral is trained on a diverse dataset of 80+ programming languages, including the most popular ones, such as Python, Java, C, C++, JavaScript, and Bash. It also performs well on more specific ones like Swift and Fortran. This broad language base ensures Codestral can assist developers in various coding environments and projects.

Codestral saves developers time and effort: it can complete coding functions, write tests, and complete any partial code using a fill-in-the-middle mechanism. Interacting with Codestral will help level up the developer’s coding game and reduce the risk of errors and bugs.

It's mamba2 architecture, was a bit of a pain to remove all einops but hope we made it better for everyone!

Add codestral mamba2 #32080 by @molbap and @vasqu

Breaking changes:

We removed the chat template in the code, they should all be on the hub!

🚨 No more default chat templates #31733 by @Rocketknight1

Long-form decoding for whisper, even faster:

Our great @sanchit-gandhi worked on porting the recent compile upgrades to long form decoding in

[whisper] compile compatibility with long-form decoding #31772

What's Changed

Enhancing SFT Training Efficiency Using Packing and FlashAttention2 with Position IDs by @RhuiDih in https://github.com/huggingface/transformers/pull/31629
Updated ruff to the latest version by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/31926
fix by @gante in https://github.com/huggingface/transformers/pull/32162
fix: Fixed an if condition that is always evaluating to true by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32160
[docs] change temperature to a positive value by @faaany in https://github.com/huggingface/transformers/pull/32077
adds: extra_repr() to MambaRMSNorm to include hidden size / size of weights in the layer by @rohitdwivedula in https://github.com/huggingface/transformers/pull/32171
fix: default value reflects the runtime environment variables rather than the ones present at import time. by @junrae6454 in https://github.com/huggingface/transformers/pull/32153
Update qwen2.md by @ArtificialZeng in https://github.com/huggingface/transformers/pull/32108
Remove conversational pipeline tests by @amyeroberts in https://github.com/huggingface/transformers/pull/32099
RoPE: relaxed rope validation by @gante in https://github.com/huggingface/transformers/pull/32182
let's not warn when someone is running a forward by @ArthurZucker in https://github.com/huggingface/transformers/pull/32176
Fix resize embedding with Deepspeed by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32192
Fix float8_e4m3fn in modeling_utils by @SunMarc in https://github.com/huggingface/transformers/pull/32193
Support dequantizing GGUF FP16 format by @PenutChen in https://github.com/huggingface/transformers/pull/31783
:rotating_light: No more default chat templates by @Rocketknight1 in https://github.com/huggingface/transformers/pull/31733
fix: Replaced deprecated unittest method with the correct one by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32198
[whisper] fix short-form output type by @sanchit-gandhi in https://github.com/huggingface/transformers/pull/32178
remove unnecessary guard code related with pytorch versions 1.4.2 ~ 1.7.0 by @statelesshz in https://github.com/huggingface/transformers/pull/32210
Update question_answering.py by @avlewis in https://github.com/huggingface/transformers/pull/32208
[BigBird Pegasus] set _supports_param_buffer_assignment to False by @kashif in https://github.com/huggingface/transformers/pull/32222
[warnings] fix E721 warnings by @kashif in https://github.com/huggingface/transformers/pull/32223
Follow up for #31973 by @ydshieh in https://github.com/huggingface/transformers/pull/32025
translate philosophy.md to chinese by @statelesshz in https://github.com/huggingface/transformers/pull/32177
Allow a specific microphone to be used by the ffmpeg audio pipeline utility functions. Default to using the currently active microphone on Mac by @jrhe in https://github.com/huggingface/transformers/pull/31846
Fix code snippet for Grounding DINO by @qubvel in https://github.com/huggingface/transformers/pull/32229
Generation: stop at eos for assisted decoding by @zucchini-nlp in https://github.com/huggingface/transformers/pull/31301
Llava: generate without images by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32183
Resize embeds with DeepSpeed by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32214
don't log base model architecture in wandb if log model is false by @joaonadkarni in https://github.com/huggingface/transformers/pull/32143
Refactor: Removed un-necessary object base class by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32230
Adds: extra_repr for RMSNorm layers in most models by @rohitdwivedula in https://github.com/huggingface/transformers/pull/32204
Add check for target_sizes is None in post_process_image_guided_detection for owlv2 by @catalys1 in https://github.com/huggingface/transformers/pull/31934
[tests] fix static cache implementation is not compatible with attn_implementation==flash_attention_2 by @faaany in https://github.com/huggingface/transformers/pull/32039
Flash-Attn: fix generation when no attention mask or no pading by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32241
More flexible trigger condition by @ydshieh in https://github.com/huggingface/transformers/pull/32251
Llama 3.1: replace for loop by tensor ops at inv_freq initialization by @gante in https://github.com/huggingface/transformers/pull/32244
🚨 Bloom support for cache class by @zucchini-nlp in https://github.com/huggingface/transformers/pull/31445
Upload new model failure report to Hub by @ydshieh in https://github.com/huggingface/transformers/pull/32264
Optimize t5 tokenize logic to avoid redundant calls by @leejet in https://github.com/huggingface/transformers/pull/32270
fix: Fixed wrong argument passed to convert_blip_checkpoint function call by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32262
Repo: remove exceptions in check_docstrings by @gante in https://github.com/huggingface/transformers/pull/32259
make p_mask a numpy array before passing to select_starts_ends by @faaany in https://github.com/huggingface/transformers/pull/32076
fix(docs): Fixed a link in docs by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32274
Generate: end-to-end compilation by @gante in https://github.com/huggingface/transformers/pull/30788
Whisper tokenizer word level timestamps by @kamilakesbi in https://github.com/huggingface/transformers/pull/32197
[pipeline] fix padding for 1-d tensors by @sanchit-gandhi in https://github.com/huggingface/transformers/pull/31776
Make static cache compatible with torch.export by @guangy10 in https://github.com/huggingface/transformers/pull/32168
Add stream messages from agent run for gradio chatbot by @aymeric-roucher in https://github.com/huggingface/transformers/pull/32142
use torch 2.4 in 2 CI jobs by @ydshieh in https://github.com/huggingface/transformers/pull/32302
Docs: fix GaLore optimizer code example by @gil2rok in https://github.com/huggingface/transformers/pull/32249
Fix GGUF dequantize for gguf==0.9.1 by @Isotr0py in https://github.com/huggingface/transformers/pull/32298
Cast epochs_trained to int when resuming training by @teddy-f-47 in https://github.com/huggingface/transformers/pull/32286
feat(ci): set fetch-depth: 0 in trufflehog checkout step by @McPatate in https://github.com/huggingface/transformers/pull/31663
Fix M4T for ASR pipeline by @ylacombe in https://github.com/huggingface/transformers/pull/32296
Docs: formatting nits by @gante in https://github.com/huggingface/transformers/pull/32247
Alternative agent plan by @plaggy in https://github.com/huggingface/transformers/pull/32295
fix: Added missing raise keyword for few exceptions by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32333
fixes to properly shard FSDP across cpu and meta for cpu_efficient_loading for prequantized 4bit by @winglian in https://github.com/huggingface/transformers/pull/32276
fixes #32329 : The Torch code is correct - to get an average of 10% o… by @fkrasnov2 in https://github.com/huggingface/transformers/pull/32335
Repo checks: skip docstring checks if not in the diff by @gante in https://github.com/huggingface/transformers/pull/32328
Fix slow GemmaTokenizer and improve SPM slow -> fast conversion process by @xenova in https://github.com/huggingface/transformers/pull/32191
LLaVA-NeXT: fix anyres shapes by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32314
Gemma2 and flash-attention by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32188
Llama 3.1: Fix incorrect inv_freq assignment by @gante in https://github.com/huggingface/transformers/pull/32330
[Idefics2] - Fix FA2 call for Perceiver layer by @amyeroberts in https://github.com/huggingface/transformers/pull/32275
Gemma 2: support assisted generation by @gante in https://github.com/huggingface/transformers/pull/32357
Fix error when streaming to gradio with non-string tool arguments by @aymeric-roucher in https://github.com/huggingface/transformers/pull/32360
3-5x faster torch.compile forward compilation for autoregressive decoder models by @fxmarty in https://github.com/huggingface/transformers/pull/32227
fix: Fixed staticmethods with self as first argument by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32361
fix: warmup_steps check for training_args by @Ricardo-L-C in https://github.com/huggingface/transformers/pull/32236
LLaVa: add cache class attribute by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32278
[enc-dec cache] fix bug in indexing by @sanchit-gandhi in https://github.com/huggingface/transformers/pull/32370
[whisper] compile compatibility with long-form decoding by @sanchit-gandhi in https://github.com/huggingface/transformers/pull/31772
Remove size check between attn_weights and kv_seq_len for phi3 by @helunwencser in https://github.com/huggingface/transformers/pull/32339
add missing attribute _supports_param_buffer_assignment for gpt-j. by @nv-guomingz in https://github.com/huggingface/transformers/pull/32359
Check device map for saving tokenizer config on TPU (fix for issue #31971) by @ayukh in https://github.com/huggingface/transformers/pull/32043
update clean_up_tokenization_spaces warning by @itazap in https://github.com/huggingface/transformers/pull/32371
Empty list in defaults for LLaMA special tokens during weights conversion by @ViktorooReps in https://github.com/huggingface/transformers/pull/32342
Fix conflicting key in init kwargs in PreTrainedTokenizerBase by @OmarManzoor in https://github.com/huggingface/transformers/pull/31233
Offloaded KV Cache by @n17s in https://github.com/huggingface/transformers/pull/31325
Docker: add speech dep to the consistency docker image by @gante in https://github.com/huggingface/transformers/pull/32374
Fixed Hybrid Cache Shape Initialization. by @OsamaS99 in https://github.com/huggingface/transformers/pull/32163
Yell at the user if zero-3 init wasn't performed, but expected to have been done by @muellerzr in https://github.com/huggingface/transformers/pull/32299
Update docs by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32368
RoPE: Add numerical tests ✨ by @gante in https://github.com/huggingface/transformers/pull/32380
[generate] only require an attention mask for mps with torch<2.4 by @sanchit-gandhi in https://github.com/huggingface/transformers/pull/32367
fix: (issue #32124) Exception raised when running transformers/examples/flax/language-modeling/t5_tokenizer_model.py. by @fshp971 in https://github.com/huggingface/transformers/pull/32157
MixtralFlashAttention2: put "plus 1" inside parentheses when calculating rotary_seq_len, allowing None position_ids input. by @Luke20000429 in https://github.com/huggingface/transformers/pull/31500
Bump keras from 2.8.0 to 2.13.1 in /examples/research_projects/decision_transformer by @dependabot in https://github.com/huggingface/transformers/pull/32393
fix: SeamlessM4TFeatureExtractor stride remainder by @TechInterMezzo in https://github.com/huggingface/transformers/pull/32088
Phi3 tests: fix typing for Python 3.8 by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32388
#32184 save total_vocab_size by @itazap in https://github.com/huggingface/transformers/pull/32240
add values for neftune by @nbroad1881 in https://github.com/huggingface/transformers/pull/32399
Fix documentation references to google/bit-50 model by @JuanFKurucz in https://github.com/huggingface/transformers/pull/32407
Persist embedding type of BART and mBART models after resize by @AbdiHaryadi in https://github.com/huggingface/transformers/pull/32242
fix: Updated test_embeded_special_tokens for luke and mluke models by @Sai-Suraj-27 in https://github.com/huggingface/transformers/pull/32413
Respect the config's attn_implementation if set by @amyeroberts in https://github.com/huggingface/transformers/pull/32383
Fix documentation links and code reference to model llava-next by @JuanFKurucz in https://github.com/huggingface/transformers/pull/32434
Cache: create docs by @zucchini-nlp in https://github.com/huggingface/transformers/pull/32150
Llava: fix checkpoint_doc by @RUFFY-369 in https://github.com/huggingface/transformers/pull/32458
add the missing flash attention test marker by @faaany in https://github.com/huggingface/transformers/pull/32419
Update kwargs validation for preprocess with decorator by @qubvel in https://github.com/huggingface/transformers/pull/32024
Fix get large model config for Switch Transformer encoder only tester by @JuanFKurucz in https://github.com/huggingface/transformers/pull/32438
Dependencies: fix typo by @gante in https://github.com/huggingface/transformers/pull/32389
Add Nemotron HF Support by @suiyoubi in https://github.com/huggingface/transformers/pull/31699
Generate: fix end to end compilation by @gante in https://github.com/huggingface/transformers/pull/32465
Add codestral mamba2 by @molbap in https://github.com/huggingface/transformers/pull/32080

New Contributors

@RhuiDih made their first contribution in https://github.com/huggingface/transformers/pull/31629
@rohitdwivedula made their first contribution in https://github.com/huggingface/transformers/pull/32171
@ArtificialZeng made their first contribution in https://github.com/huggingface/transformers/pull/32108
@avlewis made their first contribution in https://github.com/huggingface/transformers/pull/32208
@jrhe made their first contribution in https://github.com/huggingface/transformers/pull/31846
@joaonadkarni made their first contribution in https://github.com/huggingface/transformers/pull/32143
@catalys1 made their first contribution in https://github.com/huggingface/transformers/pull/31934
@leejet made their first contribution in https://github.com/huggingface/transformers/pull/32270
@guangy10 made their first contribution in https://github.com/huggingface/transformers/pull/32168
@gil2rok made their first contribution in https://github.com/huggingface/transformers/pull/32249
@teddy-f-47 made their first contribution in https://github.com/huggingface/transformers/pull/32286
@plaggy made their first contribution in https://github.com/huggingface/transformers/pull/32295
@fkrasnov2 made their first contribution in https://github.com/huggingface/transformers/pull/32335
@helunwencser made their first contribution in https://github.com/huggingface/transformers/pull/32339
@nv-guomingz made their first contribution in https://github.com/huggingface/transformers/pull/32359
@ayukh made their first contribution in https://github.com/huggingface/transformers/pull/32043
@n17s made their first contribution in https://github.com/huggingface/transformers/pull/31325
@OsamaS99 made their first contribution in https://github.com/huggingface/transformers/pull/32163
@fshp971 made their first contribution in https://github.com/huggingface/transformers/pull/32157
@Luke20000429 made their first contribution in https://github.com/huggingface/transformers/pull/31500
@TechInterMezzo made their first contribution in https://github.com/huggingface/transformers/pull/32088
@AbdiHaryadi made their first contribution in https://github.com/huggingface/transformers/pull/32242
@RUFFY-369 made their first contribution in https://github.com/huggingface/transformers/pull/32458
@suiyoubi made their first contribution in https://github.com/huggingface/transformers/pull/31699

Full Changelog: https://github.com/huggingface/transformers/compare/v4.43.4...v4.44.0