v0.16.0

Major and breaking

🚀 Scaling GRPO to 70B+ Models and Multi-Node Training with vLLM Server & NCCL Communication

Previously, vLLM could only be used by dedicating a single GPU, preventing both the scalability benefits of vLLM and multi-node training. This limitation has now been removed!

GRPO can now scale efficiently with models exceeding 70B parameters, supporting multi-node training with super-fast performance.

autonlp-08-16

To take advantage of this, simply launch a vLLM server using the following command:

trl vllm-serve --model <model_name> --tensor_parallel_size <tp_size>

Then, start GRPO training with use_vllm=True.

Below is a comparison of GRPO throughput with and without vLLM, across different TP values and model sizes.

@binary-husky and @qgallouedec in https://github.com/huggingface/trl/pull/3094

🐦‍🔥 6x faster GRPO with multi-step optimization

This release introduces the multi-step trick, which allows for the reuse of generated data across multiple steps, speeding up the training process.

To support this, we've implemented importance sampling and clipping logic. This enhancement should lead to significant improvements in training speed.

To use it, simply set num_iterations to a value greater than 1.

training_args = GRPOConfig(..., num_iterations=4)

by @qgallouedec in https://github.com/huggingface/trl/pull/2899

🌍 Use global normalization in GRPO

As demonstrated in Dr GRPO, sequence-level normalization can introduce a response level length bias.

GmsxibSaoAAevq_

To address this, we have now switched to normalizing the loss and by the total number of tokens in the batch, ensuring more consistent and unbiased training.

- loss = ((per_token_loss * completion_mask).sum(dim=1) / completion_mask.sum(dim=1)).mean()
+ loss = (per_token_loss * completion_mask).sum() / completion_mask.sum()

by @edbeeching in https://github.com/huggingface/trl/pull/2881

⚖️ Add option not to scale rewards

As demonstrated in Dr GRPO, scaling rewards can introduce a question-level difficulty bias. To address this, we have now added an option to disable reward scaling in GRPO.

training_args = GRPOConfig(..., scale_rewards=False)

  advantages = rewards - mean_grouped_rewards
- advantages = advantages / std_grouped_rewards
+ if self.args.scale_rewards:
+     advantages = advantages / std_grouped_rewards

it's likely that we'll make this (scale_rewards=False) the default behavior in the future.

by @qgallouedec in https://github.com/huggingface/trl/pull/3135

🤸‍♀️ Domain-specific rewards in GRPO

When optimizing across multiple domains, not all reward functions are relevant for every sample. For example, a math verifier's reward does not apply to grammar samples, and a grammar verifier's reward does not apply to math samples.

It is now possible to return None for rewards that do not make sense for a given sample. For instance, when the domain is specified in a column like domain, you can implement it as follows:

GmcKjsgaQAAawqj

def math_reward(completions, domain, **kwargs):
    rewards = []
    for completion, dom in zip(completions, domain):
        if dom == "math":
            rewards.append(verify(completion))
        else:
            rewards.append(None)
    return rewards

This allows for more domain-specific reward handling, ensuring that irrelevant rewards are ignored and don’t interfere with optimization.

by @shirinyamani in https://github.com/huggingface/trl/pull/3079

🍃 Do not load reference model when `beta == 0.0`

It has been observed that not minimizing the KL divergence between the trained model and the reference model can still yield good results, while significantly reducing memory usage and compute. This is because there is no need to store the reference model in memory or perform a forward pass for it.

When beta is set to 0.0, the reference model is not loaded, and the KL divergence is not computed, leading to savings in both time and memory.

training_args = GRPOConfig(..., beta=0.0)

by @ingambe in https://github.com/huggingface/trl/pull/2806

🕊️ Padding-free for SFT

Padding-free batching is an alternative approach to packing for reducing memory usage. In this method, a batch is first sampled and then flattened into a single sequence, avoiding padding. Unlike packing, which can result in incomplete sequences by combining parts of different samples, padding-free batching ensures that all sequences remain complete and intact.

To enable padding-free batching in SFT, simply set padding_free=True in the SFTConfig, and make sure to use flash_attention2 as the attention implementation.

training_args = SFTConfig(..., padding_free=True, model_init_kwargs={"attn_implementation": "flash_attention2"})

by @qgallouedec in https://github.com/huggingface/trl/pull/3076

🎬 Clip Higher for Better Exploration

As outlined in the DAPO paper, increasing the upper bound epsilon leads to higher entropy during generation, promoting better exploration. To enable this, we’ve added support for adjusting the upper bound epsilon directly in the default GRPO trainer.

training_args = GRPOConfig(epsilon_high=0.28)

by @shirinyamani in https://github.com/huggingface/trl/pull/3118

Bug fixes

🧶 [GRPO][vLLM + LoRA] Move unmerge of PEFT model after weight loading by @XZ-X in https://github.com/huggingface/trl/pull/2873
🪂 Don't gather logits in SFT to avoid hanging by @qgallouedec in https://github.com/huggingface/trl/pull/2890
♻️ Fix caching in SFT by @qgallouedec in https://github.com/huggingface/trl/pull/2945
🐯 Fix LigerKernel for SFTTrainer by @lewtun @kashif and @qgallouedec in https://github.com/huggingface/trl/pull/2874, https://github.com/huggingface/trl/pull/2940 and https://github.com/huggingface/trl/pull/2949
🫔 [GRPO] Pass wrapped model to unwrap_model_for_generation for DeepSpeed Stage-3 compatibility by @kiddj in https://github.com/huggingface/trl/pull/2871
🛣️ inference_mode to no_grad when computing old_per_token_logps by @qgallouedec in https://github.com/huggingface/trl/pull/2987
🏊 [SFT] Compatibility with padding free and iterable dataset by @qgallouedec in https://github.com/huggingface/trl/pull/3053
Fixing JSD loss computation in GKDTrainer as per definition by @abhigoyal1997 in https://github.com/huggingface/trl/pull/3043

Minor

💬 Add maybe_convert_to_chatml map for conversational datasets in SFT by @kashif in https://github.com/huggingface/trl/pull/2862
🍟 [SFT] Handles the dataset if it has been preprocessed by @BenasdTW and @DanFosing in https://github.com/huggingface/trl/pull/2863 and https://github.com/huggingface/trl/pull/2939
✨ Add vLLM guided decoding support to GRPO Trainer by @kldzj in https://github.com/huggingface/trl/pull/2811
🩳 max_seq_length to max_length by @qgallouedec in https://github.com/huggingface/trl/pull/2895 and https://github.com/huggingface/trl/pull/2947
Optimize vllm num_generations by @edbeeching in https://github.com/huggingface/trl/pull/2855
📍 [GRPO] add gradient_checkpointing by @kashif in https://github.com/huggingface/trl/pull/2848
🪪 Adds profiling decorators for GRPOTrainer by @edbeeching in https://github.com/huggingface/trl/pull/2889 and https://github.com/huggingface/trl/pull/2975
🐈 Bye bye chat by @qgallouedec in https://github.com/huggingface/trl/pull/2934
📇 GRPO: print completions to console and update docs by @nopepper in https://github.com/huggingface/trl/pull/2951
👧🏽 Adding DoRA support to model config by @nbasyl in https://github.com/huggingface/trl/pull/2974
🧗 Add GRPO Trainer support for third-party accelerators by @ji-huazhong in https://github.com/huggingface/trl/pull/2836
🪙 [SFT] Log num_tokens and some logging fixes by @qgallouedec in https://github.com/huggingface/trl/pull/3006
🌡️ Fix temperature inconsistency in GRPO trainer by @Aladoro in https://github.com/huggingface/trl/pull/3029
⛔ Add EOS token to processed input in SFT by @qgallouedec in https://github.com/huggingface/trl/pull/3091
⚡ Pack 300 times faster, truncate 100 times faster by @mariosasko in https://github.com/huggingface/trl/pull/3009

What's Changed

[SFT] fix check for AutoLigerKernelForCausalLM by @kashif in https://github.com/huggingface/trl/pull/2874
🆙 Bump vLLM min version to 0.7.2 by @edbeeching in https://github.com/huggingface/trl/pull/2860
[GRPO] Fix loss normalization by @edbeeching in https://github.com/huggingface/trl/pull/2881
💬 Add maybe_convert_to_chatml map for conversational datasets in SFT by @kashif in https://github.com/huggingface/trl/pull/2862
🧶 [GRPO][vLLM + LoRA] Move unmerge of PEFT model after weight loading by @XZ-X in https://github.com/huggingface/trl/pull/2873
🍟 [SFT] Handles the dataset if it has been preprocessed by @BenasdTW in https://github.com/huggingface/trl/pull/2863
Optimize vllm num_generations by @edbeeching in https://github.com/huggingface/trl/pull/2855
🪂 Don't gather logits in SFT to avoid hanging by @qgallouedec in https://github.com/huggingface/trl/pull/2890
✨ Add vLLM guided decoding support to GRPO Trainer by @kldzj in https://github.com/huggingface/trl/pull/2811
⚰️ Remove deprecated by @qgallouedec in https://github.com/huggingface/trl/pull/2894
🩳 max_seq_length to max_length by @qgallouedec in https://github.com/huggingface/trl/pull/2895
🍃 GRPO - Do not load reference model when beta == 0 by @ingambe in https://github.com/huggingface/trl/pull/2806
📍 [GRPO] add gradient_checkpointing by @kashif in https://github.com/huggingface/trl/pull/2848
🪪 Adds profiling decorators for GRPOTrainer by @edbeeching in https://github.com/huggingface/trl/pull/2889
🐦‍🔥 6x faster GRPO with multi-step optimization by @qgallouedec in https://github.com/huggingface/trl/pull/2899
🔹 Fix: Miscalculated mask shape in comments by @linkedlist771 in https://github.com/huggingface/trl/pull/2925
🤖 Style bot by @qgallouedec in https://github.com/huggingface/trl/pull/2935
🧼 Upgrade ruff by @qgallouedec in https://github.com/huggingface/trl/pull/2938
🐈 Bye bye chat by @qgallouedec in https://github.com/huggingface/trl/pull/2934
♻️ Fix caching in SFT by @qgallouedec in https://github.com/huggingface/trl/pull/2945
📋 Add vLLM version to environment printout by @qgallouedec in https://github.com/huggingface/trl/pull/2946
☠️ Update max_seq_length to max_length in SFTConfig by @qgallouedec in https://github.com/huggingface/trl/pull/2947
🐯 Fix LigerKernel for SFTTrainer by @lewtun in https://github.com/huggingface/trl/pull/2940
✋ Prevent applying the chat template to tokenized datasets by @DanFosing in https://github.com/huggingface/trl/pull/2939
📇 GRPO: print completions to console and update docs by @nopepper in https://github.com/huggingface/trl/pull/2951
↩️ Fix typo in TextEnvironment init param, should be max_tool_response by @shenxiangzhuang in https://github.com/huggingface/trl/pull/2921
🗿 Updated DPO default values for alpha and tau by @Ishan-Kumar2 in https://github.com/huggingface/trl/pull/2918
📌 Pin liger-kernel and vLLM by @qgallouedec in https://github.com/huggingface/trl/pull/2952
⏪ Parameterize enable_prefix_caching by @ji-huazhong in https://github.com/huggingface/trl/pull/2900
🔢 Fix GRPO doc about num_iterations by @qgallouedec in https://github.com/huggingface/trl/pull/2966
Update grpo_trainer.py by @tpoisonooo in https://github.com/huggingface/trl/pull/2973
👧🏽 Adding DoRA support to model config by @nbasyl in https://github.com/huggingface/trl/pull/2974
🧗 Add GRPO Trainer support for third-party accelerators by @ji-huazhong in https://github.com/huggingface/trl/pull/2836
🕸 Add distributing training guide by @qgallouedec in https://github.com/huggingface/trl/pull/2956
👂 Update learning rate doc in KTOConfig by @sileod in https://github.com/huggingface/trl/pull/2912
🌌 Fix logits computation in trainer prediction step by @logicaltrojan in https://github.com/huggingface/trl/pull/2969
🪪 Adds a more fine-grained profiling context by @edbeeching in https://github.com/huggingface/trl/pull/2975
🧬 Fix typo in grpo_trainer.py by @congchan in https://github.com/huggingface/trl/pull/2988
📜 Update README and doc index by @qgallouedec in https://github.com/huggingface/trl/pull/2986
📑 Fix logged metrics for KTO by @vaibhavjindal in https://github.com/huggingface/trl/pull/2982
⚰️ Deprecate liger-kernel by @qgallouedec in https://github.com/huggingface/trl/pull/2949
🔍 Update GRPO config documentation for beta parameter stability by @nopepper in https://github.com/huggingface/trl/pull/2992
🫔 [GRPO] Pass wrapped model to unwrap_model_for_generation for DeepSpeed Stage-3 compatibility by @kiddj in https://github.com/huggingface/trl/pull/2871
🛣️ inference_mode to no_grad when computing old_per_token_logps by @qgallouedec in https://github.com/huggingface/trl/pull/2987
🚀 DeepSpeed integration documentation by @qgallouedec in https://github.com/huggingface/trl/pull/2993
Update pr_style_bot.yml by @qgallouedec in https://github.com/huggingface/trl/pull/3003
🪙 [SFT] Log num_tokens and some logging fixes by @qgallouedec in https://github.com/huggingface/trl/pull/3006
Improve ci by @paulinebm in https://github.com/huggingface/trl/pull/3007
✌️Remove double compute of sum in SFTTrainer by @lexasub in https://github.com/huggingface/trl/pull/3001
📚 Update customization and distributing training documentation by @qgallouedec in https://github.com/huggingface/trl/pull/2991
🌍 Use global normalization for KL logging (to match normalization for loss) by @tchang1997 in https://github.com/huggingface/trl/pull/3004
🗜️ Loosened tokenizer type hint on apply_chat_template by @jamesbraza in https://github.com/huggingface/trl/pull/3005
🎲 Add support for additional generation kwargs in GRPO Trainer by @nopepper in https://github.com/huggingface/trl/pull/2989
🚀 Supporting deepspeed>=0.16.4's rename by @jamesbraza in https://github.com/huggingface/trl/pull/2963
🌡️ Fix temperature inconsistency in GRPO trainer by @Aladoro in https://github.com/huggingface/trl/pull/3029
🏁 Passing custom BOS/EOS token to GPROTrainer.generation_config by @jamesbraza in https://github.com/huggingface/trl/pull/3046
💠 Fixing SFTTrainer.compute_loss crash with accelerate by @jamesbraza in https://github.com/huggingface/trl/pull/3048
👯 [GRPO] Relax the assumption that prompts are unique within a batch by @qgallouedec in https://github.com/huggingface/trl/pull/3052
[GRPO] use argument names with processing_class by @kashif in https://github.com/huggingface/trl/pull/3062
🦥 Fixed SFTTrainer.compute_loss hang from #3048's PR comments by @jamesbraza in https://github.com/huggingface/trl/pull/3056
🏊 [SFT] Compatibility with padding free and iterable dataset by @qgallouedec in https://github.com/huggingface/trl/pull/3053
Fixing JSD loss computation in GKDTrainer as per definition by @abhigoyal1997 in https://github.com/huggingface/trl/pull/3043
🎭 Minor spelling fix in documentation (caracteres -> characters) by @esnible in https://github.com/huggingface/trl/pull/3074
💎 Gemma 3 SFT example on Codeforces dataset by @qgallouedec in https://github.com/huggingface/trl/pull/3070
🫣 [GRPO] add cache_implementation option in GRPO by @kashif in https://github.com/huggingface/trl/pull/3075
⛔ Add EOS token to processed input in SFT by @qgallouedec in https://github.com/huggingface/trl/pull/3091
🕊️ Padding-free for SFT by @qgallouedec in https://github.com/huggingface/trl/pull/3076
add "_prepare_fsdp" for DPOTrainer by @faaany in https://github.com/huggingface/trl/pull/2539
Use main process for dataset.map by @lewtun in https://github.com/huggingface/trl/pull/3106
Flexible_reward by @shirinyamani in https://github.com/huggingface/trl/pull/3079
🎬 Clip higher by @shirinyamani in https://github.com/huggingface/trl/pull/3118
🚀 Scaling GRPO to 70B+ Models and Multi-Node Training with vLLM Server & NCCL Communication by @binary-husky in https://github.com/huggingface/trl/pull/3094
⚡ Pack 300 times faster, truncate 100 times faster by @mariosasko in https://github.com/huggingface/trl/pull/3009
☎️ Documentation for disable gathering of model weights for generation in DeepSpeed ZeRO-3 by @qgallouedec in https://github.com/huggingface/trl/pull/3136
⚖️ Add option not to scale rewards (Dr. GRPO) by @qgallouedec in https://github.com/huggingface/trl/pull/3135
Release: v0.16 by @qgallouedec in https://github.com/huggingface/trl/pull/3137

New Contributors

@XZ-X made their first contribution in https://github.com/huggingface/trl/pull/2873
@BenasdTW made their first contribution in https://github.com/huggingface/trl/pull/2863
@kldzj made their first contribution in https://github.com/huggingface/trl/pull/2811
@ingambe made their first contribution in https://github.com/huggingface/trl/pull/2806
@linkedlist771 made their first contribution in https://github.com/huggingface/trl/pull/2925
@DanFosing made their first contribution in https://github.com/huggingface/trl/pull/2939
@nopepper made their first contribution in https://github.com/huggingface/trl/pull/2951
@shenxiangzhuang made their first contribution in https://github.com/huggingface/trl/pull/2921
@Ishan-Kumar2 made their first contribution in https://github.com/huggingface/trl/pull/2918
@tpoisonooo made their first contribution in https://github.com/huggingface/trl/pull/2973
@nbasyl made their first contribution in https://github.com/huggingface/trl/pull/2974
@sileod made their first contribution in https://github.com/huggingface/trl/pull/2912
@logicaltrojan made their first contribution in https://github.com/huggingface/trl/pull/2969
@congchan made their first contribution in https://github.com/huggingface/trl/pull/2988
@vaibhavjindal made their first contribution in https://github.com/huggingface/trl/pull/2982
@kiddj made their first contribution in https://github.com/huggingface/trl/pull/2871
@paulinebm made their first contribution in https://github.com/huggingface/trl/pull/3007
@lexasub made their first contribution in https://github.com/huggingface/trl/pull/3001
@tchang1997 made their first contribution in https://github.com/huggingface/trl/pull/3004
@Aladoro made their first contribution in https://github.com/huggingface/trl/pull/3029
@abhigoyal1997 made their first contribution in https://github.com/huggingface/trl/pull/3043
@esnible made their first contribution in https://github.com/huggingface/trl/pull/3074
@mariosasko made their first contribution in https://github.com/huggingface/trl/pull/3009

Full Changelog: https://github.com/huggingface/trl/compare/v0.15.0...v0.16.0