v0.20.0 — TRL — releases.sh

Breaking and major changes

🎞️ GSPO

GSPO is a GRPO variant that computes importance sampling weights at the sequence level instead of per-token.

📜 Paper: https://huggingface.co/papers/2507.18071

To reproduce the paper's setting, use this configuration:

from trl import GRPOConfig

training_args = GRPOConfig(
    importance_sampling_level="sequence",
    loss_type="grpo",
    steps_per_generation=...,
    beta=0.04,  # not explicitly specified in the paper, but they likely used the same value as in the GRPO paper
    epsilon=3e-4,  # https://x.com/ChujieZheng/status/1948933507696525392
)

by @qgallouedec in https://github.com/huggingface/trl/pull/3775

👁️ [GRPO] Add VLM training capabilities to the GRPO trainer

The GRPOTrainer can now be used for VLM training. Give a try with this dummy example:

from trl import GRPOTrainer
from datasets import load_dataset

# Dummy vision-language dataset
dataset = load_dataset("trl-internal-testing/zen-image", "conversational_prompt_only", split="train")

# Dummy reward function: count the number of unique characters in the completions
def reward_num_unique_chars(completions, **kwargs):
    return [len(set(c[0]["content"])) for c in completions]

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-VL-3B-Instruct",
    reward_funcs=[reward_num_unique_chars],
    train_dataset=dataset,
)

trainer.train()

by @CompN3rd and @kashif in https://github.com/huggingface/trl/pull/3072 in https://github.com/huggingface/trl/pull/3760

🐙 MPO

The DPO trainer supports combining multiple loss functions with different weights, enabling more sophisticated optimization strategies. This is particularly useful for implementing algorithms like MPO (Mixed Preference Optimization). MPO is a training approach that combines multiple optimization objectives, as described in the paper Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization.

To combine multiple losses, specify the loss types and corresponding weights as lists:

from trl import DPOConfig

# MPO: Combines DPO (sigmoid) for preference and BCO (bco_pair) for quality
training_args = DPOConfig(
    loss_type=["sigmoid", "bco_pair", "sft"],  # Loss types to combine
    loss_weights=[0.8, 0.2, 1.0]  # Corresponding weights, as used in the MPO paper
)

by @qgallouedec in https://github.com/huggingface/trl/pull/2544

Add support for CB with native transformers

Continuous Batching allows for faster generation using the transformers backend. You can now use it with the GRPOTrainer by setting use_transformers_paged=True in the config.

use_transformers_paged = True
from trl import GRPOConfig
training_args = GRPOConfig(
    # ... other args
    use_transformers_paged=Ture,
)

by @ArthurZucker in https://github.com/huggingface/trl/pull/3471

Add entropy based filtering inside the GRPOTrainer

In Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning, it is shown that utilizing only 20% of the highest entropy tokens leads to similar performance as using all tokens. You can now enable this feature in the GRPOTrainer by setting entropy_filtering=True in the config.

from trl import GRPOConfig

training_args = GRPOConfig(
    # ... other args
    top_entropy_quantile=0.2,  # Use only the top 20% of tokens based on entropy
)

by @pramodith in https://github.com/huggingface/trl/pull/3563

👐 FSDP2+GRPO

GRPO now supports FSDP2 training. Just run your script with an FSDP2 config:

accelerate launch --config_file examples/accelerate_configs/fsdp2.yaml run_grpo.py

by @SalmanMohammadi in https://github.com/huggingface/trl/pull/3687

What's Changed

⬆️ Bump dev version by @qgallouedec in https://github.com/huggingface/trl/pull/3626
fix grpo generation_kwargs by @ahatamiz in https://github.com/huggingface/trl/pull/3634
fixing num_processes by @shirinyamani in https://github.com/huggingface/trl/pull/3637
env var for vllm colocate exp added by @shirinyamani in https://github.com/huggingface/trl/pull/3638
Update dpo_vlm.py by @Clement25 in https://github.com/huggingface/trl/pull/3629
☕️ GRPO script reward_funcs error by @tcapelle in https://github.com/huggingface/trl/pull/3639
🤝 validate gradient_accumulation_steps vs steps_per_generation for on-policy GRPO by @HarryHsing in https://github.com/huggingface/trl/pull/3493
Add entropy based filtering inside the GRPOTrainer. by @pramodith in https://github.com/huggingface/trl/pull/3563
Make sure chat template isn't lost when truncating prompt. by @pramodith in https://github.com/huggingface/trl/pull/3651
Add paranthesis to correct the check. by @pramodith in https://github.com/huggingface/trl/pull/3658
Add support for CB with native transformers by @ArthurZucker in https://github.com/huggingface/trl/pull/3471
feat: Pass trainer state to reward functions by @seungduk-yanolja in https://github.com/huggingface/trl/pull/3669
Enable completion-only loss in SFTTrainer when using Liger Kernel by @kswhitecross in https://github.com/huggingface/trl/pull/3674
Add mlflow support for generate_during_eval DPOTrainer by @dhruvmullick in https://github.com/huggingface/trl/pull/3660
[SFT] drop attention_mask if we have position ids for fa2 by @kashif in https://github.com/huggingface/trl/pull/3673
Faster position_ids computation for FFD packing by @mariosasko in https://github.com/huggingface/trl/pull/3649
Support datasets 4 by @lhoestq in https://github.com/huggingface/trl/pull/3688
Update steps_per_generation default description grpo_config.py by @wa008 in https://github.com/huggingface/trl/pull/3685
Fix non-serializable torch.dtype bug in VLLM weight sync by @CarlosArguilar in https://github.com/huggingface/trl/pull/3690
fix: support dict access in SFT Trainer by @jannisborn in https://github.com/huggingface/trl/pull/3677
[fix] type error of quantile by @gitabtion in https://github.com/huggingface/trl/pull/3667
[CI] Fix slow grpo CI by @kashif in https://github.com/huggingface/trl/pull/3693
Restore the effect of liger_kernel's monkey_patch on global modules in UT. by @YangKai0616 in https://github.com/huggingface/trl/pull/3680
Add type hints to dpo_trainer.py by @bvantuan in https://github.com/huggingface/trl/pull/3631
Fix mislabeling: "First-fit decreasing" is actually "Best-fit-decreasing" by @LeonEricsson in https://github.com/huggingface/trl/pull/3696
✂️ [BUG when vllm and prompt_truncation are used]: Strip out pad tokens in truncated prompt text by @pramodith in https://github.com/huggingface/trl/pull/3698
📣 Use explicit version for checking datasets version by @qgallouedec in https://github.com/huggingface/trl/pull/3702
🔭 Fix package discovery configuration in setup.cfg by @qgallouedec in https://github.com/huggingface/trl/pull/3703
[SFT] Add seq_lengths to signature columns by @LeonEricsson in https://github.com/huggingface/trl/pull/3699
⚗️ Tiny MoE for test by @qgallouedec in https://github.com/huggingface/trl/pull/3712
BUG: Disregard pad token entropies for entropy threshold calculation by @pramodith in https://github.com/huggingface/trl/pull/3715
Fix ORPOTrainer loss scaling with gradient accumulation by @Aratako in https://github.com/huggingface/trl/pull/3716
[Online DPO] Safeguard logit slice against empty prompt by @LeonEricsson in https://github.com/huggingface/trl/pull/3719
Remove deprecated processor.tokenizer by @Tavish9 in https://github.com/huggingface/trl/pull/3720
👋 Remove --bf16 flag from training scripts by @qgallouedec in https://github.com/huggingface/trl/pull/3724
↔️ Fix CB in GRPO by @qgallouedec in https://github.com/huggingface/trl/pull/3722
📥 Set environment variables for vLLM distributed training in GRPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/3723
[GRPO] remove common activation offloading substring in all cases by @winglian in https://github.com/huggingface/trl/pull/3738
🔧 Fix GRPO sampling logic by @qgallouedec in https://github.com/huggingface/trl/pull/3725
🕸 Use wandb.run.url instead of wandb.run.get_url() (deprecated) by @qgallouedec in https://github.com/huggingface/trl/pull/3726
Updated processing_class docs for trainers by @sergiopaniego in https://github.com/huggingface/trl/pull/3737
Updated missing processing_class docs for rest of trainers by @sergiopaniego in https://github.com/huggingface/trl/pull/3745
Add comment for average_tokens_across_devices by @qgallouedec in https://github.com/huggingface/trl/pull/3746
uses steps_per_generation in vllm max_num_seqs by @akakakakakaa in https://github.com/huggingface/trl/pull/3747
🏗️ Refactor top-entropy in GRPO by @qgallouedec in https://github.com/huggingface/trl/pull/3727
[GRPO] Fix: Processing ref logprobs in batches by @idanshen in https://github.com/huggingface/trl/pull/3740
Add Object detection grounding recipe to Community tutorials by @sergiopaniego in https://github.com/huggingface/trl/pull/3752
🐙 MPO by @qgallouedec in https://github.com/huggingface/trl/pull/2544
⚰️ Remove deprecated by @qgallouedec in https://github.com/huggingface/trl/pull/3704
👨‍💼 [SFT] Packing with completion_only and assistant_only training by @LeonEricsson in https://github.com/huggingface/trl/pull/3749
👁️ [GRPO] Add VLM training capabilities to the trainer by @CompN3rd in https://github.com/huggingface/trl/pull/3072
Add MPO recipe to Community tutorials by @sergiopaniego in https://github.com/huggingface/trl/pull/3766
✋ Prevent NCCL Device Conflicts Between vLLM Server and Trainers by @CarlosArguilar in https://github.com/huggingface/trl/pull/3762
🔔 Add deprecation warnings for AlignPropTrainer and DDPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/3755
🔠 Support model str in OnlineDPO by @kashif in https://github.com/huggingface/trl/pull/3765
🌌 [GRPO] Log generation entropy by @LeonEricsson in https://github.com/huggingface/trl/pull/3700
🤓 [GRPO] Documentation for entropy metric by @LeonEricsson in https://github.com/huggingface/trl/pull/3770
Add uv scripts headers by @lhoestq in https://github.com/huggingface/trl/pull/3767
Update missing uv dep by @lhoestq in https://github.com/huggingface/trl/pull/3772
📐 Fix CI and GeometricMixtureWrapper by @qgallouedec in https://github.com/huggingface/trl/pull/3779
🩹 [Hotfix] Fix pynccl communicator assertion error with VLLMClient by @CarlosArguilar in https://github.com/huggingface/trl/pull/3774
🍿 [SFT] Fix dataset indexing which crashed with a IterableDataset by @LeonEricsson in https://github.com/huggingface/trl/pull/3771
🎞️ GSPO by @qgallouedec in https://github.com/huggingface/trl/pull/3775
🤏 [SFT] Improve doc on training on assistant only messages by @lewtun in https://github.com/huggingface/trl/pull/3784
📐 Add epsilon hyperparameter recommendation to GSPO by @qgallouedec in https://github.com/huggingface/trl/pull/3790
📍 Support training peft model with gradient checkpointing by @qgallouedec in https://github.com/huggingface/trl/pull/3785
💬 Fix clone_chat_template vocab size and support PEFT instruction tuning by @qgallouedec in https://github.com/huggingface/trl/pull/3763
🌋 [GRPO] add support for pixel_attention_mask (SmolVLM2) and image_sizes (LLaVa-Next) by @kashif in https://github.com/huggingface/trl/pull/3760
🔍 Add guidance on choosing max_length value and include visualization tool by @qgallouedec in https://github.com/huggingface/trl/pull/3630
📘 SFT doc rewrite by @qgallouedec in https://github.com/huggingface/trl/pull/3619
👐 FSDP2+GRPO by @SalmanMohammadi in https://github.com/huggingface/trl/pull/3687
Release: v0.20 by @qgallouedec in https://github.com/huggingface/trl/pull/3792

New Contributors

@ahatamiz made their first contribution in https://github.com/huggingface/trl/pull/3634
@Clement25 made their first contribution in https://github.com/huggingface/trl/pull/3629
@HarryHsing made their first contribution in https://github.com/huggingface/trl/pull/3493
@ArthurZucker made their first contribution in https://github.com/huggingface/trl/pull/3471
@seungduk-yanolja made their first contribution in https://github.com/huggingface/trl/pull/3669
@kswhitecross made their first contribution in https://github.com/huggingface/trl/pull/3674
@lhoestq made their first contribution in https://github.com/huggingface/trl/pull/3688
@CarlosArguilar made their first contribution in https://github.com/huggingface/trl/pull/3690
@jannisborn made their first contribution in https://github.com/huggingface/trl/pull/3677
@gitabtion made their first contribution in https://github.com/huggingface/trl/pull/3667
@YangKai0616 made their first contribution in https://github.com/huggingface/trl/pull/3680
@Aratako made their first contribution in https://github.com/huggingface/trl/pull/3716
@CompN3rd made their first contribution in https://github.com/huggingface/trl/pull/3072

Full Changelog: https://github.com/huggingface/trl/compare/v0.19.0...v0.20.0