releases.shpreview

v0.20.0

$npx -y @buildinternet/releases show rel_RCMk5USOlaeoH0XT5iNTo

Breaking and major changes

๐ŸŽž๏ธ GSPO

GSPO is a GRPO variant that computes importance sampling weights at the sequence level instead of per-token.

<img width="930" height="538" alt="Screenshot 2025-07-28 at 10 54 15โ€ฏPM" src="https://github.com/user-attachments/assets/923835af-dc61-4fd4-8a99-44242d02bb7b" />

๐Ÿ“œ Paper: https://huggingface.co/papers/2507.18071

To reproduce the paper's setting, use this configuration:

from trl import GRPOConfig

training_args = GRPOConfig(
    importance_sampling_level="sequence",
    loss_type="grpo",
    steps_per_generation=...,
    beta=0.04,  # not explicitly specified in the paper, but they likely used the same value as in the GRPO paper
    epsilon=3e-4,  # https://x.com/ChujieZheng/status/1948933507696525392
)

by @qgallouedec in https://github.com/huggingface/trl/pull/3775

๐Ÿ‘๏ธ [GRPO] Add VLM training capabilities to the GRPO trainer

<img width="1136" height="594" alt="Group 291-4" src="https://github.com/user-attachments/assets/04850e80-9689-472d-acd7-fda331e66dc3" />

The GRPOTrainer can now be used for VLM training. Give a try with this dummy example:

from trl import GRPOTrainer
from datasets import load_dataset

# Dummy vision-language dataset
dataset = load_dataset("trl-internal-testing/zen-image", "conversational_prompt_only", split="train")

# Dummy reward function: count the number of unique characters in the completions
def reward_num_unique_chars(completions, **kwargs):
    return [len(set(c[0]["content"])) for c in completions]

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-VL-3B-Instruct",
    reward_funcs=[reward_num_unique_chars],
    train_dataset=dataset,
)

trainer.train()

by @CompN3rd and @kashif in https://github.com/huggingface/trl/pull/3072 in https://github.com/huggingface/trl/pull/3760

๐Ÿ™ MPO

<img width="440" height="438" alt="Screenshot 2025-07-28 at 10 52 15โ€ฏPM" src="https://github.com/user-attachments/assets/e07a7936-c4c5-480d-9ffd-db5b77a5445e" />

The DPO trainer supports combining multiple loss functions with different weights, enabling more sophisticated optimization strategies. This is particularly useful for implementing algorithms like MPO (Mixed Preference Optimization). MPO is a training approach that combines multiple optimization objectives, as described in the paper Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization.

To combine multiple losses, specify the loss types and corresponding weights as lists:

from trl import DPOConfig

# MPO: Combines DPO (sigmoid) for preference and BCO (bco_pair) for quality
training_args = DPOConfig(
    loss_type=["sigmoid", "bco_pair", "sft"],  # Loss types to combine
    loss_weights=[0.8, 0.2, 1.0]  # Corresponding weights, as used in the MPO paper
)

by @qgallouedec in https://github.com/huggingface/trl/pull/2544

Add support for CB with native transformers

Continuous Batching allows for faster generation using the transformers backend. You can now use it with the GRPOTrainer by setting use_transformers_paged=True in the config.

use_transformers_paged = True
from trl import GRPOConfig
training_args = GRPOConfig(
    # ... other args
    use_transformers_paged=Ture,
)

by @ArthurZucker in https://github.com/huggingface/trl/pull/3471

Add entropy based filtering inside the GRPOTrainer

<img width="788" height="438" alt="Screenshot 2025-07-28 at 10 27 20โ€ฏPM" src="https://github.com/user-attachments/assets/8073a5db-a98e-4534-aea9-d3dbd2e75f4a" />

In Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning, it is shown that utilizing only 20% of the highest entropy tokens leads to similar performance as using all tokens. You can now enable this feature in the GRPOTrainer by setting entropy_filtering=True in the config.

from trl import GRPOConfig

training_args = GRPOConfig(
    # ... other args
    top_entropy_quantile=0.2,  # Use only the top 20% of tokens based on entropy
)

by @pramodith in https://github.com/huggingface/trl/pull/3563

๐Ÿ‘ FSDP2+GRPO

GRPO now supports FSDP2 training. Just run your script with an FSDP2 config:

accelerate launch --config_file examples/accelerate_configs/fsdp2.yaml run_grpo.py

by @SalmanMohammadi in https://github.com/huggingface/trl/pull/3687

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.19.0...v0.20.0

Fetched April 7, 2026