releases.shpreview

v1.0.0

March 31, 2026TRLView original ↗
$npx -y @buildinternet/releases show rel_p4jBIBiR7l1UL1yoc7ZSF
<img width="1800" height="1013" alt="thumbnail-2" src="https://github.com/user-attachments/assets/5c55b86a-0600-4f70-bf37-41ab240af851" />

Read our blog post for an overview of TRL v1.

Features

Asynchronous GRPO

Asynchronous GRPO decouples generation from the gradient update loop by offloading rollouts to an external vLLM server. Generation runs in parallel while training continues, eliminating idle GPU time and improving hardware utilization.

from trl.experimental.async_grpo import AsyncGRPOTrainer
from trl.rewards import accuracy_reward
from datasets import load_dataset

dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = AsyncGRPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    reward_funcs=accuracy_reward,
    train_dataset=dataset,
)
trainer.train()

by @qgallouedec in https://github.com/huggingface/trl/pull/5293

Variational Sequence-Level Soft Policy Optimization (VESPO)

<img width="465" height="279" alt="Screenshot 2026-03-20 at 5 49 50 PM" src="https://github.com/user-attachments/assets/b60c9697-6eb7-498e-95b3-df78c367f5fa" />

VESPO addresses training instability in off-policy RL caused by policy staleness, asynchronous updates, and train-inference mismatches. Rather than relying on heuristic token-level clipping (GRPO) or sequence-length normalization (GSPO), VESPO derives a principled reshaping kernel from a variational framework. In practice, this yields a smooth, asymmetric Gamma weighting function that gracefully suppresses extreme sequence-level importance weights without introducing length bias. It can be enabled via the loss_type parameter of GRPOConfig:

from trl import GRPOConfig, GRPOTrainer

trainer = GRPOTrainer(
    model="Qwen/Qwen3-0.6B",
    args=GRPOConfig(loss_type="vespo"),
    ...
)

by @casinca in https://github.com/huggingface/trl/pull/5199

Divergence Proximal Policy Optimization (DPPO)

<img width="3180" height="1187" alt="z_TXYw37xZqsQ21YiDkYL" src="https://github.com/user-attachments/assets/40f1d538-82b3-4097-91c6-119ea9f7797b" /> <img width="1189" height="490" alt="SfgWotuuuRKPkg-0bxWv1" src="https://github.com/user-attachments/assets/2b090df3-0bfb-42e4-9f94-15943736e689" />

DPPO is a new experimental trainer that replaces the standard PPO clipping mechanism with divergence constraints, providing more principled trust-region updates.

by @LeonEricsson in https://github.com/huggingface/trl/pull/5117

Self-Distillation Policy Optimization (SDPO)

SDPO is a new experimental trainer that augments on-policy RL with self-distillation from the model's own high-reward trajectories. Instead of using an external teacher, SDPO treats the current model conditioned on feedback as a self-teacher, distilling its feedback-informed predictions back into the policy.

from trl.experimental import SDPOTrainer, SDPOConfig

config = SDPOConfig(
    output_dir="./results",
    num_generations=8,
    success_reward_threshold=1.0,
    use_successful_as_teacher=True,
)

trainer = SDPOTrainer(
    model="Qwen/Qwen2.5-Math-1.5B-Instruct",
    reward_funcs=[accuracy_reward],
    args=config,
    train_dataset=dataset,
)
trainer.train()

by @MengAiDev in https://github.com/huggingface/trl/pull/4935

Reward functions can now log extra columns and scalar metrics

Reward functions can return a dictionary of extra values (scalars or per-sample columns) that will be logged alongside the reward. This makes it easier to track intermediate signals without writing custom callbacks.

def my_reward_fn(completions, answer, log_extra=None, log_metric=None, **kwargs):
    extracted = [extract_answer(c) for c in completions]
    rewards = [1.0 if e == a else 0.0 for e, a in zip(extracted, answer)]

    if log_extra:
        log_extra("golden_answer", list(answer))
        log_extra("extracted_answer", extracted)

    if log_metric:
        log_metric("accuracy", sum(rewards) / len(rewards))

    return rewards
<img width="1400" height="407" alt="image" src="https://github.com/user-attachments/assets/d345b0ac-0d3c-446f-9321-a26e73ee16b4" /> <img width="1353" height="673" alt="image" src="https://github.com/user-attachments/assets/b4c0302b-f69a-4715-9aad-278b4ad13299" />

by @manueldeprada in https://github.com/huggingface/trl/pull/5233

Tool calling support in VLLMClient.chat()

VLLMClient.chat() now supports tool calling, enabling agentic workflows directly through the vLLM client interface.

by @kansalaman in https://github.com/huggingface/trl/pull/4889

35% faster packing

BFD packing is 35% faster. The "bfd-requeue" packing strategy has also been renamed to "bfd_split". See MIGRATION.md for details.

<img width="1784" height="732" alt="benchmark_results" src="https://github.com/user-attachments/assets/8f0a35ad-cf1a-4fe1-a1f4-9b102637bdca" />

by @mariosasko in https://github.com/huggingface/trl/pull/5189

[GKD] Buffer implementation and vLLM inference for distillation trainer

The GKD/GOLD trainer now supports buffered rollout generation, decoupling generation from gradient updates for more efficient distillation. vLLM inference support has also been added to the base self-distillation trainer.

by @cmpatino in https://github.com/huggingface/trl/pull/5137 and https://github.com/huggingface/trl/pull/5388

v0 → v1 migration guide

A MIGRATION.md guide has been added covering all breaking changes when upgrading from TRL v0 to v1. If you're already on v0.29, the changes are minimal.

by @qgallouedec in https://github.com/huggingface/trl/pull/5255

Other

Fixes

Documentation and Examples

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.29.0...v1.0.0

Fetched April 7, 2026