releases.shpreview
Hugging Face/TRL/v1.0.0rc1

v1.0.0rc1

March 20, 2026TRLView original ↗
$npx -y @buildinternet/releases show rel_uH5Du5_0GLzI_59w8HPxR

Features

Variational Sequence-Level Soft Policy Optimization (VESPO)

<img width="465" height="279" alt="Screenshot 2026-03-20 at 5 49 50 PM" src="https://github.com/user-attachments/assets/b60c9697-6eb7-498e-95b3-df78c367f5fa" />

VESPO addresses training instability in off-policy RL caused by policy staleness, asynchronous updates, and train-inference mismatches. Rather than relying on heuristic token-level clipping (GRPO) or sequence-length normalization (GSPO), VESPO derives a principled reshaping kernel from a variational framework. In practice, this yields a smooth, asymmetric Gamma weighting function that gracefully suppresses extreme sequence-level importance weights without introducing length bias. It can be enabled via the loss_type parameter of GRPOConfig:

from trl import GRPOConfig, GRPOTrainer

trainer = GRPOTrainer(
    model="Qwen/Qwen3-0.6B",
    args=GRPOConfig(loss_type="vespo"),
    ...
)

by @casinca in https://github.com/huggingface/trl/pull/5199

Divergence Proximal Policy Optimization (DPPO)

<img width="3180" height="1187" alt="z_TXYw37xZqsQ21YiDkYL" src="https://github.com/user-attachments/assets/40f1d538-82b3-4097-91c6-119ea9f7797b" /> <img width="1189" height="490" alt="SfgWotuuuRKPkg-0bxWv1" src="https://github.com/user-attachments/assets/2b090df3-0bfb-42e4-9f94-15943736e689" />

DPPO is a new experimental trainer that replaces the standard PPO clipping mechanism with divergence constraints, providing more principled trust-region updates.

by @LeonEricsson in https://github.com/huggingface/trl/pull/5117

Reward functions can now log extra columns and scalar metrics

Reward functions can return a dictionary of extra values (scalars or per-sample columns) that will be logged alongside the reward. This makes it easier to track intermediate signals without writing custom callbacks.

def my_reward_fn(completions, answer, log_extra=None, log_metric=None, **kwargs):
    extracted = [extract_answer(c) for c in completions]
    rewards = [1.0 if e == a else 0.0 for e, a in zip(extracted, answer)]

    if log_extra:
        log_extra("golden_answer", list(answer))
        log_extra("extracted_answer", extracted)

    if log_metric:
        log_metric("accuracy", sum(rewards) / len(rewards))

    return rewards
<img width="1400" height="407" alt="image" src="https://github.com/user-attachments/assets/d345b0ac-0d3c-446f-9321-a26e73ee16b4" /> <img width="1353" height="673" alt="image" src="https://github.com/user-attachments/assets/b4c0302b-f69a-4715-9aad-278b4ad13299" />

by @manueldeprada in https://github.com/huggingface/trl/pull/5233

Tool calling support in VLLMClient.chat()

VLLMClient.chat() now supports tool calling, enabling agentic workflows directly through the vLLM client interface.

by @kansalaman in https://github.com/huggingface/trl/pull/4889

35% faster packing

BFD packing is 35% faster. The "bfd-requeue" packing strategy has also been renamed to "bfd_split". See MIGRATION.md for details.

<img width="1784" height="732" alt="benchmark_results" src="https://github.com/user-attachments/assets/8f0a35ad-cf1a-4fe1-a1f4-9b102637bdca" />

by @mariosasko in https://github.com/huggingface/trl/pull/5189

[GKD] Buffer implementation for distillation trainer

The GKD/GOLD trainer now supports buffered rollout generation, decoupling generation from gradient updates for more efficient distillation.

by @cmpatino in https://github.com/huggingface/trl/pull/5137

v0 → v1 migration guide

A MIGRATION.md guide has been added covering all breaking changes when upgrading from TRL v0 to v1. If you're already on v0.29, the changes are minimal.

by @qgallouedec in https://github.com/huggingface/trl/pull/5255

Other

Fixes

Documentation and Examples

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.29.0...v1.0.0rc1

Fetched April 7, 2026