VESPO addresses training instability in off-policy RL caused by policy staleness, asynchronous updates, and train-inference mismatches. Rather than relying on heuristic token-level clipping (GRPO) or sequence-length normalization (GSPO), VESPO derives a principled reshaping kernel from a variational framework. In practice, this yields a smooth, asymmetric Gamma weighting function that gracefully suppresses extreme sequence-level importance weights without introducing length bias. It can be enabled via the loss_type parameter of GRPOConfig:
from trl import GRPOConfig, GRPOTrainer
trainer = GRPOTrainer(
model="Qwen/Qwen3-0.6B",
args=GRPOConfig(loss_type="vespo"),
...
)
by @casinca in https://github.com/huggingface/trl/pull/5199
DPPO is a new experimental trainer that replaces the standard PPO clipping mechanism with divergence constraints, providing more principled trust-region updates.
by @LeonEricsson in https://github.com/huggingface/trl/pull/5117
Reward functions can return a dictionary of extra values (scalars or per-sample columns) that will be logged alongside the reward. This makes it easier to track intermediate signals without writing custom callbacks.
def my_reward_fn(completions, answer, log_extra=None, log_metric=None, **kwargs):
extracted = [extract_answer(c) for c in completions]
rewards = [1.0 if e == a else 0.0 for e, a in zip(extracted, answer)]
if log_extra:
log_extra("golden_answer", list(answer))
log_extra("extracted_answer", extracted)
if log_metric:
log_metric("accuracy", sum(rewards) / len(rewards))
return rewards
<img width="1400" height="407" alt="image" src="https://github.com/user-attachments/assets/d345b0ac-0d3c-446f-9321-a26e73ee16b4" />
<img width="1353" height="673" alt="image" src="https://github.com/user-attachments/assets/b4c0302b-f69a-4715-9aad-278b4ad13299" />
by @manueldeprada in https://github.com/huggingface/trl/pull/5233
VLLMClient.chat()VLLMClient.chat() now supports tool calling, enabling agentic workflows directly through the vLLM client interface.
by @kansalaman in https://github.com/huggingface/trl/pull/4889
BFD packing is 35% faster. The "bfd-requeue" packing strategy has also been renamed to "bfd_split". See MIGRATION.md for details.
by @mariosasko in https://github.com/huggingface/trl/pull/5189
The GKD/GOLD trainer now supports buffered rollout generation, decoupling generation from gradient updates for more efficient distillation.
by @cmpatino in https://github.com/huggingface/trl/pull/5137
A MIGRATION.md guide has been added covering all breaking changes when upgrading from TRL v0 to v1. If you're already on v0.29, the changes are minimal.
by @qgallouedec in https://github.com/huggingface/trl/pull/5255
vllm_mode to "colocate" by @qgallouedec in https://github.com/huggingface/trl/pull/5255truncation_mode in SFT by @albertvillanova in https://github.com/huggingface/trl/pull/5306max_length in DPO VLM training by @albertvillanova in https://github.com/huggingface/trl/pull/5284pad_to_multiple_of to GRPOTrainer and RLOOTrainer by @czkkkkkk in https://github.com/huggingface/trl/pull/5180accuracy_reward crash when called from non-main thread by @qgallouedec in https://github.com/huggingface/trl/pull/5281RewardFunc type alias to reflect actual calling convention by @s-zx in https://github.com/huggingface/trl/pull/5246prepare_multimodal_messages to support tool_calls and tool role by @alvarobartt in https://github.com/huggingface/trl/pull/5212AGENTS.md) by @qgallouedec in https://github.com/huggingface/trl/pull/5236pad_to_multiple_of to GRPOTrainer and RLOOTrainer by @czkkkkkk in https://github.com/huggingface/trl/pull/5180prepare_multimodal_messages to support tool_calls and tool role by @alvarobartt in https://github.com/huggingface/trl/pull/5212AGENTS.md) by @qgallouedec in https://github.com/huggingface/trl/pull/5236prompts in vLLM client and server by @qgallouedec in https://github.com/huggingface/trl/pull/5225truncate_prompt_tokens for vLLM 0.17.0 by @winglian in https://github.com/huggingface/trl/pull/5248rollout_func from _generate_single_turn to _generate by @qgallouedec in https://github.com/huggingface/trl/pull/5232RewardFunc type alias to reflect actual calling convention by @s-zx in https://github.com/huggingface/trl/pull/5246_generate_single_turn by @qgallouedec in https://github.com/huggingface/trl/pull/5239_generate_single_turn by @qgallouedec in https://github.com/huggingface/trl/pull/5240bfd-requeue to bfd_split by @mariosasko in https://github.com/huggingface/trl/pull/5189vllm_mode to "colocate" and add v0→v1 migration guide by @qgallouedec in https://github.com/huggingface/trl/pull/5255grpo_trainer.py): Variational Sequence-Level Soft Policy Optimization (VESPO) by @casinca in https://github.com/huggingface/trl/pull/5199accuracy_reward crash when called from non-main thread by @qgallouedec in https://github.com/huggingface/trl/pull/5281hasattr and getattr with defaults in AGENTS.md by @qgallouedec in https://github.com/huggingface/trl/pull/5294RewardFunc type annotation to allow Nonevalues in reward list by @qgallouedec in https://github.com/huggingface/trl/pull/5297Json() type for tool calling dataset format by @lhoestq in https://github.com/huggingface/trl/pull/5307Full Changelog: https://github.com/huggingface/trl/compare/v0.29.0...v1.0.0rc1
Fetched April 7, 2026