Read our blog post for an overview of TRL v1.
Asynchronous GRPO decouples generation from the gradient update loop by offloading rollouts to an external vLLM server. Generation runs in parallel while training continues, eliminating idle GPU time and improving hardware utilization.
from trl.experimental.async_grpo import AsyncGRPOTrainer
from trl.rewards import accuracy_reward
from datasets import load_dataset
dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
trainer = AsyncGRPOTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct",
reward_funcs=accuracy_reward,
train_dataset=dataset,
)
trainer.train()
by @qgallouedec in https://github.com/huggingface/trl/pull/5293
VESPO addresses training instability in off-policy RL caused by policy staleness, asynchronous updates, and train-inference mismatches. Rather than relying on heuristic token-level clipping (GRPO) or sequence-length normalization (GSPO), VESPO derives a principled reshaping kernel from a variational framework. In practice, this yields a smooth, asymmetric Gamma weighting function that gracefully suppresses extreme sequence-level importance weights without introducing length bias. It can be enabled via the loss_type parameter of GRPOConfig:
from trl import GRPOConfig, GRPOTrainer
trainer = GRPOTrainer(
model="Qwen/Qwen3-0.6B",
args=GRPOConfig(loss_type="vespo"),
...
)
by @casinca in https://github.com/huggingface/trl/pull/5199
DPPO is a new experimental trainer that replaces the standard PPO clipping mechanism with divergence constraints, providing more principled trust-region updates.
by @LeonEricsson in https://github.com/huggingface/trl/pull/5117
SDPO is a new experimental trainer that augments on-policy RL with self-distillation from the model's own high-reward trajectories. Instead of using an external teacher, SDPO treats the current model conditioned on feedback as a self-teacher, distilling its feedback-informed predictions back into the policy.
from trl.experimental import SDPOTrainer, SDPOConfig
config = SDPOConfig(
output_dir="./results",
num_generations=8,
success_reward_threshold=1.0,
use_successful_as_teacher=True,
)
trainer = SDPOTrainer(
model="Qwen/Qwen2.5-Math-1.5B-Instruct",
reward_funcs=[accuracy_reward],
args=config,
train_dataset=dataset,
)
trainer.train()
by @MengAiDev in https://github.com/huggingface/trl/pull/4935
Reward functions can return a dictionary of extra values (scalars or per-sample columns) that will be logged alongside the reward. This makes it easier to track intermediate signals without writing custom callbacks.
def my_reward_fn(completions, answer, log_extra=None, log_metric=None, **kwargs):
extracted = [extract_answer(c) for c in completions]
rewards = [1.0 if e == a else 0.0 for e, a in zip(extracted, answer)]
if log_extra:
log_extra("golden_answer", list(answer))
log_extra("extracted_answer", extracted)
if log_metric:
log_metric("accuracy", sum(rewards) / len(rewards))
return rewards
<img width="1400" height="407" alt="image" src="https://github.com/user-attachments/assets/d345b0ac-0d3c-446f-9321-a26e73ee16b4" />
<img width="1353" height="673" alt="image" src="https://github.com/user-attachments/assets/b4c0302b-f69a-4715-9aad-278b4ad13299" />
by @manueldeprada in https://github.com/huggingface/trl/pull/5233
VLLMClient.chat()VLLMClient.chat() now supports tool calling, enabling agentic workflows directly through the vLLM client interface.
by @kansalaman in https://github.com/huggingface/trl/pull/4889
BFD packing is 35% faster. The "bfd-requeue" packing strategy has also been renamed to "bfd_split". See MIGRATION.md for details.
by @mariosasko in https://github.com/huggingface/trl/pull/5189
The GKD/GOLD trainer now supports buffered rollout generation, decoupling generation from gradient updates for more efficient distillation. vLLM inference support has also been added to the base self-distillation trainer.
by @cmpatino in https://github.com/huggingface/trl/pull/5137 and https://github.com/huggingface/trl/pull/5388
A MIGRATION.md guide has been added covering all breaking changes when upgrading from TRL v0 to v1. If you're already on v0.29, the changes are minimal.
by @qgallouedec in https://github.com/huggingface/trl/pull/5255
vllm_mode to "colocate" by @qgallouedec in https://github.com/huggingface/trl/pull/5255truncation_mode in SFT by @albertvillanova in https://github.com/huggingface/trl/pull/5306max_length in DPO VLM training by @albertvillanova in https://github.com/huggingface/trl/pull/5284pad_to_multiple_of to GRPOTrainer and RLOOTrainer by @czkkkkkk in https://github.com/huggingface/trl/pull/5180print_prompt_completions_sample to include reasoning content by @qgallouedec in https://github.com/huggingface/trl/pull/5327pixel_position_ids vision key by @qgallouedec in https://github.com/huggingface/trl/pull/5374None to apply_chat_template when it is an empty list by @rabinadk1 in https://github.com/huggingface/trl/pull/5380accuracy_reward crash when called from non-main thread by @qgallouedec in https://github.com/huggingface/trl/pull/5281RewardFunc type alias to reflect actual calling convention by @s-zx in https://github.com/huggingface/trl/pull/5246prepare_multimodal_messages to support tool_calls and tool role by @alvarobartt in https://github.com/huggingface/trl/pull/5212AGENTS.md) by @qgallouedec in https://github.com/huggingface/trl/pull/5236environment_factory by @sergiopaniego in https://github.com/huggingface/trl/pull/5235.ai by @qgallouedec in https://github.com/huggingface/trl/pull/5268pad_to_multiple_of to GRPOTrainer and RLOOTrainer by @czkkkkkk in https://github.com/huggingface/trl/pull/5180prepare_multimodal_messages to support tool_calls and tool role by @alvarobartt in https://github.com/huggingface/trl/pull/5212AGENTS.md) by @qgallouedec in https://github.com/huggingface/trl/pull/5236prompts in vLLM client and server by @qgallouedec in https://github.com/huggingface/trl/pull/5225truncate_prompt_tokens for vLLM 0.17.0 by @winglian in https://github.com/huggingface/trl/pull/5248rollout_func from _generate_single_turn to _generate by @qgallouedec in https://github.com/huggingface/trl/pull/5232RewardFunc type alias to reflect actual calling convention by @s-zx in https://github.com/huggingface/trl/pull/5246_generate_single_turn by @qgallouedec in https://github.com/huggingface/trl/pull/5239_generate_single_turn by @qgallouedec in https://github.com/huggingface/trl/pull/5240bfd-requeue to bfd_split by @mariosasko in https://github.com/huggingface/trl/pull/5189vllm_mode to "colocate" and add v0→v1 migration guide by @qgallouedec in https://github.com/huggingface/trl/pull/5255grpo_trainer.py): Variational Sequence-Level Soft Policy Optimization (VESPO) by @casinca in https://github.com/huggingface/trl/pull/5199accuracy_reward crash when called from non-main thread by @qgallouedec in https://github.com/huggingface/trl/pull/5281hasattr and getattr with defaults in AGENTS.md by @qgallouedec in https://github.com/huggingface/trl/pull/5294RewardFunc type annotation to allow Nonevalues in reward list by @qgallouedec in https://github.com/huggingface/trl/pull/5297Json() type for tool calling dataset format by @lhoestq in https://github.com/huggingface/trl/pull/5307GRPOTrainer/async: fix prefix EOS slicing for tool suffix (with Qwen3/3.5 type of chat templates) by @casinca in https://github.com/huggingface/trl/pull/5330grpo_trainer.py by @casinca in https://github.com/huggingface/trl/pull/5332environment_factory by @sergiopaniego in https://github.com/huggingface/trl/pull/5235print_prompt_completions_sample to include reasoning content by @qgallouedec in https://github.com/huggingface/trl/pull/5327AGENTS.md by @qgallouedec in https://github.com/huggingface/trl/pull/5280pixel_position_ids vision key by @qgallouedec in https://github.com/huggingface/trl/pull/5374.ai by @qgallouedec in https://github.com/huggingface/trl/pull/5268apply_chat_template when it is an empty list by @rabinadk1 in https://github.com/huggingface/trl/pull/5380TRACKIO_SPACE_ID env var from all scripts by @sergiopaniego in https://github.com/huggingface/trl/pull/5365pr_template_check.yml by @qgallouedec in https://github.com/huggingface/trl/pull/5393disable_config=True from generate to GenerationConfig by @qgallouedec in https://github.com/huggingface/trl/pull/5384Full Changelog: https://github.com/huggingface/trl/compare/v0.29.0...v1.0.0
Fetched April 7, 2026