v1.0.0rc1
Features
Variational Sequence-Level Soft Policy Optimization (VESPO)
<img width="465" height="279" alt="Screenshot 2026-03-20 at 5 49 50 PM" src="https://github.com/user-attachments/assets/b60c9697-6eb7-498e-95b3-df78c367f5fa" />VESPO addresses training instability in off-policy RL caused by policy staleness, asynchronous updates, and train-inference mismatches. Rather than relying on heuristic token-level clipping (GRPO) or sequence-length normalization (GSPO), VESPO derives a principled reshaping kernel from a variational framework. In practice, this yields a smooth, asymmetric Gamma weighting function that gracefully suppresses extreme sequence-level importance weights without introducing length bias. It can be enabled via the loss_type parameter of GRPOConfig:
from trl import GRPOConfig, GRPOTrainer
trainer = GRPOTrainer(
model="Qwen/Qwen3-0.6B",
args=GRPOConfig(loss_type="vespo"),
...
)
by @casinca in https://github.com/huggingface/trl/pull/5199
Divergence Proximal Policy Optimization (DPPO)
<img width="3180" height="1187" alt="z_TXYw37xZqsQ21YiDkYL" src="https://github.com/user-attachments/assets/40f1d538-82b3-4097-91c6-119ea9f7797b" /> <img width="1189" height="490" alt="SfgWotuuuRKPkg-0bxWv1" src="https://github.com/user-attachments/assets/2b090df3-0bfb-42e4-9f94-15943736e689" />DPPO is a new experimental trainer that replaces the standard PPO clipping mechanism with divergence constraints, providing more principled trust-region updates.
by @LeonEricsson in https://github.com/huggingface/trl/pull/5117
Reward functions can now log extra columns and scalar metrics
Reward functions can return a dictionary of extra values (scalars or per-sample columns) that will be logged alongside the reward. This makes it easier to track intermediate signals without writing custom callbacks.
def my_reward_fn(completions, answer, log_extra=None, log_metric=None, **kwargs):
extracted = [extract_answer(c) for c in completions]
rewards = [1.0 if e == a else 0.0 for e, a in zip(extracted, answer)]
if log_extra:
log_extra("golden_answer", list(answer))
log_extra("extracted_answer", extracted)
if log_metric:
log_metric("accuracy", sum(rewards) / len(rewards))
return rewards
<img width="1400" height="407" alt="image" src="https://github.com/user-attachments/assets/d345b0ac-0d3c-446f-9321-a26e73ee16b4" />
<img width="1353" height="673" alt="image" src="https://github.com/user-attachments/assets/b4c0302b-f69a-4715-9aad-278b4ad13299" />
by @manueldeprada in https://github.com/huggingface/trl/pull/5233
Tool calling support in VLLMClient.chat()
VLLMClient.chat() now supports tool calling, enabling agentic workflows directly through the vLLM client interface.
by @kansalaman in https://github.com/huggingface/trl/pull/4889
35% faster packing
BFD packing is 35% faster. The "bfd-requeue" packing strategy has also been renamed to "bfd_split". See MIGRATION.md for details.
by @mariosasko in https://github.com/huggingface/trl/pull/5189
[GKD] Buffer implementation for distillation trainer
The GKD/GOLD trainer now supports buffered rollout generation, decoupling generation from gradient updates for more efficient distillation.
by @cmpatino in https://github.com/huggingface/trl/pull/5137
v0 → v1 migration guide
A MIGRATION.md guide has been added covering all breaking changes when upgrading from TRL v0 to v1. If you're already on v0.29, the changes are minimal.
by @qgallouedec in https://github.com/huggingface/trl/pull/5255
Other
- Change default
vllm_modeto"colocate"by @qgallouedec in https://github.com/huggingface/trl/pull/5255 - Support
truncation_modein SFT by @albertvillanova in https://github.com/huggingface/trl/pull/5306 - Support
max_lengthin DPO VLM training by @albertvillanova in https://github.com/huggingface/trl/pull/5284 - Add
pad_to_multiple_ofto GRPOTrainer and RLOOTrainer by @czkkkkkk in https://github.com/huggingface/trl/pull/5180 - Support sequence sampling in Liger Kernel by @michaelroyzen in https://github.com/huggingface/trl/pull/5190
- Add tool calling support to VLLMClient.chat() by @kansalaman in https://github.com/huggingface/trl/pull/4889
- Add support for raw token IDs in vLLM client prompts by @qgallouedec in https://github.com/huggingface/trl/pull/5225
- Add VLM support when passing raw token IDs to vLLM client by @qgallouedec in https://github.com/huggingface/trl/pull/5227
Fixes
- Fix DPOTrainer collators to truncate sequences before padding by @albertvillanova in https://github.com/huggingface/trl/pull/5305
- Prevent corruption of DPO VLM training if "keep_end" truncation_mode by @albertvillanova in https://github.com/huggingface/trl/pull/5286
- Fix mm_token_type_ids silently dropped in DPO VLM training by @albertvillanova in https://github.com/huggingface/trl/pull/5279
- Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a reward model by @albertvillanova in https://github.com/huggingface/trl/pull/5295
- Fix
accuracy_rewardcrash when called from non-main thread by @qgallouedec in https://github.com/huggingface/trl/pull/5281 - Fix GRPOTrainer attribute access for vLLM model config by @falcondai in https://github.com/huggingface/trl/pull/5302
- [GRPO] Fix re-tokenization bug in tool-calling loop by @qgallouedec in https://github.com/huggingface/trl/pull/5242
- [CPO/ORPO] Fix handling of different length chosen/rejected prompts by @davmels in https://github.com/huggingface/trl/pull/4639
- Fix
RewardFunctype alias to reflect actual calling convention by @s-zx in https://github.com/huggingface/trl/pull/5246 - fix(ppo): add gradient_checkpointing_enable/disable to PolicyAndValueWrapper by @s-zx in https://github.com/huggingface/trl/pull/5245
- Fix
prepare_multimodal_messagesto supporttool_callsandtoolrole by @alvarobartt in https://github.com/huggingface/trl/pull/5212 - Fix support for model_init_kwargs when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5230
- Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5274
- Fix support for model_init_kwargs in GKD/GOLD when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5266
- Sync entire prompt/completion token tensors before indexing by @shawnghu in https://github.com/huggingface/trl/pull/5218
- Clean up model update group on worker exit by @AmineDiro in https://github.com/huggingface/trl/pull/5325
Documentation and Examples
- Add minimal CARLA example script by @sergiopaniego in https://github.com/huggingface/trl/pull/5161
- Nemotron 3 examples added by @sergiopaniego in https://github.com/huggingface/trl/pull/5272
- Align docs about tool calling in trainers with dataset format by @albertvillanova in https://github.com/huggingface/trl/pull/5311
- Add repository-specific guidance for agents (
AGENTS.md) by @qgallouedec in https://github.com/huggingface/trl/pull/5236 - Align documentation with the intended public API by @qgallouedec in https://github.com/huggingface/trl/pull/5162
What's Changed
- ⬆️ Bump dev version by @qgallouedec in https://github.com/huggingface/trl/pull/5182
- Handle mm_token_type_ids in SFT/GRPO/RLOO to fix IndexError by @albertvillanova in https://github.com/huggingface/trl/pull/5178
- Document parameters with differing default values in core configs by @albertvillanova in https://github.com/huggingface/trl/pull/5168
- Make _BaseConfig and _BaseTrainer explicitly private by @albertvillanova in https://github.com/huggingface/trl/pull/5169
- Refactor CLI [4/N]: Replace top-level TrlParser with ArgumentParser by @albertvillanova in https://github.com/huggingface/trl/pull/5170
- Add minimal CARLA example script by @sergiopaniego in https://github.com/huggingface/trl/pull/5161
- Align documentation with the intended public API by @qgallouedec in https://github.com/huggingface/trl/pull/5162
- Fix deprecation warning of create_reference_model by @albertvillanova in https://github.com/huggingface/trl/pull/5184
- Fix deprecation warning of fork in multi-threaded process by @albertvillanova in https://github.com/huggingface/trl/pull/5185
- Refactor CLI [5/N]: Refactor TrainingCommand with delayed imports by @albertvillanova in https://github.com/huggingface/trl/pull/5186
- Refactor CLI [6/N]: Refactor env/vllm-serve commands with delayed imports by @albertvillanova in https://github.com/huggingface/trl/pull/5187
- Fix CI tests patching BaseTrainer by @albertvillanova in https://github.com/huggingface/trl/pull/5192
- Add
pad_to_multiple_ofto GRPOTrainer and RLOOTrainer by @czkkkkkk in https://github.com/huggingface/trl/pull/5180 - Re-add liger-kernel to dev deps by @qgallouedec in https://github.com/huggingface/trl/pull/5164
- Set CI PYTORCH_ALLOC_CONF env variable to avoid OOM by @albertvillanova in https://github.com/huggingface/trl/pull/5197
- Support sequence sampling in Liger Kernel and pass importance_samplin… by @michaelroyzen in https://github.com/huggingface/trl/pull/5190
- Mark CI test_training_vlm_and_liger as xfail by @albertvillanova in https://github.com/huggingface/trl/pull/5202
- Decouple rollout dispatch from vLLM backend in GRPO _generate_single_turn by @albertvillanova in https://github.com/huggingface/trl/pull/5122
- CI: Add Qwen 3.5 tiny model to tests by @qgallouedec in https://github.com/huggingface/trl/pull/5204
- Add support for Qwen3.5 for agent training by @qgallouedec in https://github.com/huggingface/trl/pull/5205
- Update vLLM version support to include 0.13.0 by @qgallouedec in https://github.com/huggingface/trl/pull/5206
- feat: Add tool calling support to VLLMClient.chat() by @kansalaman in https://github.com/huggingface/trl/pull/4889
- Refactor CLI [7/N]: Move patching to compat and import transformers conditionally by @albertvillanova in https://github.com/huggingface/trl/pull/5208
- Update vLLM version support to include 0.14.0 and 0.14.1 by @qgallouedec in https://github.com/huggingface/trl/pull/5214
- Refactor CLI [8/N]: Refactor scripts/utils with delayed imports by @albertvillanova in https://github.com/huggingface/trl/pull/5209
- Simplify logic for structured outputs across vLLM versions by @albertvillanova in https://github.com/huggingface/trl/pull/5215
- Refactor CLI [9/N]: Replace HfArgumentParser from transformers with local by @albertvillanova in https://github.com/huggingface/trl/pull/5210
- Refactor CLI [10/N]: Refactor scripts with delayed imports by @albertvillanova in https://github.com/huggingface/trl/pull/5219
- Refactor CLI [11/N]: Refactor scripts/vllm_serve with delayed imports by @albertvillanova in https://github.com/huggingface/trl/pull/5220
- Refactor CLI [12/N]: Fix command name in scripts help usage by @albertvillanova in https://github.com/huggingface/trl/pull/5221
- Refactor CLI [13/N]: Pass clean training args to scripts by @albertvillanova in https://github.com/huggingface/trl/pull/5223
- Fix
prepare_multimodal_messagesto supporttool_callsandtoolrole by @alvarobartt in https://github.com/huggingface/trl/pull/5212 - Fix link to Hugging Face Hub in OpenEnv documentation by @thesteve0 in https://github.com/huggingface/trl/pull/5229
- Fix type for model_init_kwargs when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5230
- Add repository-specific guidance for agents (
AGENTS.md) by @qgallouedec in https://github.com/huggingface/trl/pull/5236 - Add support for raw ids in
promptsin vLLM client and server by @qgallouedec in https://github.com/huggingface/trl/pull/5225 - Deprecate
truncate_prompt_tokensfor vLLM 0.17.0 by @winglian in https://github.com/huggingface/trl/pull/5248 - Add VLM support when passing raw token IDs to vLLM client by @qgallouedec in https://github.com/huggingface/trl/pull/5227
- Move
rollout_funcfrom_generate_single_turnto_generateby @qgallouedec in https://github.com/huggingface/trl/pull/5232 - Fix
RewardFunctype alias to reflect actual calling convention by @s-zx in https://github.com/huggingface/trl/pull/5246 - [GRPO] In-place temperature scaling operation by @winglian in https://github.com/huggingface/trl/pull/5254
- Update vLLM version support to 0.15.0 by @qgallouedec in https://github.com/huggingface/trl/pull/5251
- Sync entire prompt/completion token tensors before indexing by @shawnghu in https://github.com/huggingface/trl/pull/5218
- Update vLLM version support to 0.16.0 by @qgallouedec in https://github.com/huggingface/trl/pull/5252
- Update vLLM version support to 0.17.0 by @qgallouedec in https://github.com/huggingface/trl/pull/5253
- [GRPO/RLOO] Tokenize before vLLM generation call by @qgallouedec in https://github.com/huggingface/trl/pull/5238
- Refactor CLI [14/N] : Remove TrainingArguments import from core trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5257
- Support JSON string parsing of teacher_model_init_kwargs in MiniLLMConfig by @albertvillanova in https://github.com/huggingface/trl/pull/5259
- Fix typo in docstring for teacher_model_init_kwargs by @albertvillanova in https://github.com/huggingface/trl/pull/5260
- Remove extra_fields dead code [1/N]: Remove extra_fields handling from VLLMGeneration.generate by @albertvillanova in https://github.com/huggingface/trl/pull/5262
- [GRPO/RLOO] Unify tokenization across all generation backends in
_generate_single_turnby @qgallouedec in https://github.com/huggingface/trl/pull/5239 - Remove extra_fields dead code [2/N]: Remove extra_fields from VLLMGeneration.generate return value by @albertvillanova in https://github.com/huggingface/trl/pull/5263
- Remove extra_fields dead code [3/N]: Remove extra_fields from GRPOTrainer._generate_single_turn return value by @albertvillanova in https://github.com/huggingface/trl/pull/5264
- fix(ppo): add gradient_checkpointing_enable/disable to PolicyAndValueWrapper by @s-zx in https://github.com/huggingface/trl/pull/5245
- [GRPO/RLOO] Extract tokenize prompts from
_generate_single_turnby @qgallouedec in https://github.com/huggingface/trl/pull/5240 - [CPO/ORPO] Fix handling of different length chosen/rejected prompts. by @davmels in https://github.com/huggingface/trl/pull/4639
- Fix type for teacher_model_init_kwargs when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5258
- Align GOLDConfig docstrings for optional params with None default by @albertvillanova in https://github.com/huggingface/trl/pull/5261
- Fix support for model_init_kwargs in GKD/GOLD when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5266
- Update TRL banner to support light/dark mode by @qgallouedec in https://github.com/huggingface/trl/pull/5270
- Fix error message in OnlineDPO by @qgallouedec in https://github.com/huggingface/trl/pull/5237
- Fix title consistency from "Transformer Reinforcement Learning" to "Transformers Reinforcement Learning" by @qgallouedec in https://github.com/huggingface/trl/pull/5183
- Nemotron 3 examples added by @sergiopaniego in https://github.com/huggingface/trl/pull/5272
- Fix mm_token_type_ids silently dropped in DPO VLM training by @albertvillanova in https://github.com/huggingface/trl/pull/5279
- Simplify get_train_dataloader in GRPO and RLOO by @albertvillanova in https://github.com/huggingface/trl/pull/5276
- Raise ValueError for None train_dataset in experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5275
- 35% faster packing + rename
bfd-requeuetobfd_splitby @mariosasko in https://github.com/huggingface/trl/pull/5189 - Change default
vllm_modeto"colocate"and add v0→v1 migration guide by @qgallouedec in https://github.com/huggingface/trl/pull/5255 - Allow nullable logprobs in vLLM serve responses by @LeonEricsson in https://github.com/huggingface/trl/pull/5203
- feat(
grpo_trainer.py): Variational Sequence-Level Soft Policy Optimization (VESPO) by @casinca in https://github.com/huggingface/trl/pull/5199 - Simplify structured outputs logic across vLLM versions in scripts/vllm_serve by @albertvillanova in https://github.com/huggingface/trl/pull/5273
- Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5274
- Fix
accuracy_rewardcrash when called from non-main thread by @qgallouedec in https://github.com/huggingface/trl/pull/5281 - Remove TrainingArguments import from experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5290
- Remove custom get_train/eval_dataloader from OnlineDPO by @albertvillanova in https://github.com/huggingface/trl/pull/5291
- [GKD] Buffer Implementation for Distillation Trainer by @cmpatino in https://github.com/huggingface/trl/pull/5137
- Support max_length in DPO VLM training by @albertvillanova in https://github.com/huggingface/trl/pull/5284
- Prevent corruption of DPO VLM training if "keep_end" truncation_mode by @albertvillanova in https://github.com/huggingface/trl/pull/5286
- Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a reward model by @albertvillanova in https://github.com/huggingface/trl/pull/5295
- Apply docstyle by @qgallouedec in https://github.com/huggingface/trl/pull/5296
- Add guidance to avoid
hasattrandgetattrwith defaults inAGENTS.mdby @qgallouedec in https://github.com/huggingface/trl/pull/5294 - Fix DPOTrainer collators to truncate sequences before padding by @albertvillanova in https://github.com/huggingface/trl/pull/5305
- Update
RewardFunctype annotation to allowNonevalues in reward list by @qgallouedec in https://github.com/huggingface/trl/pull/5297 - Suggest the
Json()type for tool calling dataset format by @lhoestq in https://github.com/huggingface/trl/pull/5307 - Allow reward functions to log extra columns and scalar metrics by @manueldeprada in https://github.com/huggingface/trl/pull/5233
- Fix GRPOTrainer attribute access for vLLM model config by @falcondai in https://github.com/huggingface/trl/pull/5302
- Support truncation_mode in SFT by @albertvillanova in https://github.com/huggingface/trl/pull/5306
- 🔌 Asynchronous GRPO by @qgallouedec in https://github.com/huggingface/trl/pull/5293
- Fix datasets version supporting Json dtype in docs about tool calling dataset format by @albertvillanova in https://github.com/huggingface/trl/pull/5310
- Align docs about tool calling in trainers with dataset format by @albertvillanova in https://github.com/huggingface/trl/pull/5311
- [GRPO] Fix re-tokenization bug in tool-calling loop by concatenating token IDs by @qgallouedec in https://github.com/huggingface/trl/pull/5242
- feat(experimental): Divergence Proximal Policy Optimization by @LeonEricsson in https://github.com/huggingface/trl/pull/5117
- Clean up model update group on worker exit by @AmineDiro in https://github.com/huggingface/trl/pull/5325
- Fix style in DPPO docstrings by @albertvillanova in https://github.com/huggingface/trl/pull/5326
New Contributors
- @czkkkkkk made their first contribution in https://github.com/huggingface/trl/pull/5180
- @michaelroyzen made their first contribution in https://github.com/huggingface/trl/pull/5190
- @thesteve0 made their first contribution in https://github.com/huggingface/trl/pull/5229
- @s-zx made their first contribution in https://github.com/huggingface/trl/pull/5246
- @shawnghu made their first contribution in https://github.com/huggingface/trl/pull/5218
- @davmels made their first contribution in https://github.com/huggingface/trl/pull/4639
- @manueldeprada made their first contribution in https://github.com/huggingface/trl/pull/5233
- @falcondai made their first contribution in https://github.com/huggingface/trl/pull/5302
- @AmineDiro made their first contribution in https://github.com/huggingface/trl/pull/5325
Full Changelog: https://github.com/huggingface/trl/compare/v0.29.0...v1.0.0rc1
Fetched April 7, 2026
