v1.0.0rc1

Features

Variational Sequence-Level Soft Policy Optimization (VESPO)

VESPO addresses training instability in off-policy RL caused by policy staleness, asynchronous updates, and train-inference mismatches. Rather than relying on heuristic token-level clipping (GRPO) or sequence-length normalization (GSPO), VESPO derives a principled reshaping kernel from a variational framework. In practice, this yields a smooth, asymmetric Gamma weighting function that gracefully suppresses extreme sequence-level importance weights without introducing length bias. It can be enabled via the loss_type parameter of GRPOConfig:

from trl import GRPOConfig, GRPOTrainer

trainer = GRPOTrainer(
    model="Qwen/Qwen3-0.6B",
    args=GRPOConfig(loss_type="vespo"),
    ...
)

by @casinca in https://github.com/huggingface/trl/pull/5199

Divergence Proximal Policy Optimization (DPPO)

DPPO is a new experimental trainer that replaces the standard PPO clipping mechanism with divergence constraints, providing more principled trust-region updates.

by @LeonEricsson in https://github.com/huggingface/trl/pull/5117

Reward functions can now log extra columns and scalar metrics

Reward functions can return a dictionary of extra values (scalars or per-sample columns) that will be logged alongside the reward. This makes it easier to track intermediate signals without writing custom callbacks.

def my_reward_fn(completions, answer, log_extra=None, log_metric=None, **kwargs):
    extracted = [extract_answer(c) for c in completions]
    rewards = [1.0 if e == a else 0.0 for e, a in zip(extracted, answer)]

    if log_extra:
        log_extra("golden_answer", list(answer))
        log_extra("extracted_answer", extracted)

    if log_metric:
        log_metric("accuracy", sum(rewards) / len(rewards))

    return rewards

by @manueldeprada in https://github.com/huggingface/trl/pull/5233

Tool calling support in `VLLMClient.chat()`

VLLMClient.chat() now supports tool calling, enabling agentic workflows directly through the vLLM client interface.

by @kansalaman in https://github.com/huggingface/trl/pull/4889

35% faster packing

BFD packing is 35% faster. The "bfd-requeue" packing strategy has also been renamed to "bfd_split". See MIGRATION.md for details.

by @mariosasko in https://github.com/huggingface/trl/pull/5189

[GKD] Buffer implementation for distillation trainer

The GKD/GOLD trainer now supports buffered rollout generation, decoupling generation from gradient updates for more efficient distillation.

by @cmpatino in https://github.com/huggingface/trl/pull/5137

v0 → v1 migration guide

A MIGRATION.md guide has been added covering all breaking changes when upgrading from TRL v0 to v1. If you're already on v0.29, the changes are minimal.

by @qgallouedec in https://github.com/huggingface/trl/pull/5255

Other

Change default vllm_mode to "colocate" by @qgallouedec in https://github.com/huggingface/trl/pull/5255
Support truncation_mode in SFT by @albertvillanova in https://github.com/huggingface/trl/pull/5306
Support max_length in DPO VLM training by @albertvillanova in https://github.com/huggingface/trl/pull/5284
Add pad_to_multiple_of to GRPOTrainer and RLOOTrainer by @czkkkkkk in https://github.com/huggingface/trl/pull/5180
Support sequence sampling in Liger Kernel by @michaelroyzen in https://github.com/huggingface/trl/pull/5190
Add tool calling support to VLLMClient.chat() by @kansalaman in https://github.com/huggingface/trl/pull/4889
Add support for raw token IDs in vLLM client prompts by @qgallouedec in https://github.com/huggingface/trl/pull/5225
Add VLM support when passing raw token IDs to vLLM client by @qgallouedec in https://github.com/huggingface/trl/pull/5227

Fixes

Fix DPOTrainer collators to truncate sequences before padding by @albertvillanova in https://github.com/huggingface/trl/pull/5305
Prevent corruption of DPO VLM training if "keep_end" truncation_mode by @albertvillanova in https://github.com/huggingface/trl/pull/5286
Fix mm_token_type_ids silently dropped in DPO VLM training by @albertvillanova in https://github.com/huggingface/trl/pull/5279
Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a reward model by @albertvillanova in https://github.com/huggingface/trl/pull/5295
Fix accuracy_reward crash when called from non-main thread by @qgallouedec in https://github.com/huggingface/trl/pull/5281
Fix GRPOTrainer attribute access for vLLM model config by @falcondai in https://github.com/huggingface/trl/pull/5302
[GRPO] Fix re-tokenization bug in tool-calling loop by @qgallouedec in https://github.com/huggingface/trl/pull/5242
[CPO/ORPO] Fix handling of different length chosen/rejected prompts by @davmels in https://github.com/huggingface/trl/pull/4639
Fix RewardFunc type alias to reflect actual calling convention by @s-zx in https://github.com/huggingface/trl/pull/5246
fix(ppo): add gradient_checkpointing_enable/disable to PolicyAndValueWrapper by @s-zx in https://github.com/huggingface/trl/pull/5245
Fix prepare_multimodal_messages to support tool_calls and tool role by @alvarobartt in https://github.com/huggingface/trl/pull/5212
Fix support for model_init_kwargs when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5230
Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5274
Fix support for model_init_kwargs in GKD/GOLD when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5266
Sync entire prompt/completion token tensors before indexing by @shawnghu in https://github.com/huggingface/trl/pull/5218
Clean up model update group on worker exit by @AmineDiro in https://github.com/huggingface/trl/pull/5325

Documentation and Examples

Add minimal CARLA example script by @sergiopaniego in https://github.com/huggingface/trl/pull/5161
Nemotron 3 examples added by @sergiopaniego in https://github.com/huggingface/trl/pull/5272
Align docs about tool calling in trainers with dataset format by @albertvillanova in https://github.com/huggingface/trl/pull/5311
Add repository-specific guidance for agents (AGENTS.md) by @qgallouedec in https://github.com/huggingface/trl/pull/5236
Align documentation with the intended public API by @qgallouedec in https://github.com/huggingface/trl/pull/5162

What's Changed

⬆️ Bump dev version by @qgallouedec in https://github.com/huggingface/trl/pull/5182
Handle mm_token_type_ids in SFT/GRPO/RLOO to fix IndexError by @albertvillanova in https://github.com/huggingface/trl/pull/5178
Document parameters with differing default values in core configs by @albertvillanova in https://github.com/huggingface/trl/pull/5168
Make _BaseConfig and _BaseTrainer explicitly private by @albertvillanova in https://github.com/huggingface/trl/pull/5169
Refactor CLI [4/N]: Replace top-level TrlParser with ArgumentParser by @albertvillanova in https://github.com/huggingface/trl/pull/5170
Add minimal CARLA example script by @sergiopaniego in https://github.com/huggingface/trl/pull/5161
Align documentation with the intended public API by @qgallouedec in https://github.com/huggingface/trl/pull/5162
Fix deprecation warning of create_reference_model by @albertvillanova in https://github.com/huggingface/trl/pull/5184
Fix deprecation warning of fork in multi-threaded process by @albertvillanova in https://github.com/huggingface/trl/pull/5185
Refactor CLI [5/N]: Refactor TrainingCommand with delayed imports by @albertvillanova in https://github.com/huggingface/trl/pull/5186
Refactor CLI [6/N]: Refactor env/vllm-serve commands with delayed imports by @albertvillanova in https://github.com/huggingface/trl/pull/5187
Fix CI tests patching BaseTrainer by @albertvillanova in https://github.com/huggingface/trl/pull/5192
Add pad_to_multiple_of to GRPOTrainer and RLOOTrainer by @czkkkkkk in https://github.com/huggingface/trl/pull/5180
Re-add liger-kernel to dev deps by @qgallouedec in https://github.com/huggingface/trl/pull/5164
Set CI PYTORCH_ALLOC_CONF env variable to avoid OOM by @albertvillanova in https://github.com/huggingface/trl/pull/5197
Support sequence sampling in Liger Kernel and pass importance_samplin… by @michaelroyzen in https://github.com/huggingface/trl/pull/5190
Mark CI test_training_vlm_and_liger as xfail by @albertvillanova in https://github.com/huggingface/trl/pull/5202
Decouple rollout dispatch from vLLM backend in GRPO _generate_single_turn by @albertvillanova in https://github.com/huggingface/trl/pull/5122
CI: Add Qwen 3.5 tiny model to tests by @qgallouedec in https://github.com/huggingface/trl/pull/5204
Add support for Qwen3.5 for agent training by @qgallouedec in https://github.com/huggingface/trl/pull/5205
Update vLLM version support to include 0.13.0 by @qgallouedec in https://github.com/huggingface/trl/pull/5206
feat: Add tool calling support to VLLMClient.chat() by @kansalaman in https://github.com/huggingface/trl/pull/4889
Refactor CLI [7/N]: Move patching to compat and import transformers conditionally by @albertvillanova in https://github.com/huggingface/trl/pull/5208
Update vLLM version support to include 0.14.0 and 0.14.1 by @qgallouedec in https://github.com/huggingface/trl/pull/5214
Refactor CLI [8/N]: Refactor scripts/utils with delayed imports by @albertvillanova in https://github.com/huggingface/trl/pull/5209
Simplify logic for structured outputs across vLLM versions by @albertvillanova in https://github.com/huggingface/trl/pull/5215
Refactor CLI [9/N]: Replace HfArgumentParser from transformers with local by @albertvillanova in https://github.com/huggingface/trl/pull/5210
Refactor CLI [10/N]: Refactor scripts with delayed imports by @albertvillanova in https://github.com/huggingface/trl/pull/5219
Refactor CLI [11/N]: Refactor scripts/vllm_serve with delayed imports by @albertvillanova in https://github.com/huggingface/trl/pull/5220
Refactor CLI [12/N]: Fix command name in scripts help usage by @albertvillanova in https://github.com/huggingface/trl/pull/5221
Refactor CLI [13/N]: Pass clean training args to scripts by @albertvillanova in https://github.com/huggingface/trl/pull/5223
Fix prepare_multimodal_messages to support tool_calls and tool role by @alvarobartt in https://github.com/huggingface/trl/pull/5212
Fix link to Hugging Face Hub in OpenEnv documentation by @thesteve0 in https://github.com/huggingface/trl/pull/5229
Fix type for model_init_kwargs when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5230
Add repository-specific guidance for agents (AGENTS.md) by @qgallouedec in https://github.com/huggingface/trl/pull/5236
Add support for raw ids in prompts in vLLM client and server by @qgallouedec in https://github.com/huggingface/trl/pull/5225
Deprecate truncate_prompt_tokens for vLLM 0.17.0 by @winglian in https://github.com/huggingface/trl/pull/5248
Add VLM support when passing raw token IDs to vLLM client by @qgallouedec in https://github.com/huggingface/trl/pull/5227
Move rollout_func from _generate_single_turn to _generate by @qgallouedec in https://github.com/huggingface/trl/pull/5232
Fix RewardFunc type alias to reflect actual calling convention by @s-zx in https://github.com/huggingface/trl/pull/5246
[GRPO] In-place temperature scaling operation by @winglian in https://github.com/huggingface/trl/pull/5254
Update vLLM version support to 0.15.0 by @qgallouedec in https://github.com/huggingface/trl/pull/5251
Sync entire prompt/completion token tensors before indexing by @shawnghu in https://github.com/huggingface/trl/pull/5218
Update vLLM version support to 0.16.0 by @qgallouedec in https://github.com/huggingface/trl/pull/5252
Update vLLM version support to 0.17.0 by @qgallouedec in https://github.com/huggingface/trl/pull/5253
[GRPO/RLOO] Tokenize before vLLM generation call by @qgallouedec in https://github.com/huggingface/trl/pull/5238
Refactor CLI [14/N] : Remove TrainingArguments import from core trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5257
Support JSON string parsing of teacher_model_init_kwargs in MiniLLMConfig by @albertvillanova in https://github.com/huggingface/trl/pull/5259
Fix typo in docstring for teacher_model_init_kwargs by @albertvillanova in https://github.com/huggingface/trl/pull/5260
Remove extra_fields dead code [1/N]: Remove extra_fields handling from VLLMGeneration.generate by @albertvillanova in https://github.com/huggingface/trl/pull/5262
[GRPO/RLOO] Unify tokenization across all generation backends in _generate_single_turn by @qgallouedec in https://github.com/huggingface/trl/pull/5239
Remove extra_fields dead code [2/N]: Remove extra_fields from VLLMGeneration.generate return value by @albertvillanova in https://github.com/huggingface/trl/pull/5263
Remove extra_fields dead code [3/N]: Remove extra_fields from GRPOTrainer._generate_single_turn return value by @albertvillanova in https://github.com/huggingface/trl/pull/5264
fix(ppo): add gradient_checkpointing_enable/disable to PolicyAndValueWrapper by @s-zx in https://github.com/huggingface/trl/pull/5245
[GRPO/RLOO] Extract tokenize prompts from _generate_single_turn by @qgallouedec in https://github.com/huggingface/trl/pull/5240
[CPO/ORPO] Fix handling of different length chosen/rejected prompts. by @davmels in https://github.com/huggingface/trl/pull/4639
Fix type for teacher_model_init_kwargs when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5258
Align GOLDConfig docstrings for optional params with None default by @albertvillanova in https://github.com/huggingface/trl/pull/5261
Fix support for model_init_kwargs in GKD/GOLD when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5266
Update TRL banner to support light/dark mode by @qgallouedec in https://github.com/huggingface/trl/pull/5270
Fix error message in OnlineDPO by @qgallouedec in https://github.com/huggingface/trl/pull/5237
Fix title consistency from "Transformer Reinforcement Learning" to "Transformers Reinforcement Learning" by @qgallouedec in https://github.com/huggingface/trl/pull/5183
Nemotron 3 examples added by @sergiopaniego in https://github.com/huggingface/trl/pull/5272
Fix mm_token_type_ids silently dropped in DPO VLM training by @albertvillanova in https://github.com/huggingface/trl/pull/5279
Simplify get_train_dataloader in GRPO and RLOO by @albertvillanova in https://github.com/huggingface/trl/pull/5276
Raise ValueError for None train_dataset in experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5275
35% faster packing + rename bfd-requeue to bfd_split by @mariosasko in https://github.com/huggingface/trl/pull/5189
Change default vllm_mode to "colocate" and add v0→v1 migration guide by @qgallouedec in https://github.com/huggingface/trl/pull/5255
Allow nullable logprobs in vLLM serve responses by @LeonEricsson in https://github.com/huggingface/trl/pull/5203
feat(grpo_trainer.py): Variational Sequence-Level Soft Policy Optimization (VESPO) by @casinca in https://github.com/huggingface/trl/pull/5199
Simplify structured outputs logic across vLLM versions in scripts/vllm_serve by @albertvillanova in https://github.com/huggingface/trl/pull/5273
Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5274
Fix accuracy_reward crash when called from non-main thread by @qgallouedec in https://github.com/huggingface/trl/pull/5281
Remove TrainingArguments import from experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5290
Remove custom get_train/eval_dataloader from OnlineDPO by @albertvillanova in https://github.com/huggingface/trl/pull/5291
[GKD] Buffer Implementation for Distillation Trainer by @cmpatino in https://github.com/huggingface/trl/pull/5137
Support max_length in DPO VLM training by @albertvillanova in https://github.com/huggingface/trl/pull/5284
Prevent corruption of DPO VLM training if "keep_end" truncation_mode by @albertvillanova in https://github.com/huggingface/trl/pull/5286
Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a reward model by @albertvillanova in https://github.com/huggingface/trl/pull/5295
Apply docstyle by @qgallouedec in https://github.com/huggingface/trl/pull/5296
Add guidance to avoid hasattr and getattr with defaults in AGENTS.md by @qgallouedec in https://github.com/huggingface/trl/pull/5294
Fix DPOTrainer collators to truncate sequences before padding by @albertvillanova in https://github.com/huggingface/trl/pull/5305
Update RewardFunc type annotation to allow Nonevalues in reward list by @qgallouedec in https://github.com/huggingface/trl/pull/5297
Suggest the Json() type for tool calling dataset format by @lhoestq in https://github.com/huggingface/trl/pull/5307
Allow reward functions to log extra columns and scalar metrics by @manueldeprada in https://github.com/huggingface/trl/pull/5233
Fix GRPOTrainer attribute access for vLLM model config by @falcondai in https://github.com/huggingface/trl/pull/5302
Support truncation_mode in SFT by @albertvillanova in https://github.com/huggingface/trl/pull/5306
🔌 Asynchronous GRPO by @qgallouedec in https://github.com/huggingface/trl/pull/5293
Fix datasets version supporting Json dtype in docs about tool calling dataset format by @albertvillanova in https://github.com/huggingface/trl/pull/5310
Align docs about tool calling in trainers with dataset format by @albertvillanova in https://github.com/huggingface/trl/pull/5311
[GRPO] Fix re-tokenization bug in tool-calling loop by concatenating token IDs by @qgallouedec in https://github.com/huggingface/trl/pull/5242
feat(experimental): Divergence Proximal Policy Optimization by @LeonEricsson in https://github.com/huggingface/trl/pull/5117
Clean up model update group on worker exit by @AmineDiro in https://github.com/huggingface/trl/pull/5325
Fix style in DPPO docstrings by @albertvillanova in https://github.com/huggingface/trl/pull/5326

New Contributors

@czkkkkkk made their first contribution in https://github.com/huggingface/trl/pull/5180
@michaelroyzen made their first contribution in https://github.com/huggingface/trl/pull/5190
@thesteve0 made their first contribution in https://github.com/huggingface/trl/pull/5229
@s-zx made their first contribution in https://github.com/huggingface/trl/pull/5246
@shawnghu made their first contribution in https://github.com/huggingface/trl/pull/5218
@davmels made their first contribution in https://github.com/huggingface/trl/pull/4639
@manueldeprada made their first contribution in https://github.com/huggingface/trl/pull/5233
@falcondai made their first contribution in https://github.com/huggingface/trl/pull/5302
@AmineDiro made their first contribution in https://github.com/huggingface/trl/pull/5325

Full Changelog: https://github.com/huggingface/trl/compare/v0.29.0...v1.0.0rc1

Features

Variational Sequence-Level Soft Policy Optimization (VESPO)

Divergence Proximal Policy Optimization (DPPO)

Reward functions can now log extra columns and scalar metrics

Tool calling support in `VLLMClient.chat()`

35% faster packing

[GKD] Buffer implementation for distillation trainer

v0 → v1 migration guide

Other

Fixes

Documentation and Examples

What's Changed

New Contributors

More from Hugging Face

More from Hugging Face

v1.0.0rc1

Features

Variational Sequence-Level Soft Policy Optimization (VESPO)

Divergence Proximal Policy Optimization (DPPO)

Reward functions can now log extra columns and scalar metrics

Tool calling support in VLLMClient.chat()

35% faster packing

[GKD] Buffer implementation for distillation trainer

v0 → v1 migration guide

Other

Fixes

Documentation and Examples

What's Changed

New Contributors

More from Hugging Face

More from Hugging Face

Tool calling support in `VLLMClient.chat()`