v1.0.0

Read our blog post for an overview of TRL v1.

Features

Asynchronous GRPO

Asynchronous GRPO decouples generation from the gradient update loop by offloading rollouts to an external vLLM server. Generation runs in parallel while training continues, eliminating idle GPU time and improving hardware utilization.

from trl.experimental.async_grpo import AsyncGRPOTrainer
from trl.rewards import accuracy_reward
from datasets import load_dataset

dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = AsyncGRPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    reward_funcs=accuracy_reward,
    train_dataset=dataset,
)
trainer.train()

by @qgallouedec in https://github.com/huggingface/trl/pull/5293

Variational Sequence-Level Soft Policy Optimization (VESPO)

VESPO addresses training instability in off-policy RL caused by policy staleness, asynchronous updates, and train-inference mismatches. Rather than relying on heuristic token-level clipping (GRPO) or sequence-length normalization (GSPO), VESPO derives a principled reshaping kernel from a variational framework. In practice, this yields a smooth, asymmetric Gamma weighting function that gracefully suppresses extreme sequence-level importance weights without introducing length bias. It can be enabled via the loss_type parameter of GRPOConfig:

from trl import GRPOConfig, GRPOTrainer

trainer = GRPOTrainer(
    model="Qwen/Qwen3-0.6B",
    args=GRPOConfig(loss_type="vespo"),
    ...
)

by @casinca in https://github.com/huggingface/trl/pull/5199

Divergence Proximal Policy Optimization (DPPO)

DPPO is a new experimental trainer that replaces the standard PPO clipping mechanism with divergence constraints, providing more principled trust-region updates.

by @LeonEricsson in https://github.com/huggingface/trl/pull/5117

Self-Distillation Policy Optimization (SDPO)

SDPO is a new experimental trainer that augments on-policy RL with self-distillation from the model's own high-reward trajectories. Instead of using an external teacher, SDPO treats the current model conditioned on feedback as a self-teacher, distilling its feedback-informed predictions back into the policy.

from trl.experimental import SDPOTrainer, SDPOConfig

config = SDPOConfig(
    output_dir="./results",
    num_generations=8,
    success_reward_threshold=1.0,
    use_successful_as_teacher=True,
)

trainer = SDPOTrainer(
    model="Qwen/Qwen2.5-Math-1.5B-Instruct",
    reward_funcs=[accuracy_reward],
    args=config,
    train_dataset=dataset,
)
trainer.train()

by @MengAiDev in https://github.com/huggingface/trl/pull/4935

Reward functions can now log extra columns and scalar metrics

Reward functions can return a dictionary of extra values (scalars or per-sample columns) that will be logged alongside the reward. This makes it easier to track intermediate signals without writing custom callbacks.

def my_reward_fn(completions, answer, log_extra=None, log_metric=None, **kwargs):
    extracted = [extract_answer(c) for c in completions]
    rewards = [1.0 if e == a else 0.0 for e, a in zip(extracted, answer)]

    if log_extra:
        log_extra("golden_answer", list(answer))
        log_extra("extracted_answer", extracted)

    if log_metric:
        log_metric("accuracy", sum(rewards) / len(rewards))

    return rewards

by @manueldeprada in https://github.com/huggingface/trl/pull/5233

Tool calling support in `VLLMClient.chat()`

VLLMClient.chat() now supports tool calling, enabling agentic workflows directly through the vLLM client interface.

by @kansalaman in https://github.com/huggingface/trl/pull/4889

35% faster packing

BFD packing is 35% faster. The "bfd-requeue" packing strategy has also been renamed to "bfd_split". See MIGRATION.md for details.

by @mariosasko in https://github.com/huggingface/trl/pull/5189

[GKD] Buffer implementation and vLLM inference for distillation trainer

The GKD/GOLD trainer now supports buffered rollout generation, decoupling generation from gradient updates for more efficient distillation. vLLM inference support has also been added to the base self-distillation trainer.

by @cmpatino in https://github.com/huggingface/trl/pull/5137 and https://github.com/huggingface/trl/pull/5388

v0 → v1 migration guide

A MIGRATION.md guide has been added covering all breaking changes when upgrading from TRL v0 to v1. If you're already on v0.29, the changes are minimal.

by @qgallouedec in https://github.com/huggingface/trl/pull/5255

Other

Change default vllm_mode to "colocate" by @qgallouedec in https://github.com/huggingface/trl/pull/5255
Support truncation_mode in SFT by @albertvillanova in https://github.com/huggingface/trl/pull/5306
Support max_length in DPO VLM training by @albertvillanova in https://github.com/huggingface/trl/pull/5284
Add pad_to_multiple_of to GRPOTrainer and RLOOTrainer by @czkkkkkk in https://github.com/huggingface/trl/pull/5180
Support sequence sampling in Liger Kernel by @michaelroyzen in https://github.com/huggingface/trl/pull/5190
Add tool calling support to VLLMClient.chat() by @kansalaman in https://github.com/huggingface/trl/pull/4889
Add support for raw token IDs in vLLM client prompts by @qgallouedec in https://github.com/huggingface/trl/pull/5225
Add VLM support when passing raw token IDs to vLLM client by @qgallouedec in https://github.com/huggingface/trl/pull/5227
Enhance print_prompt_completions_sample to include reasoning content by @qgallouedec in https://github.com/huggingface/trl/pull/5327
Add support for pixel_position_ids vision key by @qgallouedec in https://github.com/huggingface/trl/pull/5374
Add second version of Qwen 3.5 chat template by @apardyl in https://github.com/huggingface/trl/pull/5405
Pass tools as None to apply_chat_template when it is an empty list by @rabinadk1 in https://github.com/huggingface/trl/pull/5380

Fixes

Fix DPOTrainer collators to truncate sequences before padding by @albertvillanova in https://github.com/huggingface/trl/pull/5305
Prevent corruption of DPO VLM training if "keep_end" truncation_mode by @albertvillanova in https://github.com/huggingface/trl/pull/5286
Fix mm_token_type_ids silently dropped in DPO VLM training by @albertvillanova in https://github.com/huggingface/trl/pull/5279
Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a reward model by @albertvillanova in https://github.com/huggingface/trl/pull/5295
Fix accuracy_reward crash when called from non-main thread by @qgallouedec in https://github.com/huggingface/trl/pull/5281
Fix GRPOTrainer attribute access for vLLM model config by @falcondai in https://github.com/huggingface/trl/pull/5302
[GRPO] Fix re-tokenization bug in tool-calling loop by @qgallouedec in https://github.com/huggingface/trl/pull/5242
[CPO/ORPO] Fix handling of different length chosen/rejected prompts by @davmels in https://github.com/huggingface/trl/pull/4639
Fix RewardFunc type alias to reflect actual calling convention by @s-zx in https://github.com/huggingface/trl/pull/5246
fix(ppo): add gradient_checkpointing_enable/disable to PolicyAndValueWrapper by @s-zx in https://github.com/huggingface/trl/pull/5245
Fix prepare_multimodal_messages to support tool_calls and tool role by @alvarobartt in https://github.com/huggingface/trl/pull/5212
Fix support for model_init_kwargs when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5230
Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5274
Fix support for model_init_kwargs in GKD/GOLD when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5266
Sync entire prompt/completion token tensors before indexing by @shawnghu in https://github.com/huggingface/trl/pull/5218
Clean up model update group on worker exit by @AmineDiro in https://github.com/huggingface/trl/pull/5325
Fix prefix EOS slicing for tool suffix (with Qwen3/3.5 chat templates) by @casinca in https://github.com/huggingface/trl/pull/5330
Fix: apply reward_weights to logged reward/reward_std in GRPOTrainer by @lailanelkoussy in https://github.com/huggingface/trl/pull/5353
Fix IDs shape mismatch in SFT for VLMs with text-only by @albertvillanova in https://github.com/huggingface/trl/pull/5354

Documentation and Examples

Add minimal CARLA example script by @sergiopaniego in https://github.com/huggingface/trl/pull/5161
Nemotron 3 examples added by @sergiopaniego in https://github.com/huggingface/trl/pull/5272
Align docs about tool calling in trainers with dataset format by @albertvillanova in https://github.com/huggingface/trl/pull/5311
Add repository-specific guidance for agents (AGENTS.md) by @qgallouedec in https://github.com/huggingface/trl/pull/5236
Align documentation with the intended public API by @qgallouedec in https://github.com/huggingface/trl/pull/5162
Update openenv examples to use environment_factory by @sergiopaniego in https://github.com/huggingface/trl/pull/5235
Add "It Takes Two: Your GRPO Is Secretly DPO" paper to GRPOTrainer by @DhruvvArora in https://github.com/huggingface/trl/pull/5347
Centralize AI agent templates in .ai by @qgallouedec in https://github.com/huggingface/trl/pull/5268

What's Changed

⬆️ Bump dev version by @qgallouedec in https://github.com/huggingface/trl/pull/5182
Handle mm_token_type_ids in SFT/GRPO/RLOO to fix IndexError by @albertvillanova in https://github.com/huggingface/trl/pull/5178
Document parameters with differing default values in core configs by @albertvillanova in https://github.com/huggingface/trl/pull/5168
Make _BaseConfig and _BaseTrainer explicitly private by @albertvillanova in https://github.com/huggingface/trl/pull/5169
Refactor CLI [4/N]: Replace top-level TrlParser with ArgumentParser by @albertvillanova in https://github.com/huggingface/trl/pull/5170
Add minimal CARLA example script by @sergiopaniego in https://github.com/huggingface/trl/pull/5161
Align documentation with the intended public API by @qgallouedec in https://github.com/huggingface/trl/pull/5162
Fix deprecation warning of create_reference_model by @albertvillanova in https://github.com/huggingface/trl/pull/5184
Fix deprecation warning of fork in multi-threaded process by @albertvillanova in https://github.com/huggingface/trl/pull/5185
Refactor CLI [5/N]: Refactor TrainingCommand with delayed imports by @albertvillanova in https://github.com/huggingface/trl/pull/5186
Refactor CLI [6/N]: Refactor env/vllm-serve commands with delayed imports by @albertvillanova in https://github.com/huggingface/trl/pull/5187
Fix CI tests patching BaseTrainer by @albertvillanova in https://github.com/huggingface/trl/pull/5192
Add pad_to_multiple_of to GRPOTrainer and RLOOTrainer by @czkkkkkk in https://github.com/huggingface/trl/pull/5180
Re-add liger-kernel to dev deps by @qgallouedec in https://github.com/huggingface/trl/pull/5164
Set CI PYTORCH_ALLOC_CONF env variable to avoid OOM by @albertvillanova in https://github.com/huggingface/trl/pull/5197
Support sequence sampling in Liger Kernel and pass importance_samplin… by @michaelroyzen in https://github.com/huggingface/trl/pull/5190
Mark CI test_training_vlm_and_liger as xfail by @albertvillanova in https://github.com/huggingface/trl/pull/5202
Decouple rollout dispatch from vLLM backend in GRPO _generate_single_turn by @albertvillanova in https://github.com/huggingface/trl/pull/5122
CI: Add Qwen 3.5 tiny model to tests by @qgallouedec in https://github.com/huggingface/trl/pull/5204
Add support for Qwen3.5 for agent training by @qgallouedec in https://github.com/huggingface/trl/pull/5205
Update vLLM version support to include 0.13.0 by @qgallouedec in https://github.com/huggingface/trl/pull/5206
feat: Add tool calling support to VLLMClient.chat() by @kansalaman in https://github.com/huggingface/trl/pull/4889
Refactor CLI [7/N]: Move patching to compat and import transformers conditionally by @albertvillanova in https://github.com/huggingface/trl/pull/5208
Update vLLM version support to include 0.14.0 and 0.14.1 by @qgallouedec in https://github.com/huggingface/trl/pull/5214
Refactor CLI [8/N]: Refactor scripts/utils with delayed imports by @albertvillanova in https://github.com/huggingface/trl/pull/5209
Simplify logic for structured outputs across vLLM versions by @albertvillanova in https://github.com/huggingface/trl/pull/5215
Refactor CLI [9/N]: Replace HfArgumentParser from transformers with local by @albertvillanova in https://github.com/huggingface/trl/pull/5210
Refactor CLI [10/N]: Refactor scripts with delayed imports by @albertvillanova in https://github.com/huggingface/trl/pull/5219
Refactor CLI [11/N]: Refactor scripts/vllm_serve with delayed imports by @albertvillanova in https://github.com/huggingface/trl/pull/5220
Refactor CLI [12/N]: Fix command name in scripts help usage by @albertvillanova in https://github.com/huggingface/trl/pull/5221
Refactor CLI [13/N]: Pass clean training args to scripts by @albertvillanova in https://github.com/huggingface/trl/pull/5223
Fix prepare_multimodal_messages to support tool_calls and tool role by @alvarobartt in https://github.com/huggingface/trl/pull/5212
Fix link to Hugging Face Hub in OpenEnv documentation by @thesteve0 in https://github.com/huggingface/trl/pull/5229
Fix type for model_init_kwargs when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5230
Add repository-specific guidance for agents (AGENTS.md) by @qgallouedec in https://github.com/huggingface/trl/pull/5236
Add support for raw ids in prompts in vLLM client and server by @qgallouedec in https://github.com/huggingface/trl/pull/5225
Deprecate truncate_prompt_tokens for vLLM 0.17.0 by @winglian in https://github.com/huggingface/trl/pull/5248
Add VLM support when passing raw token IDs to vLLM client by @qgallouedec in https://github.com/huggingface/trl/pull/5227
Move rollout_func from _generate_single_turn to _generate by @qgallouedec in https://github.com/huggingface/trl/pull/5232
Fix RewardFunc type alias to reflect actual calling convention by @s-zx in https://github.com/huggingface/trl/pull/5246
[GRPO] In-place temperature scaling operation by @winglian in https://github.com/huggingface/trl/pull/5254
Update vLLM version support to 0.15.0 by @qgallouedec in https://github.com/huggingface/trl/pull/5251
Sync entire prompt/completion token tensors before indexing by @shawnghu in https://github.com/huggingface/trl/pull/5218
Update vLLM version support to 0.16.0 by @qgallouedec in https://github.com/huggingface/trl/pull/5252
Update vLLM version support to 0.17.0 by @qgallouedec in https://github.com/huggingface/trl/pull/5253
[GRPO/RLOO] Tokenize before vLLM generation call by @qgallouedec in https://github.com/huggingface/trl/pull/5238
Refactor CLI [14/N] : Remove TrainingArguments import from core trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5257
Support JSON string parsing of teacher_model_init_kwargs in MiniLLMConfig by @albertvillanova in https://github.com/huggingface/trl/pull/5259
Fix typo in docstring for teacher_model_init_kwargs by @albertvillanova in https://github.com/huggingface/trl/pull/5260
Remove extra_fields dead code [1/N]: Remove extra_fields handling from VLLMGeneration.generate by @albertvillanova in https://github.com/huggingface/trl/pull/5262
[GRPO/RLOO] Unify tokenization across all generation backends in _generate_single_turn by @qgallouedec in https://github.com/huggingface/trl/pull/5239
Remove extra_fields dead code [2/N]: Remove extra_fields from VLLMGeneration.generate return value by @albertvillanova in https://github.com/huggingface/trl/pull/5263
Remove extra_fields dead code [3/N]: Remove extra_fields from GRPOTrainer._generate_single_turn return value by @albertvillanova in https://github.com/huggingface/trl/pull/5264
fix(ppo): add gradient_checkpointing_enable/disable to PolicyAndValueWrapper by @s-zx in https://github.com/huggingface/trl/pull/5245
[GRPO/RLOO] Extract tokenize prompts from _generate_single_turn by @qgallouedec in https://github.com/huggingface/trl/pull/5240
[CPO/ORPO] Fix handling of different length chosen/rejected prompts. by @davmels in https://github.com/huggingface/trl/pull/4639
Fix type for teacher_model_init_kwargs when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5258
Align GOLDConfig docstrings for optional params with None default by @albertvillanova in https://github.com/huggingface/trl/pull/5261
Fix support for model_init_kwargs in GKD/GOLD when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5266
Update TRL banner to support light/dark mode by @qgallouedec in https://github.com/huggingface/trl/pull/5270
Fix error message in OnlineDPO by @qgallouedec in https://github.com/huggingface/trl/pull/5237
Fix title consistency from "Transformer Reinforcement Learning" to "Transformers Reinforcement Learning" by @qgallouedec in https://github.com/huggingface/trl/pull/5183
Nemotron 3 examples added by @sergiopaniego in https://github.com/huggingface/trl/pull/5272
Fix mm_token_type_ids silently dropped in DPO VLM training by @albertvillanova in https://github.com/huggingface/trl/pull/5279
Simplify get_train_dataloader in GRPO and RLOO by @albertvillanova in https://github.com/huggingface/trl/pull/5276
Raise ValueError for None train_dataset in experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5275
35% faster packing + rename bfd-requeue to bfd_split by @mariosasko in https://github.com/huggingface/trl/pull/5189
Change default vllm_mode to "colocate" and add v0→v1 migration guide by @qgallouedec in https://github.com/huggingface/trl/pull/5255
Allow nullable logprobs in vLLM serve responses by @LeonEricsson in https://github.com/huggingface/trl/pull/5203
feat(grpo_trainer.py): Variational Sequence-Level Soft Policy Optimization (VESPO) by @casinca in https://github.com/huggingface/trl/pull/5199
Simplify structured outputs logic across vLLM versions in scripts/vllm_serve by @albertvillanova in https://github.com/huggingface/trl/pull/5273
Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5274
Fix accuracy_reward crash when called from non-main thread by @qgallouedec in https://github.com/huggingface/trl/pull/5281
Remove TrainingArguments import from experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5290
Remove custom get_train/eval_dataloader from OnlineDPO by @albertvillanova in https://github.com/huggingface/trl/pull/5291
[GKD] Buffer Implementation for Distillation Trainer by @cmpatino in https://github.com/huggingface/trl/pull/5137
Support max_length in DPO VLM training by @albertvillanova in https://github.com/huggingface/trl/pull/5284
Prevent corruption of DPO VLM training if "keep_end" truncation_mode by @albertvillanova in https://github.com/huggingface/trl/pull/5286
Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a reward model by @albertvillanova in https://github.com/huggingface/trl/pull/5295
Apply docstyle by @qgallouedec in https://github.com/huggingface/trl/pull/5296
Add guidance to avoid hasattr and getattr with defaults in AGENTS.md by @qgallouedec in https://github.com/huggingface/trl/pull/5294
Fix DPOTrainer collators to truncate sequences before padding by @albertvillanova in https://github.com/huggingface/trl/pull/5305
Update RewardFunc type annotation to allow Nonevalues in reward list by @qgallouedec in https://github.com/huggingface/trl/pull/5297
Suggest the Json() type for tool calling dataset format by @lhoestq in https://github.com/huggingface/trl/pull/5307
Allow reward functions to log extra columns and scalar metrics by @manueldeprada in https://github.com/huggingface/trl/pull/5233
Fix GRPOTrainer attribute access for vLLM model config by @falcondai in https://github.com/huggingface/trl/pull/5302
Support truncation_mode in SFT by @albertvillanova in https://github.com/huggingface/trl/pull/5306
🔌 Asynchronous GRPO by @qgallouedec in https://github.com/huggingface/trl/pull/5293
Fix datasets version supporting Json dtype in docs about tool calling dataset format by @albertvillanova in https://github.com/huggingface/trl/pull/5310
Align docs about tool calling in trainers with dataset format by @albertvillanova in https://github.com/huggingface/trl/pull/5311
[GRPO] Fix re-tokenization bug in tool-calling loop by concatenating token IDs by @qgallouedec in https://github.com/huggingface/trl/pull/5242
feat(experimental): Divergence Proximal Policy Optimization by @LeonEricsson in https://github.com/huggingface/trl/pull/5117
Clean up model update group on worker exit by @AmineDiro in https://github.com/huggingface/trl/pull/5325
Fix style in DPPO docstrings by @albertvillanova in https://github.com/huggingface/trl/pull/5326
GRPOTrainer/async: fix prefix EOS slicing for tool suffix (with Qwen3/3.5 type of chat templates) by @casinca in https://github.com/huggingface/trl/pull/5330
refactor(async_rollout_worker): renamed tool variables to mirror grpo_trainer.py by @casinca in https://github.com/huggingface/trl/pull/5332
Add truncation to SFT DataCollatorForLanguageModeling by @albertvillanova in https://github.com/huggingface/trl/pull/5315
Add SDPO (Self-Distillation Policy Optimization) trainer by @MengAiDev in https://github.com/huggingface/trl/pull/4935
Update openenv examples to use environment_factory by @sergiopaniego in https://github.com/huggingface/trl/pull/5235
Enhance print_prompt_completions_sample to include reasoning content by @qgallouedec in https://github.com/huggingface/trl/pull/5327
Add Cursor Bugbot rules from AGENTS.md by @qgallouedec in https://github.com/huggingface/trl/pull/5280
Change model dtype from bfloat16 to float32 in AsyncGRPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/5333
docs: Add "It Takes Two: Your GRPO Is Secretly DPO" paper to GRPOTrainer by @DhruvvArora in https://github.com/huggingface/trl/pull/5347
fix: apply reward_weights to logged reward/reward_std in GRPOTrainer by @lailanelkoussy in https://github.com/huggingface/trl/pull/5353
Remove post-collation truncation from DPO by @albertvillanova in https://github.com/huggingface/trl/pull/5350
Remove unused flush_right by @albertvillanova in https://github.com/huggingface/trl/pull/5358
Fix IDs shape mismatch in SFT for VLMs with text-only by @albertvillanova in https://github.com/huggingface/trl/pull/5354
Remove post-collation truncation from SFT by @albertvillanova in https://github.com/huggingface/trl/pull/5359
Simplify DPO DataCollatorForPreference by @albertvillanova in https://github.com/huggingface/trl/pull/5362
Simplify SFT tokenization by @albertvillanova in https://github.com/huggingface/trl/pull/5363
Simplify SFT DataCollatorForLanguageModeling by @albertvillanova in https://github.com/huggingface/trl/pull/5360
Use BaseConfig post_init in experimental KTO and MiniLLM configs by @albertvillanova in https://github.com/huggingface/trl/pull/5371
Move truncate_dataset to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/5370
Simplify DPO tokenization by @albertvillanova in https://github.com/huggingface/trl/pull/5369
Kd vllm generation by @cmpatino in https://github.com/huggingface/trl/pull/5351
Adds support for the pixel_position_ids vision key by @qgallouedec in https://github.com/huggingface/trl/pull/5374
Minor diff reduction between RLOO and GRPO by @qgallouedec in https://github.com/huggingface/trl/pull/5368
Remove requirements.txt by @albertvillanova in https://github.com/huggingface/trl/pull/5377
Remove dead truncation_mode from experimental BCO, CPO and ORPO by @albertvillanova in https://github.com/huggingface/trl/pull/5378
Centralize AI agent templates in .ai by @qgallouedec in https://github.com/huggingface/trl/pull/5268
Pass tools as None to apply_chat_template when it is an empty list by @rabinadk1 in https://github.com/huggingface/trl/pull/5380
Require datasets>=4.7.0 for Json dtype to prevent insertion of None values by @albertvillanova in https://github.com/huggingface/trl/pull/5376
Remove deprecated TRACKIO_SPACE_ID env var from all scripts by @sergiopaniego in https://github.com/huggingface/trl/pull/5365
Mark test_rloo[fsdp2] as xfail for transformers 5.4.0 by @albertvillanova in https://github.com/huggingface/trl/pull/5387
Enforce PR template for first-time contributors and document AI usage policy by @qgallouedec in https://github.com/huggingface/trl/pull/5356
Enhance PR template check to exclude reopened PRs from first-time contributor validation by @qgallouedec in https://github.com/huggingface/trl/pull/5392
chore: update pr_template_check.yml by @qgallouedec in https://github.com/huggingface/trl/pull/5393
Move disable_config=True from generate to GenerationConfig by @qgallouedec in https://github.com/huggingface/trl/pull/5384
Add vLLM inference to the Base Self-Distillation Trainer by @cmpatino in https://github.com/huggingface/trl/pull/5388
Add HF_TOKEN environment variable to workflow files by @qgallouedec in https://github.com/huggingface/trl/pull/5397
Add second version of Qwen 3.5 chat template to chat_template_utils by @apardyl in https://github.com/huggingface/trl/pull/5405
Release: v1.0 by @qgallouedec in https://github.com/huggingface/trl/pull/5409

New Contributors

@czkkkkkk made their first contribution in https://github.com/huggingface/trl/pull/5180
@michaelroyzen made their first contribution in https://github.com/huggingface/trl/pull/5190
@thesteve0 made their first contribution in https://github.com/huggingface/trl/pull/5229
@s-zx made their first contribution in https://github.com/huggingface/trl/pull/5246
@shawnghu made their first contribution in https://github.com/huggingface/trl/pull/5218
@davmels made their first contribution in https://github.com/huggingface/trl/pull/4639
@manueldeprada made their first contribution in https://github.com/huggingface/trl/pull/5233
@falcondai made their first contribution in https://github.com/huggingface/trl/pull/5302
@AmineDiro made their first contribution in https://github.com/huggingface/trl/pull/5325
@DhruvvArora made their first contribution in https://github.com/huggingface/trl/pull/5347
@lailanelkoussy made their first contribution in https://github.com/huggingface/trl/pull/5353
@rabinadk1 made their first contribution in https://github.com/huggingface/trl/pull/5380
@apardyl made their first contribution in https://github.com/huggingface/trl/pull/5405

Full Changelog: https://github.com/huggingface/trl/compare/v0.29.0...v1.0.0

Features

Asynchronous GRPO

Variational Sequence-Level Soft Policy Optimization (VESPO)

Divergence Proximal Policy Optimization (DPPO)

Self-Distillation Policy Optimization (SDPO)

Reward functions can now log extra columns and scalar metrics

Tool calling support in `VLLMClient.chat()`

35% faster packing

[GKD] Buffer implementation and vLLM inference for distillation trainer

v0 → v1 migration guide

Other

Fixes

Documentation and Examples

What's Changed

New Contributors

More from Hugging Face

More from Hugging Face

v1.0.0

Features

Asynchronous GRPO

Variational Sequence-Level Soft Policy Optimization (VESPO)

Divergence Proximal Policy Optimization (DPPO)

Self-Distillation Policy Optimization (SDPO)

Reward functions can now log extra columns and scalar metrics

Tool calling support in VLLMClient.chat()

35% faster packing

[GKD] Buffer implementation and vLLM inference for distillation trainer

v0 → v1 migration guide

Other

Fixes

Documentation and Examples

What's Changed

New Contributors

More from Hugging Face

More from Hugging Face

Tool calling support in `VLLMClient.chat()`