v0.26.0

Features

🕵️‍♂️ GRPO: Agent training

GRPOTrainer now supports training agents using tools. This allows language models to interact with external functions or APIs during training.

from datasets import Dataset
from trl import GRPOTrainer

def multiply(a: int, b: int) -> int:
    """
    Multiplies two integers.

    Args:
        a: The first integer.
        b: The second integer.

    Returns:
        The product of the two integers.
    """
    return a * b


dataset = Dataset.from_list(
    [
        {"prompt": [{"role": "user", "content": "What is 3 multiplied by 4?"}], "answer": 12},
        {"prompt": [{"role": "user", "content": "Calculate 7 times 8."}], "answer": 56},
        {"prompt": [{"role": "user", "content": "Find the product of 5 and 6."}], "answer": 30},
        {"prompt": [{"role": "user", "content": "What do you get when you multiply 9 by 9?"}], "answer": 81},
        {"prompt": [{"role": "user", "content": "Compute 12 multiplied by 11."}], "answer": 132},
        {"prompt": [{"role": "user", "content": "What is 15 times 14?"}], "answer": 210},
    ]
)

def accuracy(completions, answer, **kwargs):
    predictions = [completion[-1]["content"] for completion in completions]
    rewards = [float(str(ans) in pred) for pred, ans in zip(predictions, answer)]
    return rewards

trainer = GRPOTrainer(
    model="Qwen/Qwen3-0.6B",
    train_dataset=dataset,
    tools=[multiply],
    reward_funcs=accuracy,
)
trainer.train()

by @qgallouedec in https://github.com/huggingface/trl/pull/4300

ScaleRL: Add CISPO Loss

CISPO Loss was first introduced in the Minimax-M1 paper, the ScaleRL paper subsequently showed that CISPO loss scales the best in terms of performance and efficiency as models are trained for longer.

GRPOTrainer now supports the CISPO loss using loss_type="cispo" in the GRPOConfig.

by @pramodith in https://github.com/huggingface/trl/pull/4495

Add vLLM quantization option for colocate

When the input model is quantized using bitsandbytes, vLLM will now also use quantization when in colocate mode.

by @sergiopaniego in https://github.com/huggingface/trl/pull/4496

Reasoning reward

TRL nows includes a reasoning reward function

from trl.rewards import reasoning_accuracy_reward

solutions = [r"\frac{1}{3}", r"\frac{1}{3}", r"\frac{1}{3}"]
completions = [
    [
        {
            "role": "assistant",
            "content": r"<think> Reasoning content </think> The final answer is \boxed{\frac{1}{3}}",
        }
    ],
    [
        {
            "role": "assistant",
            "content": r"<think> Reasoning content </think> The final answer is \boxed{\frac{1}{2}}",
        }
    ],
    [
        {
            "role": "assistant",
            "content": r"<think> Reasoning content with partial answers \boxed{\frac{1}{3}} but no final answer",
        }
    ],
]
reasoning_accuracy_reward(completions, solutions)  # [1.0, 0.0, 0.0]

As any other reward function, it can be used in GRPOTrainer or RLOOTrainer.

from trl import GRPOTrainer
from trl.rewards import reasoning_accuracy_reward

trainer = GRPOTrainer(
    ...,
    reward_funcs=reasoning_accuracy_reward,
)

by @lewtun in https://github.com/huggingface/trl/pull/4563

Add `shuffle_dataset` option to `SFTTrainer`

You can now shuffle the dataset in SFTTrainer by setting the shuffle_dataset argument to True in SFTConfig. This is useful when the dataset features high similarity between consecutive samples.

from trl import SFTTrainer, SFTConfig

SFTConfig(shuffle_dataset=True)

by @qgallouedec in https://github.com/huggingface/trl/pull/4564

Add SAPO Loss in GRPO

Soft Adaptive Policy Optimization (SAPO), replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals. Compared with GSPO and GRPO, SAPO is both sequence-coherent and token-adaptive. Like GSPO, SAPO maintains sequence-level coherence, but its soft gating forms a continuous trust region that avoids the brittle hard clipping band used in GSPO.

You can now use SAPO loss in GRPOTrainer by setting loss_type="sapo" in the GRPOConfig.

by @pramodith in https://github.com/huggingface/trl/pull/4600

Other Features

Support completion bootstrap for VLM in GRPO/RLOO by @SolarWindRider in https://github.com/huggingface/trl/pull/4452
Add support for images inside tables with Trackio completions logging by @taha-yassine in https://github.com/huggingface/trl/pull/4505
Add step time metric to GRPO Trainer for performance tracking by @qgallouedec in https://github.com/huggingface/trl/pull/4516
Add target_parameters to LoraConfig by @jonnyli1125 in https://github.com/huggingface/trl/pull/4536
[SFT] Log mean token accuracy from Liger kernel by @kashif in https://github.com/huggingface/trl/pull/4302
Add num_generations_eval parameter for efficient evaluation by @mingxuetian in https://github.com/huggingface/trl/pull/4458
[GRPO] Sequence-level TIS & MIS by @LeonEricsson in https://github.com/huggingface/trl/pull/4530
TRL supports vLLM 0.11 by @qgallouedec in https://github.com/huggingface/trl/pull/4633
feat: implement DeepSeek unbiased KL estimator for GRPO by @jlcanta in https://github.com/huggingface/trl/pull/4638

Experimental

Move XPOTrainer to trl.experimental.xpo by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4485
Move judges to experimental submodule by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4439
Add MiniLLM Trainer by @t1101675 in https://github.com/huggingface/trl/pull/4504
refactor: Move CPOTrainer to experimental module by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4470
Move GKDTrainer to experimental module by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4474
Move NashMDTrainer to experimental module by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4477
Move PPOTrainer to trl.experimental.ppo by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4482
[ORPO] Move ORPOTrainer to experimental by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4480
Move PRMTrainer to trl.experimental.prm by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4483
Move OnlineDPOTrainer to experimental module by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4473
Move WinRateCallback to experimental by @qgallouedec in https://github.com/huggingface/trl/pull/4558
Move tests for GSPOTokenTrainer to experimental by @qgallouedec in https://github.com/huggingface/trl/pull/4572
Raise FutureWarning for classes moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/4605
Move MergeModelCallback to experimental by @qgallouedec in https://github.com/huggingface/trl/pull/4608
Raise FutureWarning for trainer moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/4620
Remove no longer applicable warning once BCO was moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/4628
Refactor suppression of warning at experimental import by @albertvillanova in https://github.com/huggingface/trl/pull/4629
🚚 Move KTO to trl.experimental by @neha222222 in https://github.com/huggingface/trl/pull/4575

Fixes

Buffer samples based on group level stds. by @pramodith in https://github.com/huggingface/trl/pull/4492
Fix bugs in CISPO conditions by @pramodith in https://github.com/huggingface/trl/pull/4499
device_map and dtype to "auto" by default by @qgallouedec in https://github.com/huggingface/trl/pull/4509
MiniLLM: Fix arguments in config & add to documentation index by @t1101675 in https://github.com/huggingface/trl/pull/4518
[Bug Fix] OnlineDPOTrainer with vLLM Server Mode by @YangKai0616 in https://github.com/huggingface/trl/pull/4500
Rename flash-attn to flash-attn2 by @qgallouedec in https://github.com/huggingface/trl/pull/4514
fix(GOLDTrainer): Resolve incorrect attribute access and VLLMClient.generate() output type by @fabio-sim in https://github.com/huggingface/trl/pull/4526
Fix bug with VLM processors in prompt-completion completion text-only training by @kschwethelm in https://github.com/huggingface/trl/pull/4553
fix+docs: device_map=None for DeepSpeed and add ZeRO paper (1910.02054) to Paper Index by @JenWei0312 in https://github.com/huggingface/trl/pull/4551
Fix vLLM sleep mode: add collective RPC call to reload weights in vLLM wake-up process by @qgallouedec in https://github.com/huggingface/trl/pull/4571
fix: use shift_labels for metrics when using CP or SP by @jue-jue-zi in https://github.com/huggingface/trl/pull/4579
Fix 'generation_config' AttributeError by @albertvillanova in https://github.com/huggingface/trl/pull/4596
Fix FSDP2 model key miss match when sync LoRA model to vLLM server by @Xiao-Chenguang in https://github.com/huggingface/trl/pull/4603
Fix KTOTrainer CUDA error for large-vocab models via tensor indexing by @bhuvanprakash in https://github.com/huggingface/trl/pull/4635

Documentation and Examples

docs: Add PEFT subsection to reducing memory usage guide by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4430
[DOCS] update and fix openenv by @burtenshaw in https://github.com/huggingface/trl/pull/4490
Fix link to OpenEnv docs by @lukehinds in https://github.com/huggingface/trl/pull/4502
Tweak description for vLLM sleep mode by @lewtun in https://github.com/huggingface/trl/pull/4506
Paper Index: Change num_completions to num_generations by @pramodith in https://github.com/huggingface/trl/pull/4515
docs: Extend CLI basic usage examples to all supported CLIs by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4425
[OpenEnv] add vllm colocate mode to openenv scripts by @kashif in https://github.com/huggingface/trl/pull/4510
[Doc] Drop dummy reward and dataset for DeepMath-103K and accuracy reward by @qgallouedec in https://github.com/huggingface/trl/pull/4524
Add OpenEnv Script examples to docs by @sergiopaniego in https://github.com/huggingface/trl/pull/4533
Update OpenEnv example scripts by @sergiopaniego in https://github.com/huggingface/trl/pull/4547
[OpenEnv] browsergym example script by @kashif in https://github.com/huggingface/trl/pull/4539
Update OpenEnv guide with latest details by @sergiopaniego in https://github.com/huggingface/trl/pull/4552
Add GRPO Wordle OpenEnv Colab by @sergiopaniego in https://github.com/huggingface/trl/pull/4542
Update OpenEnv guide with new notebook by @sergiopaniego in https://github.com/huggingface/trl/pull/4555
docs: add KTO (2402.01306) to Paper Index + link ref to KTOTrainer by @SSusantAchary in https://github.com/huggingface/trl/pull/4440
Add LFM2 to SFT notebook examples by @sergiopaniego in https://github.com/huggingface/trl/pull/4455
docs: Rewrite PEFT integration guide with comprehensive examples by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4421
Reorder documentation TOC to surface key trainer sections by @qgallouedec in https://github.com/huggingface/trl/pull/4565
Fix typo in GRPO description in README by @iliasmerigh in https://github.com/huggingface/trl/pull/4573
Fix Replay Buffer docs. by @pramodith in https://github.com/huggingface/trl/pull/4574
Fix PPO example by @qgallouedec in https://github.com/huggingface/trl/pull/4556
docs: Add Beyond the 80/20 Rule (2506.01939) to Paper Index by @xuanduy04 in https://github.com/huggingface/trl/pull/4580
docs: Expand training customization examples by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4427
docs: Expand speeding up training guide with acceleration methods by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4428
Update How-to guides by @qgallouedec in https://github.com/huggingface/trl/pull/4604
Fixed OpenEnv example scripts by @sergiopaniego in https://github.com/huggingface/trl/pull/4610
Add ministral 3 free notebooks by @sergiopaniego in https://github.com/huggingface/trl/pull/4614
Replace arXiv paper links with HF links by @albertvillanova in https://github.com/huggingface/trl/pull/4613
Add experimental imports to docs by @albertvillanova in https://github.com/huggingface/trl/pull/4616
Fix README style by @sergiopaniego in https://github.com/huggingface/trl/pull/4619
Fix link to OpenEnv blog in docs by @sergiopaniego in https://github.com/huggingface/trl/pull/4625
Update ministral notebooks with official bf16 ckpt by @sergiopaniego in https://github.com/huggingface/trl/pull/4626
Add missing experimental autodoc classes to docs by @albertvillanova in https://github.com/huggingface/trl/pull/4618
Add logos as assets by @qgallouedec in https://github.com/huggingface/trl/pull/4627
fix(PPO examples): passing model dict to models by @casinca in https://github.com/huggingface/trl/pull/4630
[ALST/Ulysses] Added ALST/Ulysses documentation by @kashif in https://github.com/huggingface/trl/pull/4420
Adding EssentialAI/rnj-1-instruct GRPO example by @sergiopaniego in https://github.com/huggingface/trl/pull/4640
Update rnj_1_instruct notebook by @sergiopaniego in https://github.com/huggingface/trl/pull/4646
Add agent training notebook to examples by @sergiopaniego in https://github.com/huggingface/trl/pull/4645

Deprecations

Replace wandb_log_unique_prompts with log_unique_prompts by @taha-yassine in https://github.com/huggingface/trl/pull/4508
Remove deprecations for 0.26 release by @albertvillanova in https://github.com/huggingface/trl/pull/4607
Remove deprecated batched formatting in GOLDTrainer by @albertvillanova in https://github.com/huggingface/trl/pull/4622

Miscellaneous

⛴️ Add kernels to Docker images by @ishitab02 in https://github.com/huggingface/trl/pull/4445
Replace accelerate logging with stdlib in CLI by @lewtun in https://github.com/huggingface/trl/pull/4512
Replace flash attention2 with kernels-community/flash-attn2 by @tamoghnokandar in https://github.com/huggingface/trl/pull/4426
Fix Docker images for Liger by @lewtun in https://github.com/huggingface/trl/pull/4522
Remove test trainer args by @qgallouedec in https://github.com/huggingface/trl/pull/4517
Prevent upcasting norm layers in prepare_model_for_kbit_training by @sergiopaniego in https://github.com/huggingface/trl/pull/4457
Remove module-level imports of extra deps in experimental.judges by @albertvillanova in https://github.com/huggingface/trl/pull/4598
Clean up model preparation by @qgallouedec in https://github.com/huggingface/trl/pull/4577
Remove deprecation warning from RLOOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/4644
Disable gradient checkpointing during no-grad inference to avoid PyTorch warning by @qgallouedec in https://github.com/huggingface/trl/pull/4636

What's Changed

⬆️ Bump dev version by @qgallouedec in https://github.com/huggingface/trl/pull/4479
Add LFM2 to SFT notebook examples by @sergiopaniego in https://github.com/huggingface/trl/pull/4455
Add tiny model Qwen3VLForConditionalGeneration to CI by @albertvillanova in https://github.com/huggingface/trl/pull/4494
Buffer samples based on group level stds. by @pramodith in https://github.com/huggingface/trl/pull/4492
Move XPOTrainer to trl.experimental.xpo by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4485
⛴️ Add kernels to Docker images by @ishitab02 in https://github.com/huggingface/trl/pull/4445
ScaleRL: Add CISPO Loss by @pramodith in https://github.com/huggingface/trl/pull/4495
Support completion bootstrap for VLM in GRPO/RLOO by @SolarWindRider in https://github.com/huggingface/trl/pull/4452
docs: Add PEFT subsection to reducing memory usage guide by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4430
Fix bugs in CISPO conditions by @pramodith in https://github.com/huggingface/trl/pull/4499
Move judges to experimental submodule by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4439
[DOCS] update and fix openenv by @burtenshaw in https://github.com/huggingface/trl/pull/4490
Consistency regarding relative imports by @qgallouedec in https://github.com/huggingface/trl/pull/4498
Fix link to OpenEnv docs by @lukehinds in https://github.com/huggingface/trl/pull/4502
Tweak description for vLLM sleep mode by @lewtun in https://github.com/huggingface/trl/pull/4506
Add support for images inside tables with Trackio completions logging by @taha-yassine in https://github.com/huggingface/trl/pull/4505
Add MiniLLM Trainer by @t1101675 in https://github.com/huggingface/trl/pull/4504
Replace accelerate logging with stdlib in CLI by @lewtun in https://github.com/huggingface/trl/pull/4512
Add temporary workaround for lr_scheduler_kwargs dtype issue in Transformers 4.57.0 by @qgallouedec in https://github.com/huggingface/trl/pull/4513
device_map and dtype to "auto" by default by @qgallouedec in https://github.com/huggingface/trl/pull/4509
Replace wandb_log_unique_prompts with log_unique_prompts by @taha-yassine in https://github.com/huggingface/trl/pull/4508
refactor: Move CPOTrainer to experimental module by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4470
MiniLLM: Fix arguments in config & add to documentation index by @t1101675 in https://github.com/huggingface/trl/pull/4518
Replace flash attention2 with kernels-community/flash-attn2 by @tamoghnokandar in https://github.com/huggingface/trl/pull/4426
Move GKDTrainer to experimental module by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4474
Paper Index: Change num_completions to num_generations by @pramodith in https://github.com/huggingface/trl/pull/4515
Fix Docker images for Liger by @lewtun in https://github.com/huggingface/trl/pull/4522
[Bug Fix] OnlineDPOTrainer with vLLM Server Mode by @YangKai0616 in https://github.com/huggingface/trl/pull/4500
Move NashMDTrainer to experimental module by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4477
Move PPOTrainer to trl.experimental.ppo by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4482
Add step time metric to GRPO Trainer for performance tracking by @qgallouedec in https://github.com/huggingface/trl/pull/4516
Rename flash-attn to flash-attn2 by @qgallouedec in https://github.com/huggingface/trl/pull/4514
Remove test trainer args by @qgallouedec in https://github.com/huggingface/trl/pull/4517
docs: Extend CLI basic usage examples to all supported CLIs by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4425
Prevent upcasting norm layers in prepare_model_for_kbit_training by @sergiopaniego in https://github.com/huggingface/trl/pull/4457
Add vLLM quantization option for colocate by @sergiopaniego in https://github.com/huggingface/trl/pull/4496
fix(GOLDTrainer): Resolve incorrect attribute access and VLLMClient.generate() output type by @fabio-sim in https://github.com/huggingface/trl/pull/4526
[OpenEnv] add vllm colocate mode to openenv scripts by @kashif in https://github.com/huggingface/trl/pull/4510
[Doc] Drop dummy reward and dataset for DeepMath-103K and accuracy reward by @qgallouedec in https://github.com/huggingface/trl/pull/4524
Add OpenEnv Script examples to docs by @sergiopaniego in https://github.com/huggingface/trl/pull/4533
Update OpenEnv example scripts by @sergiopaniego in https://github.com/huggingface/trl/pull/4547
[OpenEnv] browsergym example script by @kashif in https://github.com/huggingface/trl/pull/4539
Update OpenEnv guide with latest details by @sergiopaniego in https://github.com/huggingface/trl/pull/4552
Fix bug with VLM processors in prompt-completion completion text-only training by @kschwethelm in https://github.com/huggingface/trl/pull/4553
Add target_parameters to LoraConfig by @jonnyli1125 in https://github.com/huggingface/trl/pull/4536
fix+docs: device_map=None for DeepSpeed and add ZeRO paper (1910.02054) to Paper Index by @JenWei0312 in https://github.com/huggingface/trl/pull/4551
[ORPO] Move ORPOTrainer to experimental by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4480
Add GRPO Wordle OpenEnv Colab by @sergiopaniego in https://github.com/huggingface/trl/pull/4542
Update OpenEnv guide with new notebook by @sergiopaniego in https://github.com/huggingface/trl/pull/4555
Move PRMTrainer to trl.experimental.prm by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4483
docs: add KTO (2402.01306) to Paper Index + link ref to KTOTrainer by @SSusantAchary in https://github.com/huggingface/trl/pull/4440
[SFT] Log mean token accuracy from Liger kernel by @kashif in https://github.com/huggingface/trl/pull/4302
Move OnlineDPOTrainer to experimental module by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4473
Add num_generations_eval parameter for efficient evaluation by @mingxuetian in https://github.com/huggingface/trl/pull/4458
docs: Rewrite PEFT integration guide with comprehensive examples by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4421
Reorder documentation TOC to surface key trainer sections by @qgallouedec in https://github.com/huggingface/trl/pull/4565
Reasoning reward by @lewtun in https://github.com/huggingface/trl/pull/4563
Fix vLLM sleep mode: add collective RPC call to reload weights in vLLM wake-up process by @qgallouedec in https://github.com/huggingface/trl/pull/4571
Fix typo in GRPO description in README by @iliasmerigh in https://github.com/huggingface/trl/pull/4573
Add shuffle_dataset option to SFTTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/4564
Fix Replay Buffer docs. by @pramodith in https://github.com/huggingface/trl/pull/4574
Fix PPO example by @qgallouedec in https://github.com/huggingface/trl/pull/4556
Move WinRateCallback to experimental by @qgallouedec in https://github.com/huggingface/trl/pull/4558
Move tests for GSPOTokenTrainer to experimental by @qgallouedec in https://github.com/huggingface/trl/pull/4572
Revert hotfix Fall back to config.text_config._name_or_path by @albertvillanova in https://github.com/huggingface/trl/pull/4581
fix: use shift_labels for metrics when using CP or SP by @jue-jue-zi in https://github.com/huggingface/trl/pull/4579
Add missing require_bitsandbytes marker to CI tests by @albertvillanova in https://github.com/huggingface/trl/pull/4586
Remove module-level imports of extra deps in experimental.judges by @albertvillanova in https://github.com/huggingface/trl/pull/4598
Fix 'generation_config' AttributeError by @albertvillanova in https://github.com/huggingface/trl/pull/4596
Revert "Hotfix CI with dev dependencies: xfail test_prepare_inputs_for_generation" by @albertvillanova in https://github.com/huggingface/trl/pull/4587
docs: Add Beyond the 80/20 Rule (2506.01939) to Paper Index by @xuanduy04 in https://github.com/huggingface/trl/pull/4580
[GRPO] Sequence-level TIS & MIS by @LeonEricsson in https://github.com/huggingface/trl/pull/4530
docs: Expand training customization examples by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4427
docs: Expand speeding up training guide with acceleration methods by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4428
Raise FutureWarning for classes moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/4605
Update How-to guides by @qgallouedec in https://github.com/huggingface/trl/pull/4604
Silence experimental warnings when imported in the stable by @qgallouedec in https://github.com/huggingface/trl/pull/4606
Remove deprecations for 0.26 release by @albertvillanova in https://github.com/huggingface/trl/pull/4607
Fixed OpenEnv example scripts by @sergiopaniego in https://github.com/huggingface/trl/pull/4610
Move MergeModelCallback to experimental by @qgallouedec in https://github.com/huggingface/trl/pull/4608
[GRPOTrainer]: Add SAPO Loss by @pramodith in https://github.com/huggingface/trl/pull/4600
Add ministral 3 free notebooks by @sergiopaniego in https://github.com/huggingface/trl/pull/4614
Replace arXiv paper links with HF links by @albertvillanova in https://github.com/huggingface/trl/pull/4613
Add experimental imports to docs by @albertvillanova in https://github.com/huggingface/trl/pull/4616
Fix README style by @sergiopaniego in https://github.com/huggingface/trl/pull/4619
Fix link to OpenEnv blog in docs by @sergiopaniego in https://github.com/huggingface/trl/pull/4625
Update ministral notebooks with official bf16 ckpt by @sergiopaniego in https://github.com/huggingface/trl/pull/4626
Remove deprecated batched formatting in GOLDTrainer by @albertvillanova in https://github.com/huggingface/trl/pull/4622
Clean up model preparation by @qgallouedec in https://github.com/huggingface/trl/pull/4577
Silence experimental warning during docs build by @albertvillanova in https://github.com/huggingface/trl/pull/4623
Raise warnings at 2nd stack level by @albertvillanova in https://github.com/huggingface/trl/pull/4621
Raise FutureWarning for trainer moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/4620
Add missing experimental autodoc classes to docs by @albertvillanova in https://github.com/huggingface/trl/pull/4618
Add logos as assets by @qgallouedec in https://github.com/huggingface/trl/pull/4627
Remove no longer applicable warning once BCO was moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/4628
Refactor suppression of warning at experimental import by @albertvillanova in https://github.com/huggingface/trl/pull/4629
fix(PPO examples): passing model dict to models by @casinca in https://github.com/huggingface/trl/pull/4630
Fix FSDP2 model key miss match when sync LoRA model to vLLM server by @Xiao-Chenguang in https://github.com/huggingface/trl/pull/4603
TRL supports vLLM 0.11 by @qgallouedec in https://github.com/huggingface/trl/pull/4633
[ALST/Ulysses] Added ALST/Ulysses documentation by @kashif in https://github.com/huggingface/trl/pull/4420
Adding EssentialAI/rnj-1-instruct GRPO example by @sergiopaniego in https://github.com/huggingface/trl/pull/4640
🚚 Move KTO to trl.experimental by @neha222222 in https://github.com/huggingface/trl/pull/4575
🕵️‍♂️ GRPO: Agent training by @qgallouedec in https://github.com/huggingface/trl/pull/4300
feat: implement DeepSeek unbiased KL estimator for GRPO by @jlcanta in https://github.com/huggingface/trl/pull/4638
Update rnj_1_instruct notebook by @sergiopaniego in https://github.com/huggingface/trl/pull/4646
Remove deprecation warning from RLOOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/4644
Add agent training notebook to examples by @sergiopaniego in https://github.com/huggingface/trl/pull/4645
Fix KTOTrainer CUDA error for large-vocab models via tensor indexing by @bhuvanprakash in https://github.com/huggingface/trl/pull/4635
Disable gradient checkpointing during no-grad inference to avoid PyTorch warning by @qgallouedec in https://github.com/huggingface/trl/pull/4636
Release: v0.26 by @qgallouedec in https://github.com/huggingface/trl/pull/4649

New Contributors

@lukehinds made their first contribution in https://github.com/huggingface/trl/pull/4502
@t1101675 made their first contribution in https://github.com/huggingface/trl/pull/4504
@tamoghnokandar made their first contribution in https://github.com/huggingface/trl/pull/4426
@fabio-sim made their first contribution in https://github.com/huggingface/trl/pull/4526
@kschwethelm made their first contribution in https://github.com/huggingface/trl/pull/4553
@jonnyli1125 made their first contribution in https://github.com/huggingface/trl/pull/4536
@JenWei0312 made their first contribution in https://github.com/huggingface/trl/pull/4551
@SSusantAchary made their first contribution in https://github.com/huggingface/trl/pull/4440
@mingxuetian made their first contribution in https://github.com/huggingface/trl/pull/4458
@iliasmerigh made their first contribution in https://github.com/huggingface/trl/pull/4573
@xuanduy04 made their first contribution in https://github.com/huggingface/trl/pull/4580
@casinca made their first contribution in https://github.com/huggingface/trl/pull/4630
@Xiao-Chenguang made their first contribution in https://github.com/huggingface/trl/pull/4603
@neha222222 made their first contribution in https://github.com/huggingface/trl/pull/4575
@jlcanta made their first contribution in https://github.com/huggingface/trl/pull/4638
@bhuvanprakash made their first contribution in https://github.com/huggingface/trl/pull/4635

Full Changelog: https://github.com/huggingface/trl/compare/v0.25.0...v0.26.0

Features

🕵️‍♂️ GRPO: Agent training

ScaleRL: Add CISPO Loss

Add vLLM quantization option for colocate

Reasoning reward

Add `shuffle_dataset` option to `SFTTrainer`

Add SAPO Loss in GRPO

Other Features

Experimental

Fixes

Documentation and Examples

Deprecations

Miscellaneous

What's Changed

New Contributors

More from Hugging Face

More from Hugging Face

v0.26.0

Features

🕵️‍♂️ GRPO: Agent training

ScaleRL: Add CISPO Loss

Add vLLM quantization option for colocate

Reasoning reward

Add shuffle_dataset option to SFTTrainer

Add SAPO Loss in GRPO

Other Features

Experimental

Fixes

Documentation and Examples

Deprecations

Miscellaneous

What's Changed

New Contributors

More from Hugging Face

More from Hugging Face

Add `shuffle_dataset` option to `SFTTrainer`