v0.26.0
Features
π΅οΈββοΈ GRPO: Agent training
GRPOTrainer now supports training agents using tools. This allows language models to interact with external functions or APIs during training.
from datasets import Dataset
from trl import GRPOTrainer
def multiply(a: int, b: int) -> int:
"""
Multiplies two integers.
Args:
a: The first integer.
b: The second integer.
Returns:
The product of the two integers.
"""
return a * b
dataset = Dataset.from_list(
[
{"prompt": [{"role": "user", "content": "What is 3 multiplied by 4?"}], "answer": 12},
{"prompt": [{"role": "user", "content": "Calculate 7 times 8."}], "answer": 56},
{"prompt": [{"role": "user", "content": "Find the product of 5 and 6."}], "answer": 30},
{"prompt": [{"role": "user", "content": "What do you get when you multiply 9 by 9?"}], "answer": 81},
{"prompt": [{"role": "user", "content": "Compute 12 multiplied by 11."}], "answer": 132},
{"prompt": [{"role": "user", "content": "What is 15 times 14?"}], "answer": 210},
]
)
def accuracy(completions, answer, **kwargs):
predictions = [completion[-1]["content"] for completion in completions]
rewards = [float(str(ans) in pred) for pred, ans in zip(predictions, answer)]
return rewards
trainer = GRPOTrainer(
model="Qwen/Qwen3-0.6B",
train_dataset=dataset,
tools=[multiply],
reward_funcs=accuracy,
)
trainer.train()
by @qgallouedec in https://github.com/huggingface/trl/pull/4300
ScaleRL: Add CISPO Loss
CISPO Loss was first introduced in the Minimax-M1 paper, the ScaleRL paper subsequently showed that CISPO loss scales the best in terms of performance and efficiency as models are trained for longer.
GRPOTrainer now supports the CISPO loss using loss_type="cispo" in the GRPOConfig.
by @pramodith in https://github.com/huggingface/trl/pull/4495
Add vLLM quantization option for colocate
When the input model is quantized using bitsandbytes, vLLM will now also use quantization when in colocate mode.
by @sergiopaniego in https://github.com/huggingface/trl/pull/4496
Reasoning reward
TRL nows includes a reasoning reward function
from trl.rewards import reasoning_accuracy_reward
solutions = [r"\frac{1}{3}", r"\frac{1}{3}", r"\frac{1}{3}"]
completions = [
[
{
"role": "assistant",
"content": r"<think> Reasoning content </think> The final answer is \boxed{\frac{1}{3}}",
}
],
[
{
"role": "assistant",
"content": r"<think> Reasoning content </think> The final answer is \boxed{\frac{1}{2}}",
}
],
[
{
"role": "assistant",
"content": r"<think> Reasoning content with partial answers \boxed{\frac{1}{3}} but no final answer",
}
],
]
reasoning_accuracy_reward(completions, solutions) # [1.0, 0.0, 0.0]
As any other reward function, it can be used in GRPOTrainer or RLOOTrainer.
from trl import GRPOTrainer
from trl.rewards import reasoning_accuracy_reward
trainer = GRPOTrainer(
...,
reward_funcs=reasoning_accuracy_reward,
)
by @lewtun in https://github.com/huggingface/trl/pull/4563
Add shuffle_dataset option to SFTTrainer
You can now shuffle the dataset in SFTTrainer by setting the shuffle_dataset argument to True in SFTConfig. This is useful when the dataset features high similarity between consecutive samples.
from trl import SFTTrainer, SFTConfig
SFTConfig(shuffle_dataset=True)
by @qgallouedec in https://github.com/huggingface/trl/pull/4564
Add SAPO Loss in GRPO
Soft Adaptive Policy Optimization (SAPO), replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals. Compared with GSPO and GRPO, SAPO is both sequence-coherent and token-adaptive. Like GSPO, SAPO maintains sequence-level coherence, but its soft gating forms a continuous trust region that avoids the brittle hard clipping band used in GSPO.
You can now use SAPO loss in GRPOTrainer by setting loss_type="sapo" in the GRPOConfig.
by @pramodith in https://github.com/huggingface/trl/pull/4600
Other Features
- Support completion bootstrap for VLM in GRPO/RLOO by @SolarWindRider in https://github.com/huggingface/trl/pull/4452
- Add support for images inside tables with Trackio completions logging by @taha-yassine in https://github.com/huggingface/trl/pull/4505
- Add step time metric to GRPO Trainer for performance tracking by @qgallouedec in https://github.com/huggingface/trl/pull/4516
- Add target_parameters to LoraConfig by @jonnyli1125 in https://github.com/huggingface/trl/pull/4536
- [SFT] Log mean token accuracy from Liger kernel by @kashif in https://github.com/huggingface/trl/pull/4302
- Add
num_generations_evalparameter for efficient evaluation by @mingxuetian in https://github.com/huggingface/trl/pull/4458 - [GRPO] Sequence-level TIS & MIS by @LeonEricsson in https://github.com/huggingface/trl/pull/4530
- TRL supports vLLM 0.11 by @qgallouedec in https://github.com/huggingface/trl/pull/4633
- feat: implement DeepSeek unbiased KL estimator for GRPO by @jlcanta in https://github.com/huggingface/trl/pull/4638
Experimental
- Move XPOTrainer to trl.experimental.xpo by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4485
- Move judges to experimental submodule by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4439
- Add MiniLLM Trainer by @t1101675 in https://github.com/huggingface/trl/pull/4504
- refactor: Move CPOTrainer to experimental module by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4470
- Move GKDTrainer to experimental module by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4474
- Move NashMDTrainer to experimental module by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4477
- Move PPOTrainer to trl.experimental.ppo by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4482
- [ORPO] Move ORPOTrainer to experimental by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4480
- Move PRMTrainer to trl.experimental.prm by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4483
- Move OnlineDPOTrainer to experimental module by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4473
- Move
WinRateCallbackto experimental by @qgallouedec in https://github.com/huggingface/trl/pull/4558 - Move tests for GSPOTokenTrainer to experimental by @qgallouedec in https://github.com/huggingface/trl/pull/4572
- Raise FutureWarning for classes moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/4605
- Move MergeModelCallback to experimental by @qgallouedec in https://github.com/huggingface/trl/pull/4608
- Raise FutureWarning for trainer moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/4620
- Remove no longer applicable warning once BCO was moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/4628
- Refactor suppression of warning at experimental import by @albertvillanova in https://github.com/huggingface/trl/pull/4629
- π Move KTO to trl.experimental by @neha222222 in https://github.com/huggingface/trl/pull/4575
Fixes
- Buffer samples based on group level stds. by @pramodith in https://github.com/huggingface/trl/pull/4492
- Fix bugs in CISPO conditions by @pramodith in https://github.com/huggingface/trl/pull/4499
device_mapanddtypeto"auto"by default by @qgallouedec in https://github.com/huggingface/trl/pull/4509- MiniLLM: Fix arguments in config & add to documentation index by @t1101675 in https://github.com/huggingface/trl/pull/4518
- [Bug Fix] OnlineDPOTrainer with vLLM Server Mode by @YangKai0616 in https://github.com/huggingface/trl/pull/4500
- Rename
flash-attntoflash-attn2by @qgallouedec in https://github.com/huggingface/trl/pull/4514 - fix(GOLDTrainer): Resolve incorrect attribute access and VLLMClient.generate() output type by @fabio-sim in https://github.com/huggingface/trl/pull/4526
- Fix bug with VLM processors in prompt-completion completion text-only training by @kschwethelm in https://github.com/huggingface/trl/pull/4553
- fix+docs:
device_map=Nonefor DeepSpeed and add ZeRO paper (1910.02054) to Paper Index by @JenWei0312 in https://github.com/huggingface/trl/pull/4551 - Fix vLLM sleep mode: add collective RPC call to reload weights in vLLM wake-up process by @qgallouedec in https://github.com/huggingface/trl/pull/4571
- fix: use shift_labels for metrics when using CP or SP by @jue-jue-zi in https://github.com/huggingface/trl/pull/4579
- Fix 'generation_config' AttributeError by @albertvillanova in https://github.com/huggingface/trl/pull/4596
- Fix FSDP2 model key miss match when sync LoRA model to vLLM server by @Xiao-Chenguang in https://github.com/huggingface/trl/pull/4603
- Fix KTOTrainer CUDA error for large-vocab models via tensor indexing by @bhuvanprakash in https://github.com/huggingface/trl/pull/4635
Documentation and Examples
- docs: Add PEFT subsection to reducing memory usage guide by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4430
- [DOCS] update and fix openenv by @burtenshaw in https://github.com/huggingface/trl/pull/4490
- Fix link to OpenEnv docs by @lukehinds in https://github.com/huggingface/trl/pull/4502
- Tweak description for vLLM sleep mode by @lewtun in https://github.com/huggingface/trl/pull/4506
- Paper Index: Change
num_completionstonum_generationsby @pramodith in https://github.com/huggingface/trl/pull/4515 - docs: Extend CLI basic usage examples to all supported CLIs by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4425
- [OpenEnv] add vllm colocate mode to openenv scripts by @kashif in https://github.com/huggingface/trl/pull/4510
- [Doc] Drop dummy reward and dataset for DeepMath-103K and accuracy reward by @qgallouedec in https://github.com/huggingface/trl/pull/4524
- Add OpenEnv Script examples to docs by @sergiopaniego in https://github.com/huggingface/trl/pull/4533
- Update OpenEnv example scripts by @sergiopaniego in https://github.com/huggingface/trl/pull/4547
- [OpenEnv] browsergym example script by @kashif in https://github.com/huggingface/trl/pull/4539
- Update OpenEnv guide with latest details by @sergiopaniego in https://github.com/huggingface/trl/pull/4552
- Add GRPO Wordle OpenEnv Colab by @sergiopaniego in https://github.com/huggingface/trl/pull/4542
- Update OpenEnv guide with new notebook by @sergiopaniego in https://github.com/huggingface/trl/pull/4555
- docs: add KTO (2402.01306) to Paper Index + link ref to KTOTrainer by @SSusantAchary in https://github.com/huggingface/trl/pull/4440
- Add LFM2 to SFT notebook examples by @sergiopaniego in https://github.com/huggingface/trl/pull/4455
- docs: Rewrite PEFT integration guide with comprehensive examples by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4421
- Reorder documentation TOC to surface key trainer sections by @qgallouedec in https://github.com/huggingface/trl/pull/4565
- Fix typo in GRPO description in README by @iliasmerigh in https://github.com/huggingface/trl/pull/4573
- Fix Replay Buffer docs. by @pramodith in https://github.com/huggingface/trl/pull/4574
- Fix PPO example by @qgallouedec in https://github.com/huggingface/trl/pull/4556
- docs: Add Beyond the 80/20 Rule (2506.01939) to Paper Index by @xuanduy04 in https://github.com/huggingface/trl/pull/4580
- docs: Expand training customization examples by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4427
- docs: Expand speeding up training guide with acceleration methods by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4428
- Update How-to guides by @qgallouedec in https://github.com/huggingface/trl/pull/4604
- Fixed OpenEnv example scripts by @sergiopaniego in https://github.com/huggingface/trl/pull/4610
- Add ministral 3 free notebooks by @sergiopaniego in https://github.com/huggingface/trl/pull/4614
- Replace arXiv paper links with HF links by @albertvillanova in https://github.com/huggingface/trl/pull/4613
- Add experimental imports to docs by @albertvillanova in https://github.com/huggingface/trl/pull/4616
- Fix README style by @sergiopaniego in https://github.com/huggingface/trl/pull/4619
- Fix link to OpenEnv blog in docs by @sergiopaniego in https://github.com/huggingface/trl/pull/4625
- Update ministral notebooks with official bf16 ckpt by @sergiopaniego in https://github.com/huggingface/trl/pull/4626
- Add missing experimental autodoc classes to docs by @albertvillanova in https://github.com/huggingface/trl/pull/4618
- Add logos as assets by @qgallouedec in https://github.com/huggingface/trl/pull/4627
- fix(PPO examples): passing model dict to models by @casinca in https://github.com/huggingface/trl/pull/4630
- [ALST/Ulysses] Added ALST/Ulysses documentation by @kashif in https://github.com/huggingface/trl/pull/4420
- Adding EssentialAI/rnj-1-instruct GRPO example by @sergiopaniego in https://github.com/huggingface/trl/pull/4640
- Update
rnj_1_instructnotebook by @sergiopaniego in https://github.com/huggingface/trl/pull/4646 - Add agent training notebook to examples by @sergiopaniego in https://github.com/huggingface/trl/pull/4645
Deprecations
- Replace
wandb_log_unique_promptswithlog_unique_promptsby @taha-yassine in https://github.com/huggingface/trl/pull/4508 - Remove deprecations for 0.26 release by @albertvillanova in https://github.com/huggingface/trl/pull/4607
- Remove deprecated batched formatting in GOLDTrainer by @albertvillanova in https://github.com/huggingface/trl/pull/4622
Miscellaneous
- β΄οΈ Add kernels to Docker images by @ishitab02 in https://github.com/huggingface/trl/pull/4445
- Replace accelerate logging with stdlib in CLI by @lewtun in https://github.com/huggingface/trl/pull/4512
- Replace flash attention2 with kernels-community/flash-attn2 by @tamoghnokandar in https://github.com/huggingface/trl/pull/4426
- Fix Docker images for Liger by @lewtun in https://github.com/huggingface/trl/pull/4522
- Remove test trainer args by @qgallouedec in https://github.com/huggingface/trl/pull/4517
- Prevent upcasting norm layers in
prepare_model_for_kbit_trainingby @sergiopaniego in https://github.com/huggingface/trl/pull/4457 - Remove module-level imports of extra deps in experimental.judges by @albertvillanova in https://github.com/huggingface/trl/pull/4598
- Clean up model preparation by @qgallouedec in https://github.com/huggingface/trl/pull/4577
- Remove deprecation warning from RLOOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/4644
- Disable gradient checkpointing during no-grad inference to avoid PyTorch warning by @qgallouedec in https://github.com/huggingface/trl/pull/4636
What's Changed
- β¬οΈ Bump dev version by @qgallouedec in https://github.com/huggingface/trl/pull/4479
- Add LFM2 to SFT notebook examples by @sergiopaniego in https://github.com/huggingface/trl/pull/4455
- Add tiny model Qwen3VLForConditionalGeneration to CI by @albertvillanova in https://github.com/huggingface/trl/pull/4494
- Buffer samples based on group level stds. by @pramodith in https://github.com/huggingface/trl/pull/4492
- Move XPOTrainer to trl.experimental.xpo by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4485
- β΄οΈ Add kernels to Docker images by @ishitab02 in https://github.com/huggingface/trl/pull/4445
- ScaleRL: Add CISPO Loss by @pramodith in https://github.com/huggingface/trl/pull/4495
- Support completion bootstrap for VLM in GRPO/RLOO by @SolarWindRider in https://github.com/huggingface/trl/pull/4452
- docs: Add PEFT subsection to reducing memory usage guide by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4430
- Fix bugs in CISPO conditions by @pramodith in https://github.com/huggingface/trl/pull/4499
- Move judges to experimental submodule by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4439
- [DOCS] update and fix openenv by @burtenshaw in https://github.com/huggingface/trl/pull/4490
- Consistency regarding relative imports by @qgallouedec in https://github.com/huggingface/trl/pull/4498
- Fix link to OpenEnv docs by @lukehinds in https://github.com/huggingface/trl/pull/4502
- Tweak description for vLLM sleep mode by @lewtun in https://github.com/huggingface/trl/pull/4506
- Add support for images inside tables with Trackio completions logging by @taha-yassine in https://github.com/huggingface/trl/pull/4505
- Add MiniLLM Trainer by @t1101675 in https://github.com/huggingface/trl/pull/4504
- Replace accelerate logging with stdlib in CLI by @lewtun in https://github.com/huggingface/trl/pull/4512
- Add temporary workaround for
lr_scheduler_kwargsdtype issue in Transformers 4.57.0 by @qgallouedec in https://github.com/huggingface/trl/pull/4513 device_mapanddtypeto"auto"by default by @qgallouedec in https://github.com/huggingface/trl/pull/4509- Replace
wandb_log_unique_promptswithlog_unique_promptsby @taha-yassine in https://github.com/huggingface/trl/pull/4508 - refactor: Move CPOTrainer to experimental module by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4470
- MiniLLM: Fix arguments in config & add to documentation index by @t1101675 in https://github.com/huggingface/trl/pull/4518
- Replace flash attention2 with kernels-community/flash-attn2 by @tamoghnokandar in https://github.com/huggingface/trl/pull/4426
- Move GKDTrainer to experimental module by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4474
- Paper Index: Change
num_completionstonum_generationsby @pramodith in https://github.com/huggingface/trl/pull/4515 - Fix Docker images for Liger by @lewtun in https://github.com/huggingface/trl/pull/4522
- [Bug Fix] OnlineDPOTrainer with vLLM Server Mode by @YangKai0616 in https://github.com/huggingface/trl/pull/4500
- Move NashMDTrainer to experimental module by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4477
- Move PPOTrainer to trl.experimental.ppo by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4482
- Add step time metric to GRPO Trainer for performance tracking by @qgallouedec in https://github.com/huggingface/trl/pull/4516
- Rename
flash-attntoflash-attn2by @qgallouedec in https://github.com/huggingface/trl/pull/4514 - Remove test trainer args by @qgallouedec in https://github.com/huggingface/trl/pull/4517
- docs: Extend CLI basic usage examples to all supported CLIs by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4425
- Prevent upcasting norm layers in
prepare_model_for_kbit_trainingby @sergiopaniego in https://github.com/huggingface/trl/pull/4457 - Add vLLM quantization option for colocate by @sergiopaniego in https://github.com/huggingface/trl/pull/4496
- fix(GOLDTrainer): Resolve incorrect attribute access and VLLMClient.generate() output type by @fabio-sim in https://github.com/huggingface/trl/pull/4526
- [OpenEnv] add vllm colocate mode to openenv scripts by @kashif in https://github.com/huggingface/trl/pull/4510
- [Doc] Drop dummy reward and dataset for DeepMath-103K and accuracy reward by @qgallouedec in https://github.com/huggingface/trl/pull/4524
- Add OpenEnv Script examples to docs by @sergiopaniego in https://github.com/huggingface/trl/pull/4533
- Update OpenEnv example scripts by @sergiopaniego in https://github.com/huggingface/trl/pull/4547
- [OpenEnv] browsergym example script by @kashif in https://github.com/huggingface/trl/pull/4539
- Update OpenEnv guide with latest details by @sergiopaniego in https://github.com/huggingface/trl/pull/4552
- Fix bug with VLM processors in prompt-completion completion text-only training by @kschwethelm in https://github.com/huggingface/trl/pull/4553
- Add target_parameters to LoraConfig by @jonnyli1125 in https://github.com/huggingface/trl/pull/4536
- fix+docs:
device_map=Nonefor DeepSpeed and add ZeRO paper (1910.02054) to Paper Index by @JenWei0312 in https://github.com/huggingface/trl/pull/4551 - [ORPO] Move ORPOTrainer to experimental by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4480
- Add GRPO Wordle OpenEnv Colab by @sergiopaniego in https://github.com/huggingface/trl/pull/4542
- Update OpenEnv guide with new notebook by @sergiopaniego in https://github.com/huggingface/trl/pull/4555
- Move PRMTrainer to trl.experimental.prm by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4483
- docs: add KTO (2402.01306) to Paper Index + link ref to KTOTrainer by @SSusantAchary in https://github.com/huggingface/trl/pull/4440
- [SFT] Log mean token accuracy from Liger kernel by @kashif in https://github.com/huggingface/trl/pull/4302
- Move OnlineDPOTrainer to experimental module by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4473
- Add
num_generations_evalparameter for efficient evaluation by @mingxuetian in https://github.com/huggingface/trl/pull/4458 - docs: Rewrite PEFT integration guide with comprehensive examples by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4421
- Reorder documentation TOC to surface key trainer sections by @qgallouedec in https://github.com/huggingface/trl/pull/4565
- Reasoning reward by @lewtun in https://github.com/huggingface/trl/pull/4563
- Fix vLLM sleep mode: add collective RPC call to reload weights in vLLM wake-up process by @qgallouedec in https://github.com/huggingface/trl/pull/4571
- Fix typo in GRPO description in README by @iliasmerigh in https://github.com/huggingface/trl/pull/4573
- Add
shuffle_datasetoption toSFTTrainerby @qgallouedec in https://github.com/huggingface/trl/pull/4564 - Fix Replay Buffer docs. by @pramodith in https://github.com/huggingface/trl/pull/4574
- Fix PPO example by @qgallouedec in https://github.com/huggingface/trl/pull/4556
- Move
WinRateCallbackto experimental by @qgallouedec in https://github.com/huggingface/trl/pull/4558 - Move tests for GSPOTokenTrainer to experimental by @qgallouedec in https://github.com/huggingface/trl/pull/4572
- Revert hotfix Fall back to config.text_config._name_or_path by @albertvillanova in https://github.com/huggingface/trl/pull/4581
- fix: use shift_labels for metrics when using CP or SP by @jue-jue-zi in https://github.com/huggingface/trl/pull/4579
- Add missing require_bitsandbytes marker to CI tests by @albertvillanova in https://github.com/huggingface/trl/pull/4586
- Remove module-level imports of extra deps in experimental.judges by @albertvillanova in https://github.com/huggingface/trl/pull/4598
- Fix 'generation_config' AttributeError by @albertvillanova in https://github.com/huggingface/trl/pull/4596
- Revert "Hotfix CI with dev dependencies: xfail test_prepare_inputs_for_generation" by @albertvillanova in https://github.com/huggingface/trl/pull/4587
- docs: Add Beyond the 80/20 Rule (2506.01939) to Paper Index by @xuanduy04 in https://github.com/huggingface/trl/pull/4580
- [GRPO] Sequence-level TIS & MIS by @LeonEricsson in https://github.com/huggingface/trl/pull/4530
- docs: Expand training customization examples by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4427
- docs: Expand speeding up training guide with acceleration methods by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4428
- Raise FutureWarning for classes moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/4605
- Update How-to guides by @qgallouedec in https://github.com/huggingface/trl/pull/4604
- Silence experimental warnings when imported in the stable by @qgallouedec in https://github.com/huggingface/trl/pull/4606
- Remove deprecations for 0.26 release by @albertvillanova in https://github.com/huggingface/trl/pull/4607
- Fixed OpenEnv example scripts by @sergiopaniego in https://github.com/huggingface/trl/pull/4610
- Move MergeModelCallback to experimental by @qgallouedec in https://github.com/huggingface/trl/pull/4608
- [GRPOTrainer]: Add SAPO Loss by @pramodith in https://github.com/huggingface/trl/pull/4600
- Add ministral 3 free notebooks by @sergiopaniego in https://github.com/huggingface/trl/pull/4614
- Replace arXiv paper links with HF links by @albertvillanova in https://github.com/huggingface/trl/pull/4613
- Add experimental imports to docs by @albertvillanova in https://github.com/huggingface/trl/pull/4616
- Fix README style by @sergiopaniego in https://github.com/huggingface/trl/pull/4619
- Fix link to OpenEnv blog in docs by @sergiopaniego in https://github.com/huggingface/trl/pull/4625
- Update ministral notebooks with official bf16 ckpt by @sergiopaniego in https://github.com/huggingface/trl/pull/4626
- Remove deprecated batched formatting in GOLDTrainer by @albertvillanova in https://github.com/huggingface/trl/pull/4622
- Clean up model preparation by @qgallouedec in https://github.com/huggingface/trl/pull/4577
- Silence experimental warning during docs build by @albertvillanova in https://github.com/huggingface/trl/pull/4623
- Raise warnings at 2nd stack level by @albertvillanova in https://github.com/huggingface/trl/pull/4621
- Raise FutureWarning for trainer moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/4620
- Add missing experimental autodoc classes to docs by @albertvillanova in https://github.com/huggingface/trl/pull/4618
- Add logos as assets by @qgallouedec in https://github.com/huggingface/trl/pull/4627
- Remove no longer applicable warning once BCO was moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/4628
- Refactor suppression of warning at experimental import by @albertvillanova in https://github.com/huggingface/trl/pull/4629
- fix(PPO examples): passing model dict to models by @casinca in https://github.com/huggingface/trl/pull/4630
- Fix FSDP2 model key miss match when sync LoRA model to vLLM server by @Xiao-Chenguang in https://github.com/huggingface/trl/pull/4603
- TRL supports vLLM 0.11 by @qgallouedec in https://github.com/huggingface/trl/pull/4633
- [ALST/Ulysses] Added ALST/Ulysses documentation by @kashif in https://github.com/huggingface/trl/pull/4420
- Adding EssentialAI/rnj-1-instruct GRPO example by @sergiopaniego in https://github.com/huggingface/trl/pull/4640
- π Move KTO to trl.experimental by @neha222222 in https://github.com/huggingface/trl/pull/4575
- π΅οΈββοΈ GRPO: Agent training by @qgallouedec in https://github.com/huggingface/trl/pull/4300
- feat: implement DeepSeek unbiased KL estimator for GRPO by @jlcanta in https://github.com/huggingface/trl/pull/4638
- Update
rnj_1_instructnotebook by @sergiopaniego in https://github.com/huggingface/trl/pull/4646 - Remove deprecation warning from RLOOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/4644
- Add agent training notebook to examples by @sergiopaniego in https://github.com/huggingface/trl/pull/4645
- Fix KTOTrainer CUDA error for large-vocab models via tensor indexing by @bhuvanprakash in https://github.com/huggingface/trl/pull/4635
- Disable gradient checkpointing during no-grad inference to avoid PyTorch warning by @qgallouedec in https://github.com/huggingface/trl/pull/4636
- Release: v0.26 by @qgallouedec in https://github.com/huggingface/trl/pull/4649
New Contributors
- @lukehinds made their first contribution in https://github.com/huggingface/trl/pull/4502
- @t1101675 made their first contribution in https://github.com/huggingface/trl/pull/4504
- @tamoghnokandar made their first contribution in https://github.com/huggingface/trl/pull/4426
- @fabio-sim made their first contribution in https://github.com/huggingface/trl/pull/4526
- @kschwethelm made their first contribution in https://github.com/huggingface/trl/pull/4553
- @jonnyli1125 made their first contribution in https://github.com/huggingface/trl/pull/4536
- @JenWei0312 made their first contribution in https://github.com/huggingface/trl/pull/4551
- @SSusantAchary made their first contribution in https://github.com/huggingface/trl/pull/4440
- @mingxuetian made their first contribution in https://github.com/huggingface/trl/pull/4458
- @iliasmerigh made their first contribution in https://github.com/huggingface/trl/pull/4573
- @xuanduy04 made their first contribution in https://github.com/huggingface/trl/pull/4580
- @casinca made their first contribution in https://github.com/huggingface/trl/pull/4630
- @Xiao-Chenguang made their first contribution in https://github.com/huggingface/trl/pull/4603
- @neha222222 made their first contribution in https://github.com/huggingface/trl/pull/4575
- @jlcanta made their first contribution in https://github.com/huggingface/trl/pull/4638
- @bhuvanprakash made their first contribution in https://github.com/huggingface/trl/pull/4635
Full Changelog: https://github.com/huggingface/trl/compare/v0.25.0...v0.26.0
Fetched April 7, 2026
