GRPOTrainer now supports training agents using tools. This allows language models to interact with external functions or APIs during training.
from datasets import Dataset
from trl import GRPOTrainer
def multiply(a: int, b: int) -> int:
"""
Multiplies two integers.
Args:
a: The first integer.
b: The second integer.
Returns:
The product of the two integers.
"""
return a * b
dataset = Dataset.from_list(
[
{"prompt": [{"role": "user", "content": "What is 3 multiplied by 4?"}], "answer": 12},
{"prompt": [{"role": "user", "content": "Calculate 7 times 8."}], "answer": 56},
{"prompt": [{"role": "user", "content": "Find the product of 5 and 6."}], "answer": 30},
{"prompt": [{"role": "user", "content": "What do you get when you multiply 9 by 9?"}], "answer": 81},
{"prompt": [{"role": "user", "content": "Compute 12 multiplied by 11."}], "answer": 132},
{"prompt": [{"role": "user", "content": "What is 15 times 14?"}], "answer": 210},
]
)
def accuracy(completions, answer, **kwargs):
predictions = [completion[-1]["content"] for completion in completions]
rewards = [float(str(ans) in pred) for pred, ans in zip(predictions, answer)]
return rewards
trainer = GRPOTrainer(
model="Qwen/Qwen3-0.6B",
train_dataset=dataset,
tools=[multiply],
reward_funcs=accuracy,
)
trainer.train()
by @qgallouedec in https://github.com/huggingface/trl/pull/4300
CISPO Loss was first introduced in the Minimax-M1 paper, the ScaleRL paper subsequently showed that CISPO loss scales the best in terms of performance and efficiency as models are trained for longer.
GRPOTrainer now supports the CISPO loss using loss_type="cispo" in the GRPOConfig.
by @pramodith in https://github.com/huggingface/trl/pull/4495
When the input model is quantized using bitsandbytes, vLLM will now also use quantization when in colocate mode.
by @sergiopaniego in https://github.com/huggingface/trl/pull/4496
TRL nows includes a reasoning reward function
from trl.rewards import reasoning_accuracy_reward
solutions = [r"\frac{1}{3}", r"\frac{1}{3}", r"\frac{1}{3}"]
completions = [
[
{
"role": "assistant",
"content": r"<think> Reasoning content </think> The final answer is \boxed{\frac{1}{3}}",
}
],
[
{
"role": "assistant",
"content": r"<think> Reasoning content </think> The final answer is \boxed{\frac{1}{2}}",
}
],
[
{
"role": "assistant",
"content": r"<think> Reasoning content with partial answers \boxed{\frac{1}{3}} but no final answer",
}
],
]
reasoning_accuracy_reward(completions, solutions) # [1.0, 0.0, 0.0]
As any other reward function, it can be used in GRPOTrainer or RLOOTrainer.
from trl import GRPOTrainer
from trl.rewards import reasoning_accuracy_reward
trainer = GRPOTrainer(
...,
reward_funcs=reasoning_accuracy_reward,
)
by @lewtun in https://github.com/huggingface/trl/pull/4563
shuffle_dataset option to SFTTrainerYou can now shuffle the dataset in SFTTrainer by setting the shuffle_dataset argument to True in SFTConfig. This is useful when the dataset features high similarity between consecutive samples.
from trl import SFTTrainer, SFTConfig
SFTConfig(shuffle_dataset=True)
by @qgallouedec in https://github.com/huggingface/trl/pull/4564
Soft Adaptive Policy Optimization (SAPO), replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals. Compared with GSPO and GRPO, SAPO is both sequence-coherent and token-adaptive. Like GSPO, SAPO maintains sequence-level coherence, but its soft gating forms a continuous trust region that avoids the brittle hard clipping band used in GSPO.
You can now use SAPO loss in GRPOTrainer by setting loss_type="sapo" in the GRPOConfig.
by @pramodith in https://github.com/huggingface/trl/pull/4600
num_generations_eval parameter for efficient evaluation by @mingxuetian in https://github.com/huggingface/trl/pull/4458WinRateCallback to experimental by @qgallouedec in https://github.com/huggingface/trl/pull/4558device_map and dtype to "auto" by default by @qgallouedec in https://github.com/huggingface/trl/pull/4509flash-attn to flash-attn2 by @qgallouedec in https://github.com/huggingface/trl/pull/4514device_map=None for DeepSpeed and add ZeRO paper (1910.02054) to Paper Index by @JenWei0312 in https://github.com/huggingface/trl/pull/4551num_completions to num_generations by @pramodith in https://github.com/huggingface/trl/pull/4515rnj_1_instruct notebook by @sergiopaniego in https://github.com/huggingface/trl/pull/4646wandb_log_unique_prompts with log_unique_prompts by @taha-yassine in https://github.com/huggingface/trl/pull/4508prepare_model_for_kbit_training by @sergiopaniego in https://github.com/huggingface/trl/pull/4457lr_scheduler_kwargs dtype issue in Transformers 4.57.0 by @qgallouedec in https://github.com/huggingface/trl/pull/4513device_map and dtype to "auto" by default by @qgallouedec in https://github.com/huggingface/trl/pull/4509wandb_log_unique_prompts with log_unique_prompts by @taha-yassine in https://github.com/huggingface/trl/pull/4508num_completions to num_generations by @pramodith in https://github.com/huggingface/trl/pull/4515flash-attn to flash-attn2 by @qgallouedec in https://github.com/huggingface/trl/pull/4514prepare_model_for_kbit_training by @sergiopaniego in https://github.com/huggingface/trl/pull/4457device_map=None for DeepSpeed and add ZeRO paper (1910.02054) to Paper Index by @JenWei0312 in https://github.com/huggingface/trl/pull/4551num_generations_eval parameter for efficient evaluation by @mingxuetian in https://github.com/huggingface/trl/pull/4458shuffle_dataset option to SFTTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/4564WinRateCallback to experimental by @qgallouedec in https://github.com/huggingface/trl/pull/4558rnj_1_instruct notebook by @sergiopaniego in https://github.com/huggingface/trl/pull/4646Full Changelog: https://github.com/huggingface/trl/compare/v0.25.0...v0.26.0
Fetched April 7, 2026