releases.shpreview

TRL

$npx -y @buildinternet/releases show trl
Mon
Wed
Fri
AprMayJunJulAugSepOctNovDecJanFebMarApr
Less
More
Releases10Avg3/moVersionsv0.27.0 → v1.2.0
Apr 17, 2026

Features

New SSDTrainer — Simple Self-Distillation

<img width="778" height="334" alt="Screenshot 2026-04-16 at 9 08 04 PM" src="https://github.com/user-attachments/assets/8ca223f0-6740-48a8-967c-ec10cb262a93" />

A new experimental SSDTrainer implements the method described in Embarrassingly Simple Self-Distillation Improves Code Generation. SSD samples completions from the model itself at a training-time temperature/truncation setting, then fine-tunes on those raw, unverified samples with standard cross-entropy loss. No reward model, verifier, teacher model, or RL: just prompts and the model.

from datasets import Dataset
from trl.experimental.ssd import SSDConfig, SSDTrainer

dataset = Dataset.from_dict({
    "prompt": [
        [{"role": "user", "content": "Write a function to add two numbers."}],
        [{"role": "user", "content": "Write a function to check if a number is prime."}],
    ],
})

trainer = SSDTrainer(
    model="Qwen/Qwen3-4B-Instruct",
    args=SSDConfig(
        output_dir="ssd-model",
        temperature=0.6,      # T_train from the paper
        top_k=20,
        top_p=0.95,
        learning_rate=5e-6,
    ),
    train_dataset=dataset,
)
trainer.train()

by @kashif in https://github.com/huggingface/trl/pull/5505

Drop, don't truncate, overlong tool results in GRPOTrainer

When tool calls produce more tokens than max_completion_length allows, GRPOTrainer now rolls back the tool messages/images added in the current iteration instead of trying to truncate them. This removes ~80 lines of fragile, image-boundary-aware bookkeeping in favor of a ~15-line snapshot-and-rollback. Since overlong samples almost always get rewarded as failures anyway, the learning signal is effectively unchanged — but the code is dramatically simpler and no longer needs per-VLM-family vision-token lookup tables.

by @qgallouedec in https://github.com/huggingface/trl/pull/5521

Expanded tool-calling model support: LLaMA 3.1 / 3.2 & DeepSeek-V3

Continuing the effort from v1.1:

  • LLaMA 3.1 and 3.2 tool-calling response schemas, with dedicated templates for identity matching. Note that these templates only support a single tool call and no content alongside the tool call — limitations inherited from the models' native templates. By @qgallouedec in https://github.com/huggingface/trl/pull/5518
  • DeepSeek-V3 training chat template with {% generation %} markers, enabling assistant-only loss masking for DeepSeek-V3 models. By @RudrenduPaul in https://github.com/huggingface/trl/pull/5527

As a result of a tightened detection (see fixes below), the list of templates reported as tool-calling capable is now correct — notably, the basic Llama 3 template is no longer falsely classified as tool-calling capable.

KTO/DPO alignment push

A major cleanup sweep keeps KTOTrainer and DPOTrainer in lockstep, same initialization patterns, same config surface, same precompute behavior:

All by @albertvillanova.

Other

Fixes

Deprecations

  • Deprecate use_transformers_paged in GRPOConfig and RLOOConfig (and remove entirely from experimental OnlineDPOConfig, GOLDConfig, SelfDistillationConfig). Will be removed from the remaining configs in v2.0.0. In a small A/B benchmark (Qwen3-0.6B GRPO), the paged path is ~20% slower and uses ~6x more peak VRAM than the default; it's also superseded by transformers continuous batching. By @qgallouedec in https://github.com/huggingface/trl/pull/5544

Documentation and Examples

CI

What's Changed

Full Changelog: https://github.com/huggingface/trl/compare/v1.1.0...v1.2.0

Apr 12, 2026

Features

DistillationTrainer for efficient on-policy distillation

Read the blog post: https://huggingface.co/spaces/HuggingFaceTB/trl-distillation-trainer

The new DistillationTrainer implements on-policy knowledge distillation as described in On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes. It extends the ideas from the GKDTrainer with three key optimizations: a generation buffer that decouples the training microbatch size from the generation batch size (up to 40x speedup), external teacher server support so the teacher doesn't need to fit on training GPUs, and binary-encoded logprob payloads that shrink transfer payloads by ~5x.

from datasets import load_dataset
from trl.experimental.distillation import DistillationConfig, DistillationTrainer

dataset = load_dataset("openai/gsm8k", "main", split="train")
dataset = dataset.map(
    lambda x: {"messages": [{"role": "user", "content": x["question"]}]},
    remove_columns=dataset.column_names,
)

trainer = DistillationTrainer(
    model="Qwen/Qwen2.5-1.5B-Instruct",
    teacher_model="Qwen/Qwen2.5-7B-Instruct",
    args=DistillationConfig(
        output_dir="results/distill-qwen-gsm8k",
        lmbda=1.0,                   # fully on-policy (student generates)
        beta=1.0,                    # reverse KL
        teacher_model_init_kwargs={"torch_dtype": "bfloat16"},
    ),
    train_dataset=dataset,
)
trainer.train()

by @cmpatino in https://github.com/huggingface/trl/pull/5407, https://github.com/huggingface/trl/pull/5500 and https://github.com/huggingface/trl/pull/5501

Chunked LM head for memory-efficient log-prob computation in AsyncGRPOTrainer

AsyncGRPOTrainer now supports a chunked LM-head path that computes per-token log-probs and entropy via online logsumexp without materializing the full [N, V] logits tensor. Combined with completion_mask filtering to skip prompt tokens, this brings massive memory savings on long sequences — up to 44x lower peak-allocated memory on an 8192-token sequence:

chunk_lm_head_sizePeak Alloc (GB)ReductionWall Time (ms)
None (baseline)18.551.00x808.7
40960.4244.32x459.0
81920.7624.34x393.0

Enable it via the new chunk_lm_head_size option in AsyncGRPOConfig:

from trl.experimental.async_grpo import AsyncGRPOConfig, AsyncGRPOTrainer

trainer = AsyncGRPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    args=AsyncGRPOConfig(chunk_lm_head_size=4096),
    ...
)

Note: mutually exclusive with use_liger_kernel (both replace the LM head forward pass).

by @AmineDiro in https://github.com/huggingface/trl/pull/5349

{% generation %} support in training chat templates

SFT with assistant_only_loss=True requires chat templates to include {% generation %} / {% endgeneration %} markers so that return_assistant_tokens_mask=True produces correct masks. Very few models ship these markers natively, so users hit a cryptic error when enabling assistant-only loss with models like Qwen3, Llama 3 or GPT-OSS.

SFTTrainer now automatically swaps in a patched training chat template when the original template lacks generation markers — no manual template surgery required. Training templates are shipped for Qwen2.5, Qwen3, Llama 3 and GPT-OSS, stored as standalone .jinja files under trl/chat_templates/ for readability, diffability, and editor syntax highlighting.

from trl import SFTConfig, SFTTrainer

trainer = SFTTrainer(
    model="Qwen/Qwen3-4B",
    args=SFTConfig(assistant_only_loss=True),  # now just works
    train_dataset=dataset,
)
trainer.train()

by @qgallouedec in https://github.com/huggingface/trl/pull/5459, https://github.com/huggingface/trl/pull/5470, by @RudrenduPaul in https://github.com/huggingface/trl/pull/5493 and https://github.com/huggingface/trl/pull/5522, and by @casinca in https://github.com/huggingface/trl/pull/5484

Expanded tool-calling model support

Agent training now supports a broader family of models via native tool-call response schemas:

A new supports_tool_calling() utility detects whether a tokenizer/processor can render a full tool-calling turn, and GRPOTrainer now validates tool support at initialization — raising a clear error upfront instead of failing cryptically mid-training.

by @qgallouedec in https://github.com/huggingface/trl/pull/5462, https://github.com/huggingface/trl/pull/5464, https://github.com/huggingface/trl/pull/5463, https://github.com/huggingface/trl/pull/5469 and https://github.com/huggingface/trl/pull/5454

Multimodal tool responses for VLM training

environment_factory tool methods can now return multimodal content blocks (images + text) for VLM training. Previously, tool responses were always converted to str(result), discarding any visual information. Now tools can return content block lists with images, and the trainer handles them end-to-end through tokenization, generation, and the forward pass — including correct pixel_values plumbing.

class ScreenshotEnv:
    def take_screenshot(self) -> list[dict]:
        return [
            {"type": "image", "image": self.browser.screenshot()},
            {"type": "text", "text": "Current page state"},
        ]

The OpenEnv browsergym.py example has been migrated to this pattern, and a new carla_vlm.py example demonstrates VLM training against CARLA with camera-image tool responses.

by @sergiopaniego in https://github.com/huggingface/trl/pull/5323 and https://github.com/huggingface/trl/pull/5437, and by @qgallouedec in https://github.com/huggingface/trl/pull/5448

Built-in reward functions now log extra columns

accuracy_reward and reasoning_accuracy_reward now emit extra diagnostic columns (solution, gold_parsed, answer_parsed) via the log_extra callback introduced in v1.0.0. These show up in the rich completions table, making it much easier to debug why a reward was (or wasn't) assigned.

<img width="1627" alt="accuracy reward with extra columns" src="https://github.com/user-attachments/assets/d7f6e9c2-4d7b-4886-ba7a-f58f0ccfcb9b" />
from datasets import load_dataset
from trl import GRPOConfig, GRPOTrainer
from trl.rewards import accuracy_reward

dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = GRPOTrainer(
    model="Qwen/Qwen2-0.5B-Instruct",
    reward_funcs=accuracy_reward,
    args=GRPOConfig(log_completions=True),
    train_dataset=dataset,
)
trainer.train()

by @qgallouedec in https://github.com/huggingface/trl/pull/5308

Other

Fixes

Deprecations and Removals

  • Deprecate keep_end truncation mode in DPOConfig and SFTConfig — will be removed in v2.0.0. Use keep_start instead. By @albertvillanova in https://github.com/huggingface/trl/pull/5465
  • Deprecate pad_token config parameter in DPOConfig, SFTConfig, and RewardConfig — will be removed in v2.0.0. Set tokenizer.pad_token directly on the processing_class instead. By @albertvillanova in https://github.com/huggingface/trl/pull/5480
  • Remove trl.experimental.judges module and all judge support from trainers. Judges were experimental, unused in practice, and llm-blender (backing PairRMJudge) was unmaintained and incompatible with transformers v5 — actively blocking v5 adoption. Everything judges did can be achieved with reward_funcs. OnlineDPOTrainer, NashMDTrainer, and XPOTrainer are now unified on reward-model scoring only. By @qgallouedec in https://github.com/huggingface/trl/pull/5485

Documentation and Examples

CI

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v1.0.0...v1.1.0

Mar 31, 2026
<img width="1800" height="1013" alt="thumbnail-2" src="https://github.com/user-attachments/assets/5c55b86a-0600-4f70-bf37-41ab240af851" />

Read our blog post for an overview of TRL v1.

Features

Asynchronous GRPO

Asynchronous GRPO decouples generation from the gradient update loop by offloading rollouts to an external vLLM server. Generation runs in parallel while training continues, eliminating idle GPU time and improving hardware utilization.

from trl.experimental.async_grpo import AsyncGRPOTrainer
from trl.rewards import accuracy_reward
from datasets import load_dataset

dataset = load_dataset("trl-lib/DeepMath-103K", split="train")

trainer = AsyncGRPOTrainer(
    model="Qwen/Qwen2.5-0.5B-Instruct",
    reward_funcs=accuracy_reward,
    train_dataset=dataset,
)
trainer.train()

by @qgallouedec in https://github.com/huggingface/trl/pull/5293

Variational Sequence-Level Soft Policy Optimization (VESPO)

<img width="465" height="279" alt="Screenshot 2026-03-20 at 5 49 50 PM" src="https://github.com/user-attachments/assets/b60c9697-6eb7-498e-95b3-df78c367f5fa" />

VESPO addresses training instability in off-policy RL caused by policy staleness, asynchronous updates, and train-inference mismatches. Rather than relying on heuristic token-level clipping (GRPO) or sequence-length normalization (GSPO), VESPO derives a principled reshaping kernel from a variational framework. In practice, this yields a smooth, asymmetric Gamma weighting function that gracefully suppresses extreme sequence-level importance weights without introducing length bias. It can be enabled via the loss_type parameter of GRPOConfig:

from trl import GRPOConfig, GRPOTrainer

trainer = GRPOTrainer(
    model="Qwen/Qwen3-0.6B",
    args=GRPOConfig(loss_type="vespo"),
    ...
)

by @casinca in https://github.com/huggingface/trl/pull/5199

Divergence Proximal Policy Optimization (DPPO)

<img width="3180" height="1187" alt="z_TXYw37xZqsQ21YiDkYL" src="https://github.com/user-attachments/assets/40f1d538-82b3-4097-91c6-119ea9f7797b" /> <img width="1189" height="490" alt="SfgWotuuuRKPkg-0bxWv1" src="https://github.com/user-attachments/assets/2b090df3-0bfb-42e4-9f94-15943736e689" />

DPPO is a new experimental trainer that replaces the standard PPO clipping mechanism with divergence constraints, providing more principled trust-region updates.

by @LeonEricsson in https://github.com/huggingface/trl/pull/5117

Self-Distillation Policy Optimization (SDPO)

SDPO is a new experimental trainer that augments on-policy RL with self-distillation from the model's own high-reward trajectories. Instead of using an external teacher, SDPO treats the current model conditioned on feedback as a self-teacher, distilling its feedback-informed predictions back into the policy.

from trl.experimental import SDPOTrainer, SDPOConfig

config = SDPOConfig(
    output_dir="./results",
    num_generations=8,
    success_reward_threshold=1.0,
    use_successful_as_teacher=True,
)

trainer = SDPOTrainer(
    model="Qwen/Qwen2.5-Math-1.5B-Instruct",
    reward_funcs=[accuracy_reward],
    args=config,
    train_dataset=dataset,
)
trainer.train()

by @MengAiDev in https://github.com/huggingface/trl/pull/4935

Reward functions can now log extra columns and scalar metrics

Reward functions can return a dictionary of extra values (scalars or per-sample columns) that will be logged alongside the reward. This makes it easier to track intermediate signals without writing custom callbacks.

def my_reward_fn(completions, answer, log_extra=None, log_metric=None, **kwargs):
    extracted = [extract_answer(c) for c in completions]
    rewards = [1.0 if e == a else 0.0 for e, a in zip(extracted, answer)]

    if log_extra:
        log_extra("golden_answer", list(answer))
        log_extra("extracted_answer", extracted)

    if log_metric:
        log_metric("accuracy", sum(rewards) / len(rewards))

    return rewards
<img width="1400" height="407" alt="image" src="https://github.com/user-attachments/assets/d345b0ac-0d3c-446f-9321-a26e73ee16b4" /> <img width="1353" height="673" alt="image" src="https://github.com/user-attachments/assets/b4c0302b-f69a-4715-9aad-278b4ad13299" />

by @manueldeprada in https://github.com/huggingface/trl/pull/5233

Tool calling support in VLLMClient.chat()

VLLMClient.chat() now supports tool calling, enabling agentic workflows directly through the vLLM client interface.

by @kansalaman in https://github.com/huggingface/trl/pull/4889

35% faster packing

BFD packing is 35% faster. The "bfd-requeue" packing strategy has also been renamed to "bfd_split". See MIGRATION.md for details.

<img width="1784" height="732" alt="benchmark_results" src="https://github.com/user-attachments/assets/8f0a35ad-cf1a-4fe1-a1f4-9b102637bdca" />

by @mariosasko in https://github.com/huggingface/trl/pull/5189

[GKD] Buffer implementation and vLLM inference for distillation trainer

The GKD/GOLD trainer now supports buffered rollout generation, decoupling generation from gradient updates for more efficient distillation. vLLM inference support has also been added to the base self-distillation trainer.

by @cmpatino in https://github.com/huggingface/trl/pull/5137 and https://github.com/huggingface/trl/pull/5388

v0 → v1 migration guide

A MIGRATION.md guide has been added covering all breaking changes when upgrading from TRL v0 to v1. If you're already on v0.29, the changes are minimal.

by @qgallouedec in https://github.com/huggingface/trl/pull/5255

Other

Fixes

Documentation and Examples

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.29.0...v1.0.0

Mar 20, 2026

Features

Variational Sequence-Level Soft Policy Optimization (VESPO)

<img width="465" height="279" alt="Screenshot 2026-03-20 at 5 49 50 PM" src="https://github.com/user-attachments/assets/b60c9697-6eb7-498e-95b3-df78c367f5fa" />

VESPO addresses training instability in off-policy RL caused by policy staleness, asynchronous updates, and train-inference mismatches. Rather than relying on heuristic token-level clipping (GRPO) or sequence-length normalization (GSPO), VESPO derives a principled reshaping kernel from a variational framework. In practice, this yields a smooth, asymmetric Gamma weighting function that gracefully suppresses extreme sequence-level importance weights without introducing length bias. It can be enabled via the loss_type parameter of GRPOConfig:

from trl import GRPOConfig, GRPOTrainer

trainer = GRPOTrainer(
    model="Qwen/Qwen3-0.6B",
    args=GRPOConfig(loss_type="vespo"),
    ...
)

by @casinca in https://github.com/huggingface/trl/pull/5199

Divergence Proximal Policy Optimization (DPPO)

<img width="3180" height="1187" alt="z_TXYw37xZqsQ21YiDkYL" src="https://github.com/user-attachments/assets/40f1d538-82b3-4097-91c6-119ea9f7797b" /> <img width="1189" height="490" alt="SfgWotuuuRKPkg-0bxWv1" src="https://github.com/user-attachments/assets/2b090df3-0bfb-42e4-9f94-15943736e689" />

DPPO is a new experimental trainer that replaces the standard PPO clipping mechanism with divergence constraints, providing more principled trust-region updates.

by @LeonEricsson in https://github.com/huggingface/trl/pull/5117

Reward functions can now log extra columns and scalar metrics

Reward functions can return a dictionary of extra values (scalars or per-sample columns) that will be logged alongside the reward. This makes it easier to track intermediate signals without writing custom callbacks.

def my_reward_fn(completions, answer, log_extra=None, log_metric=None, **kwargs):
    extracted = [extract_answer(c) for c in completions]
    rewards = [1.0 if e == a else 0.0 for e, a in zip(extracted, answer)]

    if log_extra:
        log_extra("golden_answer", list(answer))
        log_extra("extracted_answer", extracted)

    if log_metric:
        log_metric("accuracy", sum(rewards) / len(rewards))

    return rewards
<img width="1400" height="407" alt="image" src="https://github.com/user-attachments/assets/d345b0ac-0d3c-446f-9321-a26e73ee16b4" /> <img width="1353" height="673" alt="image" src="https://github.com/user-attachments/assets/b4c0302b-f69a-4715-9aad-278b4ad13299" />

by @manueldeprada in https://github.com/huggingface/trl/pull/5233

Tool calling support in VLLMClient.chat()

VLLMClient.chat() now supports tool calling, enabling agentic workflows directly through the vLLM client interface.

by @kansalaman in https://github.com/huggingface/trl/pull/4889

35% faster packing

BFD packing is 35% faster. The "bfd-requeue" packing strategy has also been renamed to "bfd_split". See MIGRATION.md for details.

<img width="1784" height="732" alt="benchmark_results" src="https://github.com/user-attachments/assets/8f0a35ad-cf1a-4fe1-a1f4-9b102637bdca" />

by @mariosasko in https://github.com/huggingface/trl/pull/5189

[GKD] Buffer implementation for distillation trainer

The GKD/GOLD trainer now supports buffered rollout generation, decoupling generation from gradient updates for more efficient distillation.

by @cmpatino in https://github.com/huggingface/trl/pull/5137

v0 → v1 migration guide

A MIGRATION.md guide has been added covering all breaking changes when upgrading from TRL v0 to v1. If you're already on v0.29, the changes are minimal.

by @qgallouedec in https://github.com/huggingface/trl/pull/5255

Other

Fixes

Documentation and Examples

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.29.0...v1.0.0rc1

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.29.0...v0.29.1

Feb 25, 2026

Features

Add environment_factory to GRPOTrainer

GRPOTrainer now accepts an environment_factory argument, allowing users to specify a custom environment class for training. This enables more flexible and diverse training scenarios by letting users define their own environments with specific dynamics and reward structures.

from datasets import Dataset
from trl import GRPOConfig, GRPOTrainer

dataset = Dataset.from_dict({
    "prompt": [[{"role": "user", "content": f"Increment the counter by {i}."}] for i in range(1, 7)]
})

def reward_func(environments, **kwargs):
    return [env.counter for env in environments]

class IncrementEnv:
    def reset(self):
        self.counter = 0

    def increment(self, step: int) -> int:
        """
        Increment the internal counter.

        Args:
            step: Value to add to the counter.

        Returns:
            The updated counter value.
        """
        self.counter += step
        return self.counter

trainer = GRPOTrainer(
    model="Qwen/Qwen3-0.6B",
    args=GRPOConfig(chat_template_kwargs={"enable_thinking": False}),
    train_dataset=dataset,
    reward_funcs=reward_func,
    environment_factory=IncrementEnv,
)
trainer.train()

by @qgallouedec in https://github.com/huggingface/trl/pull/5093

Skills

TRL introduces agent-native CLI Integration: trl-training, a first-class Agent Skill that exposes TRL’s training workflows (SFT, DPO, GRPO, etc.) in a structured, agent-readable format. The skill is packaged directly with the trl library and can be installed via the CLI:

# Install into the project's agent directory (default scope=project), by agent name: claude, codex, opencode
trl skills install trl-training --target <agent>

This enables AI agents to safely and reproducibly execute TRL training workflows using a well-defined interface.

Skills can be installed at the project or global scope, and support explicit targets and overwrite controls.

Other

Fixes

Documentation and Examples

Deprecations

CI Improvements

Miscellaneous

Refactor CLI

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.28.0...v0.29.0

Feb 10, 2026
v0.28.0

Features

Experimental

Fixes

Documentation and Examples

Deprecations

CI Improvements

Miscellaneous

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.27.0...v0.28.0

Feb 3, 2026

What's Changed

  • Remove access to warnings_issued by @qgallouedec in #4960
  • Fix SFTTrainer init logic: remove TrainingArguments.push_to_hub_token only for transformers < v5 by @albertvillanova in #4942
  • Fix extra EOS appended in DPO preprocessing for conversational data by @qgallouedec in https://github.com/huggingface/trl/pull/4908

Full Changelog: https://github.com/huggingface/trl/compare/v0.27.1...v0.27.2

Jan 24, 2026

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.27.0...v0.27.1

Jan 16, 2026

Features

Experimental

Fixes

Documentation and Examples

Deprecations

CI Improvements

Miscellaneous

--

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.26.0...v0.27.0

Dec 18, 2025

What's Changed

Full Changelog: https://github.com/huggingface/trl/compare/v0.26.1...v0.26.2

Dec 12, 2025

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.26.0...v0.26.1

Dec 9, 2025

Features

🕵️‍♂️ GRPO: Agent training

GRPOTrainer now supports training agents using tools. This allows language models to interact with external functions or APIs during training.

from datasets import Dataset
from trl import GRPOTrainer

def multiply(a: int, b: int) -> int:
    """
    Multiplies two integers.

    Args:
        a: The first integer.
        b: The second integer.

    Returns:
        The product of the two integers.
    """
    return a * b


dataset = Dataset.from_list(
    [
        {"prompt": [{"role": "user", "content": "What is 3 multiplied by 4?"}], "answer": 12},
        {"prompt": [{"role": "user", "content": "Calculate 7 times 8."}], "answer": 56},
        {"prompt": [{"role": "user", "content": "Find the product of 5 and 6."}], "answer": 30},
        {"prompt": [{"role": "user", "content": "What do you get when you multiply 9 by 9?"}], "answer": 81},
        {"prompt": [{"role": "user", "content": "Compute 12 multiplied by 11."}], "answer": 132},
        {"prompt": [{"role": "user", "content": "What is 15 times 14?"}], "answer": 210},
    ]
)

def accuracy(completions, answer, **kwargs):
    predictions = [completion[-1]["content"] for completion in completions]
    rewards = [float(str(ans) in pred) for pred, ans in zip(predictions, answer)]
    return rewards

trainer = GRPOTrainer(
    model="Qwen/Qwen3-0.6B",
    train_dataset=dataset,
    tools=[multiply],
    reward_funcs=accuracy,
)
trainer.train()

by @qgallouedec in https://github.com/huggingface/trl/pull/4300

ScaleRL: Add CISPO Loss

CISPO Loss was first introduced in the Minimax-M1 paper, the ScaleRL paper subsequently showed that CISPO loss scales the best in terms of performance and efficiency as models are trained for longer.

GRPOTrainer now supports the CISPO loss using loss_type="cispo" in the GRPOConfig.

by @pramodith in https://github.com/huggingface/trl/pull/4495

Add vLLM quantization option for colocate

When the input model is quantized using bitsandbytes, vLLM will now also use quantization when in colocate mode.

by @sergiopaniego in https://github.com/huggingface/trl/pull/4496

Reasoning reward

TRL nows includes a reasoning reward function

from trl.rewards import reasoning_accuracy_reward

solutions = [r"\frac{1}{3}", r"\frac{1}{3}", r"\frac{1}{3}"]
completions = [
    [
        {
            "role": "assistant",
            "content": r"<think> Reasoning content </think> The final answer is \boxed{\frac{1}{3}}",
        }
    ],
    [
        {
            "role": "assistant",
            "content": r"<think> Reasoning content </think> The final answer is \boxed{\frac{1}{2}}",
        }
    ],
    [
        {
            "role": "assistant",
            "content": r"<think> Reasoning content with partial answers \boxed{\frac{1}{3}} but no final answer",
        }
    ],
]
reasoning_accuracy_reward(completions, solutions)  # [1.0, 0.0, 0.0] 

As any other reward function, it can be used in GRPOTrainer or RLOOTrainer.

from trl import GRPOTrainer
from trl.rewards import reasoning_accuracy_reward

trainer = GRPOTrainer(
    ...,
    reward_funcs=reasoning_accuracy_reward,
)

by @lewtun in https://github.com/huggingface/trl/pull/4563

Add shuffle_dataset option to SFTTrainer

You can now shuffle the dataset in SFTTrainer by setting the shuffle_dataset argument to True in SFTConfig. This is useful when the dataset features high similarity between consecutive samples.

from trl import SFTTrainer, SFTConfig

SFTConfig(shuffle_dataset=True)

by @qgallouedec in https://github.com/huggingface/trl/pull/4564

Add SAPO Loss in GRPO

Soft Adaptive Policy Optimization (SAPO), replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals. Compared with GSPO and GRPO, SAPO is both sequence-coherent and token-adaptive. Like GSPO, SAPO maintains sequence-level coherence, but its soft gating forms a continuous trust region that avoids the brittle hard clipping band used in GSPO.

You can now use SAPO loss in GRPOTrainer by setting loss_type="sapo" in the GRPOConfig.

by @pramodith in https://github.com/huggingface/trl/pull/4600

Other Features

Experimental

Fixes

Documentation and Examples

Deprecations

Miscellaneous

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.25.0...v0.26.0

Nov 12, 2025

What's Changed

Full Changelog: https://github.com/huggingface/trl/compare/v0.25.0...0.25.1

Nov 6, 2025

Features

Experimental

Fixes

Documentation and Examples

Deprecations

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.24.0...v0.25.0

Oct 16, 2025

Features

Fixes

Documentation

Deprecations

Experimental

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.23.0...v0.24.0

Oct 2, 2025

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.23.0...v0.23.1

Sep 10, 2025

Major

🥓 Context Parallelism

SFT now supports Context Parallelism (CP) for training large language models on very large sequences. You can now train with an arbitrarily long sequence length.

<img width="844" height="336" alt="Screenshot 2025-09-09 at 10 39 30 PM" src="https://github.com/user-attachments/assets/f1dfc349-440a-4e05-aac9-439a3c286f08" />

by @kashif in https://github.com/huggingface/trl/pull/3994

🧨 Dynamic Fine-Tuning

Dynamic Fine-Tuning (DFT) is a nnow supported in TRL.

from trl import SFTConfig

training_args = SFTConfig(
    loss_type="dft",
    ...
)
<img width="692" height="472" alt="Screenshot 2025-09-09 at 10 37 36 PM" src="https://github.com/user-attachments/assets/4ee2b4ab-7cc6-4578-bfac-c38124891510" />

by @qgallouedec in https://github.com/huggingface/trl/pull/4042

🪵 Truncated Importance Sampling (TIS) to address rollout-training mismatch

Different implementations are used for rollout generation (vLLM) and model training. The implementation gap implicitly turns the on-policy RL to be off-policy. Truncated Importance Sampling (TIS) a simple yet effective importance sampling technique for handling such discrepancy. This is now implemented in GRPO.

from trl import GRPOConfig

training_args = GRPOConfig(
    ...
    use_vllm=True,
    vllm_importance_sampling_correction=True, # default True
    vllm_importance_sampling_cap=2.0, # hyper-parameter C
)

by @LeonEricsson in https://github.com/huggingface/trl/pull/3867

🥣 [SFTTrainer]: Add Aux Loss for MoE models

Mixture of Experts (MoE) models require an auxiliary loss to ensure that the different experts are used evenly. This auxiliary loss is now supported in SFTTrainer.

training_args = SFTConfig(
    model_init_kwargs={"output_router_logits": True},
    ...
)

by @pramodith in https://github.com/huggingface/trl/pull/4012

💤 [GRPO/RLOO] Adds an option to sleep vllm when running in colocated mode

When running GRPO (or RLOO) with vLLM in colocated mode, the vLLM server consume VRAM during optimization while not being used. We now have an option to put the vLLM server to sleep during optimization to free up VRAM.

from trl import GRPOConfig

training_args = GRPOConfig(..., vllm_sleep_enabled=True)

by @edbeeching in https://github.com/huggingface/trl/pull/3968

⚖️ Add vLLM server mode and VLM support to OnlineDPOTrainer

You can now use vLLM server mode with OnlineDPOTrainer. Additionally, VLM models are now supported.

by @vaelev in https://github.com/huggingface/trl/pull/3783

Comprehensive Paper Index Enhancement with 9 New Algorithm Implementations

The paper index has been significantly enhanced with the addition of 9+ new algorithm implementations, providing a more comprehensive resource for users.

by @behroozazarkhalili in https://github.com/huggingface/trl/pull/3990

Other Notable Changes

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.22.0...v0.23.0

Sep 3, 2025

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.22.1...v0.22.2

Aug 29, 2025

What changed

  • Refactor version retrieval to use importlib.metadata by @qgallouedec
  • Release: 0.22.1 by @qgallouedec

Full Changelog: https://github.com/huggingface/trl/compare/v0.22.0...v0.22.1

Previous123Next
Latest
v1.2.0
Tracking Since
Jan 25, 2023
Last fetched Apr 19, 2026