releases.shpreview

TRL

$npx -y @buildinternet/releases show trl
Mon
Wed
Fri
AprMayJunJulAugSepOctNovDecJanFebMarApr
Less
More
Releases10Avg3/moVersionsv0.27.0 → v1.2.0
Aug 29, 2025

Major

🔮 Native VLM support for SFTTrainer

SFTTrainer now natively supports Vision-Language Models (VLMs). This includes support for both languauge modeling, prompt-completion data. It also supports training on completion-only.

<img width="1136" height="586" alt="Group 291-6" src="https://github.com/user-attachments/assets/2629b8e7-d853-4b7c-91d5-f4c128287e04" />
from trl import SFTConfig, SFTTrainer
from datasets import load_dataset

trainer = SFTTrainer(
    model="Qwen/Qwen2.5-VL-3B-Instruct",
    args=SFTConfig(max_length=None),
    train_dataset=load_dataset("trl-lib/llava-instruct-mix", split="train"),
)
trainer.train()

by @qgallouedec in https://github.com/huggingface/trl/pull/3862, https://github.com/huggingface/trl/pull/3907 and https://github.com/huggingface/trl/pull/3908

🔥 RLOOTrainer refactor

RLOOTrainer has been refactored to align with the design principles of other other trainers in the library. You can now use this trainer exactly like GRPO.

from datasets import load_dataset
from trl import RLOOConfig, RLOOTrainer

dataset = load_dataset("trl-lib/ultrafeedback-prompt", split="train")

# Dummy reward function for demonstration purposes
def reward_num_unique_letters(completions, **kwargs):
    """Reward function that rewards completions with more unique letters."""
    completion_contents = [completion[0]["content"] for completion in completions]
    return [float(len(set(content))) for content in completion_contents]

trainer = RLOOTrainer(
    model="Qwen/Qwen2-0.5B-Instruct",
    reward_funcs=reward_num_unique_letters,
    train_dataset=dataset,
)
trainer.train()

by @shirinyamani in https://github.com/huggingface/trl/pull/3801

🧭 HF jobs x TRL guide

You can now levarage Hugging Face Jobs to easily train and deploy your models with TRL.

hf jobs uv run --flavor a100-large --secrets HF_TOKEN "https://raw.githubusercontent.com/huggingface/trl/main/trl/scripts/sft.py" --model_name_or_path Qwen/Qwen2-0.5B --dataset_name trl-lib/Capybara

A guide is available in the docs.

by @sergiopaniego in https://github.com/huggingface/trl/pull/3890

🏌️ DAPO loss type

GRPOTrainer now supports DAPO loss type, which aggregates token-level losses by normalizing with the number of active token in the global accumulated batch. This method was introduced to eliminate length bias. Simply use

from trl import GRPOConfig, GRPOTrainer

training_args = GRPOConfig(
    loss_type="dapo",
    ...
)

by @qgallouedec in https://github.com/huggingface/trl/pull/3938

🪶 [GRPO] PPO Lite: Scale rewards by Std of Batch

The authors of Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning (Lite PPO) find that the combination of:

  1. scaling rewards by the standard deviation computed over the entire batch and
  2. aggregating loss over the total number of tokens

can unlock the learning capability of critic-free policies using vanilla PPO loss. Their results demonstrate that this simple combination consistently improves performance, surpassing strategies like GRPO and DAPO.

TRL supports using these learnings to train a GRPO model by:

from trl import GRPOConfig

training_args = GRPOConfig(
    scale_rewards="batch",
    loss_type="dapo",
    ...
)

by @pramodith in https://github.com/huggingface/trl/pull/3935

🎢 [Callbacks] BEMA

Bias-Corrected Exponential Moving Average (BEMA) improves the stability and efficiency of language model fine-tuning by reducing stochasticity and eliminating bias. To use BEMA with SFT as described in the paper, you can now use the [BEMACallback]:

from trl import BEMACallback, SFTTrainer

trainer = SFTTrainer(
    ...
    callbacks=[BEMACallback()],
)

by @kashif in https://github.com/huggingface/trl/pull/3855

Minor

Deprecations

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.21.0...v0.22.0

Aug 5, 2025

Major and breaking

🌺 OpenAI GPT OSS & Harmony support

<img width="4544" height="2344" alt="Group 293-2" src="https://github.com/user-attachments/assets/17241da2-1b1d-41bc-a5f8-0983ea46606f" />

Open AI GPT OSS models are here! Check out the OpenAI Cookbook to see an example of how to SFT these models.

by @qgallouedec in https://github.com/huggingface/trl/pull/3848

Add vLLM transformers backend to online methods

You can now pass vllm_model_impl to the TRL vLLM server. Example, for transformers backend:

trl vllm-serve ... --vllm_model_impl transformers

by @merveenoyan in https://github.com/huggingface/trl/pull/3773

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.20.0...v0.21.0

Jul 29, 2025

Breaking and major changes

🎞️ GSPO

GSPO is a GRPO variant that computes importance sampling weights at the sequence level instead of per-token.

<img width="930" height="538" alt="Screenshot 2025-07-28 at 10 54 15 PM" src="https://github.com/user-attachments/assets/923835af-dc61-4fd4-8a99-44242d02bb7b" />

📜 Paper: https://huggingface.co/papers/2507.18071

To reproduce the paper's setting, use this configuration:

from trl import GRPOConfig

training_args = GRPOConfig(
    importance_sampling_level="sequence",
    loss_type="grpo",
    steps_per_generation=...,
    beta=0.04,  # not explicitly specified in the paper, but they likely used the same value as in the GRPO paper
    epsilon=3e-4,  # https://x.com/ChujieZheng/status/1948933507696525392
)

by @qgallouedec in https://github.com/huggingface/trl/pull/3775

👁️ [GRPO] Add VLM training capabilities to the GRPO trainer

<img width="1136" height="594" alt="Group 291-4" src="https://github.com/user-attachments/assets/04850e80-9689-472d-acd7-fda331e66dc3" />

The GRPOTrainer can now be used for VLM training. Give a try with this dummy example:

from trl import GRPOTrainer
from datasets import load_dataset

# Dummy vision-language dataset
dataset = load_dataset("trl-internal-testing/zen-image", "conversational_prompt_only", split="train")

# Dummy reward function: count the number of unique characters in the completions
def reward_num_unique_chars(completions, **kwargs):
    return [len(set(c[0]["content"])) for c in completions]

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-VL-3B-Instruct",
    reward_funcs=[reward_num_unique_chars],
    train_dataset=dataset,
)

trainer.train()

by @CompN3rd and @kashif in https://github.com/huggingface/trl/pull/3072 in https://github.com/huggingface/trl/pull/3760

🐙 MPO

<img width="440" height="438" alt="Screenshot 2025-07-28 at 10 52 15 PM" src="https://github.com/user-attachments/assets/e07a7936-c4c5-480d-9ffd-db5b77a5445e" />

The DPO trainer supports combining multiple loss functions with different weights, enabling more sophisticated optimization strategies. This is particularly useful for implementing algorithms like MPO (Mixed Preference Optimization). MPO is a training approach that combines multiple optimization objectives, as described in the paper Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization.

To combine multiple losses, specify the loss types and corresponding weights as lists:

from trl import DPOConfig

# MPO: Combines DPO (sigmoid) for preference and BCO (bco_pair) for quality
training_args = DPOConfig(
    loss_type=["sigmoid", "bco_pair", "sft"],  # Loss types to combine
    loss_weights=[0.8, 0.2, 1.0]  # Corresponding weights, as used in the MPO paper
)

by @qgallouedec in https://github.com/huggingface/trl/pull/2544

Add support for CB with native transformers

Continuous Batching allows for faster generation using the transformers backend. You can now use it with the GRPOTrainer by setting use_transformers_paged=True in the config.

use_transformers_paged = True
from trl import GRPOConfig
training_args = GRPOConfig(
    # ... other args
    use_transformers_paged=Ture,
)

by @ArthurZucker in https://github.com/huggingface/trl/pull/3471

Add entropy based filtering inside the GRPOTrainer

<img width="788" height="438" alt="Screenshot 2025-07-28 at 10 27 20 PM" src="https://github.com/user-attachments/assets/8073a5db-a98e-4534-aea9-d3dbd2e75f4a" />

In Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning, it is shown that utilizing only 20% of the highest entropy tokens leads to similar performance as using all tokens. You can now enable this feature in the GRPOTrainer by setting entropy_filtering=True in the config.

from trl import GRPOConfig

training_args = GRPOConfig(
    # ... other args
    top_entropy_quantile=0.2,  # Use only the top 20% of tokens based on entropy
)

by @pramodith in https://github.com/huggingface/trl/pull/3563

👐 FSDP2+GRPO

GRPO now supports FSDP2 training. Just run your script with an FSDP2 config:

accelerate launch --config_file examples/accelerate_configs/fsdp2.yaml run_grpo.py

by @SalmanMohammadi in https://github.com/huggingface/trl/pull/3687

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.19.0...v0.20.0

Jul 8, 2025

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.19.0...v0.19.1

Jun 21, 2025

Breaking and major changes

🧰 [SFT] Tool support

SFTTrainer now supports training with tools! You just have to add a column tools to your dataset, which contains a list of tool definitions as json schemas. The tools will be automatically registered and can be used in the training process.

from datasets import Dataset
from transformers.utils import get_json_schema
from trl import SFTTrainer

# Fictitious functions to simulate tool calls
def start_timer(duration: int) -> int:
    """
    Starts a timer for the specified duration in seconds.

    Args:
        duration: Duration in seconds to set the timer for.

    Returns:
        The duration set for the timer.
    """
    return duration

def create_reminder(time: str, note: str) -> str:
    """
    Creates a reminder for the specified time and note.

    Args:
        time: The time for the reminder.
        note: The note for the reminder.

    Returns:
        A confirmation message indicating that the reminder has been set.
    """
    return "I'll remind you to call mom at 7 PM."

# Define the JSON schemas for the tools
start_timer = get_json_schema(start_timer)
create_reminder = get_json_schema(create_reminder)

dataset = Dataset.from_dict({
    "messages": [
        [
            {"role": "user", "content": "Set a timer for 10 minutes."},
            {"role": "assistant", "tool_calls": [{"type": "function", "function": {"name": "start_timer", "arguments": {"duration": 600}}}]},
            {"role": "tool", "name": "start_timer", "content": "600"},
            {"role": "assistant", "content": "Timer set for 10 minutes."},
        ],
        ...,
    ],
    "tools": [
        [start_timer, create_reminder],
        ...,
    ]
})

# Initialize the trainer
trainer = SFTTrainer(model="Qwen3-0.6B", train_dataset=dataset)

# Train the model
trainer.train()

by @qgallouedec in https://github.com/huggingface/trl/pull/3597

📉 FFD packing

We introduce a new packing method: FFD (First Fit Decreasing) packing. This method is designed to optimize the packing of sequences in a way that more efficiently reduces the size of the training dataset by grouping examples more effectively. Previously, we used a wrapped packing method, which often truncated sequences even when they were not longer than the maximum sequence length. The new FFD packing method avoids unnecessary truncation by grouping sequences more intelligently. This new packing strategy is now the default when packing is enabled.

training_args = SFTConfig(..., packing=True)

by @qgallouedec in https://github.com/huggingface/trl/pull/3521 and accelerated by @mariosasko in https://github.com/huggingface/trl/pull/3537

[Liger] liger DPO support

The DPOTrainer now supports the Liger-powered DPO loss, enabling faster training with lower memory usage.

training_args = DPOConfig(..., use_liger_loss=True)

by @kashif in https://github.com/huggingface/trl/pull/2568

💬 Fix setup_chat_format and add clone_chat_template

We introduce clone_chat_template, a more convenient and flexible function for setting up chat templates from any tokenizer that already includes one. It handles EOS tokens and copies all added tokens from the source tokenizer, preserving their "special" status. You can either use this function directly:

from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import clone_chat_template

model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")

model, tokenizer = clone_chat_template(model, tokenizer, "Qwen/Qwen3-4B")

or use the chat_template_path parameter in SFTConfig to specify a chat template, which will be automatically cloned when the SFTTrainer is initialized.

from trl import SFTConfig

training_args = SFTConfig(chat_template_path="Qwen/Qwen3-4B")

by @qgallouedec in https://github.com/huggingface/trl/pull/3404 and https://github.com/huggingface/trl/pull/3599

📚 SFTTrainer support chat template kwargs

SFTTrainer now supports passing additional keyword arguments to the chat template. This allows for more flexibility in customizing the chat format during training. To enable it, just add a chat_template_kwargs column to your your dataset.

example = {'messages': [{'content': 'What is better than ugly?', 'role': 'user'},
                        {'content': 'Beautiful.', 'role': 'assistant'}]
           'chat_template_kwargs': {'my_template_arg': 'my_value'}}

by @qgallouedec in https://github.com/huggingface/trl/pull/3609

🤵‍♂️ SFT on assistant messages only

The SFTTrainer now supports training on assistant messages only

example = {'messages': [
    {'role': 'user', 'content': 'What is better than ugly?'},          # masked in the loss
    {'role': 'assistant', 'content': 'Beautiful.'},                    # used in the loss
    {'role': 'user', 'content': 'And what is better than implicit?'},  # masked in the loss
    {'role': 'assistant', 'content': 'Explicit.'},                     # used in the loss
]}

by @qgallouedec in https://github.com/huggingface/trl/pull/3586

🧬 Add generation_kwargs as a property of GRPOConfig to support additional generation arguments

The GRPOConfig now includes a generation_kwargs property, allowing users to specify additional generation arguments for the GRPOTrainer. This allows for further customization of the generation behavior, such as setting suppress_tokens, num_beams, etc. Depending on the generation backend used (transformers or vLLM), this property will be passed either to transformers.GenerationConfig (if using transformers) or vllm.SamplingParams (if using vLLM).

from trl import GRPOConfig

training_args = GRPOConfig(..., generation_kwargs={"length_penalty": -0.1})

by @pramodith in https://github.com/huggingface/trl/pull/3617

New defaults

Minor changes

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.18.0...v0.19.0

Jun 15, 2025

What's Changed

Full Changelog: https://github.com/huggingface/trl/compare/v0.18.1...v0.18.2

Jun 3, 2025

What's Changed

Full Changelog: https://github.com/huggingface/trl/compare/v0.18.0...v0.18.1

May 28, 2025

Major or breaking

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.17.0...v0.18.0

Apr 25, 2025

Major and breaking

The TRL v0.17 release introduces three major changes that, together, enable significantly faster generation performance in GRPO—up to 10x faster in some configurations.

These three changes are:

  • Data parallelism (DP) for the vLLM server
  • A new GRPO training strategy that generates once per effective batch
  • Support for the V1 engine in vLLM

Below, we provide a summary of these changes and how to use them.

⚡ Up to 4x faster: Data Parallel for vLLM server

The TRL vLLM server now supports data parallelism (DP), enabling significantly faster generation speeds—especially for smaller models. This new feature can be used by adding the --data_parallel_size N argument when launching the vLLM server.

trl vllm-serve --model Qwen/Qwen2.5-14B-Instruct --tensor_parallel_size 2 --data_parallel_size 2

by @qgallouedec in https://github.com/huggingface/trl/pull/3310

* ☝️ [GRPO] Generate once per effective batch

Previously, GRPO made one generation request per global batch. The global batch is the total of all local batches, without accounting for gradient accumulation. In other words, if the gradient accumulation step was 8, GRPO would make 8 generation requests per training step.

Now, GRPO groups these global batches into a single "effective batch" and makes only one generation request per effective batch. Since vLLM applies optimizations that are especially effective for large batches, this new approach leads to significantly faster training overall.

No changes are required in the training script, as this is handled internally by the GRPO trainer.

by @qgallouedec in https://github.com/huggingface/trl/pull/3283

⏱️ Fix vLLM server to support V1 Engine

vLLM provides two versions of its engine (V0 and V1), and V1 is significantly faster. This version is now supported by TRL and requires vLLM version 0.8.3 or higher.

by @I-l-l-I in https://github.com/huggingface/trl/pull/3276

👎 [GRPO] Adds option to disable dropout

Disabling dropout has shown to stabilize training. You can now disable dropout in GRPO by setting the disable_dropout argument to False in the GRPO config.

from trl import GRPOConfig

training_args = GRPOConfig(..., disable_dropout=True)

by @edbeeching in https://github.com/huggingface/trl/pull/3234

🩺 Dr. GRPO loss

GRPO now supports the various losses proposed in the recent literature, including the Dr. GRPO loss. The loss type can be set in the GRPO config:

from trl import GRPOConfig

training_args = GRPOConfig(..., loss_type="dr_grpo")

by @qgallouedec in https://github.com/huggingface/trl/pull/3256

🎲 [GRPO] Make training dataset shuffle optional

The GRPO trainer now has an option to disable shuffling of the training dataset. This is useful for curriculum learning, where the order of the training data is important.

from trl import GRPOConfig

training_args = GRPOConfig(..., shuffle_dataset=False)

by @LeonEricsson in https://github.com/huggingface/trl/pull/3334

☕ Overlong-filtering for GRPO

Overlong filtering has been shown to significantly stabilize learning and improve performance. You can now use it in TRL!

It simply consists in masking the loss of truncated samples

from trl import GRPOConfig

training_args = GRPOConfig(..., mask_truncated_completions=True)

by @shirinyamani in https://github.com/huggingface/trl/pull/3248

🐯 Integrate Liger GRPO Loss to GRPO Trainer

Liger allows to significantly reduce the memory peak of the loss computation. You can now use it in TRL with the use_liger_loss argument in the GRPO config:

from trl import GRPOConfig

training_args = GRPOConfig(..., use_liger_loss=True)

by @shivam15s in https://github.com/huggingface/trl/pull/3184

Bug fixes

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.16.0...v0.17.0

Apr 4, 2025

What's Changed

Full Changelog: https://github.com/huggingface/trl/compare/v0.16.0...v0.16.1

Mar 22, 2025

Major and breaking

🚀 Scaling GRPO to 70B+ Models and Multi-Node Training with vLLM Server & NCCL Communication

Previously, vLLM could only be used by dedicating a single GPU, preventing both the scalability benefits of vLLM and multi-node training. This limitation has now been removed!

GRPO can now scale efficiently with models exceeding 70B parameters, supporting multi-node training with super-fast performance.

To take advantage of this, simply launch a vLLM server using the following command:

trl vllm-serve --model <model_name> --tensor_parallel_size <tp_size>

Then, start GRPO training with use_vllm=True.

Below is a comparison of GRPO throughput with and without vLLM, across different TP values and model sizes.

@binary-husky and @qgallouedec in https://github.com/huggingface/trl/pull/3094

🐦‍🔥 6x faster GRPO with multi-step optimization

This release introduces the multi-step trick, which allows for the reuse of generated data across multiple steps, speeding up the training process.

To support this, we've implemented importance sampling and clipping logic. This enhancement should lead to significant improvements in training speed.

<img width="1097" alt="Screenshot 2025-03-23 at 14 52 28" src="https://github.com/user-attachments/assets/8f1ee339-63c5-43cf-9b0f-5395432513ae" />

To use it, simply set num_iterations to a value greater than 1.

training_args = GRPOConfig(..., num_iterations=4)

by @qgallouedec in https://github.com/huggingface/trl/pull/2899

🌍 Use global normalization in GRPO

As demonstrated in Dr GRPO, sequence-level normalization can introduce a response level length bias.

To address this, we have now switched to normalizing the loss and by the total number of tokens in the batch, ensuring more consistent and unbiased training.

- loss = ((per_token_loss * completion_mask).sum(dim=1) / completion_mask.sum(dim=1)).mean()
+ loss = (per_token_loss * completion_mask).sum() / completion_mask.sum()

by @edbeeching in https://github.com/huggingface/trl/pull/2881

⚖️ Add option not to scale rewards

As demonstrated in Dr GRPO, scaling rewards can introduce a question-level difficulty bias. To address this, we have now added an option to disable reward scaling in GRPO.

training_args = GRPOConfig(..., scale_rewards=False)
  advantages = rewards - mean_grouped_rewards
- advantages = advantages / std_grouped_rewards
+ if self.args.scale_rewards:
+     advantages = advantages / std_grouped_rewards

it's likely that we'll make this (scale_rewards=False) the default behavior in the future.

by @qgallouedec in https://github.com/huggingface/trl/pull/3135

🤸‍♀️ Domain-specific rewards in GRPO

When optimizing across multiple domains, not all reward functions are relevant for every sample. For example, a math verifier's reward does not apply to grammar samples, and a grammar verifier's reward does not apply to math samples.

It is now possible to return None for rewards that do not make sense for a given sample. For instance, when the domain is specified in a column like domain, you can implement it as follows:

def math_reward(completions, domain, **kwargs):
    rewards = []
    for completion, dom in zip(completions, domain):
        if dom == "math":
            rewards.append(verify(completion))
        else:
            rewards.append(None)
    return rewards

This allows for more domain-specific reward handling, ensuring that irrelevant rewards are ignored and don’t interfere with optimization.

by @shirinyamani in https://github.com/huggingface/trl/pull/3079

🍃 Do not load reference model when beta == 0.0

It has been observed that not minimizing the KL divergence between the trained model and the reference model can still yield good results, while significantly reducing memory usage and compute. This is because there is no need to store the reference model in memory or perform a forward pass for it.

When beta is set to 0.0, the reference model is not loaded, and the KL divergence is not computed, leading to savings in both time and memory.

training_args = GRPOConfig(..., beta=0.0)

by @ingambe in https://github.com/huggingface/trl/pull/2806

🕊️ Padding-free for SFT

Padding-free batching is an alternative approach to packing for reducing memory usage. In this method, a batch is first sampled and then flattened into a single sequence, avoiding padding. Unlike packing, which can result in incomplete sequences by combining parts of different samples, padding-free batching ensures that all sequences remain complete and intact.

To enable padding-free batching in SFT, simply set padding_free=True in the SFTConfig, and make sure to use flash_attention2 as the attention implementation.

training_args = SFTConfig(..., padding_free=True, model_init_kwargs={"attn_implementation": "flash_attention2"})

by @qgallouedec in https://github.com/huggingface/trl/pull/3076

🎬 Clip Higher for Better Exploration

As outlined in the DAPO paper, increasing the upper bound epsilon leads to higher entropy during generation, promoting better exploration. To enable this, we’ve added support for adjusting the upper bound epsilon directly in the default GRPO trainer.

training_args = GRPOConfig(epsilon_high=0.28)

by @shirinyamani in https://github.com/huggingface/trl/pull/3118

Bug fixes

Minor

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.15.0...v0.16.0

Feb 25, 2025

What changed

  • ♻️ Fix caching in SFT by @qgallouedec in #2945
  • 🐯 Fix LigerKernel for SFTTrainer by @lewtun in #2940
  • 📌 Pin liger-kernel and vLLM by @qgallouedec in #2952

Full Changelog: https://github.com/huggingface/trl/compare/v0.15.1...v0.15.2

Feb 18, 2025

What's Changed

  • 💬 Add maybe_convert_to_chatmlmap for conversational datasets by @kashif in SFT in #2862
  • [SFT] fix check for AutoLigerKernelForCausalLM by @kashif in #2874
  • 🍟 [SFT] Handles the dataset if it has been preprocessed by @BenasdTW in #2863
  • 🧶 [GRPO][vLLM + LoRA] Move unmerge of PEFT model after weight loading by @XZ-X in #2873
  • 🪂 Don't gather logits in SFT to avoid hanging by @qgallouedec in #2890
  • Release: v0.15.1 by @qgallouedec

Full Changelog: https://github.com/huggingface/trl/compare/v0.15.0...v0.15.1

Feb 13, 2025

Major and breaking changes

Coming soon

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.9.6...v0.15.0

Jan 29, 2025

Major and breaking changes

👨‍👨‍👧‍👧 GRPO

by @qgallouedec in https://github.com/huggingface/trl/pull/2565

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.13.0...v0.14.0

Dec 16, 2024

Major and breaking changes

🐾 Process-supervised RM Trainer

We introduced a new trainer to train Process-supervised Reward Model (PRM) in TRL. A PRM rewards the quality of intermediate steps, promoting structured reasoning over focusing solely on the final outcome.With this trainer, we introduce a new dataset type: Stepwise supervision, which is a variant of the prompt-completion type, but for which completion is divided into several intermediate steps, and each step is associated with a label. Find out more in the stepwise-supervision section in the TRL documentation.

Here is an example of how to use the PRMTrainer to train a PRM on the Math Shepherd dataset:

# train_prm.py
from datasets import load_dataset
from trl import PRMConfig, PRMTrainer
from transformers import AutoModelForTokenClassification, AutoTokenizer

model = AutoModelForTokenClassification.from_pretrained("Qwen/Qwen2-0.5B", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B")
train_dataset = load_dataset("trl-lib/math_shepherd", split="train[:10%]")

training_args = PRMConfig(output_dir="Qwen2-0.5B-Reward-Math-Sheperd", logging_steps=10)
trainer = PRMTrainer(model=model, args=training_args, processing_class=tokenizer, train_dataset=train_dataset)
trainer.train()

For more information, check out the PRMTrainer documentation.

by @qgallouedec and @gaetanlop in https://github.com/huggingface/trl/pull/2127 and https://github.com/huggingface/trl/pull/2148

🔀 Add MergeModelCallBack

Various works show that model merging can non-trivially improve performance, especially if the models belong to the same architecture. TRL now features a callback that merges the reference model with the current policy and optionally pushes the merged checkpoint to the Hub. This could be done on step/epoch end and/or the end of training. This callback uses Arcee's mergekit lib: https://github.com/arcee-ai/mergekit

from trl import DPOTrainer, MergeModelCallback
from trl.mergekit_utils import MergeConfig

config = MergeConfig()
merge_callback = MergeModelCallback(config)
trainer = DPOTrainer(...,  callbacks=[merge_callback])

by @August-murr in https://github.com/huggingface/trl/pull/2282

🔨 Support for tools for data utils

TRL preprocessing utils now support tooling. A first step toward agent fine-tuning.

from trl import apply_chat_template

def get_current_temperature(location: str):
    """
    Gets the temperature at a given location.

    Args:
        location: The location to get the temperature for
    """
    return 22.0

example = apply_chat_template(example, tokenizer, tools=[get_current_temperature])

by @August-murr in https://github.com/huggingface/trl/pull/2455

🌋 Add support for LLaVA-Next in DPOTrainer

VLMs have their own specificities which require special treatment in the trainer. DPOTrainer now supports LLaVA-Next models natively.

model = model = AutoModelForVision2Seq.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
trainer = DPOTrainer(model=model, ...)

by @chenweize1998 in https://github.com/huggingface/trl/pull/2413

🕹️ CLI and TRLParser refactor

TRL CLI has been refactored to be more user-friendly and easy to extend. We plan to extend the support to all trainers soon.

(simplified output, for readibility)

$ trl dpo --help
usage: trl dpo [-h] --dataset_name DATASET_NAME [--dataset_config DATASET_CONFIG] --output_dir OUTPUT_DIR [--loss_type {sigmoid,hinge,ipo}]

options:
  -h, --help            show this help message and exit
  --dataset_name DATASET_NAME, --dataset-name DATASET_NAME
  --dataset_config DATASET_CONFIG, --dataset-config DATASET_CONFIG
  --output_dir OUTPUT_DIR, --output-dir OUTPUT_DIR
                        The output directory where the model predictions and checkpoints will be written. (default: None)
  --loss_type {sigmoid,hinge,ipo}, --loss-type {sigmoid,hinge,ipo}

by @qgallouedec in https://github.com/huggingface/trl/pull/2380 and https://github.com/huggingface/trl/pull/2412

🤝 Mixture of judges

TRL features a new judge AllTrueJudge that unifies the decision of multiple binary judges. This judge implements the Mixture of Judges as described in the CGPO paper.

from trl import AllTrueJudge, BaseBinaryJudge

class RandomBinaryJudge(BaseBinaryJudge):
    """
    Random binary judge, for testing purposes.
    """

    def judge(self, prompts, completions, gold_completions=None, shuffle_order=True):
        return [random.choice([0, 1, -1]) for _ in range(len(prompts))]


prompts = ["The capital of France is", "The biggest planet in the solar system is"]
completions = [["Paris", "Marseille"], ["Saturn", "Jupiter"]]
judge = AllTrueJudge(judges=[RandomBinaryJudge(), RandomBinaryJudge()])
judgements = judge.judge(prompts=prompts, completions=completions)
print(judgements)  # [0, 1]

by @gaetanlop in https://github.com/huggingface/trl/pull/2159

❄️ DPO trainer supports num_logits_to_keep to save memory

Save memory by only keeping the top num_logits_to_keep logits in the DPO trainer.

training_args = DPOConfig(..., use_num_logits_to_keep=True)

by @xyangk in https://github.com/huggingface/trl/pull/2129

🗺️ Implementation DiscoPOP Loss

The DiscoPOP paper uses LLMs to discover more efficient offline preference optimization losses. In the paper the proposed DiscoPOP loss (which is a log-ratio modulated loss) outperformed other optimization losses on different tasks (IMDb positive text generation, Reddit TLDR summarization, and Alpaca Eval 2.0).

training_args = DPOConfig(..., loss_type="discopop", discopop_tau=0.05)

by @fanconic in https://github.com/huggingface/trl/pull/2323

🧑‍🍳 Add precompute batch size argument in DPOTrainer for reference model

We can now control the batch size for precomputing reference model logits.

training_args = DPOConfig(
...
    precompute_ref_log_probs=True,
    precompute_ref_batch_size=4,
)

by @SwayamInSync in https://github.com/huggingface/trl/pull/2426

📦 Support for packing tokenized datasets for SFT

SFTTrainer has supported packing datasets for faster training. Now, it support packing tokenized datasets as well.

by @kmehant in https://github.com/huggingface/trl/pull/2011

📉 Add PEFT support for PPOTrainer

PPOTrainer now supports PEFT for efficient training.

PPOTrainer(
    ...,
    peft_config=peft_config,
)

by @ccs96307 in https://github.com/huggingface/trl/pull/2344

💾 Deprecate config in favor of args in PPOTrainer

config has been deprecated in favor of args in PPOTrainer.

  PPOTrainer(
-   config=training_args,
+   args=training_args,
  )

by @qgallouedec in https://github.com/huggingface/trl/pull/2384

👮 Deprecate policy in favor of model in PPOTrainer

policy has been deprecated in favor of model in PPOTrainer.

  PPOTrainer(
-   policy=model,
+   model=model,
  )

by @qgallouedec in https://github.com/huggingface/trl/pull/2386

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.12.0...v0.13.0

Dec 6, 2024

What's Changed

Full Changelog: https://github.com/huggingface/trl/compare/v0.12.1...v0.12.2

Nov 15, 2024

What's Changed

Full Changelog: https://github.com/huggingface/trl/compare/v0.12.0...v0.12.1

Nov 4, 2024

Major and breaking changes

General reward model support for Online DPO

Online DPO intially only supported a reward model that had the same tokenizer and chat template as the trained model. Now, you can use any reward model.

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification, AutoTokenizer
from trl import OnlineDPOConfig, OnlineDPOTrainer

model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_config.model_name_or_path, padding_side="left")

reward_model = AutoModelForSequenceClassification.from_pretrained(training_args.reward_model_path, num_labels=1)
reward_tokenizer = AutoTokenizer.from_pretrained(reward_model_name, truncation=True, truncation_side="left")

dataset = load_dataset(script_args.dataset_name)

training_args = OnlineDPOConfig(output_dir="...")
trainer = OnlineDPOTrainer(
    model=model,
    reward_model=reward_model,
    args=training_args,
    train_dataset=dataset,
    processing_class=tokenizer,
    reward_processing_class=reward_tokenizer,
)
trainer.train()

by @qgallouedec in https://github.com/huggingface/trl/pull/2276

Migration PPOv2 -> PPO

The PPOv2 trainer has been renamed to PPO. The old PPO trainer has been removed. PPOv2 is now deprecated and will be removed in the next release.

- trainer = PPOv2Trainer(...)
+ trainer = PPOTrainer(...)

by @qgallouedec in https://github.com/huggingface/trl/pull/2174

Refactor ScriptArguments

We had ScriptArguments, SFTScriptArguments, DPOScriptArguments and RewardScriptArguments. Since they all share mostly the same fields, we've merged them into a single ScriptArguments class. SFTScriptArguments, DPOScriptArguments and RewardScriptArguments still exist but are deprecated and will be removed in the next release.

- script_args = DPOScriptArguments(...)
+ script_args = ScriptArguments(...)

by @qgallouedec in https://github.com/huggingface/trl/pull/2145

Soft judges for PairRM

The PairRMJudge now when called via the judge method has a flag return_scores that returns the probability scores of the first completion of the pair (instead of the rank of the preferred completion). The logits for the probability score can be scaled by an optional temperature parameter.

from trl import PairRMJudge
pairrm_judge = PairRMJudge()
prompts = ["Translate 'hello' to French", "What's the capital of Japan?"]
completions = [["Bonjour", "Salut"], ["Kyoto", "Tokyo"]]
results = pairrm_judge.judge(prompts, completions, return_scores=True)
print(results)  # [0.7492601275444031, 0.0005497377132996917]

by @kashif in https://github.com/huggingface/trl/pull/2221

Use pairwise judges for online methods

The OnlineDPOTrainer and any trainers that inherit from it (NashMDTrainer and XPOTrainer) can now accept an initialized PairwiseJudge instead of a reward model.

from datasets import load_dataset
from trl import OnlineDPOConfig, OnlineDPOTrainer, PairRMJudge
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
judge = PairRMJudge()
train_dataset = load_dataset("trl-lib/ultrafeedback-prompt", split="train")

training_args = OnlineDPOConfig(output_dir="Qwen2-0.5B-OnlineDPO", logging_steps=10)
trainer = OnlineDPOTrainer(
    model=model, judge=judge, args=training_args, processing_class=tokenizer, train_dataset=train_dataset
)
trainer.train()

by @kashif in https://github.com/huggingface/trl/pull/2243

Rename trainer arg tokenizer to processing_class

The tokenizer argument in the trainers has been renamed to processing_class to better reflect the fact that it can be not only a tokenizer but also a processor.

- trainer = DPOTrainer(model, args=training_args, train_dataset=dataset, tokenizer=tokenizer)
+ trainer = DPOTrainer(model, args=training_args, train_dataset=dataset, processing_class=tokenizer)

tokenizer is still supported for SFTTrainer and DPOTrainer but deprecated and will be removed in the next release.

by @qgallouedec in https://github.com/huggingface/trl/pull/2162

Adding weighted preference optimization (WPO) to DPO

The WPO paper adapts off-policy data to resemble on-policy data more closely by reweighting preference pairs according to their probability under the current policy. To use this method, set the use_weighting flag to True in the [DPOConfig].

DPOConfig(..., use_weighting=True)
<img width="1112" alt="Screenshot 2024-11-04 at 10 59 38" src="https://github.com/user-attachments/assets/544ddc02-bd09-4f21-b8a4-b81c21561a9b"> <img width="539" alt="Screenshot 2024-11-04 at 10 59 22" src="https://github.com/user-attachments/assets/8d5afe9e-89bd-4d00-8483-dd7ba98997e7">

by @gaetanlop in https://github.com/huggingface/trl/pull/2141

🃏 Model card for TRL

Using trainer.push_to_hub() now automatically creates a model card that includes:

  • A link to the base model used
  • A link to the dataset used for training
  • A link to the TRL repository
  • Sample demo code
  • A link to the associated Weights & Biases run
  • A link to the paper detailing the training procedure
  • Versions of dependencies
  • BibTeX citations for both the training procedure and TRL

All links are properly formatted to allow cross-referencing, enabling traceability back to sources (e.g., the model appears linked on the paper’s page).

https://github.com/user-attachments/assets/b903964e-9087-45cc-8fb0-2418fdd87b72

by @qgallouedec in https://github.com/huggingface/trl/pull/2123

Minor

Conversational dataset support

You can now use conversational datasets directly, without needing to apply a chat template beforehand, for the following trainers:

  • BCOTrainer (by @qgallouedec in PR #2107)
  • CPOTrainer (by @qgallouedec in PR #2144)
  • DPOTrainer (by @qgallouedec in PR #2131)
  • KTOTrainer (by @qgallouedec in PR #2248)
  • ORPOTrainer (by @qgallouedec in PR #2184)
from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import DPOTrainer

model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
dataset = load_dataset(dataset_name, split="train")

# Not needed anymore:
#
# def process(row):
#     prompt = tokenizer.apply_chat_template(example["prompt"], tokenize=False, add_generation_prompt=True)
#     prompt_chosen = tokenizer.apply_chat_template(example["prompt"] + example["chosen"], tokenize=False)
#     chosen = prompt_chosen[len(prompt) :]
#     prompt_rejected = tokenizer.apply_chat_template(example["prompt"] + example["rejected"], tokenize=False)
#     rejected = prompt_rejected[len(prompt) :]
#     return {"prompt": prompt, "chosen": chosen, "rejected": rejected}
#
# dataset = dataset.map(process)

training_args = DPOConfig(output_dir="...")
trainer = DPOTrainer(model, args=training_args, train_dataset=dataset, processing_class=tokenizer)
trainer.train()

Refactor DPO data processing

For more information, see PR #2209.

trl env for printing system info

You can now use trl env to print system information, including the platform, Python version, PyTorch version, CUDA device(s), and versions of various libraries.

$ trl env

Copy-paste the following information when reporting an issue:

- Platform: Linux-5.15.0-1048-aws-x86_64-with-glibc2.31
- Python version: 3.11.9
- PyTorch version: 2.4.0
- CUDA device(s): NVIDIA H100 80GB HBM3
- Transformers version: 4.47.0.dev0
- Accelerate version: 0.19.0
- Accelerate config: not found
- Datasets version: 3.0.2
- HF Hub version: 0.26.1
- TRL version: 0.12.0+14ef1ab
- bitsandbytes version: 0.44.1
- DeepSpeed version: 0.15.3
- Diffusers version: 0.30.3
- Liger-Kernel version: 0.3.0
- LLM-Blender version: 0.0.2
- OpenAI version: 1.46.0
- PEFT version: 0.13.2

by @qgallouedec in https://github.com/huggingface/trl/pull/2104

Sequence-Level KD

From GKD paper:

Sequence-Level KD (Kim & Rush, 2016). SeqKD maximizes the likelihood of high probability sequences generated by the teacher, and can be viewed as supervised FT on teacher-generated outputs.

SeqKD is taken as a baseline in the paper. It is now possible to use Sequence-Level KD in the GKDTrainer by setting seq_kd=True in the GKDConfig.

training_args = GKDConfig(..., seq_kd=True)

by @mst272 in https://github.com/huggingface/trl/pull/2220

Default dataset_text_field to "text"

Since many users use "text" as the column name for textual data in datasets, we've made it the default (previously a required argument) in SFTConfig. Now, specifying dataset_text_field="text" is no longer necessary.

  SFTConfig(
      ...,
-     dataset_text_field="text",
  )

by @qgallouedec in https://github.com/huggingface/trl/pull/2078

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.11.0...v0.12.0

Oct 15, 2024

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.11.3...v0.11.4

Latest
v1.2.0
Tracking Since
Jan 25, 2023
Last fetched Apr 19, 2026