SFTTrainerSFTTrainer now natively supports Vision-Language Models (VLMs). This includes support for both languauge modeling, prompt-completion data.
It also supports training on completion-only.
from trl import SFTConfig, SFTTrainer
from datasets import load_dataset
trainer = SFTTrainer(
model="Qwen/Qwen2.5-VL-3B-Instruct",
args=SFTConfig(max_length=None),
train_dataset=load_dataset("trl-lib/llava-instruct-mix", split="train"),
)
trainer.train()
by @qgallouedec in https://github.com/huggingface/trl/pull/3862, https://github.com/huggingface/trl/pull/3907 and https://github.com/huggingface/trl/pull/3908
RLOOTrainer refactorRLOOTrainer has been refactored to align with the design principles of other other trainers in the library. You can now use this trainer exactly like GRPO.
from datasets import load_dataset
from trl import RLOOConfig, RLOOTrainer
dataset = load_dataset("trl-lib/ultrafeedback-prompt", split="train")
# Dummy reward function for demonstration purposes
def reward_num_unique_letters(completions, **kwargs):
"""Reward function that rewards completions with more unique letters."""
completion_contents = [completion[0]["content"] for completion in completions]
return [float(len(set(content))) for content in completion_contents]
trainer = RLOOTrainer(
model="Qwen/Qwen2-0.5B-Instruct",
reward_funcs=reward_num_unique_letters,
train_dataset=dataset,
)
trainer.train()
by @shirinyamani in https://github.com/huggingface/trl/pull/3801
You can now levarage Hugging Face Jobs to easily train and deploy your models with TRL.
hf jobs uv run --flavor a100-large --secrets HF_TOKEN "https://raw.githubusercontent.com/huggingface/trl/main/trl/scripts/sft.py" --model_name_or_path Qwen/Qwen2-0.5B --dataset_name trl-lib/Capybara
A guide is available in the docs.
by @sergiopaniego in https://github.com/huggingface/trl/pull/3890
GRPOTrainer now supports DAPO loss type, which aggregates token-level losses by normalizing with the number of active token in the global accumulated batch. This method was introduced to eliminate length bias. Simply use
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
loss_type="dapo",
...
)
by @qgallouedec in https://github.com/huggingface/trl/pull/3938
The authors of Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning (Lite PPO) find that the combination of:
can unlock the learning capability of critic-free policies using vanilla PPO loss. Their results demonstrate that this simple combination consistently improves performance, surpassing strategies like GRPO and DAPO.
TRL supports using these learnings to train a GRPO model by:
from trl import GRPOConfig
training_args = GRPOConfig(
scale_rewards="batch",
loss_type="dapo",
...
)
by @pramodith in https://github.com/huggingface/trl/pull/3935
Bias-Corrected Exponential Moving Average (BEMA) improves the stability and efficiency of language model fine-tuning by reducing stochasticity and eliminating bias. To use BEMA with SFT as described in the paper, you can now use the [BEMACallback]:
from trl import BEMACallback, SFTTrainer
trainer = SFTTrainer(
...
callbacks=[BEMACallback()],
)
by @kashif in https://github.com/huggingface/trl/pull/3855
gradient_checkpointing=True by @qgallouedec in https://github.com/huggingface/trl/pull/3510position_ids for flash_attention_3 by @jue-jue-zi in https://github.com/huggingface/trl/pull/3942setup_chat_format by @qgallouedec in https://github.com/huggingface/trl/pull/3929IterativeSFTTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/3905AutoModelForVision2Seq to AutoModelForImageTextToText by @qgallouedec in https://github.com/huggingface/trl/pull/3836--bf16 value in scripts by @sergiopaniego in https://github.com/huggingface/trl/pull/3869gradient_checkpointing=True by @qgallouedec in https://github.com/huggingface/trl/pull/3510vllm_mode param in GRPO by @sergiopaniego in https://github.com/huggingface/trl/pull/3866unittest.TestCase with TrlTestCase that handles tmp dir by @qgallouedec in https://github.com/huggingface/trl/pull/3863SFTTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/3862IterativeSFTTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/3905use_cache should be set in the forward pass by @qgallouedec in https://github.com/huggingface/trl/pull/3891logger.warning instead of warnings.warn by @qgallouedec in https://github.com/huggingface/trl/pull/3923SFTTrainer in GRPOTrainer by @MengAiDev in https://github.com/huggingface/trl/pull/3919setup_chat_format by @qgallouedec in https://github.com/huggingface/trl/pull/3929position_ids for flash_attention_3 by @jue-jue-zi in https://github.com/huggingface/trl/pull/3942trackio to docs by @sergiopaniego in https://github.com/huggingface/trl/pull/3965Full Changelog: https://github.com/huggingface/trl/compare/v0.21.0...v0.22.0
Open AI GPT OSS models are here! Check out the OpenAI Cookbook to see an example of how to SFT these models.
by @qgallouedec in https://github.com/huggingface/trl/pull/3848
You can now pass vllm_model_impl to the TRL vLLM server.
Example, for transformers backend:
trl vllm-serve ... --vllm_model_impl transformers
by @merveenoyan in https://github.com/huggingface/trl/pull/3773
using_llama_models.md by @bwook00 in https://github.com/huggingface/trl/pull/3794Full Changelog: https://github.com/huggingface/trl/compare/v0.20.0...v0.21.0
GSPO is a GRPO variant that computes importance sampling weights at the sequence level instead of per-token.
<img width="930" height="538" alt="Screenshot 2025-07-28 at 10 54 15 PM" src="https://github.com/user-attachments/assets/923835af-dc61-4fd4-8a99-44242d02bb7b" />📜 Paper: https://huggingface.co/papers/2507.18071
To reproduce the paper's setting, use this configuration:
from trl import GRPOConfig
training_args = GRPOConfig(
importance_sampling_level="sequence",
loss_type="grpo",
steps_per_generation=...,
beta=0.04, # not explicitly specified in the paper, but they likely used the same value as in the GRPO paper
epsilon=3e-4, # https://x.com/ChujieZheng/status/1948933507696525392
)
by @qgallouedec in https://github.com/huggingface/trl/pull/3775
The GRPOTrainer can now be used for VLM training. Give a try with this dummy example:
from trl import GRPOTrainer
from datasets import load_dataset
# Dummy vision-language dataset
dataset = load_dataset("trl-internal-testing/zen-image", "conversational_prompt_only", split="train")
# Dummy reward function: count the number of unique characters in the completions
def reward_num_unique_chars(completions, **kwargs):
return [len(set(c[0]["content"])) for c in completions]
trainer = GRPOTrainer(
model="Qwen/Qwen2.5-VL-3B-Instruct",
reward_funcs=[reward_num_unique_chars],
train_dataset=dataset,
)
trainer.train()
by @CompN3rd and @kashif in https://github.com/huggingface/trl/pull/3072 in https://github.com/huggingface/trl/pull/3760
The DPO trainer supports combining multiple loss functions with different weights, enabling more sophisticated optimization strategies. This is particularly useful for implementing algorithms like MPO (Mixed Preference Optimization). MPO is a training approach that combines multiple optimization objectives, as described in the paper Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization.
To combine multiple losses, specify the loss types and corresponding weights as lists:
from trl import DPOConfig
# MPO: Combines DPO (sigmoid) for preference and BCO (bco_pair) for quality
training_args = DPOConfig(
loss_type=["sigmoid", "bco_pair", "sft"], # Loss types to combine
loss_weights=[0.8, 0.2, 1.0] # Corresponding weights, as used in the MPO paper
)
by @qgallouedec in https://github.com/huggingface/trl/pull/2544
Continuous Batching allows for faster generation using the transformers backend. You can now use it with the GRPOTrainer by setting use_transformers_paged=True in the config.
use_transformers_paged = True
from trl import GRPOConfig
training_args = GRPOConfig(
# ... other args
use_transformers_paged=Ture,
)
by @ArthurZucker in https://github.com/huggingface/trl/pull/3471
In Beyond the 80/20 Rule: High-Entropy Minority Tokens
Drive Effective Reinforcement Learning for LLM Reasoning, it is shown that utilizing only 20% of the highest entropy tokens leads to similar performance as using all tokens. You can now enable this feature in the GRPOTrainer by setting entropy_filtering=True in the config.
from trl import GRPOConfig
training_args = GRPOConfig(
# ... other args
top_entropy_quantile=0.2, # Use only the top 20% of tokens based on entropy
)
by @pramodith in https://github.com/huggingface/trl/pull/3563
GRPO now supports FSDP2 training. Just run your script with an FSDP2 config:
accelerate launch --config_file examples/accelerate_configs/fsdp2.yaml run_grpo.py
by @SalmanMohammadi in https://github.com/huggingface/trl/pull/3687
position_ids computation for FFD packing by @mariosasko in https://github.com/huggingface/trl/pull/3649dpo_trainer.py by @bvantuan in https://github.com/huggingface/trl/pull/3631seq_lengths to signature columns by @LeonEricsson in https://github.com/huggingface/trl/pull/3699processor.tokenizer by @Tavish9 in https://github.com/huggingface/trl/pull/3720--bf16 flag from training scripts by @qgallouedec in https://github.com/huggingface/trl/pull/3724wandb.run.url instead of wandb.run.get_url() (deprecated) by @qgallouedec in https://github.com/huggingface/trl/pull/3726processing_class docs for trainers by @sergiopaniego in https://github.com/huggingface/trl/pull/3737processing_class docs for rest of trainers by @sergiopaniego in https://github.com/huggingface/trl/pull/3745average_tokens_across_devices by @qgallouedec in https://github.com/huggingface/trl/pull/3746steps_per_generation in vllm max_num_seqs by @akakakakakaa in https://github.com/huggingface/trl/pull/3747AlignPropTrainer and DDPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/3755GeometricMixtureWrapper by @qgallouedec in https://github.com/huggingface/trl/pull/3779clone_chat_template vocab size and support PEFT instruction tuning by @qgallouedec in https://github.com/huggingface/trl/pull/3763pixel_attention_mask (SmolVLM2) and image_sizes (LLaVa-Next) by @kashif in https://github.com/huggingface/trl/pull/3760max_length value and include visualization tool by @qgallouedec in https://github.com/huggingface/trl/pull/3630Full Changelog: https://github.com/huggingface/trl/compare/v0.19.0...v0.20.0
Full Changelog: https://github.com/huggingface/trl/compare/v0.19.0...v0.19.1
SFTTrainer now supports training with tools! You just have to add a column tools to your dataset, which contains a list of tool definitions as json schemas. The tools will be automatically registered and can be used in the training process.
from datasets import Dataset
from transformers.utils import get_json_schema
from trl import SFTTrainer
# Fictitious functions to simulate tool calls
def start_timer(duration: int) -> int:
"""
Starts a timer for the specified duration in seconds.
Args:
duration: Duration in seconds to set the timer for.
Returns:
The duration set for the timer.
"""
return duration
def create_reminder(time: str, note: str) -> str:
"""
Creates a reminder for the specified time and note.
Args:
time: The time for the reminder.
note: The note for the reminder.
Returns:
A confirmation message indicating that the reminder has been set.
"""
return "I'll remind you to call mom at 7 PM."
# Define the JSON schemas for the tools
start_timer = get_json_schema(start_timer)
create_reminder = get_json_schema(create_reminder)
dataset = Dataset.from_dict({
"messages": [
[
{"role": "user", "content": "Set a timer for 10 minutes."},
{"role": "assistant", "tool_calls": [{"type": "function", "function": {"name": "start_timer", "arguments": {"duration": 600}}}]},
{"role": "tool", "name": "start_timer", "content": "600"},
{"role": "assistant", "content": "Timer set for 10 minutes."},
],
...,
],
"tools": [
[start_timer, create_reminder],
...,
]
})
# Initialize the trainer
trainer = SFTTrainer(model="Qwen3-0.6B", train_dataset=dataset)
# Train the model
trainer.train()
by @qgallouedec in https://github.com/huggingface/trl/pull/3597
We introduce a new packing method: FFD (First Fit Decreasing) packing. This method is designed to optimize the packing of sequences in a way that more efficiently reduces the size of the training dataset by grouping examples more effectively. Previously, we used a wrapped packing method, which often truncated sequences even when they were not longer than the maximum sequence length. The new FFD packing method avoids unnecessary truncation by grouping sequences more intelligently. This new packing strategy is now the default when packing is enabled.
training_args = SFTConfig(..., packing=True)
by @qgallouedec in https://github.com/huggingface/trl/pull/3521 and accelerated by @mariosasko in https://github.com/huggingface/trl/pull/3537
The DPOTrainer now supports the Liger-powered DPO loss, enabling faster training with lower memory usage.
training_args = DPOConfig(..., use_liger_loss=True)
by @kashif in https://github.com/huggingface/trl/pull/2568
setup_chat_format and add clone_chat_templateWe introduce clone_chat_template, a more convenient and flexible function for setting up chat templates from any tokenizer that already includes one. It handles EOS tokens and copies all added tokens from the source tokenizer, preserving their "special" status.
You can either use this function directly:
from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import clone_chat_template
model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")
model, tokenizer = clone_chat_template(model, tokenizer, "Qwen/Qwen3-4B")
or use the chat_template_path parameter in SFTConfig to specify a chat template, which will be automatically cloned when the SFTTrainer is initialized.
from trl import SFTConfig
training_args = SFTConfig(chat_template_path="Qwen/Qwen3-4B")
by @qgallouedec in https://github.com/huggingface/trl/pull/3404 and https://github.com/huggingface/trl/pull/3599
SFTTrainer now supports passing additional keyword arguments to the chat template. This allows for more flexibility in customizing the chat format during training. To enable it, just add a chat_template_kwargs column to your your dataset.
example = {'messages': [{'content': 'What is better than ugly?', 'role': 'user'},
{'content': 'Beautiful.', 'role': 'assistant'}]
'chat_template_kwargs': {'my_template_arg': 'my_value'}}
by @qgallouedec in https://github.com/huggingface/trl/pull/3609
The SFTTrainer now supports training on assistant messages only
example = {'messages': [
{'role': 'user', 'content': 'What is better than ugly?'}, # masked in the loss
{'role': 'assistant', 'content': 'Beautiful.'}, # used in the loss
{'role': 'user', 'content': 'And what is better than implicit?'}, # masked in the loss
{'role': 'assistant', 'content': 'Explicit.'}, # used in the loss
]}
by @qgallouedec in https://github.com/huggingface/trl/pull/3586
generation_kwargs as a property of GRPOConfig to support additional generation argumentsThe GRPOConfig now includes a generation_kwargs property, allowing users to specify additional generation arguments for the GRPOTrainer. This allows for further customization of the generation behavior, such as setting suppress_tokens, num_beams, etc.
Depending on the generation backend used (transformers or vLLM), this property will be passed either to transformers.GenerationConfig (if using transformers) or vllm.SamplingParams (if using vLLM).
from trl import GRPOConfig
training_args = GRPOConfig(..., generation_kwargs={"length_penalty": -0.1})
by @pramodith in https://github.com/huggingface/trl/pull/3617
beta=0.0 for GRPO by @qgallouedec in https://github.com/huggingface/trl/pull/3516logging_steps=10 by @qgallouedec in https://github.com/huggingface/trl/pull/3514bf16=True by @qgallouedec in https://github.com/huggingface/trl/pull/3515IterableDataset in DPO Trainer by @h-tonywu in https://github.com/huggingface/trl/pull/3559labels are retained in self._signature_columns by @sxndqc in https://github.com/huggingface/trl/pull/3589vllm_gpu_memory_utilization recommendation script by @toslali-ibm in https://github.com/huggingface/trl/pull/3554setup.cfg by @qgallouedec in https://github.com/huggingface/trl/pull/3511beta=0.0 for GRPO by @qgallouedec in https://github.com/huggingface/trl/pull/3516logging_steps=10 by @qgallouedec in https://github.com/huggingface/trl/pull/3514getattr to get gradient_checkpointing by @qgallouedec in https://github.com/huggingface/trl/pull/3535_VALID_DICT_FIELDS by @qgallouedec in https://github.com/huggingface/trl/pull/3553torch.autocast and make it cover XPU by @yao-matrix in https://github.com/huggingface/trl/pull/3541IterableDataset in DPO Trainer by @h-tonywu in https://github.com/huggingface/trl/pull/3559bf16=True by @qgallouedec in https://github.com/huggingface/trl/pull/3515setup_chat_format and add clone_chat_template by @qgallouedec in https://github.com/huggingface/trl/pull/3404logging_steps parameter from for simpler setup by @qgallouedec in https://github.com/huggingface/trl/pull/3612Trainer::create_model_card by @LeonEricsson in https://github.com/huggingface/trl/pull/3613enforce_eager default value in vLLM server. by @LeonEricsson in https://github.com/huggingface/trl/pull/3607labels are retained in self._signature_columns by @sxndqc in https://github.com/huggingface/trl/pull/3589vllm_gpu_memory_utilization recommendation script by @toslali-ibm in https://github.com/huggingface/trl/pull/3554max_prompt_length) with vLLM. by @LeonEricsson in https://github.com/huggingface/trl/pull/3601generation_kwargs as a property of GRPOConfig to support additional generation arguments. by @pramodith in https://github.com/huggingface/trl/pull/3617chat_template_path parameter to SFTConfig by @qgallouedec in https://github.com/huggingface/trl/pull/3599Full Changelog: https://github.com/huggingface/trl/compare/v0.18.0...v0.19.0
Full Changelog: https://github.com/huggingface/trl/compare/v0.18.1...v0.18.2
setup.cfg by @qgallouedec in https://github.com/huggingface/trl/pull/3511Full Changelog: https://github.com/huggingface/trl/compare/v0.18.0...v0.18.1
GRPOTrainer and SFTTrainer by @I-l-l-I in https://github.com/huggingface/trl/pull/3337TextEnvironment and tools by @lewtun in https://github.com/huggingface/trl/pull/3389formatting_func is used with completion_only_loss by @LeonEricsson in https://github.com/huggingface/trl/pull/3385setup.py to setup.cfg and make rich an optional dep by @qgallouedec in https://github.com/huggingface/trl/pull/3403trl env on xpu by @yao-matrix in https://github.com/huggingface/trl/pull/3438base_url parameter for vLLM client initialization by @re-imagined in https://github.com/huggingface/trl/pull/3324keep_end leading to zero'd out samples by @LeonEricsson in https://github.com/huggingface/trl/pull/3398Full Changelog: https://github.com/huggingface/trl/compare/v0.17.0...v0.18.0
The TRL v0.17 release introduces three major changes that, together, enable significantly faster generation performance in GRPO—up to 10x faster in some configurations.
These three changes are:
Below, we provide a summary of these changes and how to use them.
The TRL vLLM server now supports data parallelism (DP), enabling significantly faster generation speeds—especially for smaller models. This new feature can be used by adding the --data_parallel_size N argument when launching the vLLM server.
trl vllm-serve --model Qwen/Qwen2.5-14B-Instruct --tensor_parallel_size 2 --data_parallel_size 2
by @qgallouedec in https://github.com/huggingface/trl/pull/3310
Previously, GRPO made one generation request per global batch. The global batch is the total of all local batches, without accounting for gradient accumulation. In other words, if the gradient accumulation step was 8, GRPO would make 8 generation requests per training step.
Now, GRPO groups these global batches into a single "effective batch" and makes only one generation request per effective batch. Since vLLM applies optimizations that are especially effective for large batches, this new approach leads to significantly faster training overall.
No changes are required in the training script, as this is handled internally by the GRPO trainer.
by @qgallouedec in https://github.com/huggingface/trl/pull/3283
vLLM provides two versions of its engine (V0 and V1), and V1 is significantly faster. This version is now supported by TRL and requires vLLM version 0.8.3 or higher.
by @I-l-l-I in https://github.com/huggingface/trl/pull/3276
Disabling dropout has shown to stabilize training. You can now disable dropout in GRPO by setting the disable_dropout argument to False in the GRPO config.
from trl import GRPOConfig
training_args = GRPOConfig(..., disable_dropout=True)
by @edbeeching in https://github.com/huggingface/trl/pull/3234
GRPO now supports the various losses proposed in the recent literature, including the Dr. GRPO loss. The loss type can be set in the GRPO config:
from trl import GRPOConfig
training_args = GRPOConfig(..., loss_type="dr_grpo")
by @qgallouedec in https://github.com/huggingface/trl/pull/3256
The GRPO trainer now has an option to disable shuffling of the training dataset. This is useful for curriculum learning, where the order of the training data is important.
from trl import GRPOConfig
training_args = GRPOConfig(..., shuffle_dataset=False)
by @LeonEricsson in https://github.com/huggingface/trl/pull/3334
Overlong filtering has been shown to significantly stabilize learning and improve performance. You can now use it in TRL!
It simply consists in masking the loss of truncated samples
from trl import GRPOConfig
training_args = GRPOConfig(..., mask_truncated_completions=True)
by @shirinyamani in https://github.com/huggingface/trl/pull/3248
Liger allows to significantly reduce the memory peak of the loss computation. You can now use it in TRL with the use_liger_loss argument in the GRPO config:
from trl import GRPOConfig
training_args = GRPOConfig(..., use_liger_loss=True)
by @shivam15s in https://github.com/huggingface/trl/pull/3184
clip_ratio logging and better document logged values by @qgallouedec in https://github.com/huggingface/trl/pull/3145clip_ratio logging and better document logged values by @qgallouedec in https://github.com/huggingface/trl/pull/3145worker_cls as string by @qgallouedec in https://github.com/huggingface/trl/pull/3159ConstantLengthDataset by @qgallouedec in https://github.com/huggingface/trl/pull/3242formatting_func by @YeFD in https://github.com/huggingface/trl/pull/3147is_liger_kernel_available with min version by @qgallouedec in https://github.com/huggingface/trl/pull/3266test_raise_error_not_causallm by @qgallouedec in https://github.com/huggingface/trl/pull/3265_generate_and_score_completions by @syt-nju in https://github.com/huggingface/trl/pull/3336max_prompt_length < max_length by @LeonEricsson in https://github.com/huggingface/trl/pull/3341Full Changelog: https://github.com/huggingface/trl/compare/v0.16.0...v0.17.0
learning_rate argument to _maybe_log_save_evaluate by @qgallouedec in https://github.com/huggingface/trl/pull/3206Full Changelog: https://github.com/huggingface/trl/compare/v0.16.0...v0.16.1
Previously, vLLM could only be used by dedicating a single GPU, preventing both the scalability benefits of vLLM and multi-node training. This limitation has now been removed!
GRPO can now scale efficiently with models exceeding 70B parameters, supporting multi-node training with super-fast performance.
To take advantage of this, simply launch a vLLM server using the following command:
trl vllm-serve --model <model_name> --tensor_parallel_size <tp_size>
Then, start GRPO training with use_vllm=True.
Below is a comparison of GRPO throughput with and without vLLM, across different TP values and model sizes.
@binary-husky and @qgallouedec in https://github.com/huggingface/trl/pull/3094
This release introduces the multi-step trick, which allows for the reuse of generated data across multiple steps, speeding up the training process.
To support this, we've implemented importance sampling and clipping logic. This enhancement should lead to significant improvements in training speed.
<img width="1097" alt="Screenshot 2025-03-23 at 14 52 28" src="https://github.com/user-attachments/assets/8f1ee339-63c5-43cf-9b0f-5395432513ae" />To use it, simply set num_iterations to a value greater than 1.
training_args = GRPOConfig(..., num_iterations=4)
by @qgallouedec in https://github.com/huggingface/trl/pull/2899
As demonstrated in Dr GRPO, sequence-level normalization can introduce a response level length bias.
To address this, we have now switched to normalizing the loss and by the total number of tokens in the batch, ensuring more consistent and unbiased training.
- loss = ((per_token_loss * completion_mask).sum(dim=1) / completion_mask.sum(dim=1)).mean()
+ loss = (per_token_loss * completion_mask).sum() / completion_mask.sum()
by @edbeeching in https://github.com/huggingface/trl/pull/2881
As demonstrated in Dr GRPO, scaling rewards can introduce a question-level difficulty bias. To address this, we have now added an option to disable reward scaling in GRPO.
training_args = GRPOConfig(..., scale_rewards=False)
advantages = rewards - mean_grouped_rewards
- advantages = advantages / std_grouped_rewards
+ if self.args.scale_rewards:
+ advantages = advantages / std_grouped_rewards
it's likely that we'll make this (scale_rewards=False) the default behavior in the future.
by @qgallouedec in https://github.com/huggingface/trl/pull/3135
When optimizing across multiple domains, not all reward functions are relevant for every sample. For example, a math verifier's reward does not apply to grammar samples, and a grammar verifier's reward does not apply to math samples.
It is now possible to return None for rewards that do not make sense for a given sample. For instance, when the domain is specified in a column like domain, you can implement it as follows:
def math_reward(completions, domain, **kwargs):
rewards = []
for completion, dom in zip(completions, domain):
if dom == "math":
rewards.append(verify(completion))
else:
rewards.append(None)
return rewards
This allows for more domain-specific reward handling, ensuring that irrelevant rewards are ignored and don’t interfere with optimization.
by @shirinyamani in https://github.com/huggingface/trl/pull/3079
beta == 0.0It has been observed that not minimizing the KL divergence between the trained model and the reference model can still yield good results, while significantly reducing memory usage and compute. This is because there is no need to store the reference model in memory or perform a forward pass for it.
When beta is set to 0.0, the reference model is not loaded, and the KL divergence is not computed, leading to savings in both time and memory.
training_args = GRPOConfig(..., beta=0.0)
by @ingambe in https://github.com/huggingface/trl/pull/2806
Padding-free batching is an alternative approach to packing for reducing memory usage. In this method, a batch is first sampled and then flattened into a single sequence, avoiding padding. Unlike packing, which can result in incomplete sequences by combining parts of different samples, padding-free batching ensures that all sequences remain complete and intact.
To enable padding-free batching in SFT, simply set padding_free=True in the SFTConfig, and make sure to use flash_attention2 as the attention implementation.
training_args = SFTConfig(..., padding_free=True, model_init_kwargs={"attn_implementation": "flash_attention2"})
by @qgallouedec in https://github.com/huggingface/trl/pull/3076
As outlined in the DAPO paper, increasing the upper bound epsilon leads to higher entropy during generation, promoting better exploration. To enable this, we’ve added support for adjusting the upper bound epsilon directly in the default GRPO trainer.
training_args = GRPOConfig(epsilon_high=0.28)
by @shirinyamani in https://github.com/huggingface/trl/pull/3118
unwrap_model_for_generation for DeepSpeed Stage-3 compatibility by @kiddj in https://github.com/huggingface/trl/pull/2871inference_mode to no_grad when computing old_per_token_logps by @qgallouedec in https://github.com/huggingface/trl/pull/2987maybe_convert_to_chatml map for conversational datasets in SFT by @kashif in https://github.com/huggingface/trl/pull/2862max_seq_length to max_length by @qgallouedec in https://github.com/huggingface/trl/pull/2895 and https://github.com/huggingface/trl/pull/2947num_tokens and some logging fixes by @qgallouedec in https://github.com/huggingface/trl/pull/3006maybe_convert_to_chatml map for conversational datasets in SFT by @kashif in https://github.com/huggingface/trl/pull/2862max_seq_length to max_length by @qgallouedec in https://github.com/huggingface/trl/pull/2895max_seq_length to max_length in SFTConfig by @qgallouedec in https://github.com/huggingface/trl/pull/2947max_tool_response by @shenxiangzhuang in https://github.com/huggingface/trl/pull/2921enable_prefix_caching by @ji-huazhong in https://github.com/huggingface/trl/pull/2900num_iterations by @qgallouedec in https://github.com/huggingface/trl/pull/2966KTOConfig by @sileod in https://github.com/huggingface/trl/pull/2912unwrap_model_for_generation for DeepSpeed Stage-3 compatibility by @kiddj in https://github.com/huggingface/trl/pull/2871inference_mode to no_grad when computing old_per_token_logps by @qgallouedec in https://github.com/huggingface/trl/pull/2987num_tokens and some logging fixes by @qgallouedec in https://github.com/huggingface/trl/pull/3006apply_chat_template by @jamesbraza in https://github.com/huggingface/trl/pull/3005deepspeed>=0.16.4's rename by @jamesbraza in https://github.com/huggingface/trl/pull/2963GPROTrainer.generation_config by @jamesbraza in https://github.com/huggingface/trl/pull/3046SFTTrainer.compute_loss crash with accelerate by @jamesbraza in https://github.com/huggingface/trl/pull/3048SFTTrainer.compute_loss hang from #3048's PR comments by @jamesbraza in https://github.com/huggingface/trl/pull/3056Full Changelog: https://github.com/huggingface/trl/compare/v0.15.0...v0.16.0
Full Changelog: https://github.com/huggingface/trl/compare/v0.15.1...v0.15.2
maybe_convert_to_chatmlmap for conversational datasets by @kashif in SFT in #2862Full Changelog: https://github.com/huggingface/trl/compare/v0.15.0...v0.15.1
Coming soon
trl.templates in excluded packages by @qgallouedec in https://github.com/huggingface/trl/pull/2690num_logits_to_keep to logits_to_keep by @qgallouedec in https://github.com/huggingface/trl/pull/2721per_device_batch_size as generations per device by @qgallouedec in https://github.com/huggingface/trl/pull/2776set_seed() call in GRPO to ensure unique seed for each process by @qgallouedec in https://github.com/huggingface/trl/pull/2824tokenizer parameter to processing_class in tests by @qgallouedec in https://github.com/huggingface/trl/pull/2828Full Changelog: https://github.com/huggingface/trl/compare/v0.9.6...v0.15.0
by @qgallouedec in https://github.com/huggingface/trl/pull/2565
disable_dropout by @qgallouedec in https://github.com/huggingface/trl/pull/2511PreferenceCollator to DataCollatorForPreference by @qgallouedec in https://github.com/huggingface/trl/pull/2510formatting_func's documentation in ConstantLengthDataset by @SamuelLarkin in https://github.com/huggingface/trl/pull/2549max_prompt_length parameter in tests by @qgallouedec in https://github.com/huggingface/trl/pull/2588max_seq_length instead of max_length by @skandermoalla in https://github.com/huggingface/trl/pull/2590max_prompt_length and loop usage in logp computation by @qgallouedec in https://github.com/huggingface/trl/pull/2598truncation_mode in DPOTrainer by @anakin87 in https://github.com/huggingface/trl/pull/2551num_logits_to_keep to reduce memory usage in GRPO by @qgallouedec in https://github.com/huggingface/trl/pull/2683Full Changelog: https://github.com/huggingface/trl/compare/v0.13.0...v0.14.0
We introduced a new trainer to train Process-supervised Reward Model (PRM) in TRL. A PRM rewards the quality of intermediate steps, promoting structured reasoning over focusing solely on the final outcome.With this trainer, we introduce a new dataset type: Stepwise supervision, which is a variant of the prompt-completion type, but for which completion is divided into several intermediate steps, and each step is associated with a label. Find out more in the stepwise-supervision section in the TRL documentation.
Here is an example of how to use the PRMTrainer to train a PRM on the Math Shepherd dataset:
# train_prm.py
from datasets import load_dataset
from trl import PRMConfig, PRMTrainer
from transformers import AutoModelForTokenClassification, AutoTokenizer
model = AutoModelForTokenClassification.from_pretrained("Qwen/Qwen2-0.5B", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B")
train_dataset = load_dataset("trl-lib/math_shepherd", split="train[:10%]")
training_args = PRMConfig(output_dir="Qwen2-0.5B-Reward-Math-Sheperd", logging_steps=10)
trainer = PRMTrainer(model=model, args=training_args, processing_class=tokenizer, train_dataset=train_dataset)
trainer.train()
For more information, check out the PRMTrainer documentation.
by @qgallouedec and @gaetanlop in https://github.com/huggingface/trl/pull/2127 and https://github.com/huggingface/trl/pull/2148
MergeModelCallBackVarious works show that model merging can non-trivially improve performance, especially if the models belong to the same architecture. TRL now features a callback that merges the reference model with the current policy and optionally pushes the merged checkpoint to the Hub. This could be done on step/epoch end and/or the end of training. This callback uses Arcee's mergekit lib: https://github.com/arcee-ai/mergekit
from trl import DPOTrainer, MergeModelCallback
from trl.mergekit_utils import MergeConfig
config = MergeConfig()
merge_callback = MergeModelCallback(config)
trainer = DPOTrainer(..., callbacks=[merge_callback])
by @August-murr in https://github.com/huggingface/trl/pull/2282
TRL preprocessing utils now support tooling. A first step toward agent fine-tuning.
from trl import apply_chat_template
def get_current_temperature(location: str):
"""
Gets the temperature at a given location.
Args:
location: The location to get the temperature for
"""
return 22.0
example = apply_chat_template(example, tokenizer, tools=[get_current_temperature])
by @August-murr in https://github.com/huggingface/trl/pull/2455
DPOTrainerVLMs have their own specificities which require special treatment in the trainer. DPOTrainer now supports LLaVA-Next models natively.
model = model = AutoModelForVision2Seq.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
trainer = DPOTrainer(model=model, ...)
by @chenweize1998 in https://github.com/huggingface/trl/pull/2413
TRL CLI has been refactored to be more user-friendly and easy to extend. We plan to extend the support to all trainers soon.
(simplified output, for readibility)
$ trl dpo --help
usage: trl dpo [-h] --dataset_name DATASET_NAME [--dataset_config DATASET_CONFIG] --output_dir OUTPUT_DIR [--loss_type {sigmoid,hinge,ipo}]
options:
-h, --help show this help message and exit
--dataset_name DATASET_NAME, --dataset-name DATASET_NAME
--dataset_config DATASET_CONFIG, --dataset-config DATASET_CONFIG
--output_dir OUTPUT_DIR, --output-dir OUTPUT_DIR
The output directory where the model predictions and checkpoints will be written. (default: None)
--loss_type {sigmoid,hinge,ipo}, --loss-type {sigmoid,hinge,ipo}
by @qgallouedec in https://github.com/huggingface/trl/pull/2380 and https://github.com/huggingface/trl/pull/2412
TRL features a new judge AllTrueJudge that unifies the decision of multiple binary judges. This judge implements the Mixture of Judges as described in the CGPO paper.
from trl import AllTrueJudge, BaseBinaryJudge
class RandomBinaryJudge(BaseBinaryJudge):
"""
Random binary judge, for testing purposes.
"""
def judge(self, prompts, completions, gold_completions=None, shuffle_order=True):
return [random.choice([0, 1, -1]) for _ in range(len(prompts))]
prompts = ["The capital of France is", "The biggest planet in the solar system is"]
completions = [["Paris", "Marseille"], ["Saturn", "Jupiter"]]
judge = AllTrueJudge(judges=[RandomBinaryJudge(), RandomBinaryJudge()])
judgements = judge.judge(prompts=prompts, completions=completions)
print(judgements) # [0, 1]
by @gaetanlop in https://github.com/huggingface/trl/pull/2159
num_logits_to_keep to save memorySave memory by only keeping the top num_logits_to_keep logits in the DPO trainer.
training_args = DPOConfig(..., use_num_logits_to_keep=True)
by @xyangk in https://github.com/huggingface/trl/pull/2129
The DiscoPOP paper uses LLMs to discover more efficient offline preference optimization losses. In the paper the proposed DiscoPOP loss (which is a log-ratio modulated loss) outperformed other optimization losses on different tasks (IMDb positive text generation, Reddit TLDR summarization, and Alpaca Eval 2.0).
training_args = DPOConfig(..., loss_type="discopop", discopop_tau=0.05)
by @fanconic in https://github.com/huggingface/trl/pull/2323
DPOTrainer for reference modelWe can now control the batch size for precomputing reference model logits.
training_args = DPOConfig(
...
precompute_ref_log_probs=True,
precompute_ref_batch_size=4,
)
by @SwayamInSync in https://github.com/huggingface/trl/pull/2426
SFTTrainer has supported packing datasets for faster training. Now, it support packing tokenized datasets as well.
by @kmehant in https://github.com/huggingface/trl/pull/2011
PPOTrainerPPOTrainer now supports PEFT for efficient training.
PPOTrainer(
...,
peft_config=peft_config,
)
by @ccs96307 in https://github.com/huggingface/trl/pull/2344
config in favor of args in PPOTrainerconfig has been deprecated in favor of args in PPOTrainer.
PPOTrainer(
- config=training_args,
+ args=training_args,
)
by @qgallouedec in https://github.com/huggingface/trl/pull/2384
policy in favor of model in PPOTrainerpolicy has been deprecated in favor of model in PPOTrainer.
PPOTrainer(
- policy=model,
+ model=model,
)
by @qgallouedec in https://github.com/huggingface/trl/pull/2386
0.13.0.dev0 by @qgallouedec in https://github.com/huggingface/trl/pull/2305token_id instead of token in DPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/2324output_layer to the list of lm_head_namings in AutoModelForCausalLMWithValueHead by @qgallouedec in https://github.com/huggingface/trl/pull/2328tokenizer arg back and add deprecation guidelines by @qgallouedec in https://github.com/huggingface/trl/pull/2348tokenizer argument in BCO, GKD, Iterative SFT, Nash MD and XPO by @qgallouedec in https://github.com/huggingface/trl/pull/2349use_soft_judge option to WinRateCallback by @kashif in https://github.com/huggingface/trl/pull/2347GeometricMixtureWrapper.forward by @kashif in https://github.com/huggingface/trl/pull/2345data_collator in RLOOTrainer and PPOTrainer by @bartoszzuk in https://github.com/huggingface/trl/pull/2360PPOTrainer by @ccs96307 in https://github.com/huggingface/trl/pull/2344require_bitsandbytes by @qgallouedec in https://github.com/huggingface/trl/pull/2370start_time to _maybe_log_save_evaluate by @qgallouedec in https://github.com/huggingface/trl/pull/2373MergeModelCallBack by @August-murr in https://github.com/huggingface/trl/pull/2282start_time parameter by @qgallouedec in https://github.com/huggingface/trl/pull/2381config in favor of args in PPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/2384policy in favor of model in PPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/2386KTOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/2394SmolVLM models via standalone script sft_vlm_smol_vlm.py by @sergiopaniego in https://github.com/huggingface/trl/pull/2409AutoModelForCausalLMWithValueHead by @qgallouedec in https://github.com/huggingface/trl/pull/2398DPOTrainer by @chenweize1998 in https://github.com/huggingface/trl/pull/2413DPOTrainer for reference model by @SwayamInSync in https://github.com/huggingface/trl/pull/2426TrlParser by @qgallouedec in https://github.com/huggingface/trl/pull/2412max_steps calculation in RLOOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/2433datast_config to ScriptArguments by @qgallouedec in https://github.com/huggingface/trl/pull/2440ref_model in OnlineDPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/2417model_args by @qgallouedec in https://github.com/huggingface/trl/pull/2442tests_latest.yml workflow file by @qgallouedec in https://github.com/huggingface/trl/pull/2457BitsAndBytesConfig import in doc by @August-murr in https://github.com/huggingface/trl/pull/2478Full Changelog: https://github.com/huggingface/trl/compare/v0.12.0...v0.13.0
Full Changelog: https://github.com/huggingface/trl/compare/v0.12.1...v0.12.2
tokenizer arg back and add deprecation guidelines by @qgallouedec in https://github.com/huggingface/trl/pull/2348Full Changelog: https://github.com/huggingface/trl/compare/v0.12.0...v0.12.1
Online DPO intially only supported a reward model that had the same tokenizer and chat template as the trained model. Now, you can use any reward model.
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification, AutoTokenizer
from trl import OnlineDPOConfig, OnlineDPOTrainer
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_config.model_name_or_path, padding_side="left")
reward_model = AutoModelForSequenceClassification.from_pretrained(training_args.reward_model_path, num_labels=1)
reward_tokenizer = AutoTokenizer.from_pretrained(reward_model_name, truncation=True, truncation_side="left")
dataset = load_dataset(script_args.dataset_name)
training_args = OnlineDPOConfig(output_dir="...")
trainer = OnlineDPOTrainer(
model=model,
reward_model=reward_model,
args=training_args,
train_dataset=dataset,
processing_class=tokenizer,
reward_processing_class=reward_tokenizer,
)
trainer.train()
by @qgallouedec in https://github.com/huggingface/trl/pull/2276
PPOv2 -> PPOThe PPOv2 trainer has been renamed to PPO. The old PPO trainer has been removed. PPOv2 is now deprecated and will be removed in the next release.
- trainer = PPOv2Trainer(...)
+ trainer = PPOTrainer(...)
by @qgallouedec in https://github.com/huggingface/trl/pull/2174
ScriptArgumentsWe had ScriptArguments, SFTScriptArguments, DPOScriptArguments and RewardScriptArguments. Since they all share mostly the same fields, we've merged them into a single ScriptArguments class.
SFTScriptArguments, DPOScriptArguments and RewardScriptArguments still exist but are deprecated and will be removed in the next release.
- script_args = DPOScriptArguments(...)
+ script_args = ScriptArguments(...)
by @qgallouedec in https://github.com/huggingface/trl/pull/2145
The PairRMJudge now when called via the judge method has a flag return_scores that returns the probability scores of the first completion of the pair (instead of the rank of the preferred completion). The logits for the probability score can be scaled by an optional temperature parameter.
from trl import PairRMJudge
pairrm_judge = PairRMJudge()
prompts = ["Translate 'hello' to French", "What's the capital of Japan?"]
completions = [["Bonjour", "Salut"], ["Kyoto", "Tokyo"]]
results = pairrm_judge.judge(prompts, completions, return_scores=True)
print(results) # [0.7492601275444031, 0.0005497377132996917]
by @kashif in https://github.com/huggingface/trl/pull/2221
The OnlineDPOTrainer and any trainers that inherit from it (NashMDTrainer and XPOTrainer) can now accept an initialized PairwiseJudge instead of a reward model.
from datasets import load_dataset
from trl import OnlineDPOConfig, OnlineDPOTrainer, PairRMJudge
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
judge = PairRMJudge()
train_dataset = load_dataset("trl-lib/ultrafeedback-prompt", split="train")
training_args = OnlineDPOConfig(output_dir="Qwen2-0.5B-OnlineDPO", logging_steps=10)
trainer = OnlineDPOTrainer(
model=model, judge=judge, args=training_args, processing_class=tokenizer, train_dataset=train_dataset
)
trainer.train()
by @kashif in https://github.com/huggingface/trl/pull/2243
tokenizer to processing_classThe tokenizer argument in the trainers has been renamed to processing_class to better reflect the fact that it can be not only a tokenizer but also a processor.
- trainer = DPOTrainer(model, args=training_args, train_dataset=dataset, tokenizer=tokenizer)
+ trainer = DPOTrainer(model, args=training_args, train_dataset=dataset, processing_class=tokenizer)
tokenizer is still supported for SFTTrainer and DPOTrainer but deprecated and will be removed in the next release.
by @qgallouedec in https://github.com/huggingface/trl/pull/2162
The WPO paper adapts off-policy data to resemble on-policy data more closely by reweighting preference pairs according to their probability under the current policy. To use this method, set the use_weighting flag to True in the [DPOConfig].
DPOConfig(..., use_weighting=True)
<img width="1112" alt="Screenshot 2024-11-04 at 10 59 38" src="https://github.com/user-attachments/assets/544ddc02-bd09-4f21-b8a4-b81c21561a9b">
<img width="539" alt="Screenshot 2024-11-04 at 10 59 22" src="https://github.com/user-attachments/assets/8d5afe9e-89bd-4d00-8483-dd7ba98997e7">
by @gaetanlop in https://github.com/huggingface/trl/pull/2141
Using trainer.push_to_hub() now automatically creates a model card that includes:
All links are properly formatted to allow cross-referencing, enabling traceability back to sources (e.g., the model appears linked on the paper’s page).
https://github.com/user-attachments/assets/b903964e-9087-45cc-8fb0-2418fdd87b72
by @qgallouedec in https://github.com/huggingface/trl/pull/2123
You can now use conversational datasets directly, without needing to apply a chat template beforehand, for the following trainers:
BCOTrainer (by @qgallouedec in PR #2107)CPOTrainer (by @qgallouedec in PR #2144)DPOTrainer (by @qgallouedec in PR #2131)KTOTrainer (by @qgallouedec in PR #2248)ORPOTrainer (by @qgallouedec in PR #2184)from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import DPOTrainer
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
dataset = load_dataset(dataset_name, split="train")
# Not needed anymore:
#
# def process(row):
# prompt = tokenizer.apply_chat_template(example["prompt"], tokenize=False, add_generation_prompt=True)
# prompt_chosen = tokenizer.apply_chat_template(example["prompt"] + example["chosen"], tokenize=False)
# chosen = prompt_chosen[len(prompt) :]
# prompt_rejected = tokenizer.apply_chat_template(example["prompt"] + example["rejected"], tokenize=False)
# rejected = prompt_rejected[len(prompt) :]
# return {"prompt": prompt, "chosen": chosen, "rejected": rejected}
#
# dataset = dataset.map(process)
training_args = DPOConfig(output_dir="...")
trainer = DPOTrainer(model, args=training_args, train_dataset=dataset, processing_class=tokenizer)
trainer.train()
For more information, see PR #2209.
trl env for printing system infoYou can now use trl env to print system information, including the platform, Python version, PyTorch version, CUDA device(s), and versions of various libraries.
$ trl env
Copy-paste the following information when reporting an issue:
- Platform: Linux-5.15.0-1048-aws-x86_64-with-glibc2.31
- Python version: 3.11.9
- PyTorch version: 2.4.0
- CUDA device(s): NVIDIA H100 80GB HBM3
- Transformers version: 4.47.0.dev0
- Accelerate version: 0.19.0
- Accelerate config: not found
- Datasets version: 3.0.2
- HF Hub version: 0.26.1
- TRL version: 0.12.0+14ef1ab
- bitsandbytes version: 0.44.1
- DeepSpeed version: 0.15.3
- Diffusers version: 0.30.3
- Liger-Kernel version: 0.3.0
- LLM-Blender version: 0.0.2
- OpenAI version: 1.46.0
- PEFT version: 0.13.2
by @qgallouedec in https://github.com/huggingface/trl/pull/2104
From GKD paper:
Sequence-Level KD (Kim & Rush, 2016). SeqKD maximizes the likelihood of high probability sequences generated by the teacher, and can be viewed as supervised FT on teacher-generated outputs.
SeqKD is taken as a baseline in the paper. It is now possible to use Sequence-Level KD in the GKDTrainer by setting seq_kd=True in the GKDConfig.
training_args = GKDConfig(..., seq_kd=True)
by @mst272 in https://github.com/huggingface/trl/pull/2220
dataset_text_field to "text"Since many users use "text" as the column name for textual data in datasets, we've made it the default (previously a required argument) in SFTConfig. Now, specifying dataset_text_field="text" is no longer necessary.
SFTConfig(
...,
- dataset_text_field="text",
)
by @qgallouedec in https://github.com/huggingface/trl/pull/2078
training_args by @qgallouedec in https://github.com/huggingface/trl/pull/2082trl env for printing system info by @qgallouedec in https://github.com/huggingface/trl/pull/2104BCOTrainer conversational dataset support by @qgallouedec in https://github.com/huggingface/trl/pull/2107max_length from RewardDataCollatorWithPadding by @qgallouedec in https://github.com/huggingface/trl/pull/2119training_step by @qgallouedec in https://github.com/huggingface/trl/pull/2117script_args by @qgallouedec in https://github.com/huggingface/trl/pull/2130WinRateCallback table by @lewtun in https://github.com/huggingface/trl/pull/2134dpo_visual.py example to dpo_vlm.py by @qgallouedec in https://github.com/huggingface/trl/pull/2139eval_strategy="steps" when no eval dataset by @qgallouedec in https://github.com/huggingface/trl/pull/2152DPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/2131dataset_text_field to "text" by @qgallouedec in https://github.com/huggingface/trl/pull/2078CPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/2144tokenizer to processing_class by @qgallouedec in https://github.com/huggingface/trl/pull/2162"unsloth" tag by @qgallouedec in https://github.com/huggingface/trl/pull/2173skip_prompt=True in TextIteratorStreamer by @qgallouedec in https://github.com/huggingface/trl/pull/2193decoder_input_ids in DPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/2208"none" in GKD test by @qgallouedec in https://github.com/huggingface/trl/pull/2214trl env report all cuda devices by @qgallouedec in https://github.com/huggingface/trl/pull/2216ORPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/2184PPOv2 -> PPO by @qgallouedec in https://github.com/huggingface/trl/pull/2174ScriptArguments by @qgallouedec in https://github.com/huggingface/trl/pull/2145ScriptArguments warning messages by @sergiopaniego in https://github.com/huggingface/trl/pull/2230remove_unused_columns by @qgallouedec in https://github.com/huggingface/trl/pull/2233get_batch_sample and add num_items_in_batch to compute_loss by @qgallouedec in https://github.com/huggingface/trl/pull/2246processing_class instead of tokenizer in LogCompletionsCallback by @qgallouedec in https://github.com/huggingface/trl/pull/2261KTOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/2248max_new_tokens by @qgallouedec in https://github.com/huggingface/trl/pull/2272log_reports.py for Improved Logging, File Processing, and Slack Payload Handling by @Mefisto04 in https://github.com/huggingface/trl/pull/2249eval_dataset in to trainers when no eval strategy by @qgallouedec in https://github.com/huggingface/trl/pull/2270_save_checkpoint for online methods by @qgallouedec in https://github.com/huggingface/trl/pull/2288optimizer_cls_and_kwargs attribute to PPO and RLOO by @qgallouedec in https://github.com/huggingface/trl/pull/2302Full Changelog: https://github.com/huggingface/trl/compare/v0.11.0...v0.12.0
Full Changelog: https://github.com/huggingface/trl/compare/v0.11.3...v0.11.4