Hugging Face/TRL

TRL

$npx -y @buildinternet/releases show trl

Sun

Mon

Tue

Wed

Thu

Fri

Sat

AprMayJunJulAugSepOctNovDecJanFebMarApr

Less

Releases10Avg3/moVersionsv0.27.0 → v1.2.0

Aug 29, 2025

Major

🔮 Native VLM support for `SFTTrainer`

SFTTrainer now natively supports Vision-Language Models (VLMs). This includes support for both languauge modeling, prompt-completion data. It also supports training on completion-only.

from trl import SFTConfig, SFTTrainer
from datasets import load_dataset

trainer = SFTTrainer(
    model="Qwen/Qwen2.5-VL-3B-Instruct",
    args=SFTConfig(max_length=None),
    train_dataset=load_dataset("trl-lib/llava-instruct-mix", split="train"),
)
trainer.train()

by @qgallouedec in https://github.com/huggingface/trl/pull/3862, https://github.com/huggingface/trl/pull/3907 and https://github.com/huggingface/trl/pull/3908

🔥 `RLOOTrainer` refactor

RLOOTrainer has been refactored to align with the design principles of other other trainers in the library. You can now use this trainer exactly like GRPO.

from datasets import load_dataset
from trl import RLOOConfig, RLOOTrainer

dataset = load_dataset("trl-lib/ultrafeedback-prompt", split="train")

# Dummy reward function for demonstration purposes
def reward_num_unique_letters(completions, **kwargs):
    """Reward function that rewards completions with more unique letters."""
    completion_contents = [completion[0]["content"] for completion in completions]
    return [float(len(set(content))) for content in completion_contents]

trainer = RLOOTrainer(
    model="Qwen/Qwen2-0.5B-Instruct",
    reward_funcs=reward_num_unique_letters,
    train_dataset=dataset,
)
trainer.train()

by @shirinyamani in https://github.com/huggingface/trl/pull/3801

🧭 HF jobs x TRL guide

You can now levarage Hugging Face Jobs to easily train and deploy your models with TRL.

hf jobs uv run --flavor a100-large --secrets HF_TOKEN "https://raw.githubusercontent.com/huggingface/trl/main/trl/scripts/sft.py" --model_name_or_path Qwen/Qwen2-0.5B --dataset_name trl-lib/Capybara

A guide is available in the docs.

by @sergiopaniego in https://github.com/huggingface/trl/pull/3890

🏌️ DAPO loss type

GRPOTrainer now supports DAPO loss type, which aggregates token-level losses by normalizing with the number of active token in the global accumulated batch. This method was introduced to eliminate length bias. Simply use

from trl import GRPOConfig, GRPOTrainer

training_args = GRPOConfig(
    loss_type="dapo",
    ...
)

by @qgallouedec in https://github.com/huggingface/trl/pull/3938

🪶 [GRPO] PPO Lite: Scale rewards by Std of Batch

The authors of Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning (Lite PPO) find that the combination of:

scaling rewards by the standard deviation computed over the entire batch and
aggregating loss over the total number of tokens

can unlock the learning capability of critic-free policies using vanilla PPO loss. Their results demonstrate that this simple combination consistently improves performance, surpassing strategies like GRPO and DAPO.

TRL supports using these learnings to train a GRPO model by:

from trl import GRPOConfig

training_args = GRPOConfig(
    scale_rewards="batch",
    loss_type="dapo",
    ...
)

by @pramodith in https://github.com/huggingface/trl/pull/3935

🎢 [Callbacks] BEMA

Bias-Corrected Exponential Moving Average (BEMA) improves the stability and efficiency of language model fine-tuning by reducing stochasticity and eliminating bias. To use BEMA with SFT as described in the paper, you can now use the [BEMACallback]:

from trl import BEMACallback, SFTTrainer

trainer = SFTTrainer(
    ...
    callbacks=[BEMACallback()],
)

by @kashif in https://github.com/huggingface/trl/pull/3855

Minor

🎀 New defaults: gradient_checkpointing=True by @qgallouedec in https://github.com/huggingface/trl/pull/3510
🎚️ Add dataset mixer by @lewtun in https://github.com/huggingface/trl/pull/3791
💇 Add soft overlong punishment reward function and update documentation by @qgallouedec in https://github.com/huggingface/trl/pull/3804
🗿 [CPO] Add AlphaPO method via CPOTrainer by @kashif in https://github.com/huggingface/trl/pull/3824
🗳️ Extend BCO Trainer dataset format support by @reihig-ut in https://github.com/huggingface/trl/pull/3134
🐯 Support assistant-only training and Liger by @qgallouedec in https://github.com/huggingface/trl/pull/3914
🎆 Add entropy logging in SFT by @qgallouedec in https://github.com/huggingface/trl/pull/3940
📸 Return position_ids for flash_attention_3 by @jue-jue-zi in https://github.com/huggingface/trl/pull/3942

Deprecations

🗑️ Deprecate setup_chat_format by @qgallouedec in https://github.com/huggingface/trl/pull/3929
🗑 Deprecate IterativeSFTTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/3905

What's Changed

⬆️ Bump dev version by @qgallouedec in https://github.com/huggingface/trl/pull/3850
🔗 Fix collection link in doc by @qgallouedec in https://github.com/huggingface/trl/pull/3852
Typo fix in new model description by @sergiopaniego in https://github.com/huggingface/trl/pull/3854
Small style fix in README by @qgallouedec in https://github.com/huggingface/trl/pull/3861
[GRPO] 👁️ Fix vLLM server mode for VLM GRPO training incompatibility for certain AutoProcessors by @ghubnerr in https://github.com/huggingface/trl/pull/3832
👁️ From AutoModelForVision2Seq to AutoModelForImageTextToText by @qgallouedec in https://github.com/huggingface/trl/pull/3836
👋 Remove --bf16 value in scripts by @sergiopaniego in https://github.com/huggingface/trl/pull/3869
🎀 New defaults: gradient_checkpointing=True by @qgallouedec in https://github.com/huggingface/trl/pull/3510
🦦 Validate vllm_mode param in GRPO by @sergiopaniego in https://github.com/huggingface/trl/pull/3866
🎚️ Add dataset mixer by @lewtun in https://github.com/huggingface/trl/pull/3791
✨ Integrate PEFT model preparation across trainers and utilities by @qgallouedec in https://github.com/huggingface/trl/pull/3882
⌨️ Add py.typed by @cyyever in https://github.com/huggingface/trl/pull/3841
💇 Add soft overlong punishment reward function and update documentation by @qgallouedec in https://github.com/huggingface/trl/pull/3804
🕹️ [GRPO] Fix vllm mode validation in distributed setting by @Kirill-Kravtsov in https://github.com/huggingface/trl/pull/3886
⏳ Replaced unittest.TestCase with TrlTestCase that handles tmp dir by @qgallouedec in https://github.com/huggingface/trl/pull/3863
🔮 Native VLM support for SFTTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/3862
Minor optimizations in SFT. by @pramodith in https://github.com/huggingface/trl/pull/3884
🧩 Fix reward_processing_classes validation in GRPOTrainer by @chi2liu in https://github.com/huggingface/trl/pull/3876
🎢 [Callbacks] BEMA by @kashif in https://github.com/huggingface/trl/pull/3855
👁️ VLM blog by @qgallouedec in https://github.com/huggingface/trl/pull/3899
🪄 Improve quickstart documentation with updated API examples by @behroozazarkhalili in https://github.com/huggingface/trl/pull/3873
👔 HF Doc Builder style by @qgallouedec in https://github.com/huggingface/trl/pull/3498
✏️ Fix SFTTrainer token accuracy computation with PromptEncoder by @zk-quantum in https://github.com/huggingface/trl/pull/3821
☑️ Check eval batch size in grpo by @jp1924 in https://github.com/huggingface/trl/pull/3889
⚔️ Optimize truncate_with_protected_tokens to use vectorized operations by @chi2liu in https://github.com/huggingface/trl/pull/3875
Add tests for get_position_ids_from_packed_seq_lengths by @pramodith in https://github.com/huggingface/trl/pull/3883
🌳 Enhance segment tree implementation for non-power-of-2 values by @MengAiDev in https://github.com/huggingface/trl/pull/3888
⚡ Optimize completion_ids list conversion in GRPO trainer by @chi2liu in https://github.com/huggingface/trl/pull/3874
🗿 [CPO] Add AlphaPO method via CPOTrainer by @kashif in https://github.com/huggingface/trl/pull/3824
🗳️ Extend BCO Trainer dataset format support by @reihig-ut in https://github.com/huggingface/trl/pull/3134
🐯 Support assistant-only training and Liger by @qgallouedec in https://github.com/huggingface/trl/pull/3914
🗑 Deprecate IterativeSFTTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/3905
♻️ use_cache should be set in the forward pass by @qgallouedec in https://github.com/huggingface/trl/pull/3891
🌓 SFTTrainer for VLM: Support for prompt-completion data by @qgallouedec in https://github.com/huggingface/trl/pull/3907
➡️ SFTTrainer for VLM: support completion-only loss by @qgallouedec in https://github.com/huggingface/trl/pull/3908
📚 Update BEMACallback documentation to ignore docstyle and fix lag parameter description by @qgallouedec in https://github.com/huggingface/trl/pull/3917
✏️ Fix typos by @cyyever in https://github.com/huggingface/trl/pull/3921
🧹 Clean SFT tests by @qgallouedec in https://github.com/huggingface/trl/pull/3922
🤹‍♂️ Multi-image testing dataset by @qgallouedec in https://github.com/huggingface/trl/pull/3916
🧾 Use logger.warning instead of warnings.warn by @qgallouedec in https://github.com/huggingface/trl/pull/3923
♻️ Reuse multimodal message preparation from SFTTrainer in GRPOTrainer by @MengAiDev in https://github.com/huggingface/trl/pull/3919
🗑️ Deprecate setup_chat_format by @qgallouedec in https://github.com/huggingface/trl/pull/3929
🗞 bugfix 'TrainerState' object is not subscriptable by @ErezYosef in https://github.com/huggingface/trl/pull/3936
📦 Wrapping the main execution code to avoid multi-processing issues from vLLM by @kaixuanliu in https://github.com/huggingface/trl/pull/3932
🪶 [GRPO] PPO Lite: Scale rewards by Std of Batch by @pramodith in https://github.com/huggingface/trl/pull/3935
🏌️ DAPO loss type by @qgallouedec in https://github.com/huggingface/trl/pull/3938
🎆 Add entropy logging in SFT by @qgallouedec in https://github.com/huggingface/trl/pull/3940
🗂 Update paper_index section by @behroozazarkhalili in https://github.com/huggingface/trl/pull/3937
[DPO] Adding support for different losses which are now supported by Liger by @Manan17 in https://github.com/huggingface/trl/pull/3815
[GKD] add liger loss by @kashif in https://github.com/huggingface/trl/pull/3946
🤸 [SFT] Drop entropy calculation when using liger by @qgallouedec in https://github.com/huggingface/trl/pull/3947
✂️ fix: handle list tensors in split_tensor_dict function by @qgallouedec in https://github.com/huggingface/trl/pull/3951
🪶 LitePPO: Fix Docs for paper index by @pramodith in https://github.com/huggingface/trl/pull/3954
🦥 Unsloth Docs update by @shimmyshimmer in https://github.com/huggingface/trl/pull/3955
📸 Return position_ids for flash_attention_3 by @jue-jue-zi in https://github.com/huggingface/trl/pull/3942
🧭 HF jobs x TRL guide by @sergiopaniego in https://github.com/huggingface/trl/pull/3890
Add HF jobs tag when creating model card via jobs by @sergiopaniego in https://github.com/huggingface/trl/pull/3956
[CI] Modify tests to handle device allocation for models by @kashif in https://github.com/huggingface/trl/pull/3962
Fixed some typos and added small details about trackio to docs by @sergiopaniego in https://github.com/huggingface/trl/pull/3965
🔥 [Refactor] RLOOTrainer by @shirinyamani in https://github.com/huggingface/trl/pull/3801
ℹ️ Validate examples on xpu by @yao-matrix in https://github.com/huggingface/trl/pull/3897
📜 GSPO docs - Sequence importance ratio and differences in relation to GRPO by @almeidava93 in https://github.com/huggingface/trl/pull/3816
Fix CI by @qgallouedec in https://github.com/huggingface/trl/pull/3975
🧱 PyPI publishing workflow by @qgallouedec in https://github.com/huggingface/trl/pull/3976
Release: v0.22 by @qgallouedec in https://github.com/huggingface/trl/pull/3977

New Contributors

@ghubnerr made their first contribution in https://github.com/huggingface/trl/pull/3832
@Kirill-Kravtsov made their first contribution in https://github.com/huggingface/trl/pull/3886
@behroozazarkhalili made their first contribution in https://github.com/huggingface/trl/pull/3873
@zk-quantum made their first contribution in https://github.com/huggingface/trl/pull/3821
@MengAiDev made their first contribution in https://github.com/huggingface/trl/pull/3888
@reihig-ut made their first contribution in https://github.com/huggingface/trl/pull/3134
@ErezYosef made their first contribution in https://github.com/huggingface/trl/pull/3936
@kaixuanliu made their first contribution in https://github.com/huggingface/trl/pull/3932
@Manan17 made their first contribution in https://github.com/huggingface/trl/pull/3815
@shimmyshimmer made their first contribution in https://github.com/huggingface/trl/pull/3955
@jue-jue-zi made their first contribution in https://github.com/huggingface/trl/pull/3942
@almeidava93 made their first contribution in https://github.com/huggingface/trl/pull/3816

Full Changelog: https://github.com/huggingface/trl/compare/v0.21.0...v0.22.0

Aug 5, 2025

Major and breaking

🌺 OpenAI GPT OSS & Harmony support

Open AI GPT OSS models are here! Check out the OpenAI Cookbook to see an example of how to SFT these models.

by @qgallouedec in https://github.com/huggingface/trl/pull/3848

Add vLLM transformers backend to online methods

You can now pass vllm_model_impl to the TRL vLLM server. Example, for transformers backend:

trl vllm-serve ... --vllm_model_impl transformers

by @merveenoyan in https://github.com/huggingface/trl/pull/3773

What's Changed

⬆️ Bump dev version by @qgallouedec in https://github.com/huggingface/trl/pull/3793
Fix broken PEFT+TRL docs link in using_llama_models.md by @bwook00 in https://github.com/huggingface/trl/pull/3794
🐙 Add MPO VLM example script by @sergiopaniego in https://github.com/huggingface/trl/pull/3799
Examples list updated in docs by @sergiopaniego in https://github.com/huggingface/trl/pull/3806
Add vLLM transformers backend to online methods by @merveenoyan in https://github.com/huggingface/trl/pull/3773
Correction parameter description by @1787648106 in https://github.com/huggingface/trl/pull/3803
Add GSPO script examples (VLM/LLM) by @sergiopaniego in https://github.com/huggingface/trl/pull/3810
add xpu support for mergekit by @yao-matrix in https://github.com/huggingface/trl/pull/3800
GSPO parameters update from v2 by @BounharAbdelaziz in https://github.com/huggingface/trl/pull/3798
fix CI docs and grpo slow test by @kashif in https://github.com/huggingface/trl/pull/3814
Performance optimization: Replace list comprehensions with tensor operations in BCO and KTO trainers by @chi2liu in https://github.com/huggingface/trl/pull/3813
Improve trainer doc by @qgallouedec in https://github.com/huggingface/trl/pull/3818
Add 'Post training a VLM for reasoning with GRPO using TRL' recipe to Community tutorials by @sergiopaniego in https://github.com/huggingface/trl/pull/3843
[GRPO]: Fix Entropy Mask Threshold Calculation when using Multi-GPU training by @pramodith in https://github.com/huggingface/trl/pull/3833
🪦 Remove deprecated by @qgallouedec in https://github.com/huggingface/trl/pull/3817
🌺 OpenAI GPT OSS & Harmony support by @qgallouedec in https://github.com/huggingface/trl/pull/3848
Release: v0.21 by @qgallouedec in https://github.com/huggingface/trl/pull/3849

New Contributors

@bwook00 made their first contribution in https://github.com/huggingface/trl/pull/3794
@merveenoyan made their first contribution in https://github.com/huggingface/trl/pull/3773
@1787648106 made their first contribution in https://github.com/huggingface/trl/pull/3803
@BounharAbdelaziz made their first contribution in https://github.com/huggingface/trl/pull/3798
@chi2liu made their first contribution in https://github.com/huggingface/trl/pull/3813

Full Changelog: https://github.com/huggingface/trl/compare/v0.20.0...v0.21.0

Jul 29, 2025

Breaking and major changes

🎞️ GSPO

GSPO is a GRPO variant that computes importance sampling weights at the sequence level instead of per-token.

📜 Paper: https://huggingface.co/papers/2507.18071

To reproduce the paper's setting, use this configuration:

from trl import GRPOConfig

training_args = GRPOConfig(
    importance_sampling_level="sequence",
    loss_type="grpo",
    steps_per_generation=...,
    beta=0.04,  # not explicitly specified in the paper, but they likely used the same value as in the GRPO paper
    epsilon=3e-4,  # https://x.com/ChujieZheng/status/1948933507696525392
)

by @qgallouedec in https://github.com/huggingface/trl/pull/3775

👁️ [GRPO] Add VLM training capabilities to the GRPO trainer

The GRPOTrainer can now be used for VLM training. Give a try with this dummy example:

from trl import GRPOTrainer
from datasets import load_dataset

# Dummy vision-language dataset
dataset = load_dataset("trl-internal-testing/zen-image", "conversational_prompt_only", split="train")

# Dummy reward function: count the number of unique characters in the completions
def reward_num_unique_chars(completions, **kwargs):
    return [len(set(c[0]["content"])) for c in completions]

trainer = GRPOTrainer(
    model="Qwen/Qwen2.5-VL-3B-Instruct",
    reward_funcs=[reward_num_unique_chars],
    train_dataset=dataset,
)

trainer.train()

by @CompN3rd and @kashif in https://github.com/huggingface/trl/pull/3072 in https://github.com/huggingface/trl/pull/3760

🐙 MPO

The DPO trainer supports combining multiple loss functions with different weights, enabling more sophisticated optimization strategies. This is particularly useful for implementing algorithms like MPO (Mixed Preference Optimization). MPO is a training approach that combines multiple optimization objectives, as described in the paper Enhancing the Reasoning Ability of Multimodal Large Language Models via Mixed Preference Optimization.

To combine multiple losses, specify the loss types and corresponding weights as lists:

from trl import DPOConfig

# MPO: Combines DPO (sigmoid) for preference and BCO (bco_pair) for quality
training_args = DPOConfig(
    loss_type=["sigmoid", "bco_pair", "sft"],  # Loss types to combine
    loss_weights=[0.8, 0.2, 1.0]  # Corresponding weights, as used in the MPO paper
)

by @qgallouedec in https://github.com/huggingface/trl/pull/2544

Add support for CB with native transformers

Continuous Batching allows for faster generation using the transformers backend. You can now use it with the GRPOTrainer by setting use_transformers_paged=True in the config.

use_transformers_paged = True
from trl import GRPOConfig
training_args = GRPOConfig(
    # ... other args
    use_transformers_paged=Ture,
)

by @ArthurZucker in https://github.com/huggingface/trl/pull/3471

Add entropy based filtering inside the GRPOTrainer

In Beyond the 80/20 Rule: High-Entropy Minority Tokens Drive Effective Reinforcement Learning for LLM Reasoning, it is shown that utilizing only 20% of the highest entropy tokens leads to similar performance as using all tokens. You can now enable this feature in the GRPOTrainer by setting entropy_filtering=True in the config.

from trl import GRPOConfig

training_args = GRPOConfig(
    # ... other args
    top_entropy_quantile=0.2,  # Use only the top 20% of tokens based on entropy
)

by @pramodith in https://github.com/huggingface/trl/pull/3563

👐 FSDP2+GRPO

GRPO now supports FSDP2 training. Just run your script with an FSDP2 config:

accelerate launch --config_file examples/accelerate_configs/fsdp2.yaml run_grpo.py

by @SalmanMohammadi in https://github.com/huggingface/trl/pull/3687

What's Changed

⬆️ Bump dev version by @qgallouedec in https://github.com/huggingface/trl/pull/3626
fix grpo generation_kwargs by @ahatamiz in https://github.com/huggingface/trl/pull/3634
fixing num_processes by @shirinyamani in https://github.com/huggingface/trl/pull/3637
env var for vllm colocate exp added by @shirinyamani in https://github.com/huggingface/trl/pull/3638
Update dpo_vlm.py by @Clement25 in https://github.com/huggingface/trl/pull/3629
☕️ GRPO script reward_funcs error by @tcapelle in https://github.com/huggingface/trl/pull/3639
🤝 validate gradient_accumulation_steps vs steps_per_generation for on-policy GRPO by @HarryHsing in https://github.com/huggingface/trl/pull/3493
Add entropy based filtering inside the GRPOTrainer. by @pramodith in https://github.com/huggingface/trl/pull/3563
Make sure chat template isn't lost when truncating prompt. by @pramodith in https://github.com/huggingface/trl/pull/3651
Add paranthesis to correct the check. by @pramodith in https://github.com/huggingface/trl/pull/3658
Add support for CB with native transformers by @ArthurZucker in https://github.com/huggingface/trl/pull/3471
feat: Pass trainer state to reward functions by @seungduk-yanolja in https://github.com/huggingface/trl/pull/3669
Enable completion-only loss in SFTTrainer when using Liger Kernel by @kswhitecross in https://github.com/huggingface/trl/pull/3674
Add mlflow support for generate_during_eval DPOTrainer by @dhruvmullick in https://github.com/huggingface/trl/pull/3660
[SFT] drop attention_mask if we have position ids for fa2 by @kashif in https://github.com/huggingface/trl/pull/3673
Faster position_ids computation for FFD packing by @mariosasko in https://github.com/huggingface/trl/pull/3649
Support datasets 4 by @lhoestq in https://github.com/huggingface/trl/pull/3688
Update steps_per_generation default description grpo_config.py by @wa008 in https://github.com/huggingface/trl/pull/3685
Fix non-serializable torch.dtype bug in VLLM weight sync by @CarlosArguilar in https://github.com/huggingface/trl/pull/3690
fix: support dict access in SFT Trainer by @jannisborn in https://github.com/huggingface/trl/pull/3677
[fix] type error of quantile by @gitabtion in https://github.com/huggingface/trl/pull/3667
[CI] Fix slow grpo CI by @kashif in https://github.com/huggingface/trl/pull/3693
Restore the effect of liger_kernel's monkey_patch on global modules in UT. by @YangKai0616 in https://github.com/huggingface/trl/pull/3680
Add type hints to dpo_trainer.py by @bvantuan in https://github.com/huggingface/trl/pull/3631
Fix mislabeling: "First-fit decreasing" is actually "Best-fit-decreasing" by @LeonEricsson in https://github.com/huggingface/trl/pull/3696
✂️ [BUG when vllm and prompt_truncation are used]: Strip out pad tokens in truncated prompt text by @pramodith in https://github.com/huggingface/trl/pull/3698
📣 Use explicit version for checking datasets version by @qgallouedec in https://github.com/huggingface/trl/pull/3702
🔭 Fix package discovery configuration in setup.cfg by @qgallouedec in https://github.com/huggingface/trl/pull/3703
[SFT] Add seq_lengths to signature columns by @LeonEricsson in https://github.com/huggingface/trl/pull/3699
⚗️ Tiny MoE for test by @qgallouedec in https://github.com/huggingface/trl/pull/3712
BUG: Disregard pad token entropies for entropy threshold calculation by @pramodith in https://github.com/huggingface/trl/pull/3715
Fix ORPOTrainer loss scaling with gradient accumulation by @Aratako in https://github.com/huggingface/trl/pull/3716
[Online DPO] Safeguard logit slice against empty prompt by @LeonEricsson in https://github.com/huggingface/trl/pull/3719
Remove deprecated processor.tokenizer by @Tavish9 in https://github.com/huggingface/trl/pull/3720
👋 Remove --bf16 flag from training scripts by @qgallouedec in https://github.com/huggingface/trl/pull/3724
↔️ Fix CB in GRPO by @qgallouedec in https://github.com/huggingface/trl/pull/3722
📥 Set environment variables for vLLM distributed training in GRPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/3723
[GRPO] remove common activation offloading substring in all cases by @winglian in https://github.com/huggingface/trl/pull/3738
🔧 Fix GRPO sampling logic by @qgallouedec in https://github.com/huggingface/trl/pull/3725
🕸 Use wandb.run.url instead of wandb.run.get_url() (deprecated) by @qgallouedec in https://github.com/huggingface/trl/pull/3726
Updated processing_class docs for trainers by @sergiopaniego in https://github.com/huggingface/trl/pull/3737
Updated missing processing_class docs for rest of trainers by @sergiopaniego in https://github.com/huggingface/trl/pull/3745
Add comment for average_tokens_across_devices by @qgallouedec in https://github.com/huggingface/trl/pull/3746
uses steps_per_generation in vllm max_num_seqs by @akakakakakaa in https://github.com/huggingface/trl/pull/3747
🏗️ Refactor top-entropy in GRPO by @qgallouedec in https://github.com/huggingface/trl/pull/3727
[GRPO] Fix: Processing ref logprobs in batches by @idanshen in https://github.com/huggingface/trl/pull/3740
Add Object detection grounding recipe to Community tutorials by @sergiopaniego in https://github.com/huggingface/trl/pull/3752
🐙 MPO by @qgallouedec in https://github.com/huggingface/trl/pull/2544
⚰️ Remove deprecated by @qgallouedec in https://github.com/huggingface/trl/pull/3704
👨‍💼 [SFT] Packing with completion_only and assistant_only training by @LeonEricsson in https://github.com/huggingface/trl/pull/3749
👁️ [GRPO] Add VLM training capabilities to the trainer by @CompN3rd in https://github.com/huggingface/trl/pull/3072
Add MPO recipe to Community tutorials by @sergiopaniego in https://github.com/huggingface/trl/pull/3766
✋ Prevent NCCL Device Conflicts Between vLLM Server and Trainers by @CarlosArguilar in https://github.com/huggingface/trl/pull/3762
🔔 Add deprecation warnings for AlignPropTrainer and DDPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/3755
🔠 Support model str in OnlineDPO by @kashif in https://github.com/huggingface/trl/pull/3765
🌌 [GRPO] Log generation entropy by @LeonEricsson in https://github.com/huggingface/trl/pull/3700
🤓 [GRPO] Documentation for entropy metric by @LeonEricsson in https://github.com/huggingface/trl/pull/3770
Add uv scripts headers by @lhoestq in https://github.com/huggingface/trl/pull/3767
Update missing uv dep by @lhoestq in https://github.com/huggingface/trl/pull/3772
📐 Fix CI and GeometricMixtureWrapper by @qgallouedec in https://github.com/huggingface/trl/pull/3779
🩹 [Hotfix] Fix pynccl communicator assertion error with VLLMClient by @CarlosArguilar in https://github.com/huggingface/trl/pull/3774
🍿 [SFT] Fix dataset indexing which crashed with a IterableDataset by @LeonEricsson in https://github.com/huggingface/trl/pull/3771
🎞️ GSPO by @qgallouedec in https://github.com/huggingface/trl/pull/3775
🤏 [SFT] Improve doc on training on assistant only messages by @lewtun in https://github.com/huggingface/trl/pull/3784
📐 Add epsilon hyperparameter recommendation to GSPO by @qgallouedec in https://github.com/huggingface/trl/pull/3790
📍 Support training peft model with gradient checkpointing by @qgallouedec in https://github.com/huggingface/trl/pull/3785
💬 Fix clone_chat_template vocab size and support PEFT instruction tuning by @qgallouedec in https://github.com/huggingface/trl/pull/3763
🌋 [GRPO] add support for pixel_attention_mask (SmolVLM2) and image_sizes (LLaVa-Next) by @kashif in https://github.com/huggingface/trl/pull/3760
🔍 Add guidance on choosing max_length value and include visualization tool by @qgallouedec in https://github.com/huggingface/trl/pull/3630
📘 SFT doc rewrite by @qgallouedec in https://github.com/huggingface/trl/pull/3619
👐 FSDP2+GRPO by @SalmanMohammadi in https://github.com/huggingface/trl/pull/3687
Release: v0.20 by @qgallouedec in https://github.com/huggingface/trl/pull/3792

New Contributors

@ahatamiz made their first contribution in https://github.com/huggingface/trl/pull/3634
@Clement25 made their first contribution in https://github.com/huggingface/trl/pull/3629
@HarryHsing made their first contribution in https://github.com/huggingface/trl/pull/3493
@ArthurZucker made their first contribution in https://github.com/huggingface/trl/pull/3471
@seungduk-yanolja made their first contribution in https://github.com/huggingface/trl/pull/3669
@kswhitecross made their first contribution in https://github.com/huggingface/trl/pull/3674
@lhoestq made their first contribution in https://github.com/huggingface/trl/pull/3688
@CarlosArguilar made their first contribution in https://github.com/huggingface/trl/pull/3690
@jannisborn made their first contribution in https://github.com/huggingface/trl/pull/3677
@gitabtion made their first contribution in https://github.com/huggingface/trl/pull/3667
@YangKai0616 made their first contribution in https://github.com/huggingface/trl/pull/3680
@Aratako made their first contribution in https://github.com/huggingface/trl/pull/3716
@CompN3rd made their first contribution in https://github.com/huggingface/trl/pull/3072

Full Changelog: https://github.com/huggingface/trl/compare/v0.19.0...v0.20.0

Jul 8, 2025

What's Changed

fix grpo generation_kwargs by @ahatamiz in https://github.com/huggingface/trl/pull/3634
Make sure chat template isn't lost when truncating prompt. by @pramodith in https://github.com/huggingface/trl/pull/3651
Add paranthesis to correct the check. by @pramodith in https://github.com/huggingface/trl/pull/3658
[SFT] drop attention_mask if we have position ids for fa2 by @kashif in https://github.com/huggingface/trl/pull/3673
Support datasets 4 by @lhoestq in https://github.com/huggingface/trl/pull/3688
📣 Use explicit version for checking datasets version by @qgallouedec in https://github.com/huggingface/trl/pull/3702
Fix non-serializable torch.dtype bug in VLLM weight sync by @CarlosArguilar in https://github.com/huggingface/trl/pull/3690
✂️ [BUG when vllm and prompt_truncation are used]: Strip out pad tokens in truncated prompt text by @pramodith in https://github.com/huggingface/trl/pull/3698

New Contributors

@ahatamiz made their first contribution in https://github.com/huggingface/trl/pull/3634
@lhoestq made their first contribution in https://github.com/huggingface/trl/pull/3688
@CarlosArguilar made their first contribution in https://github.com/huggingface/trl/pull/3690

Full Changelog: https://github.com/huggingface/trl/compare/v0.19.0...v0.19.1

Jun 21, 2025

Breaking and major changes

🧰 [SFT] Tool support

SFTTrainer now supports training with tools! You just have to add a column tools to your dataset, which contains a list of tool definitions as json schemas. The tools will be automatically registered and can be used in the training process.

from datasets import Dataset
from transformers.utils import get_json_schema
from trl import SFTTrainer

# Fictitious functions to simulate tool calls
def start_timer(duration: int) -> int:
    """
    Starts a timer for the specified duration in seconds.

    Args:
        duration: Duration in seconds to set the timer for.

    Returns:
        The duration set for the timer.
    """
    return duration

def create_reminder(time: str, note: str) -> str:
    """
    Creates a reminder for the specified time and note.

    Args:
        time: The time for the reminder.
        note: The note for the reminder.

    Returns:
        A confirmation message indicating that the reminder has been set.
    """
    return "I'll remind you to call mom at 7 PM."

# Define the JSON schemas for the tools
start_timer = get_json_schema(start_timer)
create_reminder = get_json_schema(create_reminder)

dataset = Dataset.from_dict({
    "messages": [
        [
            {"role": "user", "content": "Set a timer for 10 minutes."},
            {"role": "assistant", "tool_calls": [{"type": "function", "function": {"name": "start_timer", "arguments": {"duration": 600}}}]},
            {"role": "tool", "name": "start_timer", "content": "600"},
            {"role": "assistant", "content": "Timer set for 10 minutes."},
        ],
        ...,
    ],
    "tools": [
        [start_timer, create_reminder],
        ...,
    ]
})

# Initialize the trainer
trainer = SFTTrainer(model="Qwen3-0.6B", train_dataset=dataset)

# Train the model
trainer.train()

by @qgallouedec in https://github.com/huggingface/trl/pull/3597

📉 FFD packing

We introduce a new packing method: FFD (First Fit Decreasing) packing. This method is designed to optimize the packing of sequences in a way that more efficiently reduces the size of the training dataset by grouping examples more effectively. Previously, we used a wrapped packing method, which often truncated sequences even when they were not longer than the maximum sequence length. The new FFD packing method avoids unnecessary truncation by grouping sequences more intelligently. This new packing strategy is now the default when packing is enabled.

training_args = SFTConfig(..., packing=True)

by @qgallouedec in https://github.com/huggingface/trl/pull/3521 and accelerated by @mariosasko in https://github.com/huggingface/trl/pull/3537

[Liger] liger DPO support

The DPOTrainer now supports the Liger-powered DPO loss, enabling faster training with lower memory usage.

training_args = DPOConfig(..., use_liger_loss=True)

by @kashif in https://github.com/huggingface/trl/pull/2568

💬 Fix `setup_chat_format` and add `clone_chat_template`

We introduce clone_chat_template, a more convenient and flexible function for setting up chat templates from any tokenizer that already includes one. It handles EOS tokens and copies all added tokens from the source tokenizer, preserving their "special" status. You can either use this function directly:

from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import clone_chat_template

model = AutoModelForCausalLM.from_pretrained("facebook/opt-350m")
tokenizer = AutoTokenizer.from_pretrained("facebook/opt-350m")

model, tokenizer = clone_chat_template(model, tokenizer, "Qwen/Qwen3-4B")

or use the chat_template_path parameter in SFTConfig to specify a chat template, which will be automatically cloned when the SFTTrainer is initialized.

from trl import SFTConfig

training_args = SFTConfig(chat_template_path="Qwen/Qwen3-4B")

by @qgallouedec in https://github.com/huggingface/trl/pull/3404 and https://github.com/huggingface/trl/pull/3599

📚 SFTTrainer support chat template kwargs

SFTTrainer now supports passing additional keyword arguments to the chat template. This allows for more flexibility in customizing the chat format during training. To enable it, just add a chat_template_kwargs column to your your dataset.

example = {'messages': [{'content': 'What is better than ugly?', 'role': 'user'},
                        {'content': 'Beautiful.', 'role': 'assistant'}]
           'chat_template_kwargs': {'my_template_arg': 'my_value'}}

by @qgallouedec in https://github.com/huggingface/trl/pull/3609

🤵‍♂️ SFT on assistant messages only

The SFTTrainer now supports training on assistant messages only

example = {'messages': [
    {'role': 'user', 'content': 'What is better than ugly?'},          # masked in the loss
    {'role': 'assistant', 'content': 'Beautiful.'},                    # used in the loss
    {'role': 'user', 'content': 'And what is better than implicit?'},  # masked in the loss
    {'role': 'assistant', 'content': 'Explicit.'},                     # used in the loss
]}

by @qgallouedec in https://github.com/huggingface/trl/pull/3586

🧬 Add `generation_kwargs` as a property of `GRPOConfig` to support additional generation arguments

The GRPOConfig now includes a generation_kwargs property, allowing users to specify additional generation arguments for the GRPOTrainer. This allows for further customization of the generation behavior, such as setting suppress_tokens, num_beams, etc. Depending on the generation backend used (transformers or vLLM), this property will be passed either to transformers.GenerationConfig (if using transformers) or vllm.SamplingParams (if using vLLM).

from trl import GRPOConfig

training_args = GRPOConfig(..., generation_kwargs={"length_penalty": -0.1})

by @pramodith in https://github.com/huggingface/trl/pull/3617

New defaults

🎀 New default: beta=0.0 for GRPO by @qgallouedec in https://github.com/huggingface/trl/pull/3516
🎀 New defaults: preparing the new structure by @qgallouedec in https://github.com/huggingface/trl/pull/3530
🎀 New defaults: logging_steps=10 by @qgallouedec in https://github.com/huggingface/trl/pull/3514
🎀 [SFT][Bugfix] sets average_tokens_across_devices to true in SFTConfig by @edbeeching in https://github.com/huggingface/trl/pull/3538
🎀 New defaults: bf16=True by @qgallouedec in https://github.com/huggingface/trl/pull/3515

Minor changes

Add support for IterableDataset in DPO Trainer by @h-tonywu in https://github.com/huggingface/trl/pull/3559
🔖 Fix: ensure user-provided labels are retained in self._signature_columns by @sxndqc in https://github.com/huggingface/trl/pull/3589
⭐ Add vllm_gpu_memory_utilization recommendation script by @toslali-ibm in https://github.com/huggingface/trl/pull/3554

What's Changed

⬆️ Bump dev version by @qgallouedec in https://github.com/huggingface/trl/pull/3505
📎 Fix clip ratio logging by @qgallouedec in https://github.com/huggingface/trl/pull/3506
📚 Fix doc building by removing vLLM from dev dependencies in setup.cfg by @qgallouedec in https://github.com/huggingface/trl/pull/3511
🧭 Patch release guide by @qgallouedec in https://github.com/huggingface/trl/pull/3512
🎀 New default: beta=0.0 for GRPO by @qgallouedec in https://github.com/huggingface/trl/pull/3516
Add "🐯 Liger GRPO meets TRL" by @qgallouedec in https://github.com/huggingface/trl/pull/3525
📉 FFD packing by @qgallouedec in https://github.com/huggingface/trl/pull/3521
🎀 New defaults: preparing the new structure by @qgallouedec in https://github.com/huggingface/trl/pull/3530
🪦 RIP trl chat by @shirinyamani in https://github.com/huggingface/trl/pull/3531
🎀 New defaults: logging_steps=10 by @qgallouedec in https://github.com/huggingface/trl/pull/3514
📰 Add blog "No GPU left behind: Unlocking Efficiency with Co-located vLLM in TRL" by @qgallouedec in https://github.com/huggingface/trl/pull/3527
🎯 Don't use getattr to get gradient_checkpointing by @qgallouedec in https://github.com/huggingface/trl/pull/3535
🧭 Remove useless transformers version checks by @qgallouedec in https://github.com/huggingface/trl/pull/3534
🐳 Add DeepseekV3 model configurations and update tests for new models by @qgallouedec in https://github.com/huggingface/trl/pull/3536
💭 [Data] Fix DeepSeek-R1 case by @kashif in https://github.com/huggingface/trl/pull/3522
🎀 [SFT][Bugfix] sets average_tokens_across_devices to true in SFTConfig by @edbeeching in https://github.com/huggingface/trl/pull/3538
⚡ Faster FFD packing by @mariosasko in https://github.com/huggingface/trl/pull/3537
📦 Packing with flash attn kwargs to avoid cross-contamination by @thepowerfuldeez in https://github.com/huggingface/trl/pull/3526
💽 [TRLParser] Fail when unknown args are provided in the config file. by @edbeeching in https://github.com/huggingface/trl/pull/3543
🛋️ Fix CI and bump accelerate by @qgallouedec in https://github.com/huggingface/trl/pull/3551
🧮 Rearrange DPOTrainer by @DaizeDong in https://github.com/huggingface/trl/pull/3501
🆙 Bump transformers to 4.51 and use _VALID_DICT_FIELDS by @qgallouedec in https://github.com/huggingface/trl/pull/3553
Update tests_latest.yml by @qgallouedec in https://github.com/huggingface/trl/pull/3558
ℹ️ Unify autocast behavior to torch.autocast and make it cover XPU by @yao-matrix in https://github.com/huggingface/trl/pull/3541
Fix dev version by @Tavish9 in https://github.com/huggingface/trl/pull/3570
[Liger] liger DPO support by @kashif in https://github.com/huggingface/trl/pull/2568
Add support for IterableDataset in DPO Trainer by @h-tonywu in https://github.com/huggingface/trl/pull/3559
🏗️ Add test for training with multiple dataloader workers and update worker initialization for compatibility with transformers 4.52.0 by @qgallouedec in https://github.com/huggingface/trl/pull/3568
🫸 Push model card with checkpoint by @qgallouedec in https://github.com/huggingface/trl/pull/3550
Add Community Tutorial: GRPO text summarization example with Unsloth optimizations by @amanzoni1 in https://github.com/huggingface/trl/pull/3576
🎀 New defaults: bf16=True by @qgallouedec in https://github.com/huggingface/trl/pull/3515
📨 [SFT] Tokenize directly when applying the chat template by @qgallouedec in https://github.com/huggingface/trl/pull/3572
Adjust max_num_batched_tokens by @toslali-ibm in https://github.com/huggingface/trl/pull/3565
💡 Fix type hints in trainer/utils.py by @bvantuan in https://github.com/huggingface/trl/pull/3591
💡 Fix wrong type hint for formatting_func argument in SFTTrainer by @MaiqiVerse in https://github.com/huggingface/trl/pull/3584
💬 Fix setup_chat_format and add clone_chat_template by @qgallouedec in https://github.com/huggingface/trl/pull/3404
🛡️ Adding trust_remote_code to vllm-serve by @maziyarpanahi in https://github.com/huggingface/trl/pull/3588
Fix typos and improve metric descriptions in documentation by @vtjl10 in https://github.com/huggingface/trl/pull/3585
Fix Typo in Documentation and Notebook; Improve Library Installation Comment by @zeevick10 in https://github.com/huggingface/trl/pull/3593
♻️ Avoids redundant calculation of ref logps in the new policy update loop by @zkpranav in https://github.com/huggingface/trl/pull/3600
🗳️ Remove logging_steps parameter from for simpler setup by @qgallouedec in https://github.com/huggingface/trl/pull/3612
Fix: list-typed tags handling in Trainer::create_model_card by @LeonEricsson in https://github.com/huggingface/trl/pull/3613
Fix Typos in Comments and Improve Clarity in Trainer Modules by @maximevtush in https://github.com/huggingface/trl/pull/3596
Change enforce_eager default value in vLLM server. by @LeonEricsson in https://github.com/huggingface/trl/pull/3607
[SFT] Clarify default collator docs by @LeonEricsson in https://github.com/huggingface/trl/pull/3606
🏁 Refactor reference model initialization in GRPOTrainer by @Tavish9 in https://github.com/huggingface/trl/pull/3575
🏛️ Fix CI and Iterative SFT by @qgallouedec in https://github.com/huggingface/trl/pull/3614
👔 Apply doc-builder style by @qgallouedec in https://github.com/huggingface/trl/pull/3615
🔖 Fix: ensure user-provided labels are retained in self._signature_columns by @sxndqc in https://github.com/huggingface/trl/pull/3589
📚 SFTTrainer support chat template kwargs by @qgallouedec in https://github.com/huggingface/trl/pull/3609
🦘 Skip no-op ChatML conversion for datasets already in ChatML format by @qgallouedec in https://github.com/huggingface/trl/pull/3594
🤵‍♂️ SFT on assistant messages only by @qgallouedec in https://github.com/huggingface/trl/pull/3586
🎁 Put the reward computation in a separate function by @ajtejankar in https://github.com/huggingface/trl/pull/3620
⭐ Add vllm_gpu_memory_utilization recommendation script by @toslali-ibm in https://github.com/huggingface/trl/pull/3554
[GRPO] Fix prompt truncation (max_prompt_length) with vLLM. by @LeonEricsson in https://github.com/huggingface/trl/pull/3601
🧬 Add generation_kwargs as a property of GRPOConfig to support additional generation arguments. by @pramodith in https://github.com/huggingface/trl/pull/3617
📜 Add chat_template_path parameter to SFTConfig by @qgallouedec in https://github.com/huggingface/trl/pull/3599
⚔️ Fix bf16 fp16 config conflict issue by @yao-matrix in https://github.com/huggingface/trl/pull/3598
🔍 Add test to verify chat template consistency by @qgallouedec in https://github.com/huggingface/trl/pull/3624
🧰 [SFT] Tool support by @qgallouedec in https://github.com/huggingface/trl/pull/3597
Release: v0.19 by @qgallouedec in https://github.com/huggingface/trl/pull/3625

New Contributors

@thepowerfuldeez made their first contribution in https://github.com/huggingface/trl/pull/3526
@DaizeDong made their first contribution in https://github.com/huggingface/trl/pull/3501
@h-tonywu made their first contribution in https://github.com/huggingface/trl/pull/3559
@amanzoni1 made their first contribution in https://github.com/huggingface/trl/pull/3576
@bvantuan made their first contribution in https://github.com/huggingface/trl/pull/3591
@MaiqiVerse made their first contribution in https://github.com/huggingface/trl/pull/3584
@maziyarpanahi made their first contribution in https://github.com/huggingface/trl/pull/3588
@vtjl10 made their first contribution in https://github.com/huggingface/trl/pull/3585
@zeevick10 made their first contribution in https://github.com/huggingface/trl/pull/3593
@zkpranav made their first contribution in https://github.com/huggingface/trl/pull/3600
@sxndqc made their first contribution in https://github.com/huggingface/trl/pull/3589
@ajtejankar made their first contribution in https://github.com/huggingface/trl/pull/3620
@pramodith made their first contribution in https://github.com/huggingface/trl/pull/3617

Full Changelog: https://github.com/huggingface/trl/compare/v0.18.0...v0.19.0

Jun 15, 2025

What's Changed

🏗️ Add test for training with multiple dataloader workers and update worker initialization for compatibility with transformers 4.52.0 by @qgallouedec in https://github.com/huggingface/trl/pull/3568

Full Changelog: https://github.com/huggingface/trl/compare/v0.18.1...v0.18.2

Jun 3, 2025

What's Changed

📎 Fix clip ratio logging by @qgallouedec in https://github.com/huggingface/trl/pull/3506
📚 Fix doc building by removing vLLM from dev dependencies in setup.cfg by @qgallouedec in https://github.com/huggingface/trl/pull/3511

Full Changelog: https://github.com/huggingface/trl/compare/v0.18.0...v0.18.1

May 28, 2025

Major or breaking

PEFT support for Liger GRPO by @SalmanMohammadi in https://github.com/huggingface/trl/pull/3355
[🐯+GRPO] Support FSDP + Fix bug when using LigerGRPO with DDP by @shivam15s in https://github.com/huggingface/trl/pull/3260
🤝 Compatibility of the TRL CLI with accelerate arguments by @qgallouedec in https://github.com/huggingface/trl/pull/3409
🧑‍🤝‍🧑 Co-Locating vLLM w/ training to for higher throughput and GPU utilization by @toslali-ibm in https://github.com/huggingface/trl/pull/3394
✌️ Add support for FSDP2 by @lewtun in https://github.com/huggingface/trl/pull/3317
💔 [GRPO] Decouple gradient accumulation from the number of minibatches generated by @edbeeching in https://github.com/huggingface/trl/pull/3388
[Models] Activation checkpointing from TorchTune by @kashif in https://github.com/huggingface/trl/pull/2954
feat: Implement Two-Sided Clipping for GRPO Trainer by @ucalyptus in https://github.com/huggingface/trl/pull/3434
🎁 Reward submodule by @qgallouedec in https://github.com/huggingface/trl/pull/3430

What's Changed

⬆️ Bump dev version by @qgallouedec in https://github.com/huggingface/trl/pull/3357
🔢 Pad to multiple of by @qgallouedec in https://github.com/huggingface/trl/pull/3362
🥸🔢 Adding pad_multiple to SFT trainer by @shirinyamani in https://github.com/huggingface/trl/pull/3365
🎭 Fix train and eval mode checking in GRPOTrainer and SFTTrainer by @I-l-l-I in https://github.com/huggingface/trl/pull/3337
Better guards for DeepSpeed imports by @lewtun in https://github.com/huggingface/trl/pull/3351
⚰️ Remove deprecated by @qgallouedec in https://github.com/huggingface/trl/pull/3364
📋 Allow calling trl cli in sft mode with config file by @CloseChoice in https://github.com/huggingface/trl/pull/3380
PEFT support for Liger GRPO by @SalmanMohammadi in https://github.com/huggingface/trl/pull/3355
DPO fixes for evaluations by @winglian in https://github.com/huggingface/trl/pull/3377
Deprecate TextEnvironment and tools by @lewtun in https://github.com/huggingface/trl/pull/3389
[🐯+GRPO] Support FSDP + Fix bug when using LigerGRPO with DDP by @shivam15s in https://github.com/huggingface/trl/pull/3260
[GRPO] Reference model initialization bug fix by @LeonEricsson in https://github.com/huggingface/trl/pull/3397
🌊 Add MLflow metrics in profiling context by @dhruvmullick in https://github.com/huggingface/trl/pull/3400
🧑‍🤝‍🧑 Co-Locating vLLM w/ training to for higher throughput and GPU utilization by @toslali-ibm in https://github.com/huggingface/trl/pull/3394
✨ [IterativeSFT] Small refresher by @LeonEricsson in https://github.com/huggingface/trl/pull/3378
💔 [SFT] Raise error when formatting_func is used with completion_only_loss by @LeonEricsson in https://github.com/huggingface/trl/pull/3385
🦁 Fix liger initialization by @shivam15s in https://github.com/huggingface/trl/pull/3401
👉 [DPO] Model forward pass padding side fix by @LeonEricsson in https://github.com/huggingface/trl/pull/3307
🪪 Remove license classifier by @qgallouedec in https://github.com/huggingface/trl/pull/3402
🕺 Migrate setup configuration from setup.py to setup.cfg and make rich an optional dep by @qgallouedec in https://github.com/huggingface/trl/pull/3403
🕊️ Un-restrict diffusers by @qgallouedec in https://github.com/huggingface/trl/pull/3407
✌️ Add support for FSDP2 by @lewtun in https://github.com/huggingface/trl/pull/3317
🤝 Compatibility of the TRL CLI with accelerate arguments by @qgallouedec in https://github.com/huggingface/trl/pull/3409
💔 [GRPO] Decouple gradient accumulation from the number of minibatches generated by @edbeeching in https://github.com/huggingface/trl/pull/3388
🎲 [GRPO] Shuffle mini batches by @edbeeching in https://github.com/huggingface/trl/pull/3391
📝 vLLM-integration documentation by @shirinyamani in https://github.com/huggingface/trl/pull/3376
🎁 Reward takes completion ids by @qgallouedec in https://github.com/huggingface/trl/pull/3272
🐍 Support Python 3.13 by @qgallouedec in https://github.com/huggingface/trl/pull/2593
[Models] Activation checkpointing from TorchTune by @kashif in https://github.com/huggingface/trl/pull/2954
🧪 Testing support for Qwen3 tiny by @shirinyamani in https://github.com/huggingface/trl/pull/3415
Update README.md by @qgallouedec in https://github.com/huggingface/trl/pull/3420
🏹 Support kv_cache_dtype to quantize kv-cache in vllm by @winglian in https://github.com/huggingface/trl/pull/3422
enable trl env on xpu by @yao-matrix in https://github.com/huggingface/trl/pull/3438
use device agnostic empty_cache in ppo & rloo by @yao-matrix in https://github.com/huggingface/trl/pull/3439
feat: Implement Two-Sided Clipping for GRPO Trainer by @ucalyptus in https://github.com/huggingface/trl/pull/3434
🎁 Reward submodule by @qgallouedec in https://github.com/huggingface/trl/pull/3430
[CI] fix CI failure of transformer dev by @kashif in https://github.com/huggingface/trl/pull/3457
enable vllm c-s tests on XPU by @yao-matrix in https://github.com/huggingface/trl/pull/3445
enable activation offloading on XPU by @yao-matrix in https://github.com/huggingface/trl/pull/3444
🙅 PPO value_model can't be None, so it shouldn't be Optional by @AMindToThink in https://github.com/huggingface/trl/pull/3300
[NashMD] fix the edge case where the model is a peft model by @kashif in https://github.com/huggingface/trl/pull/3473
Update .pre-commit-config.yaml by @kashif in https://github.com/huggingface/trl/pull/3479
[SFT] update minimal liger version by @kashif in https://github.com/huggingface/trl/pull/3483
[CI] fix sampler api to make the CI green by @kashif in https://github.com/huggingface/trl/pull/3488
Fix typo by @nikolai-kummer in https://github.com/huggingface/trl/pull/3489
[Doc][SFT] Update sft_trainer.md. link prompt-completion dataset example by @HERIUN in https://github.com/huggingface/trl/pull/3486
Fix mis-aligned prompts and completions in colocate mode by @toslali-ibm in https://github.com/huggingface/trl/pull/3491
[Docs] sync logging doc to current metrics by @kashif in https://github.com/huggingface/trl/pull/3478
[GRPO] disabling top_k sampling default by @kashif in https://github.com/huggingface/trl/pull/3494
[GKD] fix the gkd script by @kashif in https://github.com/huggingface/trl/pull/3497
👇 Update grpo.py to fix bugs for cli grpo --reward_funcs my_lib.my_reward by @wa008 in https://github.com/huggingface/trl/pull/3454
🛠️ Initialize reward_kwargs to prevent UnboundLocalError in GRPOTrainer by @teilomillet in https://github.com/huggingface/trl/pull/3459
🐌 Clean two-sided clipping by @qgallouedec in https://github.com/huggingface/trl/pull/3499
🔭 [GRPO] Log advantages and fraction of samples with an std of zero by @edbeeching in https://github.com/huggingface/trl/pull/3502
📏 Completion length logging fix + remainder logging fix by @shirinyamani in https://github.com/huggingface/trl/pull/3482
🤧 LD-DPO support by @AIR-hl in https://github.com/huggingface/trl/pull/3458
🏰 [vllm] Support base_url parameter for vLLM client initialization by @re-imagined in https://github.com/huggingface/trl/pull/3324
✂️ [DPO] Fix truncation keep_end leading to zero'd out samples by @LeonEricsson in https://github.com/huggingface/trl/pull/3398
Release: v0.18 by @qgallouedec in https://github.com/huggingface/trl/pull/3504

New Contributors

@CloseChoice made their first contribution in https://github.com/huggingface/trl/pull/3380
@SalmanMohammadi made their first contribution in https://github.com/huggingface/trl/pull/3355
@dhruvmullick made their first contribution in https://github.com/huggingface/trl/pull/3400
@toslali-ibm made their first contribution in https://github.com/huggingface/trl/pull/3394
@yao-matrix made their first contribution in https://github.com/huggingface/trl/pull/3438
@nikolai-kummer made their first contribution in https://github.com/huggingface/trl/pull/3489
@wa008 made their first contribution in https://github.com/huggingface/trl/pull/3454
@teilomillet made their first contribution in https://github.com/huggingface/trl/pull/3459
@re-imagined made their first contribution in https://github.com/huggingface/trl/pull/3324

Full Changelog: https://github.com/huggingface/trl/compare/v0.17.0...v0.18.0

Apr 25, 2025

Major and breaking

The TRL v0.17 release introduces three major changes that, together, enable significantly faster generation performance in GRPO—up to 10x faster in some configurations.

These three changes are:

Data parallelism (DP) for the vLLM server
A new GRPO training strategy that generates once per effective batch
Support for the V1 engine in vLLM

Below, we provide a summary of these changes and how to use them.

⚡ Up to 4x faster: Data Parallel for vLLM server

The TRL vLLM server now supports data parallelism (DP), enabling significantly faster generation speeds—especially for smaller models. This new feature can be used by adding the --data_parallel_size N argument when launching the vLLM server.

trl vllm-serve --model Qwen/Qwen2.5-14B-Instruct --tensor_parallel_size 2 --data_parallel_size 2

by @qgallouedec in https://github.com/huggingface/trl/pull/3310

* ☝️ [GRPO] Generate once per effective batch

Previously, GRPO made one generation request per global batch. The global batch is the total of all local batches, without accounting for gradient accumulation. In other words, if the gradient accumulation step was 8, GRPO would make 8 generation requests per training step.

Now, GRPO groups these global batches into a single "effective batch" and makes only one generation request per effective batch. Since vLLM applies optimizations that are especially effective for large batches, this new approach leads to significantly faster training overall.

No changes are required in the training script, as this is handled internally by the GRPO trainer.

by @qgallouedec in https://github.com/huggingface/trl/pull/3283

⏱️ Fix vLLM server to support V1 Engine

vLLM provides two versions of its engine (V0 and V1), and V1 is significantly faster. This version is now supported by TRL and requires vLLM version 0.8.3 or higher.

by @I-l-l-I in https://github.com/huggingface/trl/pull/3276

👎 [GRPO] Adds option to disable dropout

Disabling dropout has shown to stabilize training. You can now disable dropout in GRPO by setting the disable_dropout argument to False in the GRPO config.

from trl import GRPOConfig

training_args = GRPOConfig(..., disable_dropout=True)

by @edbeeching in https://github.com/huggingface/trl/pull/3234

🩺 Dr. GRPO loss

GRPO now supports the various losses proposed in the recent literature, including the Dr. GRPO loss. The loss type can be set in the GRPO config:

from trl import GRPOConfig

training_args = GRPOConfig(..., loss_type="dr_grpo")

by @qgallouedec in https://github.com/huggingface/trl/pull/3256

🎲 [GRPO] Make training dataset shuffle optional

The GRPO trainer now has an option to disable shuffling of the training dataset. This is useful for curriculum learning, where the order of the training data is important.

from trl import GRPOConfig

training_args = GRPOConfig(..., shuffle_dataset=False)

by @LeonEricsson in https://github.com/huggingface/trl/pull/3334

☕ Overlong-filtering for GRPO

Overlong filtering has been shown to significantly stabilize learning and improve performance. You can now use it in TRL!

It simply consists in masking the loss of truncated samples

from trl import GRPOConfig

training_args = GRPOConfig(..., mask_truncated_completions=True)

by @shirinyamani in https://github.com/huggingface/trl/pull/3248

🐯 Integrate Liger GRPO Loss to GRPO Trainer

Liger allows to significantly reduce the memory peak of the loss computation. You can now use it in TRL with the use_liger_loss argument in the GRPO config:

from trl import GRPOConfig

training_args = GRPOConfig(..., use_liger_loss=True)

by @shivam15s in https://github.com/huggingface/trl/pull/3184

Bug fixes

Fix: Multi gpu hang for ORPO and CPO Trainer by @NanoCode012 in https://github.com/huggingface/trl/pull/3069
📊 Fix clip_ratio logging and better document logged values by @qgallouedec in https://github.com/huggingface/trl/pull/3145
⏯️ Fix: handle None inputs when resuming GRPO Trainer from checkpoint by @PenutChen in https://github.com/huggingface/trl/pull/3148
📎 Fix is_clipped to compute the effective clip_ratio by @pandong2011 in https://github.com/huggingface/trl/pull/3175
😷 Fix SFT masking EOS when equal to PAD by @qgallouedec in https://github.com/huggingface/trl/pull/3200
⏯️ Fix logging when resuming from checkpoint GRPO by @qgallouedec in https://github.com/huggingface/trl/pull/3185
💠 Fix multi-gpu padding free by @qgallouedec in https://github.com/huggingface/trl/pull/3245
🕷 Fix online DPO crash when model is a DataParallel object by @wilrop in https://github.com/huggingface/trl/pull/3225
🏁 Fix adding special tokens in SFT by @qgallouedec in https://github.com/huggingface/trl/pull/3328
🍡 Fix using reward model and DeepSpeed ZeRO 3 by @qgallouedec in https://github.com/huggingface/trl/pull/3326

What's Changed

Fix: Multi gpu hang for ORPO and CPO Trainer by @NanoCode012 in https://github.com/huggingface/trl/pull/3069
📊 Fix clip_ratio logging and better document logged values by @qgallouedec in https://github.com/huggingface/trl/pull/3145
BCOTrainer version upgrade fixes by @claralp in https://github.com/huggingface/trl/pull/2867
🐇 [Research] Layer Skip SFT by @ariG23498 in https://github.com/huggingface/trl/pull/3111
🤝 Align GRPO equation doc with the implementation by @qgallouedec in https://github.com/huggingface/trl/pull/3151
Enable number of printed completions to be set by @lewtun in https://github.com/huggingface/trl/pull/3149
🩹 Fix CI by @qgallouedec in https://github.com/huggingface/trl/pull/3155
⚰️ Remove deprecated by @qgallouedec in https://github.com/huggingface/trl/pull/3153
🔫 Disable triggering CI when PR is draft by @qgallouedec in https://github.com/huggingface/trl/pull/3154
👨‍🍳 vLLM serve: destroy process group on exit and pass worker_cls as string by @qgallouedec in https://github.com/huggingface/trl/pull/3159
💰 Richer rich table - log all the rewards by @qgallouedec in https://github.com/huggingface/trl/pull/3156
💎 Gemma 3 VLM SFT example script for single-image and multi-image by @sergiopaniego in https://github.com/huggingface/trl/pull/3131
[Liger] Liger KTO support by @vaibhavjindal in https://github.com/huggingface/trl/pull/2812
🏃 Migrate CI to self-hosted runners by @qgallouedec in https://github.com/huggingface/trl/pull/3174
❤️‍🩹 [CI] fix transformers dev CI failure by @kashif in https://github.com/huggingface/trl/pull/3176
⏯️ Fix: handle None inputs when resuming GRPO Trainer from checkpoint by @PenutChen in https://github.com/huggingface/trl/pull/3148
📎 Fix is_clipped to compute the effective clip_ratio by @pandong2011 in https://github.com/huggingface/trl/pull/3175
Fix breaking typo for flash_attention reducing_memory_usage.md by @burtenshaw in https://github.com/huggingface/trl/pull/3190
Show unique prompts in GRPO WandB tables by @lewtun in https://github.com/huggingface/trl/pull/3191
🐗 [CI] Fix trufflehog false positives by @lewtun in https://github.com/huggingface/trl/pull/3192
[GRPO] Improve completion length logging by @edbeeching in https://github.com/huggingface/trl/pull/3188
😷 Fix SFT masking EOS when equal to PAD by @qgallouedec in https://github.com/huggingface/trl/pull/3200
🗝️ Fix type hint in vLLM client by @qgallouedec in https://github.com/huggingface/trl/pull/3205
📚 Accumulate completions for logging by @lewtun in https://github.com/huggingface/trl/pull/3217
Group completion metrics by common prefix by @lewtun in https://github.com/huggingface/trl/pull/3212
🐯 Integrate Liger GRPO Loss to GRPO Trainer by @shivam15s in https://github.com/huggingface/trl/pull/3184
Update ruff to 11.3 and base Python version to 3.9 by @cyyever in https://github.com/huggingface/trl/pull/3230
⏯️ Fix logging when resuming from checkpoint GRPO by @qgallouedec in https://github.com/huggingface/trl/pull/3185
📢 Improve GRPO trainer error message for invalid num_generations by @AliBakly in https://github.com/huggingface/trl/pull/3199
🎀 Simplify logging text by @qgallouedec in https://github.com/huggingface/trl/pull/3219
🌊 Add error for iterable datasets in GRPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/3216
⏳ PPOTrainer: fix progress bar for num_mini_batches > 1 by @dawidm in https://github.com/huggingface/trl/pull/2531
☑ Update PULL_REQUEST_TEMPLATE.md by @qgallouedec in https://github.com/huggingface/trl/pull/3241
🔭 Add support for better KL estimator (k3) in PPOTrainer by @AMindToThink in https://github.com/huggingface/trl/pull/3240
🏃 Fix and make CI faster by @qgallouedec in https://github.com/huggingface/trl/pull/3160
🗑️ Deprecate ConstantLengthDataset by @qgallouedec in https://github.com/huggingface/trl/pull/3242
📦 [SFT] Deprecate batched formatting_func by @YeFD in https://github.com/huggingface/trl/pull/3147
💠 Fix multi-gpu padding free by @qgallouedec in https://github.com/huggingface/trl/pull/3245
☕ Overlong-filtering for GRPO by @shirinyamani in https://github.com/huggingface/trl/pull/3248
📜 Fix license and copyrights by @qgallouedec in https://github.com/huggingface/trl/pull/3264
⛏️ Add cli dict parsing for grpo_config by @Tavish9 in https://github.com/huggingface/trl/pull/3082
🐯 is_liger_kernel_available with min version by @qgallouedec in https://github.com/huggingface/trl/pull/3266
🕷 Fix online DPO crash when model is a DataParallel object by @wilrop in https://github.com/huggingface/trl/pull/3225
👎 [GRPO] Adds option to disable dropout by @edbeeching in https://github.com/huggingface/trl/pull/3234
🚧 Temporarily restrict diffusers to <0.33.0 due to ftfy optional dep issue breaking doc builds by @qgallouedec in https://github.com/huggingface/trl/pull/3273
♾️ [CI] Remove test_raise_error_not_causallm by @qgallouedec in https://github.com/huggingface/trl/pull/3265
🩺 Dr. GRPO loss by @qgallouedec in https://github.com/huggingface/trl/pull/3256
🔗 Fix Dr. GRPO paper link by @qgallouedec in https://github.com/huggingface/trl/pull/3275
Add Fine-tuning a Multimodal Model Using SFT (Single or Multi-Image Dataset) guide to docs by @sergiopaniego in https://github.com/huggingface/trl/pull/3235
🕊️ Un-restrict diffusers by @qgallouedec in https://github.com/huggingface/trl/pull/3274
🦾 Test vLLM client-server by @qgallouedec in https://github.com/huggingface/trl/pull/3277
⏱️ Fix vLLM server to support V1 Engine by @I-l-l-I in https://github.com/huggingface/trl/pull/3276
Expose EOS token in SFTConfig by @lewtun in https://github.com/huggingface/trl/pull/3299
🏷️ Fixed naming error in output_dir for Gemma 3 VLM script by @sergiopaniego in https://github.com/huggingface/trl/pull/3297
🧗 Add Ascend NPU support for vLLM server by @ji-huazhong in https://github.com/huggingface/trl/pull/3286
🅾️ Fixes typo in SFTTrainer by @taras-sereda in https://github.com/huggingface/trl/pull/3282
[GRPO] Add metrics for low and high clipped token probabilities by @lewtun in https://github.com/huggingface/trl/pull/3289
☝️ [GRPO] Generate once per effective batch by @qgallouedec in https://github.com/huggingface/trl/pull/3283
🎲 [GRPO] Make training dataset shuffle optional by @LeonEricsson in https://github.com/huggingface/trl/pull/3334
🙋 Add Optional Eager Execution Mode for vLLM Serving by @ucalyptus in https://github.com/huggingface/trl/pull/3335
Fix typo in text_environments.md by @sunjin-k in https://github.com/huggingface/trl/pull/3305
✅ [doc] Update sft_trainer.md in table x->✓ by @HERIUN in https://github.com/huggingface/trl/pull/3313
🧸 Fix unset tokenizer pad_token by @LeonEricsson in https://github.com/huggingface/trl/pull/3290
💡 Fix type hint in _generate_and_score_completions by @syt-nju in https://github.com/huggingface/trl/pull/3336
🦄 Add optional uvicorn log level for vLLM serve by @I-l-l-I in https://github.com/huggingface/trl/pull/3338
[CPO] Check that max_prompt_length < max_length by @LeonEricsson in https://github.com/huggingface/trl/pull/3341
🏁 Fix adding special tokens in SFT by @qgallouedec in https://github.com/huggingface/trl/pull/3328
Define default chat template for SFT by @lewtun in https://github.com/huggingface/trl/pull/3309
🍡 Fix using reward model and DeepSpeed ZeRO 3 by @qgallouedec in https://github.com/huggingface/trl/pull/3326
⚡ Up to 4x faster: Data Parallel for vLLM server by @qgallouedec in https://github.com/huggingface/trl/pull/3310
Release: v0.17 by @qgallouedec in https://github.com/huggingface/trl/pull/3356

New Contributors

@NanoCode012 made their first contribution in https://github.com/huggingface/trl/pull/3069
@ariG23498 made their first contribution in https://github.com/huggingface/trl/pull/3111
@PenutChen made their first contribution in https://github.com/huggingface/trl/pull/3148
@pandong2011 made their first contribution in https://github.com/huggingface/trl/pull/3175
@shivam15s made their first contribution in https://github.com/huggingface/trl/pull/3184
@cyyever made their first contribution in https://github.com/huggingface/trl/pull/3230
@AMindToThink made their first contribution in https://github.com/huggingface/trl/pull/3240
@YeFD made their first contribution in https://github.com/huggingface/trl/pull/3147
@Tavish9 made their first contribution in https://github.com/huggingface/trl/pull/3082
@wilrop made their first contribution in https://github.com/huggingface/trl/pull/3225
@I-l-l-I made their first contribution in https://github.com/huggingface/trl/pull/3276
@taras-sereda made their first contribution in https://github.com/huggingface/trl/pull/3282
@LeonEricsson made their first contribution in https://github.com/huggingface/trl/pull/3334
@ucalyptus made their first contribution in https://github.com/huggingface/trl/pull/3335
@sunjin-k made their first contribution in https://github.com/huggingface/trl/pull/3305
@HERIUN made their first contribution in https://github.com/huggingface/trl/pull/3313
@syt-nju made their first contribution in https://github.com/huggingface/trl/pull/3336

Full Changelog: https://github.com/huggingface/trl/compare/v0.16.0...v0.17.0

Apr 4, 2025

What's Changed

😷 Fix SFT masking EOS when equal to PAD by @qgallouedec in https://github.com/huggingface/trl/pull/3200
📉 Add learning_rate argument to _maybe_log_save_evaluate by @qgallouedec in https://github.com/huggingface/trl/pull/3206

Full Changelog: https://github.com/huggingface/trl/compare/v0.16.0...v0.16.1

Mar 22, 2025

Major and breaking

🚀 Scaling GRPO to 70B+ Models and Multi-Node Training with vLLM Server & NCCL Communication

Previously, vLLM could only be used by dedicating a single GPU, preventing both the scalability benefits of vLLM and multi-node training. This limitation has now been removed!

GRPO can now scale efficiently with models exceeding 70B parameters, supporting multi-node training with super-fast performance.

To take advantage of this, simply launch a vLLM server using the following command:

trl vllm-serve --model <model_name> --tensor_parallel_size <tp_size>

Then, start GRPO training with use_vllm=True.

Below is a comparison of GRPO throughput with and without vLLM, across different TP values and model sizes.

@binary-husky and @qgallouedec in https://github.com/huggingface/trl/pull/3094

🐦‍🔥 6x faster GRPO with multi-step optimization

This release introduces the multi-step trick, which allows for the reuse of generated data across multiple steps, speeding up the training process.

To support this, we've implemented importance sampling and clipping logic. This enhancement should lead to significant improvements in training speed.

To use it, simply set num_iterations to a value greater than 1.

training_args = GRPOConfig(..., num_iterations=4)

by @qgallouedec in https://github.com/huggingface/trl/pull/2899

🌍 Use global normalization in GRPO

As demonstrated in Dr GRPO, sequence-level normalization can introduce a response level length bias.

To address this, we have now switched to normalizing the loss and by the total number of tokens in the batch, ensuring more consistent and unbiased training.

- loss = ((per_token_loss * completion_mask).sum(dim=1) / completion_mask.sum(dim=1)).mean()
+ loss = (per_token_loss * completion_mask).sum() / completion_mask.sum()

by @edbeeching in https://github.com/huggingface/trl/pull/2881

⚖️ Add option not to scale rewards

As demonstrated in Dr GRPO, scaling rewards can introduce a question-level difficulty bias. To address this, we have now added an option to disable reward scaling in GRPO.

training_args = GRPOConfig(..., scale_rewards=False)

  advantages = rewards - mean_grouped_rewards
- advantages = advantages / std_grouped_rewards
+ if self.args.scale_rewards:
+     advantages = advantages / std_grouped_rewards

it's likely that we'll make this (scale_rewards=False) the default behavior in the future.

by @qgallouedec in https://github.com/huggingface/trl/pull/3135

🤸‍♀️ Domain-specific rewards in GRPO

When optimizing across multiple domains, not all reward functions are relevant for every sample. For example, a math verifier's reward does not apply to grammar samples, and a grammar verifier's reward does not apply to math samples.

It is now possible to return None for rewards that do not make sense for a given sample. For instance, when the domain is specified in a column like domain, you can implement it as follows:

def math_reward(completions, domain, **kwargs):
    rewards = []
    for completion, dom in zip(completions, domain):
        if dom == "math":
            rewards.append(verify(completion))
        else:
            rewards.append(None)
    return rewards

This allows for more domain-specific reward handling, ensuring that irrelevant rewards are ignored and don’t interfere with optimization.

by @shirinyamani in https://github.com/huggingface/trl/pull/3079

🍃 Do not load reference model when `beta == 0.0`

It has been observed that not minimizing the KL divergence between the trained model and the reference model can still yield good results, while significantly reducing memory usage and compute. This is because there is no need to store the reference model in memory or perform a forward pass for it.

When beta is set to 0.0, the reference model is not loaded, and the KL divergence is not computed, leading to savings in both time and memory.

training_args = GRPOConfig(..., beta=0.0)

by @ingambe in https://github.com/huggingface/trl/pull/2806

🕊️ Padding-free for SFT

Padding-free batching is an alternative approach to packing for reducing memory usage. In this method, a batch is first sampled and then flattened into a single sequence, avoiding padding. Unlike packing, which can result in incomplete sequences by combining parts of different samples, padding-free batching ensures that all sequences remain complete and intact.

To enable padding-free batching in SFT, simply set padding_free=True in the SFTConfig, and make sure to use flash_attention2 as the attention implementation.

training_args = SFTConfig(..., padding_free=True, model_init_kwargs={"attn_implementation": "flash_attention2"})

by @qgallouedec in https://github.com/huggingface/trl/pull/3076

🎬 Clip Higher for Better Exploration

As outlined in the DAPO paper, increasing the upper bound epsilon leads to higher entropy during generation, promoting better exploration. To enable this, we’ve added support for adjusting the upper bound epsilon directly in the default GRPO trainer.

training_args = GRPOConfig(epsilon_high=0.28)

by @shirinyamani in https://github.com/huggingface/trl/pull/3118

Bug fixes

🧶 [GRPO][vLLM + LoRA] Move unmerge of PEFT model after weight loading by @XZ-X in https://github.com/huggingface/trl/pull/2873
🪂 Don't gather logits in SFT to avoid hanging by @qgallouedec in https://github.com/huggingface/trl/pull/2890
♻️ Fix caching in SFT by @qgallouedec in https://github.com/huggingface/trl/pull/2945
🐯 Fix LigerKernel for SFTTrainer by @lewtun @kashif and @qgallouedec in https://github.com/huggingface/trl/pull/2874, https://github.com/huggingface/trl/pull/2940 and https://github.com/huggingface/trl/pull/2949
🫔 [GRPO] Pass wrapped model to unwrap_model_for_generation for DeepSpeed Stage-3 compatibility by @kiddj in https://github.com/huggingface/trl/pull/2871
🛣️ inference_mode to no_grad when computing old_per_token_logps by @qgallouedec in https://github.com/huggingface/trl/pull/2987
🏊 [SFT] Compatibility with padding free and iterable dataset by @qgallouedec in https://github.com/huggingface/trl/pull/3053
Fixing JSD loss computation in GKDTrainer as per definition by @abhigoyal1997 in https://github.com/huggingface/trl/pull/3043

Minor

💬 Add maybe_convert_to_chatml map for conversational datasets in SFT by @kashif in https://github.com/huggingface/trl/pull/2862
🍟 [SFT] Handles the dataset if it has been preprocessed by @BenasdTW and @DanFosing in https://github.com/huggingface/trl/pull/2863 and https://github.com/huggingface/trl/pull/2939
✨ Add vLLM guided decoding support to GRPO Trainer by @kldzj in https://github.com/huggingface/trl/pull/2811
🩳 max_seq_length to max_length by @qgallouedec in https://github.com/huggingface/trl/pull/2895 and https://github.com/huggingface/trl/pull/2947
Optimize vllm num_generations by @edbeeching in https://github.com/huggingface/trl/pull/2855
📍 [GRPO] add gradient_checkpointing by @kashif in https://github.com/huggingface/trl/pull/2848
🪪 Adds profiling decorators for GRPOTrainer by @edbeeching in https://github.com/huggingface/trl/pull/2889 and https://github.com/huggingface/trl/pull/2975
🐈 Bye bye chat by @qgallouedec in https://github.com/huggingface/trl/pull/2934
📇 GRPO: print completions to console and update docs by @nopepper in https://github.com/huggingface/trl/pull/2951
👧🏽 Adding DoRA support to model config by @nbasyl in https://github.com/huggingface/trl/pull/2974
🧗 Add GRPO Trainer support for third-party accelerators by @ji-huazhong in https://github.com/huggingface/trl/pull/2836
🪙 [SFT] Log num_tokens and some logging fixes by @qgallouedec in https://github.com/huggingface/trl/pull/3006
🌡️ Fix temperature inconsistency in GRPO trainer by @Aladoro in https://github.com/huggingface/trl/pull/3029
⛔ Add EOS token to processed input in SFT by @qgallouedec in https://github.com/huggingface/trl/pull/3091
⚡ Pack 300 times faster, truncate 100 times faster by @mariosasko in https://github.com/huggingface/trl/pull/3009

What's Changed

[SFT] fix check for AutoLigerKernelForCausalLM by @kashif in https://github.com/huggingface/trl/pull/2874
🆙 Bump vLLM min version to 0.7.2 by @edbeeching in https://github.com/huggingface/trl/pull/2860
[GRPO] Fix loss normalization by @edbeeching in https://github.com/huggingface/trl/pull/2881
💬 Add maybe_convert_to_chatml map for conversational datasets in SFT by @kashif in https://github.com/huggingface/trl/pull/2862
🧶 [GRPO][vLLM + LoRA] Move unmerge of PEFT model after weight loading by @XZ-X in https://github.com/huggingface/trl/pull/2873
🍟 [SFT] Handles the dataset if it has been preprocessed by @BenasdTW in https://github.com/huggingface/trl/pull/2863
Optimize vllm num_generations by @edbeeching in https://github.com/huggingface/trl/pull/2855
🪂 Don't gather logits in SFT to avoid hanging by @qgallouedec in https://github.com/huggingface/trl/pull/2890
✨ Add vLLM guided decoding support to GRPO Trainer by @kldzj in https://github.com/huggingface/trl/pull/2811
⚰️ Remove deprecated by @qgallouedec in https://github.com/huggingface/trl/pull/2894
🩳 max_seq_length to max_length by @qgallouedec in https://github.com/huggingface/trl/pull/2895
🍃 GRPO - Do not load reference model when beta == 0 by @ingambe in https://github.com/huggingface/trl/pull/2806
📍 [GRPO] add gradient_checkpointing by @kashif in https://github.com/huggingface/trl/pull/2848
🪪 Adds profiling decorators for GRPOTrainer by @edbeeching in https://github.com/huggingface/trl/pull/2889
🐦‍🔥 6x faster GRPO with multi-step optimization by @qgallouedec in https://github.com/huggingface/trl/pull/2899
🔹 Fix: Miscalculated mask shape in comments by @linkedlist771 in https://github.com/huggingface/trl/pull/2925
🤖 Style bot by @qgallouedec in https://github.com/huggingface/trl/pull/2935
🧼 Upgrade ruff by @qgallouedec in https://github.com/huggingface/trl/pull/2938
🐈 Bye bye chat by @qgallouedec in https://github.com/huggingface/trl/pull/2934
♻️ Fix caching in SFT by @qgallouedec in https://github.com/huggingface/trl/pull/2945
📋 Add vLLM version to environment printout by @qgallouedec in https://github.com/huggingface/trl/pull/2946
☠️ Update max_seq_length to max_length in SFTConfig by @qgallouedec in https://github.com/huggingface/trl/pull/2947
🐯 Fix LigerKernel for SFTTrainer by @lewtun in https://github.com/huggingface/trl/pull/2940
✋ Prevent applying the chat template to tokenized datasets by @DanFosing in https://github.com/huggingface/trl/pull/2939
📇 GRPO: print completions to console and update docs by @nopepper in https://github.com/huggingface/trl/pull/2951
↩️ Fix typo in TextEnvironment init param, should be max_tool_response by @shenxiangzhuang in https://github.com/huggingface/trl/pull/2921
🗿 Updated DPO default values for alpha and tau by @Ishan-Kumar2 in https://github.com/huggingface/trl/pull/2918
📌 Pin liger-kernel and vLLM by @qgallouedec in https://github.com/huggingface/trl/pull/2952
⏪ Parameterize enable_prefix_caching by @ji-huazhong in https://github.com/huggingface/trl/pull/2900
🔢 Fix GRPO doc about num_iterations by @qgallouedec in https://github.com/huggingface/trl/pull/2966
Update grpo_trainer.py by @tpoisonooo in https://github.com/huggingface/trl/pull/2973
👧🏽 Adding DoRA support to model config by @nbasyl in https://github.com/huggingface/trl/pull/2974
🧗 Add GRPO Trainer support for third-party accelerators by @ji-huazhong in https://github.com/huggingface/trl/pull/2836
🕸 Add distributing training guide by @qgallouedec in https://github.com/huggingface/trl/pull/2956
👂 Update learning rate doc in KTOConfig by @sileod in https://github.com/huggingface/trl/pull/2912
🌌 Fix logits computation in trainer prediction step by @logicaltrojan in https://github.com/huggingface/trl/pull/2969
🪪 Adds a more fine-grained profiling context by @edbeeching in https://github.com/huggingface/trl/pull/2975
🧬 Fix typo in grpo_trainer.py by @congchan in https://github.com/huggingface/trl/pull/2988
📜 Update README and doc index by @qgallouedec in https://github.com/huggingface/trl/pull/2986
📑 Fix logged metrics for KTO by @vaibhavjindal in https://github.com/huggingface/trl/pull/2982
⚰️ Deprecate liger-kernel by @qgallouedec in https://github.com/huggingface/trl/pull/2949
🔍 Update GRPO config documentation for beta parameter stability by @nopepper in https://github.com/huggingface/trl/pull/2992
🫔 [GRPO] Pass wrapped model to unwrap_model_for_generation for DeepSpeed Stage-3 compatibility by @kiddj in https://github.com/huggingface/trl/pull/2871
🛣️ inference_mode to no_grad when computing old_per_token_logps by @qgallouedec in https://github.com/huggingface/trl/pull/2987
🚀 DeepSpeed integration documentation by @qgallouedec in https://github.com/huggingface/trl/pull/2993
Update pr_style_bot.yml by @qgallouedec in https://github.com/huggingface/trl/pull/3003
🪙 [SFT] Log num_tokens and some logging fixes by @qgallouedec in https://github.com/huggingface/trl/pull/3006
Improve ci by @paulinebm in https://github.com/huggingface/trl/pull/3007
✌️Remove double compute of sum in SFTTrainer by @lexasub in https://github.com/huggingface/trl/pull/3001
📚 Update customization and distributing training documentation by @qgallouedec in https://github.com/huggingface/trl/pull/2991
🌍 Use global normalization for KL logging (to match normalization for loss) by @tchang1997 in https://github.com/huggingface/trl/pull/3004
🗜️ Loosened tokenizer type hint on apply_chat_template by @jamesbraza in https://github.com/huggingface/trl/pull/3005
🎲 Add support for additional generation kwargs in GRPO Trainer by @nopepper in https://github.com/huggingface/trl/pull/2989
🚀 Supporting deepspeed>=0.16.4's rename by @jamesbraza in https://github.com/huggingface/trl/pull/2963
🌡️ Fix temperature inconsistency in GRPO trainer by @Aladoro in https://github.com/huggingface/trl/pull/3029
🏁 Passing custom BOS/EOS token to GPROTrainer.generation_config by @jamesbraza in https://github.com/huggingface/trl/pull/3046
💠 Fixing SFTTrainer.compute_loss crash with accelerate by @jamesbraza in https://github.com/huggingface/trl/pull/3048
👯 [GRPO] Relax the assumption that prompts are unique within a batch by @qgallouedec in https://github.com/huggingface/trl/pull/3052
[GRPO] use argument names with processing_class by @kashif in https://github.com/huggingface/trl/pull/3062
🦥 Fixed SFTTrainer.compute_loss hang from #3048's PR comments by @jamesbraza in https://github.com/huggingface/trl/pull/3056
🏊 [SFT] Compatibility with padding free and iterable dataset by @qgallouedec in https://github.com/huggingface/trl/pull/3053
Fixing JSD loss computation in GKDTrainer as per definition by @abhigoyal1997 in https://github.com/huggingface/trl/pull/3043
🎭 Minor spelling fix in documentation (caracteres -> characters) by @esnible in https://github.com/huggingface/trl/pull/3074
💎 Gemma 3 SFT example on Codeforces dataset by @qgallouedec in https://github.com/huggingface/trl/pull/3070
🫣 [GRPO] add cache_implementation option in GRPO by @kashif in https://github.com/huggingface/trl/pull/3075
⛔ Add EOS token to processed input in SFT by @qgallouedec in https://github.com/huggingface/trl/pull/3091
🕊️ Padding-free for SFT by @qgallouedec in https://github.com/huggingface/trl/pull/3076
add "_prepare_fsdp" for DPOTrainer by @faaany in https://github.com/huggingface/trl/pull/2539
Use main process for dataset.map by @lewtun in https://github.com/huggingface/trl/pull/3106
Flexible_reward by @shirinyamani in https://github.com/huggingface/trl/pull/3079
🎬 Clip higher by @shirinyamani in https://github.com/huggingface/trl/pull/3118
🚀 Scaling GRPO to 70B+ Models and Multi-Node Training with vLLM Server & NCCL Communication by @binary-husky in https://github.com/huggingface/trl/pull/3094
⚡ Pack 300 times faster, truncate 100 times faster by @mariosasko in https://github.com/huggingface/trl/pull/3009
☎️ Documentation for disable gathering of model weights for generation in DeepSpeed ZeRO-3 by @qgallouedec in https://github.com/huggingface/trl/pull/3136
⚖️ Add option not to scale rewards (Dr. GRPO) by @qgallouedec in https://github.com/huggingface/trl/pull/3135
Release: v0.16 by @qgallouedec in https://github.com/huggingface/trl/pull/3137

New Contributors

@XZ-X made their first contribution in https://github.com/huggingface/trl/pull/2873
@BenasdTW made their first contribution in https://github.com/huggingface/trl/pull/2863
@kldzj made their first contribution in https://github.com/huggingface/trl/pull/2811
@ingambe made their first contribution in https://github.com/huggingface/trl/pull/2806
@linkedlist771 made their first contribution in https://github.com/huggingface/trl/pull/2925
@DanFosing made their first contribution in https://github.com/huggingface/trl/pull/2939
@nopepper made their first contribution in https://github.com/huggingface/trl/pull/2951
@shenxiangzhuang made their first contribution in https://github.com/huggingface/trl/pull/2921
@Ishan-Kumar2 made their first contribution in https://github.com/huggingface/trl/pull/2918
@tpoisonooo made their first contribution in https://github.com/huggingface/trl/pull/2973
@nbasyl made their first contribution in https://github.com/huggingface/trl/pull/2974
@sileod made their first contribution in https://github.com/huggingface/trl/pull/2912
@logicaltrojan made their first contribution in https://github.com/huggingface/trl/pull/2969
@congchan made their first contribution in https://github.com/huggingface/trl/pull/2988
@vaibhavjindal made their first contribution in https://github.com/huggingface/trl/pull/2982
@kiddj made their first contribution in https://github.com/huggingface/trl/pull/2871
@paulinebm made their first contribution in https://github.com/huggingface/trl/pull/3007
@lexasub made their first contribution in https://github.com/huggingface/trl/pull/3001
@tchang1997 made their first contribution in https://github.com/huggingface/trl/pull/3004
@Aladoro made their first contribution in https://github.com/huggingface/trl/pull/3029
@abhigoyal1997 made their first contribution in https://github.com/huggingface/trl/pull/3043
@esnible made their first contribution in https://github.com/huggingface/trl/pull/3074
@mariosasko made their first contribution in https://github.com/huggingface/trl/pull/3009

Full Changelog: https://github.com/huggingface/trl/compare/v0.15.0...v0.16.0

Feb 25, 2025

What changed

♻️ Fix caching in SFT by @qgallouedec in #2945
🐯 Fix LigerKernel for SFTTrainer by @lewtun in #2940
📌 Pin liger-kernel and vLLM by @qgallouedec in #2952

Full Changelog: https://github.com/huggingface/trl/compare/v0.15.1...v0.15.2

Feb 18, 2025

What's Changed

💬 Add maybe_convert_to_chatmlmap for conversational datasets by @kashif in SFT in #2862
[SFT] fix check for AutoLigerKernelForCausalLM by @kashif in #2874
🍟 [SFT] Handles the dataset if it has been preprocessed by @BenasdTW in #2863
🧶 [GRPO][vLLM + LoRA] Move unmerge of PEFT model after weight loading by @XZ-X in #2873
🪂 Don't gather logits in SFT to avoid hanging by @qgallouedec in #2890
Release: v0.15.1 by @qgallouedec

Full Changelog: https://github.com/huggingface/trl/compare/v0.15.0...v0.15.1

Feb 13, 2025

Major and breaking changes

Coming soon

What's Changed

⬆️ Bump dev version by @qgallouedec in https://github.com/huggingface/trl/pull/2689
📦 trl.templates in excluded packages by @qgallouedec in https://github.com/huggingface/trl/pull/2690
📖 Docs fix spelling issues by @nnsW3 in https://github.com/huggingface/trl/pull/2682
📄 Add GRPO batch size note in docs by @sdpkjc in https://github.com/huggingface/trl/pull/2672
🙈 Fixed typo in the GRPO documentation by @famouswizard in https://github.com/huggingface/trl/pull/2691
docs: Fix broken "Good First Issue" link in CONTRIBUTING.md by @famouswizard in https://github.com/huggingface/trl/pull/2693
🧠 Fix typo in "understand" in ppo_trainer.md by @famouswizard in https://github.com/huggingface/trl/pull/2695
☠️ Remove deprecated by @qgallouedec in https://github.com/huggingface/trl/pull/2692
💡 Add "Mini-R1: Reproduce Deepseek R1 „aha moment“ a RL tutorial" by @qgallouedec in https://github.com/huggingface/trl/pull/2697
📋 Add eval loss logging during prediction in GRPO by @kashif in https://github.com/huggingface/trl/pull/2694
fix: Fix typo in filename Update ultrafeedback.py by @brawncode in https://github.com/huggingface/trl/pull/2699
📖 Add GRPOTrainer to README.md by @burtenshaw in https://github.com/huggingface/trl/pull/2713
Improve GRPO example by @lewtun in https://github.com/huggingface/trl/pull/2717
📖 Nit Fix in Documentation by @ParagEkbote in https://github.com/huggingface/trl/pull/2722
🏰 num_logits_to_keep to logits_to_keep by @qgallouedec in https://github.com/huggingface/trl/pull/2721
💰 Fix incorrect calculation in Olivia's baguette spending logic by @defiberrys in https://github.com/huggingface/trl/pull/2727
fix: Fix typo in filename in ultrafeedback-prompt.py by @brawncode in https://github.com/huggingface/trl/pull/2716
docs: Fix typos in alias descriptions by @defiberrys in https://github.com/huggingface/trl/pull/2729
⚠️ Fix Attention Masking in GRPO by @andyl98 in https://github.com/huggingface/trl/pull/2708
🔂 Use vLLM prefix caching for speedup by @winglian in https://github.com/huggingface/trl/pull/2757
💔 Decouple loss computing and generation in GRPO by @qgallouedec in https://github.com/huggingface/trl/pull/2762
📌 vLLM >= 0.7.1 for device fix by @ctjlewis in https://github.com/huggingface/trl/pull/2766
📐 Add vLLM dtype configuration for GRPO trainer by @joey00072 in https://github.com/huggingface/trl/pull/2738
📖 Clarification max len in Reward documentation by @ParagEkbote in https://github.com/huggingface/trl/pull/2740
🔎 Add missing script argument in PPO documentation by @JohnConnor123 in https://github.com/huggingface/trl/pull/2720
🤖 Properly unwrap torch.compile-ed models in GRPO by @winglian in https://github.com/huggingface/trl/pull/2750
🔁 🦈 Support iterative GRPO by @shirinyamani in https://github.com/huggingface/trl/pull/2700
🚧 Add Optional ZeRO-3 Weight Gathering for GRPO in Sequence Generation by @SeungyounShin in https://github.com/huggingface/trl/pull/2667
↔️ GRPO: Set max_model_len when initializing vLLM instance by @mirceapricop in https://github.com/huggingface/trl/pull/2728
💡 GRPO vram-efficiency improvement; only compute relevant logprobs by @tyler-romero in https://github.com/huggingface/trl/pull/2773
🙃 Fix reward function in GRPO example by @junuMoon in https://github.com/huggingface/trl/pull/2777
💡 Add 'Post training an LLM for reasoning with GRPO in TRL' tutorial by @sergiopaniego in https://github.com/huggingface/trl/pull/2785
📉 Optimize GRPO memory usage by redefining per_device_batch_size as generations per device by @qgallouedec in https://github.com/huggingface/trl/pull/2776
🆚 Distinguish padding and eos when they differ by @binary-husky in https://github.com/huggingface/trl/pull/2793
🎯 [SFT] add token accuracy metric by @kashif in https://github.com/huggingface/trl/pull/2597
📠 Log completions for GRPO by @qgallouedec in https://github.com/huggingface/trl/pull/2772
🔬 SFT simplification by @qgallouedec in https://github.com/huggingface/trl/pull/2405
➖ Fix GRPO example in README by @qgallouedec in https://github.com/huggingface/trl/pull/2800
⛰️ Reduce peak vram consumption with efficient selective log_softmax by @tyler-romero in https://github.com/huggingface/trl/pull/2799
fix: typos in documentation files by @maximevtush in https://github.com/huggingface/trl/pull/2804
📤 GRPO refactor loading the model weights to vllm by @winglian in https://github.com/huggingface/trl/pull/2817
🫘 Add set_seed() call in GRPO to ensure unique seed for each process by @qgallouedec in https://github.com/huggingface/trl/pull/2824
⚖️ Add reward weight in multi-reward settings for GRPO by @hesamsheikh in https://github.com/huggingface/trl/pull/2676
🙌 Share vLLM device with training when only 1 available by @qgallouedec in https://github.com/huggingface/trl/pull/2827
👴 Update tokenizer parameter to processing_class in tests by @qgallouedec in https://github.com/huggingface/trl/pull/2828
🥾 Allow bootstrap GRPO by @qgallouedec in https://github.com/huggingface/trl/pull/2829
⚡ Fix GRPO PEFT by @qgallouedec in https://github.com/huggingface/trl/pull/2725
Fix PeftModel check when moving weights to vlllm by @edbeeching in https://github.com/huggingface/trl/pull/2850
🪆 Fix for Incorrect ValueError Handling in reward_weights in grpo_trainer.py by @loveychen in https://github.com/huggingface/trl/pull/2843
👨‍👩‍👧 GRPO + PEFT + vLLM by @winglian in https://github.com/huggingface/trl/pull/2818

New Contributors

@nnsW3 made their first contribution in https://github.com/huggingface/trl/pull/2682
@sdpkjc made their first contribution in https://github.com/huggingface/trl/pull/2672
@famouswizard made their first contribution in https://github.com/huggingface/trl/pull/2691
@brawncode made their first contribution in https://github.com/huggingface/trl/pull/2699
@ParagEkbote made their first contribution in https://github.com/huggingface/trl/pull/2722
@defiberrys made their first contribution in https://github.com/huggingface/trl/pull/2727
@ctjlewis made their first contribution in https://github.com/huggingface/trl/pull/2766
@joey00072 made their first contribution in https://github.com/huggingface/trl/pull/2738
@JohnConnor123 made their first contribution in https://github.com/huggingface/trl/pull/2720
@shirinyamani made their first contribution in https://github.com/huggingface/trl/pull/2700
@mirceapricop made their first contribution in https://github.com/huggingface/trl/pull/2728
@tyler-romero made their first contribution in https://github.com/huggingface/trl/pull/2773
@junuMoon made their first contribution in https://github.com/huggingface/trl/pull/2777
@binary-husky made their first contribution in https://github.com/huggingface/trl/pull/2793
@maximevtush made their first contribution in https://github.com/huggingface/trl/pull/2804
@hesamsheikh made their first contribution in https://github.com/huggingface/trl/pull/2676
@loveychen made their first contribution in https://github.com/huggingface/trl/pull/2843

Full Changelog: https://github.com/huggingface/trl/compare/v0.9.6...v0.15.0

Jan 29, 2025

Major and breaking changes

👨‍👨‍👧‍👧 GRPO

by @qgallouedec in https://github.com/huggingface/trl/pull/2565

What's Changed

⚰️ Remove deprecated by @qgallouedec in https://github.com/huggingface/trl/pull/2485
🗣️ Improve prose for smol course by @burtenshaw in https://github.com/huggingface/trl/pull/2487
🤩 Add SmolVLM tutorials to Community Tutorials page by @sergiopaniego in https://github.com/huggingface/trl/pull/2498
🏞️ Proper dataset for documentation images by @qgallouedec in https://github.com/huggingface/trl/pull/2499
🗂️ Reorganize documentation by @qgallouedec in https://github.com/huggingface/trl/pull/2483
[ORPO] fix orpo chosen-nll loss by @kashif in https://github.com/huggingface/trl/pull/2502
🏚 Remove unused components by @qgallouedec in https://github.com/huggingface/trl/pull/2480
Update community_tutorials.md by @qgallouedec in https://github.com/huggingface/trl/pull/2509
❎ Remove RLOO example test by @qgallouedec in https://github.com/huggingface/trl/pull/2513
👨‍🍳 Clarify DPO data preparation by @qgallouedec in https://github.com/huggingface/trl/pull/2512
💧 Generalize disable_dropout by @qgallouedec in https://github.com/huggingface/trl/pull/2511
👬 Rename collator PreferenceCollator to DataCollatorForPreference by @qgallouedec in https://github.com/huggingface/trl/pull/2510
📦 Packing documentation by @qgallouedec in https://github.com/huggingface/trl/pull/2503
☄️ Update Comet integration to include LogCompletionsCallback and Trainer.evaluation_loop() by @yaricom in https://github.com/huggingface/trl/pull/2501
Remove graph breaks for torch.compile() in padding free branch in DataCollatorForCompletionOnlyLM by @Abhishek-TAMU in https://github.com/huggingface/trl/pull/2158
🚜 Use field in dataclasses by @qgallouedec in https://github.com/huggingface/trl/pull/2494
©️ Update copyrights year by @qgallouedec in https://github.com/huggingface/trl/pull/2547
🧑‍🤝‍🧑 Proper metrics gathering across ranks before logging by @zhc7 in https://github.com/huggingface/trl/pull/2474
✒️ Fix typo in formatting_func's documentation in ConstantLengthDataset by @SamuelLarkin in https://github.com/huggingface/trl/pull/2549
🕊️ DPO padding free by @qgallouedec in https://github.com/huggingface/trl/pull/2520
ℹ️ XPU support for DPO by @faaany in https://github.com/huggingface/trl/pull/2533
🔠 Fix SFT truncation documentation by @umbilnm in https://github.com/huggingface/trl/pull/2521
↩️ Revert ORPO loss changes by @kashif in https://github.com/huggingface/trl/pull/2527
🎴 Add readme for datasets by @August-murr in https://github.com/huggingface/trl/pull/2491
💔 Fix dataset type unpair conversion docs by @claralp in https://github.com/huggingface/trl/pull/2550
[RLOO] Reinforce++ by @kashif in https://github.com/huggingface/trl/pull/2552
🏛️ Improve DPO configuration documentation structure by @qgallouedec in https://github.com/huggingface/trl/pull/2561
✨ Refine model card method docstring by @qgallouedec in https://github.com/huggingface/trl/pull/2566
🪄 Minor comment style modif by @qgallouedec in https://github.com/huggingface/trl/pull/2582
🏎️ vllm for Online DPO by @qgallouedec in https://github.com/huggingface/trl/pull/2558
🔖 Issues Auto-Labeller by @August-murr in https://github.com/huggingface/trl/pull/2542
🐛 Simplify bug report template by @qgallouedec in https://github.com/huggingface/trl/pull/2585
[RLOO] fix token_level_kl by @kashif in https://github.com/huggingface/trl/pull/2575
✂️ Truncate by default by @qgallouedec in https://github.com/huggingface/trl/pull/2587
🫢 Add max_prompt_length parameter in tests by @qgallouedec in https://github.com/huggingface/trl/pull/2588
🎞️ Fix documentation SFT -max_seq_length instead of max_length by @skandermoalla in https://github.com/huggingface/trl/pull/2590
👨‍👨‍👧‍👧 GRPO by @qgallouedec in https://github.com/huggingface/trl/pull/2565
🫣 Ignore CLI test for Python 3.9 by @qgallouedec in https://github.com/huggingface/trl/pull/2592
Fix merge error by @qgallouedec in https://github.com/huggingface/trl/pull/2595
🧰 Tool fine-tuning support DPO by @August-murr in https://github.com/huggingface/trl/pull/2479
💾 Reduce memory peak in GRPO by adding max_prompt_length and loop usage in logp computation by @qgallouedec in https://github.com/huggingface/trl/pull/2598
⚡ Add uv installation instructions by @stevhliu in https://github.com/huggingface/trl/pull/2601
🧩 PPO/RLOO/OnlineDPO sequence generation: make deepsped 3 weight gathering optional by @dawidm in https://github.com/huggingface/trl/pull/2557
🫷 Include stop token in policy model's generation_config by @dawidm in https://github.com/huggingface/trl/pull/2528
✂️ Reintroduce truncation_mode in DPOTrainer by @anakin87 in https://github.com/huggingface/trl/pull/2551
👋 Drop MDX by @qgallouedec in https://github.com/huggingface/trl/pull/2611
💎 Rename an inner var in GRPO to improve clarity by @qgallouedec in https://github.com/huggingface/trl/pull/2616
🏆 Custom reward function for GRPO and shiny doc by @qgallouedec in https://github.com/huggingface/trl/pull/2606
🥞 Fix DPO gradient accumulation loss scaling by @winglian in https://github.com/huggingface/trl/pull/2615
🥞 Fix BCO gradient accumulation loss scaling by @qgallouedec in https://github.com/huggingface/trl/pull/2638
🍭 Custom reward function for RLOO by @August-murr in https://github.com/huggingface/trl/pull/2612
🌯 Fix context manager runtime error when gather is disabled by @Superskyyy in https://github.com/huggingface/trl/pull/2639
🥞 Fix CPO gradient accumulation loss scaling by @qgallouedec in https://github.com/huggingface/trl/pull/2645
🥞 Fix GRPO gradient accumulation loss scaling by @qgallouedec in https://github.com/huggingface/trl/pull/2647
🥞 Fix KTO gradient accumulation loss scaling by @qgallouedec in https://github.com/huggingface/trl/pull/2648
🚛 Provide all columns of the dataset to the reward function by @qgallouedec in https://github.com/huggingface/trl/pull/2650
👐 DeepSpeed integration for GRPO by @qgallouedec in https://github.com/huggingface/trl/pull/2652
🔎 Finegrained reward logging for GRPO by @qgallouedec in https://github.com/huggingface/trl/pull/2651
📍 Disable caching when grad checkpointing enable in GRPO by @qgallouedec in https://github.com/huggingface/trl/pull/2653
📏 Log completion length in GRPO by @qgallouedec in https://github.com/huggingface/trl/pull/2659
🌀 Fix GRPO default completion length doc by @andyl98 in https://github.com/huggingface/trl/pull/2662
🏷️ Add model tags to model trained with GRPO by @qgallouedec in https://github.com/huggingface/trl/pull/2663
🖊 Fix typos by @omahs in https://github.com/huggingface/trl/pull/2673
⚡ vLLM for fast generation in GRPO by @qgallouedec in https://github.com/huggingface/trl/pull/2600
📉 Use num_logits_to_keep to reduce memory usage in GRPO by @qgallouedec in https://github.com/huggingface/trl/pull/2683

New Contributors

@Abhishek-TAMU made their first contribution in https://github.com/huggingface/trl/pull/2158
@zhc7 made their first contribution in https://github.com/huggingface/trl/pull/2474
@SamuelLarkin made their first contribution in https://github.com/huggingface/trl/pull/2549
@umbilnm made their first contribution in https://github.com/huggingface/trl/pull/2521
@stevhliu made their first contribution in https://github.com/huggingface/trl/pull/2601
@dawidm made their first contribution in https://github.com/huggingface/trl/pull/2557
@Superskyyy made their first contribution in https://github.com/huggingface/trl/pull/2639
@andyl98 made their first contribution in https://github.com/huggingface/trl/pull/2662
@omahs made their first contribution in https://github.com/huggingface/trl/pull/2673

Full Changelog: https://github.com/huggingface/trl/compare/v0.13.0...v0.14.0

Dec 16, 2024

Major and breaking changes

🐾 Process-supervised RM Trainer

We introduced a new trainer to train Process-supervised Reward Model (PRM) in TRL. A PRM rewards the quality of intermediate steps, promoting structured reasoning over focusing solely on the final outcome.With this trainer, we introduce a new dataset type: Stepwise supervision, which is a variant of the prompt-completion type, but for which completion is divided into several intermediate steps, and each step is associated with a label. Find out more in the stepwise-supervision section in the TRL documentation.

Here is an example of how to use the PRMTrainer to train a PRM on the Math Shepherd dataset:

# train_prm.py
from datasets import load_dataset
from trl import PRMConfig, PRMTrainer
from transformers import AutoModelForTokenClassification, AutoTokenizer

model = AutoModelForTokenClassification.from_pretrained("Qwen/Qwen2-0.5B", num_labels=2)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B")
train_dataset = load_dataset("trl-lib/math_shepherd", split="train[:10%]")

training_args = PRMConfig(output_dir="Qwen2-0.5B-Reward-Math-Sheperd", logging_steps=10)
trainer = PRMTrainer(model=model, args=training_args, processing_class=tokenizer, train_dataset=train_dataset)
trainer.train()

For more information, check out the PRMTrainer documentation.

by @qgallouedec and @gaetanlop in https://github.com/huggingface/trl/pull/2127 and https://github.com/huggingface/trl/pull/2148

🔀 Add `MergeModelCallBack`

Various works show that model merging can non-trivially improve performance, especially if the models belong to the same architecture. TRL now features a callback that merges the reference model with the current policy and optionally pushes the merged checkpoint to the Hub. This could be done on step/epoch end and/or the end of training. This callback uses Arcee's mergekit lib: https://github.com/arcee-ai/mergekit

from trl import DPOTrainer, MergeModelCallback
from trl.mergekit_utils import MergeConfig

config = MergeConfig()
merge_callback = MergeModelCallback(config)
trainer = DPOTrainer(...,  callbacks=[merge_callback])

by @August-murr in https://github.com/huggingface/trl/pull/2282

🔨 Support for tools for data utils

TRL preprocessing utils now support tooling. A first step toward agent fine-tuning.

from trl import apply_chat_template

def get_current_temperature(location: str):
    """
    Gets the temperature at a given location.

    Args:
        location: The location to get the temperature for
    """
    return 22.0

example = apply_chat_template(example, tokenizer, tools=[get_current_temperature])

by @August-murr in https://github.com/huggingface/trl/pull/2455

🌋 Add support for LLaVA-Next in `DPOTrainer`

VLMs have their own specificities which require special treatment in the trainer. DPOTrainer now supports LLaVA-Next models natively.

model = model = AutoModelForVision2Seq.from_pretrained("llava-hf/llava-v1.6-mistral-7b-hf")
trainer = DPOTrainer(model=model, ...)

by @chenweize1998 in https://github.com/huggingface/trl/pull/2413

🕹️ CLI and TRLParser refactor

TRL CLI has been refactored to be more user-friendly and easy to extend. We plan to extend the support to all trainers soon.

(simplified output, for readibility)

$ trl dpo --help
usage: trl dpo [-h] --dataset_name DATASET_NAME [--dataset_config DATASET_CONFIG] --output_dir OUTPUT_DIR [--loss_type {sigmoid,hinge,ipo}]

options:
  -h, --help            show this help message and exit
  --dataset_name DATASET_NAME, --dataset-name DATASET_NAME
  --dataset_config DATASET_CONFIG, --dataset-config DATASET_CONFIG
  --output_dir OUTPUT_DIR, --output-dir OUTPUT_DIR
                        The output directory where the model predictions and checkpoints will be written. (default: None)
  --loss_type {sigmoid,hinge,ipo}, --loss-type {sigmoid,hinge,ipo}

by @qgallouedec in https://github.com/huggingface/trl/pull/2380 and https://github.com/huggingface/trl/pull/2412

🤝 Mixture of judges

TRL features a new judge AllTrueJudge that unifies the decision of multiple binary judges. This judge implements the Mixture of Judges as described in the CGPO paper.

from trl import AllTrueJudge, BaseBinaryJudge

class RandomBinaryJudge(BaseBinaryJudge):
    """
    Random binary judge, for testing purposes.
    """

    def judge(self, prompts, completions, gold_completions=None, shuffle_order=True):
        return [random.choice([0, 1, -1]) for _ in range(len(prompts))]


prompts = ["The capital of France is", "The biggest planet in the solar system is"]
completions = [["Paris", "Marseille"], ["Saturn", "Jupiter"]]
judge = AllTrueJudge(judges=[RandomBinaryJudge(), RandomBinaryJudge()])
judgements = judge.judge(prompts=prompts, completions=completions)
print(judgements)  # [0, 1]

by @gaetanlop in https://github.com/huggingface/trl/pull/2159

❄️ DPO trainer supports `num_logits_to_keep` to save memory

Save memory by only keeping the top num_logits_to_keep logits in the DPO trainer.

training_args = DPOConfig(..., use_num_logits_to_keep=True)

by @xyangk in https://github.com/huggingface/trl/pull/2129

🗺️ Implementation DiscoPOP Loss

The DiscoPOP paper uses LLMs to discover more efficient offline preference optimization losses. In the paper the proposed DiscoPOP loss (which is a log-ratio modulated loss) outperformed other optimization losses on different tasks (IMDb positive text generation, Reddit TLDR summarization, and Alpaca Eval 2.0).

training_args = DPOConfig(..., loss_type="discopop", discopop_tau=0.05)

by @fanconic in https://github.com/huggingface/trl/pull/2323

🧑‍🍳 Add precompute batch size argument in `DPOTrainer` for reference model

We can now control the batch size for precomputing reference model logits.

training_args = DPOConfig(
...
    precompute_ref_log_probs=True,
    precompute_ref_batch_size=4,
)

by @SwayamInSync in https://github.com/huggingface/trl/pull/2426

📦 Support for packing tokenized datasets for SFT

SFTTrainer has supported packing datasets for faster training. Now, it support packing tokenized datasets as well.

by @kmehant in https://github.com/huggingface/trl/pull/2011

📉 Add PEFT support for `PPOTrainer`

PPOTrainer now supports PEFT for efficient training.

PPOTrainer(
    ...,
    peft_config=peft_config,
)

by @ccs96307 in https://github.com/huggingface/trl/pull/2344

💾 Deprecate `config` in favor of `args` in `PPOTrainer`

config has been deprecated in favor of args in PPOTrainer.

  PPOTrainer(
-   config=training_args,
+   args=training_args,
  )

by @qgallouedec in https://github.com/huggingface/trl/pull/2384

👮 Deprecate `policy` in favor of `model` in `PPOTrainer`

policy has been deprecated in favor of model in PPOTrainer.

  PPOTrainer(
-   policy=model,
+   model=model,
  )

by @qgallouedec in https://github.com/huggingface/trl/pull/2386

What's Changed

⏫ Bump dev version to 0.13.0.dev0 by @qgallouedec in https://github.com/huggingface/trl/pull/2305
📰 Update blog posts in documentation by @qgallouedec in https://github.com/huggingface/trl/pull/2319
⚰️ Remove deprecated args, script arguments, and PPOv2 by @qgallouedec in https://github.com/huggingface/trl/pull/2306
🧽 Fix judge doc by @qgallouedec in https://github.com/huggingface/trl/pull/2320
🪧 Fix slack notification titles by @qgallouedec in https://github.com/huggingface/trl/pull/2322
🪪 Check with token_id instead of token in DPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/2324
Fix wrong truncating index of tensor in DPOTrainer's concatenated_forward() by @yanghh2000 in https://github.com/huggingface/trl/pull/2332
Fix gradient_checkpointing_kwargs assignment in examples by @Galaxy-Husky in https://github.com/huggingface/trl/pull/2331
Bump liger-kernel to 0.4.0 by @ByronHsu in https://github.com/huggingface/trl/pull/2333
DPO trainer supports num_logits_to_keep to save memory by @xyangk in https://github.com/huggingface/trl/pull/2129
🧞 Add output_layer to the list of lm_head_namings in AutoModelForCausalLMWithValueHead by @qgallouedec in https://github.com/huggingface/trl/pull/2328
🫴 Better guide users in error reporting by @qgallouedec in https://github.com/huggingface/trl/pull/2327
🪡 Various RLOO fixes by @qgallouedec in https://github.com/huggingface/trl/pull/2325
💣 Remove transformers version check by @xyangk in https://github.com/huggingface/trl/pull/2343
👈 Add tokenizer arg back and add deprecation guidelines by @qgallouedec in https://github.com/huggingface/trl/pull/2348
🖨️ Fix error text in BCO and KTO tokenizing function by @PhilipMay in https://github.com/huggingface/trl/pull/2286
Adding video llm fine-tuning example by @mfarre in https://github.com/huggingface/trl/pull/2336
👋 Remove deprecated tokenizer argument in BCO, GKD, Iterative SFT, Nash MD and XPO by @qgallouedec in https://github.com/huggingface/trl/pull/2349
⚖️ Add use_soft_judge option to WinRateCallback by @kashif in https://github.com/huggingface/trl/pull/2347
🪜 Stepwise supervision dataset type by @qgallouedec in https://github.com/huggingface/trl/pull/2148
🔮 Inference mode in GeometricMixtureWrapper.forward by @kashif in https://github.com/huggingface/trl/pull/2345
🗃️ Use specified data_collator in RLOOTrainer and PPOTrainer by @bartoszzuk in https://github.com/huggingface/trl/pull/2360
📉 Add PEFT support for PPOTrainer by @ccs96307 in https://github.com/huggingface/trl/pull/2344
📃 Fix description for parameter "generate_during_eval" in dpo_config by @dakru012 in https://github.com/huggingface/trl/pull/2364
🗺️ Implementation DiscoPOP Loss by @fanconic in https://github.com/huggingface/trl/pull/2323
🤝 Mixture of judges by @gaetanlop in https://github.com/huggingface/trl/pull/2159
🎲 Move random judges in testing utilities by @qgallouedec in https://github.com/huggingface/trl/pull/2365
Fix dev install by @lewtun in https://github.com/huggingface/trl/pull/2369
[winrate callback] remove redundant call to eval and train by @kashif in https://github.com/huggingface/trl/pull/2372
🧲 Use our own require_bitsandbytes by @qgallouedec in https://github.com/huggingface/trl/pull/2370
⏰ Add start_time to _maybe_log_save_evaluate by @qgallouedec in https://github.com/huggingface/trl/pull/2373
🔀 Add MergeModelCallBack by @August-murr in https://github.com/huggingface/trl/pull/2282
📝 Fix typo in dataset generation script by @jiseshen in https://github.com/huggingface/trl/pull/2379
⌛ Update log method to include start_time parameter by @qgallouedec in https://github.com/huggingface/trl/pull/2381
🙈 Suppress warning for estimating tokens in trainers by @qgallouedec in https://github.com/huggingface/trl/pull/2389
📦 Support for packing tokenized datasets for SFT by @kmehant in https://github.com/huggingface/trl/pull/2011
💾 Deprecate config in favor of args in PPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/2384
🤏 New models for tests by @qgallouedec in https://github.com/huggingface/trl/pull/2287
👮 Deprecate policy in favor of model in PPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/2386
🤐 Fix deprecation warnings by @qgallouedec in https://github.com/huggingface/trl/pull/2392
🤐 Fix deprecation warnings by @qgallouedec in https://github.com/huggingface/trl/pull/2395
🖋️ Fix warning message formatting in KTOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/2394
🧳 Move zen generation script and fix tests by @qgallouedec in https://github.com/huggingface/trl/pull/2393
🐢 Fix slow tests by @kashif in https://github.com/huggingface/trl/pull/2397
🗝️ Update type hints by @qgallouedec in https://github.com/huggingface/trl/pull/2399
🖨 Add Script Utilities section to the documentation by @qgallouedec in https://github.com/huggingface/trl/pull/2407
👁️ Added SFT support for SmolVLM models via standalone script sft_vlm_smol_vlm.py by @sergiopaniego in https://github.com/huggingface/trl/pull/2409
Add note about special tokens in chat templates for LoRA SFT by @lewtun in https://github.com/huggingface/trl/pull/2414
🌐 Community Tutorials by @burtenshaw in https://github.com/huggingface/trl/pull/2411
🔓 Remove lm_head check in AutoModelForCausalLMWithValueHead by @qgallouedec in https://github.com/huggingface/trl/pull/2398
🌋 Add support for LLaVA-Next in DPOTrainer by @chenweize1998 in https://github.com/huggingface/trl/pull/2413
⚠️ Add warning guidelines and update codebase to follow best practices by @qgallouedec in https://github.com/huggingface/trl/pull/2350
Super tiny typo fix by @fzyzcjy in https://github.com/huggingface/trl/pull/2419
🧑‍🍳 Add precompute batch size argument in DPOTrainer for reference model by @SwayamInSync in https://github.com/huggingface/trl/pull/2426
📑 Refactor TrlParser by @qgallouedec in https://github.com/huggingface/trl/pull/2412
🔮 Fix unused precomputed ref log probs in DPO by @dakru012 in https://github.com/huggingface/trl/pull/2431
🧮 Fix max_steps calculation in RLOOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/2433
🗂️ Harmonize run and example batch sizes in RLOO docs by @asparius in https://github.com/huggingface/trl/pull/2439
🔗 Add "Open in Colab" badges in community tutorials page by @qgallouedec in https://github.com/huggingface/trl/pull/2441
💬 Fix chat for windows by @qgallouedec in https://github.com/huggingface/trl/pull/2443
🆔 Add datast_config to ScriptArguments by @qgallouedec in https://github.com/huggingface/trl/pull/2440
🏎 Fix deepspeed preparation of ref_model in OnlineDPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/2417
👯 Standardize model_args by @qgallouedec in https://github.com/huggingface/trl/pull/2442
[bugfix] Fix DataCollatorForChatML unexpected generation prompt by @NIL-zhuang in https://github.com/huggingface/trl/pull/2450
⚖️ Add tests_latest.yml workflow file by @qgallouedec in https://github.com/huggingface/trl/pull/2457
🛠️ Update tests and fix PPO by @kashif in https://github.com/huggingface/trl/pull/2463
🎞️ Add "Fine-tuning open AI models using Hugging Face TRL" YouTube video to community tutorials by @qgallouedec in https://github.com/huggingface/trl/pull/2467
🔨 Support for tools for data utils by @August-murr in https://github.com/huggingface/trl/pull/2455
🐾 Process-supervised RM Trainer by @gaetanlop in https://github.com/huggingface/trl/pull/2127
🕹️ CLI refactor by @qgallouedec in https://github.com/huggingface/trl/pull/2380
👀 Add "PaliGemma 🤝 Direct Preference Optimization" in community tutorials by @qgallouedec in https://github.com/huggingface/trl/pull/2475
☄️ Add support for Comet experiment management SDK integration by @yaricom in https://github.com/huggingface/trl/pull/2462
📥 Fix missing BitsAndBytesConfig import in doc by @August-murr in https://github.com/huggingface/trl/pull/2478
👨‍🏫 smol course links and badges by @qgallouedec in https://github.com/huggingface/trl/pull/2484

New Contributors

@yanghh2000 made their first contribution in https://github.com/huggingface/trl/pull/2332
@Galaxy-Husky made their first contribution in https://github.com/huggingface/trl/pull/2331
@ByronHsu made their first contribution in https://github.com/huggingface/trl/pull/2333
@xyangk made their first contribution in https://github.com/huggingface/trl/pull/2129
@mfarre made their first contribution in https://github.com/huggingface/trl/pull/2336
@dakru012 made their first contribution in https://github.com/huggingface/trl/pull/2364
@fanconic made their first contribution in https://github.com/huggingface/trl/pull/2323
@jiseshen made their first contribution in https://github.com/huggingface/trl/pull/2379
@kmehant made their first contribution in https://github.com/huggingface/trl/pull/2011
@burtenshaw made their first contribution in https://github.com/huggingface/trl/pull/2411
@chenweize1998 made their first contribution in https://github.com/huggingface/trl/pull/2413
@fzyzcjy made their first contribution in https://github.com/huggingface/trl/pull/2419
@SwayamInSync made their first contribution in https://github.com/huggingface/trl/pull/2426
@asparius made their first contribution in https://github.com/huggingface/trl/pull/2439
@NIL-zhuang made their first contribution in https://github.com/huggingface/trl/pull/2450
@yaricom made their first contribution in https://github.com/huggingface/trl/pull/2462

Full Changelog: https://github.com/huggingface/trl/compare/v0.12.0...v0.13.0

Dec 6, 2024

What's Changed

Pin transformers version <4.47 by @kashif in https://github.com/huggingface/trl/pull/2447

Full Changelog: https://github.com/huggingface/trl/compare/v0.12.1...v0.12.2

Nov 15, 2024

What's Changed

👈 Add tokenizer arg back and add deprecation guidelines by @qgallouedec in https://github.com/huggingface/trl/pull/2348

Full Changelog: https://github.com/huggingface/trl/compare/v0.12.0...v0.12.1

Nov 4, 2024

Major and breaking changes

General reward model support for Online DPO

Online DPO intially only supported a reward model that had the same tokenizer and chat template as the trained model. Now, you can use any reward model.

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification, AutoTokenizer
from trl import OnlineDPOConfig, OnlineDPOTrainer

model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_config.model_name_or_path, padding_side="left")

reward_model = AutoModelForSequenceClassification.from_pretrained(training_args.reward_model_path, num_labels=1)
reward_tokenizer = AutoTokenizer.from_pretrained(reward_model_name, truncation=True, truncation_side="left")

dataset = load_dataset(script_args.dataset_name)

training_args = OnlineDPOConfig(output_dir="...")
trainer = OnlineDPOTrainer(
    model=model,
    reward_model=reward_model,
    args=training_args,
    train_dataset=dataset,
    processing_class=tokenizer,
    reward_processing_class=reward_tokenizer,
)
trainer.train()

by @qgallouedec in https://github.com/huggingface/trl/pull/2276

Migration `PPOv2` -> `PPO`

The PPOv2 trainer has been renamed to PPO. The old PPO trainer has been removed. PPOv2 is now deprecated and will be removed in the next release.

- trainer = PPOv2Trainer(...)
+ trainer = PPOTrainer(...)

by @qgallouedec in https://github.com/huggingface/trl/pull/2174

Refactor `ScriptArguments`

We had ScriptArguments, SFTScriptArguments, DPOScriptArguments and RewardScriptArguments. Since they all share mostly the same fields, we've merged them into a single ScriptArguments class. SFTScriptArguments, DPOScriptArguments and RewardScriptArguments still exist but are deprecated and will be removed in the next release.

- script_args = DPOScriptArguments(...)
+ script_args = ScriptArguments(...)

by @qgallouedec in https://github.com/huggingface/trl/pull/2145

Soft judges for PairRM

The PairRMJudge now when called via the judge method has a flag return_scores that returns the probability scores of the first completion of the pair (instead of the rank of the preferred completion). The logits for the probability score can be scaled by an optional temperature parameter.

from trl import PairRMJudge
pairrm_judge = PairRMJudge()
prompts = ["Translate 'hello' to French", "What's the capital of Japan?"]
completions = [["Bonjour", "Salut"], ["Kyoto", "Tokyo"]]
results = pairrm_judge.judge(prompts, completions, return_scores=True)
print(results)  # [0.7492601275444031, 0.0005497377132996917]

by @kashif in https://github.com/huggingface/trl/pull/2221

Use pairwise judges for online methods

The OnlineDPOTrainer and any trainers that inherit from it (NashMDTrainer and XPOTrainer) can now accept an initialized PairwiseJudge instead of a reward model.

from datasets import load_dataset
from trl import OnlineDPOConfig, OnlineDPOTrainer, PairRMJudge
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
judge = PairRMJudge()
train_dataset = load_dataset("trl-lib/ultrafeedback-prompt", split="train")

training_args = OnlineDPOConfig(output_dir="Qwen2-0.5B-OnlineDPO", logging_steps=10)
trainer = OnlineDPOTrainer(
    model=model, judge=judge, args=training_args, processing_class=tokenizer, train_dataset=train_dataset
)
trainer.train()

by @kashif in https://github.com/huggingface/trl/pull/2243

Rename trainer arg `tokenizer` to `processing_class`

The tokenizer argument in the trainers has been renamed to processing_class to better reflect the fact that it can be not only a tokenizer but also a processor.

- trainer = DPOTrainer(model, args=training_args, train_dataset=dataset, tokenizer=tokenizer)
+ trainer = DPOTrainer(model, args=training_args, train_dataset=dataset, processing_class=tokenizer)

tokenizer is still supported for SFTTrainer and DPOTrainer but deprecated and will be removed in the next release.

by @qgallouedec in https://github.com/huggingface/trl/pull/2162

Adding weighted preference optimization (WPO) to DPO

The WPO paper adapts off-policy data to resemble on-policy data more closely by reweighting preference pairs according to their probability under the current policy. To use this method, set the use_weighting flag to True in the [DPOConfig].

DPOConfig(..., use_weighting=True)

by @gaetanlop in https://github.com/huggingface/trl/pull/2141

🃏 Model card for TRL

Using trainer.push_to_hub() now automatically creates a model card that includes:

A link to the base model used
A link to the dataset used for training
A link to the TRL repository
Sample demo code
A link to the associated Weights & Biases run
A link to the paper detailing the training procedure
Versions of dependencies
BibTeX citations for both the training procedure and TRL

All links are properly formatted to allow cross-referencing, enabling traceability back to sources (e.g., the model appears linked on the paper’s page).

https://github.com/user-attachments/assets/b903964e-9087-45cc-8fb0-2418fdd87b72

by @qgallouedec in https://github.com/huggingface/trl/pull/2123

Minor

Conversational dataset support

You can now use conversational datasets directly, without needing to apply a chat template beforehand, for the following trainers:

BCOTrainer (by @qgallouedec in PR #2107)
CPOTrainer (by @qgallouedec in PR #2144)
DPOTrainer (by @qgallouedec in PR #2131)
KTOTrainer (by @qgallouedec in PR #2248)
ORPOTrainer (by @qgallouedec in PR #2184)

from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import DPOTrainer

model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
dataset = load_dataset(dataset_name, split="train")

# Not needed anymore:
#
# def process(row):
#     prompt = tokenizer.apply_chat_template(example["prompt"], tokenize=False, add_generation_prompt=True)
#     prompt_chosen = tokenizer.apply_chat_template(example["prompt"] + example["chosen"], tokenize=False)
#     chosen = prompt_chosen[len(prompt) :]
#     prompt_rejected = tokenizer.apply_chat_template(example["prompt"] + example["rejected"], tokenize=False)
#     rejected = prompt_rejected[len(prompt) :]
#     return {"prompt": prompt, "chosen": chosen, "rejected": rejected}
#
# dataset = dataset.map(process)

training_args = DPOConfig(output_dir="...")
trainer = DPOTrainer(model, args=training_args, train_dataset=dataset, processing_class=tokenizer)
trainer.train()

Refactor DPO data processing

For more information, see PR #2209.

`trl env` for printing system info

You can now use trl env to print system information, including the platform, Python version, PyTorch version, CUDA device(s), and versions of various libraries.

$ trl env

Copy-paste the following information when reporting an issue:

- Platform: Linux-5.15.0-1048-aws-x86_64-with-glibc2.31
- Python version: 3.11.9
- PyTorch version: 2.4.0
- CUDA device(s): NVIDIA H100 80GB HBM3
- Transformers version: 4.47.0.dev0
- Accelerate version: 0.19.0
- Accelerate config: not found
- Datasets version: 3.0.2
- HF Hub version: 0.26.1
- TRL version: 0.12.0+14ef1ab
- bitsandbytes version: 0.44.1
- DeepSpeed version: 0.15.3
- Diffusers version: 0.30.3
- Liger-Kernel version: 0.3.0
- LLM-Blender version: 0.0.2
- OpenAI version: 1.46.0
- PEFT version: 0.13.2

by @qgallouedec in https://github.com/huggingface/trl/pull/2104

Sequence-Level KD

From GKD paper:

Sequence-Level KD (Kim & Rush, 2016). SeqKD maximizes the likelihood of high probability sequences generated by the teacher, and can be viewed as supervised FT on teacher-generated outputs.

SeqKD is taken as a baseline in the paper. It is now possible to use Sequence-Level KD in the GKDTrainer by setting seq_kd=True in the GKDConfig.

training_args = GKDConfig(..., seq_kd=True)

by @mst272 in https://github.com/huggingface/trl/pull/2220

Default `dataset_text_field` to `"text"`

Since many users use "text" as the column name for textual data in datasets, we've made it the default (previously a required argument) in SFTConfig. Now, specifying dataset_text_field="text" is no longer necessary.

  SFTConfig(
      ...,
-     dataset_text_field="text",
  )

by @qgallouedec in https://github.com/huggingface/trl/pull/2078

What's Changed

[SFT] fix neftune_noise_alpha in SFTTrainer by @kashif in https://github.com/huggingface/trl/pull/1841
Standardize training_args by @qgallouedec in https://github.com/huggingface/trl/pull/2082
Fix typo in ORPO example. by @skandermoalla in https://github.com/huggingface/trl/pull/2092
Fix Inconsistency with IsShardedQLoRA Setting by @fabianlim in https://github.com/huggingface/trl/pull/2089
Fixes #2087 - _process_tokens for empty prompts in KTOTrainer by @gabikadlecova in https://github.com/huggingface/trl/pull/2093
KTO: fix logits metric, add logits metric to BCOTrainer by @claralp in https://github.com/huggingface/trl/pull/2094
Clean up README and remove openrlbenchmark dependency by @lewtun in https://github.com/huggingface/trl/pull/2085
Fix PPO/RLOO examples by @lewtun in https://github.com/huggingface/trl/pull/2100
[CLI] trl env for printing system info by @qgallouedec in https://github.com/huggingface/trl/pull/2104
[RewardTrainer] Tokenize inputs within trainer by @lewtun in https://github.com/huggingface/trl/pull/2102
Fix documentation links by @qgallouedec in https://github.com/huggingface/trl/pull/2105
fix formatting by @kashif in https://github.com/huggingface/trl/pull/2109
[online-dpo] allow parse-args as list of floats by @kashif in https://github.com/huggingface/trl/pull/2108
Fix pack test by @qgallouedec in https://github.com/huggingface/trl/pull/2111
BCOTrainer conversational dataset support by @qgallouedec in https://github.com/huggingface/trl/pull/2107
Generalizes VSFT script to support REDACTED by @edbeeching in https://github.com/huggingface/trl/pull/2120
Update example_overview.md by @kashif in https://github.com/huggingface/trl/pull/2125
Remove max_length from RewardDataCollatorWithPadding by @qgallouedec in https://github.com/huggingface/trl/pull/2119
Standardize pushing to Hub in examples by @qgallouedec in https://github.com/huggingface/trl/pull/2126
Eos token encouragement Clarification by @August-murr in https://github.com/huggingface/trl/pull/2128
Tokenize row during in training_step by @qgallouedec in https://github.com/huggingface/trl/pull/2117
♻️ Standardize script_args by @qgallouedec in https://github.com/huggingface/trl/pull/2130
Add table for WinRateCallback by @lewtun in https://github.com/huggingface/trl/pull/2116
🧹 Style by @qgallouedec in https://github.com/huggingface/trl/pull/2132
arXiv to HF Papers by @qgallouedec in https://github.com/huggingface/trl/pull/2133
Add correct label for WinRateCallback table by @lewtun in https://github.com/huggingface/trl/pull/2134
🃏 Model card for TRL by @qgallouedec in https://github.com/huggingface/trl/pull/2123
Rename dpo_visual.py example to dpo_vlm.py by @qgallouedec in https://github.com/huggingface/trl/pull/2139
[GKD] Set custom EOS tokens in generation config by @lewtun in https://github.com/huggingface/trl/pull/2142
Fix attention mask warning chat cli by @qgallouedec in https://github.com/huggingface/trl/pull/2147
[CI] Don't use eval_strategy="steps" when no eval dataset by @qgallouedec in https://github.com/huggingface/trl/pull/2152
Conversational dataset support for DPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/2131
🩹 [Hotfix] Add setter for tokenizer by @qgallouedec in https://github.com/huggingface/trl/pull/2163
↩️ Revert tokenizer hotfix #2163 by @qgallouedec in https://github.com/huggingface/trl/pull/2165
chore: update test_cli.py by @eltociear in https://github.com/huggingface/trl/pull/2168
🏷️ Model badges in trainer documentation by @qgallouedec in https://github.com/huggingface/trl/pull/2160
Default dataset_text_field to "text" by @qgallouedec in https://github.com/huggingface/trl/pull/2078
Update trl version in CITATION.cff by @qgallouedec in https://github.com/huggingface/trl/pull/2171
🗑️ Set deprecation version for DPO and SFT arguments to version 0.13 by @qgallouedec in https://github.com/huggingface/trl/pull/2170
Conversational dataset support for CPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/2144
Capybara replaced with ultrafeedback_binarized by @August-murr in https://github.com/huggingface/trl/pull/2183
minor KTO setting changes + KL batch size by @kawine in https://github.com/huggingface/trl/pull/2153
🏷️ Model badges: select only TRL models by @qgallouedec in https://github.com/huggingface/trl/pull/2178
Rename trainer arg tokenizer to processing_class by @qgallouedec in https://github.com/huggingface/trl/pull/2162
Update documentation CLI Chat by @qgallouedec in https://github.com/huggingface/trl/pull/2191
🃏 Model card: "unsloth" tag by @qgallouedec in https://github.com/huggingface/trl/pull/2173
[CI] fix dpo gpu ci tests by @kashif in https://github.com/huggingface/trl/pull/2189
Update CONTRIBUTING.md by @kushal34712 in https://github.com/huggingface/trl/pull/2181
Fix RLOO checkpointing by @bartoszzuk in https://github.com/huggingface/trl/pull/2114
Update README.md by @PRIYANKjakharia in https://github.com/huggingface/trl/pull/2186
skip_prompt=True in TextIteratorStreamer by @qgallouedec in https://github.com/huggingface/trl/pull/2193
[CI] Use transformers from source in "tests_no_optional_dep" by @qgallouedec in https://github.com/huggingface/trl/pull/2198
Fix the bug of DPOTrainer where the coefficient of aux_loss is always 0 during training by @muupan in https://github.com/huggingface/trl/pull/2200
Fix the bug of aux_loss coefficient being 0 in BCOTrainer, CPOTrainer, KTOTrainer, and ORPOTrainer by @muupan in https://github.com/huggingface/trl/pull/2201
[DPO] Adding weighted preference optimization (WPO) by @gaetanlop in https://github.com/huggingface/trl/pull/2141
[GKD] interpolate in prob. space by @kashif in https://github.com/huggingface/trl/pull/2204
Drop decoder_input_ids in DPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/2208
Update incorrect data processing in DataCollatorForChatML by @ruijunfeng in https://github.com/huggingface/trl/pull/2172
Update log_example_reports.py by @DhruvKadam-git in https://github.com/huggingface/trl/pull/2182
Report to "none" in GKD test by @qgallouedec in https://github.com/huggingface/trl/pull/2214
[Judges] Soft judges for PairRM by @kashif in https://github.com/huggingface/trl/pull/2221
Update README.md by @kushal34712 in https://github.com/huggingface/trl/pull/2180
Updated README.md with CLI examples and additional usage instructions by @Singhal1808 in https://github.com/huggingface/trl/pull/2199
trl env report all cuda devices by @qgallouedec in https://github.com/huggingface/trl/pull/2216
Conversational dataset support for ORPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/2184
🕊️ Migration PPOv2 -> PPO by @qgallouedec in https://github.com/huggingface/trl/pull/2174
Add Sequence-Level KD by @mst272 in https://github.com/huggingface/trl/pull/2220
Update dataset_formats.mdx by @August-murr in https://github.com/huggingface/trl/pull/2222
📒 Fix type/format confusions by @qgallouedec in https://github.com/huggingface/trl/pull/2223
Update commands for code linting in contributing guidelines by @Ben-Schneider-code in https://github.com/huggingface/trl/pull/2225
Refactor ScriptArguments by @qgallouedec in https://github.com/huggingface/trl/pull/2145
Updated ScriptArguments warning messages by @sergiopaniego in https://github.com/huggingface/trl/pull/2230
DPO support remove_unused_columns by @qgallouedec in https://github.com/huggingface/trl/pull/2233
Setting capture output to False by @August-murr in https://github.com/huggingface/trl/pull/2239
Update SFT examples by @lewtun in https://github.com/huggingface/trl/pull/2244
Enhancements to Log Report Script: Improved Error Handling and Logging by @DhruvKadam-git in https://github.com/huggingface/trl/pull/2232
🔀 Rename get_batch_sample and add num_items_in_batch to compute_loss by @qgallouedec in https://github.com/huggingface/trl/pull/2246
Refactor DPO data processing by @qgallouedec in https://github.com/huggingface/trl/pull/2209
Update dataset_formats.mdx by @cameronphchen in https://github.com/huggingface/trl/pull/2259
Use processing_class instead of tokenizer in LogCompletionsCallback by @qgallouedec in https://github.com/huggingface/trl/pull/2261
Adjust padding in batch generation by @gaetanlop in https://github.com/huggingface/trl/pull/2251
setup_chat_format: throw error if there is already a template in base model by @ngxson in https://github.com/huggingface/trl/pull/2252
Bump the minimum transformers version to v4.46 by @qgallouedec in https://github.com/huggingface/trl/pull/2245
Conversational dataset support for KTOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/2248
[Judges] use the pair-judges in online-preference trainers by @kashif in https://github.com/huggingface/trl/pull/2243
Update reward_modeling.py by @cameronphchen in https://github.com/huggingface/trl/pull/2266
♾️ Fix test generation max_new_tokens by @qgallouedec in https://github.com/huggingface/trl/pull/2272
Refactor log_reports.py for Improved Logging, File Processing, and Slack Payload Handling by @Mefisto04 in https://github.com/huggingface/trl/pull/2249
Replace log(sigmoid(log_odds) with logsigmoid(log_odds) for ORPO by @zhanwenchen in https://github.com/huggingface/trl/pull/2274
[KTO/BCO Trainer] add bos_token_id only if it exists by @seanexp in https://github.com/huggingface/trl/pull/2279
Fix the computation of KL divergence loss in Nash MD by @d-tiapkin in https://github.com/huggingface/trl/pull/2277
Don't pass eval_dataset in to trainers when no eval strategy by @qgallouedec in https://github.com/huggingface/trl/pull/2270
Update callbacks.py for fix small python type error by @anch0vy in https://github.com/huggingface/trl/pull/2285
Use any reward model for online methods by @qgallouedec in https://github.com/huggingface/trl/pull/2276
Clean dependencies by @qgallouedec in https://github.com/huggingface/trl/pull/2298
Fix _save_checkpoint for online methods by @qgallouedec in https://github.com/huggingface/trl/pull/2288
Refactor unit tests to use standard unittest assertion methods by @ccs96307 in https://github.com/huggingface/trl/pull/2283
Remove stale bot by @qgallouedec in https://github.com/huggingface/trl/pull/2300
Fix no optional dependencies by @qgallouedec in https://github.com/huggingface/trl/pull/2301
Add optimizer_cls_and_kwargs attribute to PPO and RLOO by @qgallouedec in https://github.com/huggingface/trl/pull/2302
Specify min versions by @qgallouedec in https://github.com/huggingface/trl/pull/2303

New Contributors

@skandermoalla made their first contribution in https://github.com/huggingface/trl/pull/2092
@fabianlim made their first contribution in https://github.com/huggingface/trl/pull/2089
@gabikadlecova made their first contribution in https://github.com/huggingface/trl/pull/2093
@August-murr made their first contribution in https://github.com/huggingface/trl/pull/2128
@kushal34712 made their first contribution in https://github.com/huggingface/trl/pull/2181
@PRIYANKjakharia made their first contribution in https://github.com/huggingface/trl/pull/2186
@ruijunfeng made their first contribution in https://github.com/huggingface/trl/pull/2172
@DhruvKadam-git made their first contribution in https://github.com/huggingface/trl/pull/2182
@Singhal1808 made their first contribution in https://github.com/huggingface/trl/pull/2199
@mst272 made their first contribution in https://github.com/huggingface/trl/pull/2220
@Ben-Schneider-code made their first contribution in https://github.com/huggingface/trl/pull/2225
@sergiopaniego made their first contribution in https://github.com/huggingface/trl/pull/2230
@cameronphchen made their first contribution in https://github.com/huggingface/trl/pull/2259
@ngxson made their first contribution in https://github.com/huggingface/trl/pull/2252
@Mefisto04 made their first contribution in https://github.com/huggingface/trl/pull/2249
@zhanwenchen made their first contribution in https://github.com/huggingface/trl/pull/2274
@d-tiapkin made their first contribution in https://github.com/huggingface/trl/pull/2277
@anch0vy made their first contribution in https://github.com/huggingface/trl/pull/2285
@ccs96307 made their first contribution in https://github.com/huggingface/trl/pull/2283

Full Changelog: https://github.com/huggingface/trl/compare/v0.11.0...v0.12.0

Oct 15, 2024

What's Changed

Fix Inconsistency with IsShardedQLoRA Setting by @fabianlim in https://github.com/huggingface/trl/pull/2089

New Contributors

@fabianlim made their first contribution in https://github.com/huggingface/trl/pull/2089

Full Changelog: https://github.com/huggingface/trl/compare/v0.11.3...v0.11.4

Previous 1 2 3 4 Next

Similar releases

Other sources from this team

Similar sources

Latest

v1.2.0

Source

@huggingface/trl

Tracking Since

Jan 25, 2023

Last fetched Apr 19, 2026

.json·.md·.atom

TRL

Major

🔮 Native VLM support for SFTTrainer

🔥 RLOOTrainer refactor

🧭 HF jobs x TRL guide

🏌️ DAPO loss type

🪶 [GRPO] PPO Lite: Scale rewards by Std of Batch

🎢 [Callbacks] BEMA

Minor

Deprecations

What's Changed

New Contributors

Major and breaking

🌺 OpenAI GPT OSS & Harmony support

Add vLLM transformers backend to online methods

What's Changed

New Contributors

Breaking and major changes

🎞️ GSPO

👁️ [GRPO] Add VLM training capabilities to the GRPO trainer

🐙 MPO

Add support for CB with native transformers

Add entropy based filtering inside the GRPOTrainer

👐 FSDP2+GRPO

What's Changed

New Contributors

What's Changed

New Contributors

Breaking and major changes

🧰 [SFT] Tool support

📉 FFD packing

[Liger] liger DPO support

💬 Fix setup_chat_format and add clone_chat_template

📚 SFTTrainer support chat template kwargs

🤵‍♂️ SFT on assistant messages only

🧬 Add generation_kwargs as a property of GRPOConfig to support additional generation arguments

New defaults

Minor changes

What's Changed

New Contributors

What's Changed

What's Changed

Major or breaking

What's Changed

New Contributors

Major and breaking

⚡ Up to 4x faster: Data Parallel for vLLM server

* ☝️ [GRPO] Generate once per effective batch

⏱️ Fix vLLM server to support V1 Engine

👎 [GRPO] Adds option to disable dropout

🩺 Dr. GRPO loss

🎲 [GRPO] Make training dataset shuffle optional

☕ Overlong-filtering for GRPO

🐯 Integrate Liger GRPO Loss to GRPO Trainer

Bug fixes

What's Changed

New Contributors

What's Changed

Major and breaking

🚀 Scaling GRPO to 70B+ Models and Multi-Node Training with vLLM Server & NCCL Communication

🐦‍🔥 6x faster GRPO with multi-step optimization

🌍 Use global normalization in GRPO

⚖️ Add option not to scale rewards

🤸‍♀️ Domain-specific rewards in GRPO

🍃 Do not load reference model when beta == 0.0

🕊️ Padding-free for SFT

🎬 Clip Higher for Better Exploration

Bug fixes

Minor

What's Changed

New Contributors

What changed

What's Changed

Major and breaking changes

What's Changed

New Contributors

Major and breaking changes

👨‍👨‍👧‍👧 GRPO

What's Changed

New Contributors

🔮 Native VLM support for `SFTTrainer`

🔥 `RLOOTrainer` refactor

💬 Fix `setup_chat_format` and add `clone_chat_template`

🧬 Add `generation_kwargs` as a property of `GRPOConfig` to support additional generation arguments

🍃 Do not load reference model when `beta == 0.0`

🔀 Add `MergeModelCallBack`

🌋 Add support for LLaVA-Next in `DPOTrainer`

❄️ DPO trainer supports `num_logits_to_keep` to save memory

🧑‍🍳 Add precompute batch size argument in `DPOTrainer` for reference model

📉 Add PEFT support for `PPOTrainer`

💾 Deprecate `config` in favor of `args` in `PPOTrainer`

👮 Deprecate `policy` in favor of `model` in `PPOTrainer`

Migration `PPOv2` -> `PPO`

Refactor `ScriptArguments`

Rename trainer arg `tokenizer` to `processing_class`

`trl env` for printing system info

Default `dataset_text_field` to `"text"`