v0.22.0

Major

🔮 Native VLM support for `SFTTrainer`

SFTTrainer now natively supports Vision-Language Models (VLMs). This includes support for both languauge modeling, prompt-completion data. It also supports training on completion-only.

from trl import SFTConfig, SFTTrainer
from datasets import load_dataset

trainer = SFTTrainer(
    model="Qwen/Qwen2.5-VL-3B-Instruct",
    args=SFTConfig(max_length=None),
    train_dataset=load_dataset("trl-lib/llava-instruct-mix", split="train"),
)
trainer.train()

by @qgallouedec in https://github.com/huggingface/trl/pull/3862, https://github.com/huggingface/trl/pull/3907 and https://github.com/huggingface/trl/pull/3908

🔥 `RLOOTrainer` refactor

RLOOTrainer has been refactored to align with the design principles of other other trainers in the library. You can now use this trainer exactly like GRPO.

from datasets import load_dataset
from trl import RLOOConfig, RLOOTrainer

dataset = load_dataset("trl-lib/ultrafeedback-prompt", split="train")

# Dummy reward function for demonstration purposes
def reward_num_unique_letters(completions, **kwargs):
    """Reward function that rewards completions with more unique letters."""
    completion_contents = [completion[0]["content"] for completion in completions]
    return [float(len(set(content))) for content in completion_contents]

trainer = RLOOTrainer(
    model="Qwen/Qwen2-0.5B-Instruct",
    reward_funcs=reward_num_unique_letters,
    train_dataset=dataset,
)
trainer.train()

by @shirinyamani in https://github.com/huggingface/trl/pull/3801

🧭 HF jobs x TRL guide

You can now levarage Hugging Face Jobs to easily train and deploy your models with TRL.

hf jobs uv run --flavor a100-large --secrets HF_TOKEN "https://raw.githubusercontent.com/huggingface/trl/main/trl/scripts/sft.py" --model_name_or_path Qwen/Qwen2-0.5B --dataset_name trl-lib/Capybara

A guide is available in the docs.

by @sergiopaniego in https://github.com/huggingface/trl/pull/3890

🏌️ DAPO loss type

GRPOTrainer now supports DAPO loss type, which aggregates token-level losses by normalizing with the number of active token in the global accumulated batch. This method was introduced to eliminate length bias. Simply use

from trl import GRPOConfig, GRPOTrainer

training_args = GRPOConfig(
    loss_type="dapo",
    ...
)

by @qgallouedec in https://github.com/huggingface/trl/pull/3938

🪶 [GRPO] PPO Lite: Scale rewards by Std of Batch

The authors of Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning (Lite PPO) find that the combination of:

scaling rewards by the standard deviation computed over the entire batch and
aggregating loss over the total number of tokens

can unlock the learning capability of critic-free policies using vanilla PPO loss. Their results demonstrate that this simple combination consistently improves performance, surpassing strategies like GRPO and DAPO.

TRL supports using these learnings to train a GRPO model by:

from trl import GRPOConfig

training_args = GRPOConfig(
    scale_rewards="batch",
    loss_type="dapo",
    ...
)

by @pramodith in https://github.com/huggingface/trl/pull/3935

🎢 [Callbacks] BEMA

Bias-Corrected Exponential Moving Average (BEMA) improves the stability and efficiency of language model fine-tuning by reducing stochasticity and eliminating bias. To use BEMA with SFT as described in the paper, you can now use the [BEMACallback]:

from trl import BEMACallback, SFTTrainer

trainer = SFTTrainer(
    ...
    callbacks=[BEMACallback()],
)

by @kashif in https://github.com/huggingface/trl/pull/3855

Minor

🎀 New defaults: gradient_checkpointing=True by @qgallouedec in https://github.com/huggingface/trl/pull/3510
🎚️ Add dataset mixer by @lewtun in https://github.com/huggingface/trl/pull/3791
💇 Add soft overlong punishment reward function and update documentation by @qgallouedec in https://github.com/huggingface/trl/pull/3804
🗿 [CPO] Add AlphaPO method via CPOTrainer by @kashif in https://github.com/huggingface/trl/pull/3824
🗳️ Extend BCO Trainer dataset format support by @reihig-ut in https://github.com/huggingface/trl/pull/3134
🐯 Support assistant-only training and Liger by @qgallouedec in https://github.com/huggingface/trl/pull/3914
🎆 Add entropy logging in SFT by @qgallouedec in https://github.com/huggingface/trl/pull/3940
📸 Return position_ids for flash_attention_3 by @jue-jue-zi in https://github.com/huggingface/trl/pull/3942

Deprecations

🗑️ Deprecate setup_chat_format by @qgallouedec in https://github.com/huggingface/trl/pull/3929
🗑 Deprecate IterativeSFTTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/3905

What's Changed

⬆️ Bump dev version by @qgallouedec in https://github.com/huggingface/trl/pull/3850
🔗 Fix collection link in doc by @qgallouedec in https://github.com/huggingface/trl/pull/3852
Typo fix in new model description by @sergiopaniego in https://github.com/huggingface/trl/pull/3854
Small style fix in README by @qgallouedec in https://github.com/huggingface/trl/pull/3861
[GRPO] 👁️ Fix vLLM server mode for VLM GRPO training incompatibility for certain AutoProcessors by @ghubnerr in https://github.com/huggingface/trl/pull/3832
👁️ From AutoModelForVision2Seq to AutoModelForImageTextToText by @qgallouedec in https://github.com/huggingface/trl/pull/3836
👋 Remove --bf16 value in scripts by @sergiopaniego in https://github.com/huggingface/trl/pull/3869
🎀 New defaults: gradient_checkpointing=True by @qgallouedec in https://github.com/huggingface/trl/pull/3510
🦦 Validate vllm_mode param in GRPO by @sergiopaniego in https://github.com/huggingface/trl/pull/3866
🎚️ Add dataset mixer by @lewtun in https://github.com/huggingface/trl/pull/3791
✨ Integrate PEFT model preparation across trainers and utilities by @qgallouedec in https://github.com/huggingface/trl/pull/3882
⌨️ Add py.typed by @cyyever in https://github.com/huggingface/trl/pull/3841
💇 Add soft overlong punishment reward function and update documentation by @qgallouedec in https://github.com/huggingface/trl/pull/3804
🕹️ [GRPO] Fix vllm mode validation in distributed setting by @Kirill-Kravtsov in https://github.com/huggingface/trl/pull/3886
⏳ Replaced unittest.TestCase with TrlTestCase that handles tmp dir by @qgallouedec in https://github.com/huggingface/trl/pull/3863
🔮 Native VLM support for SFTTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/3862
Minor optimizations in SFT. by @pramodith in https://github.com/huggingface/trl/pull/3884
🧩 Fix reward_processing_classes validation in GRPOTrainer by @chi2liu in https://github.com/huggingface/trl/pull/3876
🎢 [Callbacks] BEMA by @kashif in https://github.com/huggingface/trl/pull/3855
👁️ VLM blog by @qgallouedec in https://github.com/huggingface/trl/pull/3899
🪄 Improve quickstart documentation with updated API examples by @behroozazarkhalili in https://github.com/huggingface/trl/pull/3873
👔 HF Doc Builder style by @qgallouedec in https://github.com/huggingface/trl/pull/3498
✏️ Fix SFTTrainer token accuracy computation with PromptEncoder by @zk-quantum in https://github.com/huggingface/trl/pull/3821
☑️ Check eval batch size in grpo by @jp1924 in https://github.com/huggingface/trl/pull/3889
⚔️ Optimize truncate_with_protected_tokens to use vectorized operations by @chi2liu in https://github.com/huggingface/trl/pull/3875
Add tests for get_position_ids_from_packed_seq_lengths by @pramodith in https://github.com/huggingface/trl/pull/3883
🌳 Enhance segment tree implementation for non-power-of-2 values by @MengAiDev in https://github.com/huggingface/trl/pull/3888
⚡ Optimize completion_ids list conversion in GRPO trainer by @chi2liu in https://github.com/huggingface/trl/pull/3874
🗿 [CPO] Add AlphaPO method via CPOTrainer by @kashif in https://github.com/huggingface/trl/pull/3824
🗳️ Extend BCO Trainer dataset format support by @reihig-ut in https://github.com/huggingface/trl/pull/3134
🐯 Support assistant-only training and Liger by @qgallouedec in https://github.com/huggingface/trl/pull/3914
🗑 Deprecate IterativeSFTTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/3905
♻️ use_cache should be set in the forward pass by @qgallouedec in https://github.com/huggingface/trl/pull/3891
🌓 SFTTrainer for VLM: Support for prompt-completion data by @qgallouedec in https://github.com/huggingface/trl/pull/3907
➡️ SFTTrainer for VLM: support completion-only loss by @qgallouedec in https://github.com/huggingface/trl/pull/3908
📚 Update BEMACallback documentation to ignore docstyle and fix lag parameter description by @qgallouedec in https://github.com/huggingface/trl/pull/3917
✏️ Fix typos by @cyyever in https://github.com/huggingface/trl/pull/3921
🧹 Clean SFT tests by @qgallouedec in https://github.com/huggingface/trl/pull/3922
🤹‍♂️ Multi-image testing dataset by @qgallouedec in https://github.com/huggingface/trl/pull/3916
🧾 Use logger.warning instead of warnings.warn by @qgallouedec in https://github.com/huggingface/trl/pull/3923
♻️ Reuse multimodal message preparation from SFTTrainer in GRPOTrainer by @MengAiDev in https://github.com/huggingface/trl/pull/3919
🗑️ Deprecate setup_chat_format by @qgallouedec in https://github.com/huggingface/trl/pull/3929
🗞 bugfix 'TrainerState' object is not subscriptable by @ErezYosef in https://github.com/huggingface/trl/pull/3936
📦 Wrapping the main execution code to avoid multi-processing issues from vLLM by @kaixuanliu in https://github.com/huggingface/trl/pull/3932
🪶 [GRPO] PPO Lite: Scale rewards by Std of Batch by @pramodith in https://github.com/huggingface/trl/pull/3935
🏌️ DAPO loss type by @qgallouedec in https://github.com/huggingface/trl/pull/3938
🎆 Add entropy logging in SFT by @qgallouedec in https://github.com/huggingface/trl/pull/3940
🗂 Update paper_index section by @behroozazarkhalili in https://github.com/huggingface/trl/pull/3937
[DPO] Adding support for different losses which are now supported by Liger by @Manan17 in https://github.com/huggingface/trl/pull/3815
[GKD] add liger loss by @kashif in https://github.com/huggingface/trl/pull/3946
🤸 [SFT] Drop entropy calculation when using liger by @qgallouedec in https://github.com/huggingface/trl/pull/3947
✂️ fix: handle list tensors in split_tensor_dict function by @qgallouedec in https://github.com/huggingface/trl/pull/3951
🪶 LitePPO: Fix Docs for paper index by @pramodith in https://github.com/huggingface/trl/pull/3954
🦥 Unsloth Docs update by @shimmyshimmer in https://github.com/huggingface/trl/pull/3955
📸 Return position_ids for flash_attention_3 by @jue-jue-zi in https://github.com/huggingface/trl/pull/3942
🧭 HF jobs x TRL guide by @sergiopaniego in https://github.com/huggingface/trl/pull/3890
Add HF jobs tag when creating model card via jobs by @sergiopaniego in https://github.com/huggingface/trl/pull/3956
[CI] Modify tests to handle device allocation for models by @kashif in https://github.com/huggingface/trl/pull/3962
Fixed some typos and added small details about trackio to docs by @sergiopaniego in https://github.com/huggingface/trl/pull/3965
🔥 [Refactor] RLOOTrainer by @shirinyamani in https://github.com/huggingface/trl/pull/3801
ℹ️ Validate examples on xpu by @yao-matrix in https://github.com/huggingface/trl/pull/3897
📜 GSPO docs - Sequence importance ratio and differences in relation to GRPO by @almeidava93 in https://github.com/huggingface/trl/pull/3816
Fix CI by @qgallouedec in https://github.com/huggingface/trl/pull/3975
🧱 PyPI publishing workflow by @qgallouedec in https://github.com/huggingface/trl/pull/3976
Release: v0.22 by @qgallouedec in https://github.com/huggingface/trl/pull/3977

New Contributors

@ghubnerr made their first contribution in https://github.com/huggingface/trl/pull/3832
@Kirill-Kravtsov made their first contribution in https://github.com/huggingface/trl/pull/3886
@behroozazarkhalili made their first contribution in https://github.com/huggingface/trl/pull/3873
@zk-quantum made their first contribution in https://github.com/huggingface/trl/pull/3821
@MengAiDev made their first contribution in https://github.com/huggingface/trl/pull/3888
@reihig-ut made their first contribution in https://github.com/huggingface/trl/pull/3134
@ErezYosef made their first contribution in https://github.com/huggingface/trl/pull/3936
@kaixuanliu made their first contribution in https://github.com/huggingface/trl/pull/3932
@Manan17 made their first contribution in https://github.com/huggingface/trl/pull/3815
@shimmyshimmer made their first contribution in https://github.com/huggingface/trl/pull/3955
@jue-jue-zi made their first contribution in https://github.com/huggingface/trl/pull/3942
@almeidava93 made their first contribution in https://github.com/huggingface/trl/pull/3816

Full Changelog: https://github.com/huggingface/trl/compare/v0.21.0...v0.22.0