SFTTrainerSFTTrainer now natively supports Vision-Language Models (VLMs). This includes support for both languauge modeling, prompt-completion data.
It also supports training on completion-only.
from trl import SFTConfig, SFTTrainer
from datasets import load_dataset
trainer = SFTTrainer(
model="Qwen/Qwen2.5-VL-3B-Instruct",
args=SFTConfig(max_length=None),
train_dataset=load_dataset("trl-lib/llava-instruct-mix", split="train"),
)
trainer.train()
by @qgallouedec in https://github.com/huggingface/trl/pull/3862, https://github.com/huggingface/trl/pull/3907 and https://github.com/huggingface/trl/pull/3908
RLOOTrainer refactorRLOOTrainer has been refactored to align with the design principles of other other trainers in the library. You can now use this trainer exactly like GRPO.
from datasets import load_dataset
from trl import RLOOConfig, RLOOTrainer
dataset = load_dataset("trl-lib/ultrafeedback-prompt", split="train")
# Dummy reward function for demonstration purposes
def reward_num_unique_letters(completions, **kwargs):
"""Reward function that rewards completions with more unique letters."""
completion_contents = [completion[0]["content"] for completion in completions]
return [float(len(set(content))) for content in completion_contents]
trainer = RLOOTrainer(
model="Qwen/Qwen2-0.5B-Instruct",
reward_funcs=reward_num_unique_letters,
train_dataset=dataset,
)
trainer.train()
by @shirinyamani in https://github.com/huggingface/trl/pull/3801
You can now levarage Hugging Face Jobs to easily train and deploy your models with TRL.
hf jobs uv run --flavor a100-large --secrets HF_TOKEN "https://raw.githubusercontent.com/huggingface/trl/main/trl/scripts/sft.py" --model_name_or_path Qwen/Qwen2-0.5B --dataset_name trl-lib/Capybara
A guide is available in the docs.
by @sergiopaniego in https://github.com/huggingface/trl/pull/3890
GRPOTrainer now supports DAPO loss type, which aggregates token-level losses by normalizing with the number of active token in the global accumulated batch. This method was introduced to eliminate length bias. Simply use
from trl import GRPOConfig, GRPOTrainer
training_args = GRPOConfig(
loss_type="dapo",
...
)
by @qgallouedec in https://github.com/huggingface/trl/pull/3938
The authors of Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning (Lite PPO) find that the combination of:
can unlock the learning capability of critic-free policies using vanilla PPO loss. Their results demonstrate that this simple combination consistently improves performance, surpassing strategies like GRPO and DAPO.
TRL supports using these learnings to train a GRPO model by:
from trl import GRPOConfig
training_args = GRPOConfig(
scale_rewards="batch",
loss_type="dapo",
...
)
by @pramodith in https://github.com/huggingface/trl/pull/3935
Bias-Corrected Exponential Moving Average (BEMA) improves the stability and efficiency of language model fine-tuning by reducing stochasticity and eliminating bias. To use BEMA with SFT as described in the paper, you can now use the [BEMACallback]:
from trl import BEMACallback, SFTTrainer
trainer = SFTTrainer(
...
callbacks=[BEMACallback()],
)
by @kashif in https://github.com/huggingface/trl/pull/3855
gradient_checkpointing=True by @qgallouedec in https://github.com/huggingface/trl/pull/3510position_ids for flash_attention_3 by @jue-jue-zi in https://github.com/huggingface/trl/pull/3942setup_chat_format by @qgallouedec in https://github.com/huggingface/trl/pull/3929IterativeSFTTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/3905AutoModelForVision2Seq to AutoModelForImageTextToText by @qgallouedec in https://github.com/huggingface/trl/pull/3836--bf16 value in scripts by @sergiopaniego in https://github.com/huggingface/trl/pull/3869gradient_checkpointing=True by @qgallouedec in https://github.com/huggingface/trl/pull/3510vllm_mode param in GRPO by @sergiopaniego in https://github.com/huggingface/trl/pull/3866unittest.TestCase with TrlTestCase that handles tmp dir by @qgallouedec in https://github.com/huggingface/trl/pull/3863SFTTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/3862IterativeSFTTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/3905use_cache should be set in the forward pass by @qgallouedec in https://github.com/huggingface/trl/pull/3891logger.warning instead of warnings.warn by @qgallouedec in https://github.com/huggingface/trl/pull/3923SFTTrainer in GRPOTrainer by @MengAiDev in https://github.com/huggingface/trl/pull/3919setup_chat_format by @qgallouedec in https://github.com/huggingface/trl/pull/3929position_ids for flash_attention_3 by @jue-jue-zi in https://github.com/huggingface/trl/pull/3942trackio to docs by @sergiopaniego in https://github.com/huggingface/trl/pull/3965Full Changelog: https://github.com/huggingface/trl/compare/v0.21.0...v0.22.0
Fetched April 7, 2026