releases.shpreview

v0.22.0

$npx -y @buildinternet/releases show rel_UVAiKeoaxivE5NrnGWo6Z

Major

๐Ÿ”ฎ Native VLM support for SFTTrainer

SFTTrainer now natively supports Vision-Language Models (VLMs). This includes support for both languauge modeling, prompt-completion data. It also supports training on completion-only.

<img width="1136" height="586" alt="Group 291-6" src="https://github.com/user-attachments/assets/2629b8e7-d853-4b7c-91d5-f4c128287e04" />
from trl import SFTConfig, SFTTrainer
from datasets import load_dataset

trainer = SFTTrainer(
    model="Qwen/Qwen2.5-VL-3B-Instruct",
    args=SFTConfig(max_length=None),
    train_dataset=load_dataset("trl-lib/llava-instruct-mix", split="train"),
)
trainer.train()

by @qgallouedec in https://github.com/huggingface/trl/pull/3862, https://github.com/huggingface/trl/pull/3907 and https://github.com/huggingface/trl/pull/3908

๐Ÿ”ฅ RLOOTrainer refactor

RLOOTrainer has been refactored to align with the design principles of other other trainers in the library. You can now use this trainer exactly like GRPO.

from datasets import load_dataset
from trl import RLOOConfig, RLOOTrainer

dataset = load_dataset("trl-lib/ultrafeedback-prompt", split="train")

# Dummy reward function for demonstration purposes
def reward_num_unique_letters(completions, **kwargs):
    """Reward function that rewards completions with more unique letters."""
    completion_contents = [completion[0]["content"] for completion in completions]
    return [float(len(set(content))) for content in completion_contents]

trainer = RLOOTrainer(
    model="Qwen/Qwen2-0.5B-Instruct",
    reward_funcs=reward_num_unique_letters,
    train_dataset=dataset,
)
trainer.train()

by @shirinyamani in https://github.com/huggingface/trl/pull/3801

๐Ÿงญ HF jobs x TRL guide

You can now levarage Hugging Face Jobs to easily train and deploy your models with TRL.

hf jobs uv run --flavor a100-large --secrets HF_TOKEN "https://raw.githubusercontent.com/huggingface/trl/main/trl/scripts/sft.py" --model_name_or_path Qwen/Qwen2-0.5B --dataset_name trl-lib/Capybara

A guide is available in the docs.

by @sergiopaniego in https://github.com/huggingface/trl/pull/3890

๐ŸŒ๏ธ DAPO loss type

GRPOTrainer now supports DAPO loss type, which aggregates token-level losses by normalizing with the number of active token in the global accumulated batch. This method was introduced to eliminate length bias. Simply use

from trl import GRPOConfig, GRPOTrainer

training_args = GRPOConfig(
    loss_type="dapo",
    ...
)

by @qgallouedec in https://github.com/huggingface/trl/pull/3938

๐Ÿชถ [GRPO] PPO Lite: Scale rewards by Std of Batch

The authors of Part I: Tricks or Traps? A Deep Dive into RL for LLM Reasoning (Lite PPO) find that the combination of:

  1. scaling rewards by the standard deviation computed over the entire batch and
  2. aggregating loss over the total number of tokens

can unlock the learning capability of critic-free policies using vanilla PPO loss. Their results demonstrate that this simple combination consistently improves performance, surpassing strategies like GRPO and DAPO.

TRL supports using these learnings to train a GRPO model by:

from trl import GRPOConfig

training_args = GRPOConfig(
    scale_rewards="batch",
    loss_type="dapo",
    ...
)

by @pramodith in https://github.com/huggingface/trl/pull/3935

๐ŸŽข [Callbacks] BEMA

Bias-Corrected Exponential Moving Average (BEMA) improves the stability and efficiency of language model fine-tuning by reducing stochasticity and eliminating bias. To use BEMA with SFT as described in the paper, you can now use the [BEMACallback]:

from trl import BEMACallback, SFTTrainer

trainer = SFTTrainer(
    ...
    callbacks=[BEMACallback()],
)

by @kashif in https://github.com/huggingface/trl/pull/3855

Minor

Deprecations

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.21.0...v0.22.0

Fetched April 7, 2026