releases.shpreview

v1.4.0

$npx @buildinternet/releases get rel_BrSDFutYy_LBOtbeHH_OH

Features

Chunked cross-entropy loss for SFT (up to –50% VRAM)

<img width="2704" height="1455" alt="chunked_loss_idea" src="https://github.com/user-attachments/assets/3957f39b-3e71-4465-949a-22b2cf894d03" />

A new loss_type="chunked_nll" option drastically reduces peak activation memory in SFT by avoiding the full [batch × seq × vocab] logits tensor. Ignored-label tokens are dropped before the lm_head matmul, and the cross-entropy is computed over the remaining tokens in checkpointed chunks (default chunk_size=256, the sweet spot consistent across model sizes and sequence lengths).

from trl import SFTConfig, SFTTrainer

trainer = SFTTrainer(
    model="Qwen/Qwen3-4B",
    args=SFTConfig(loss_type="chunked_nll"),
    train_dataset=dataset,
)
trainer.train()

Peak GPU memory, AdamW fp32:

ModelHardwareSeqnllchunked_nll
Qwen3-1.7B + LoRA1×H100 80GB204847.9 GB12.3 GB (3.9× less)
Qwen3-4B1×H100 80GB16384OOM63.8 GB
Qwen3-14B8×H100 FSDP21638458.9 GB38.9 GB (1.5× less)
Qwen3-32B8×H100 FSDP28192OOM71.2 GB

End-to-end, chunked NLL is consistently as fast or faster than nll — and it unlocks sequence lengths that don't fit at all under the standard path.

The chunked path also supports VLMs (https://github.com/huggingface/trl/pull/5684).

by @qgallouedec in https://github.com/huggingface/trl/pull/5575, https://github.com/huggingface/trl/pull/5676 and https://github.com/huggingface/trl/pull/5684

OpenReward Standard environment adapter (experimental)

A new trl.experimental.openreward adapter plugs any environment speaking the Open Reward Standard (ORS) protocol into any TRL trainer accepting an environment_factory (GRPOTrainer, AsyncGRPOTrainer). One identifier wires all three trainer slots — dataset, factory, reward_func:

from trl import GRPOConfig, GRPOTrainer
from trl.experimental.openreward import OpenRewardEnv

env = OpenRewardEnv("Eigent/SETA")  # or "http://localhost:8000"

trainer = GRPOTrainer(
    model="Qwen/Qwen3-4B",
    args=GRPOConfig(...),
    train_dataset=env.dataset,
    environment_factory=env.factory,
    reward_funcs=env.reward_func,
)

Tools are bound dynamically from JSON Schema at construction (no per-env wrapper code), and env.dataset autoderives task lists from the ORS task endpoints. The same code path works for envs hosted on the OpenReward platform, self-hosted on any container service, or running locally on localhost. A SETA training example is included.

by @adithya-s-k in https://github.com/huggingface/trl/pull/5696

Training-invariance test suite

Unit tests don't catch trainer-level numerical drift (gradient-accumulation normalization bugs, attention-impl divergence (eager ↔ FA2 / kernels)) they silently shift the loss trajectory and users only notice when their run no longer reproduces. (Cf. last year's transformers grad-accum bug, or the "We found two bugs in DeepSpeed" paper.)

A new opt-in pytest -m invariant suite asserts the loss / grad_norm trajectory of short end-to-end SFT/DPO runs against committed reference snapshots, with equivalence classes for configs that should produce identical trajectories (e.g. pdb=1, gas=8 ≡ default; eager ≡ FA2 ≡ kernels). Hardware-pinned to H100 80GB, real pretrained model, full_determinism, fixed seed. Initial coverage: 2 trainers × 2 invariance axes (grad-accum, attn-impl) × gradient-checkpointing equivalence.

by @qgallouedec in https://github.com/huggingface/trl/pull/5686, https://github.com/huggingface/trl/pull/5688 and https://github.com/huggingface/trl/pull/5689

MFU helpers

Three new pure helpers in trl.trainer.utils for measuring training efficiency:

  • compute_flops_per_token(config, seq_len) — handles dense and MoE (Mixtral, Qwen3-MoE, DeepSeek-V2)
  • compute_mfu(flops_per_token, tps, world_size, peak_flops) — Model FLOPs Utilization as a percentage
  • adjusted_mfu(mfu, config, seq_len) — non-causal → causal-corrected (Llama / DS Ulysses convention)

by @AmineDiro in https://github.com/huggingface/trl/pull/5698

GRPO Liger kernel update (Liger 0.8.0)

GRPO's Liger-kernel integration is updated for Liger 0.8.0: delta two-sided clipping, use_bias_correction_kl, and SAPO/VESPO parameters are now forwarded into LigerFusedLinearGRPOLoss. The previous delta + use_liger_kernel guard is removed — both can be combined.

by @kashif in https://github.com/huggingface/trl/pull/5690

Length-normalized DPO sigmoid loss

A new loss_type="sigmoid_norm" option for DPOConfig implements the per-token (length-normalized) DPO loss used by Tülu 3 / OLMo (paper §5.1.2 eq. 6) to mitigate length bias.

from trl import DPOConfig, DPOTrainer

trainer = DPOTrainer(
    model="Qwen/Qwen3-4B",
    args=DPOConfig(loss_type="sigmoid_norm"),
    train_dataset=dataset,
)

by @BrownianNotion in https://github.com/huggingface/trl/pull/5406

Even more training chat templates

Four more model families gain training-compatible chat templates with {% generation %} markers (assistant-only loss masking) and/or response schemas (tool-calling parsing):

get_training_chat_template now also accepts a processor (not just a tokenizer) — useful for VLMs (https://github.com/huggingface/trl/pull/5560).

KTO ↔ DPO alignment: closing in on graduation

Another batch of alignment PRs this cycle. KTO and DPO are now structurally aligned across PEFT handling, model initialization, training-arg grouping, ref-logp precomputation, and metric handling — promotion of KTO out of experimental is imminent.

PRs (all by @albertvillanova): #5659, #5660, #5661, #5679, #5701, #5702, #5703, #5704, #5705, #5714.

Other

Fixes

Documentation and Examples

CI

New Contributors

What's Changed

Full Changelog: https://github.com/huggingface/trl/compare/v1.3.0...v1.4.0

Fetched May 9, 2026