A new loss_type="chunked_nll" option drastically reduces peak activation memory in SFT by avoiding the full [batch × seq × vocab] logits tensor. Ignored-label tokens are dropped before the lm_head matmul, and the cross-entropy is computed over the remaining tokens in checkpointed chunks (default chunk_size=256, the sweet spot consistent across model sizes and sequence lengths).
from trl import SFTConfig, SFTTrainer
trainer = SFTTrainer(
model="Qwen/Qwen3-4B",
args=SFTConfig(loss_type="chunked_nll"),
train_dataset=dataset,
)
trainer.train()
Peak GPU memory, AdamW fp32:
| Model | Hardware | Seq | nll | chunked_nll |
|---|---|---|---|---|
| Qwen3-1.7B + LoRA | 1×H100 80GB | 2048 | 47.9 GB | 12.3 GB (3.9× less) |
| Qwen3-4B | 1×H100 80GB | 16384 | OOM | 63.8 GB |
| Qwen3-14B | 8×H100 FSDP2 | 16384 | 58.9 GB | 38.9 GB (1.5× less) |
| Qwen3-32B | 8×H100 FSDP2 | 8192 | OOM | 71.2 GB |
End-to-end, chunked NLL is consistently as fast or faster than nll — and it unlocks sequence lengths that don't fit at all under the standard path.
The chunked path also supports VLMs (https://github.com/huggingface/trl/pull/5684).
by @qgallouedec in https://github.com/huggingface/trl/pull/5575, https://github.com/huggingface/trl/pull/5676 and https://github.com/huggingface/trl/pull/5684
A new trl.experimental.openreward adapter plugs any environment speaking the Open Reward Standard (ORS) protocol into any TRL trainer accepting an environment_factory (GRPOTrainer, AsyncGRPOTrainer). One identifier wires all three trainer slots — dataset, factory, reward_func:
from trl import GRPOConfig, GRPOTrainer
from trl.experimental.openreward import OpenRewardEnv
env = OpenRewardEnv("Eigent/SETA") # or "http://localhost:8000"
trainer = GRPOTrainer(
model="Qwen/Qwen3-4B",
args=GRPOConfig(...),
train_dataset=env.dataset,
environment_factory=env.factory,
reward_funcs=env.reward_func,
)
Tools are bound dynamically from JSON Schema at construction (no per-env wrapper code), and env.dataset autoderives task lists from the ORS task endpoints. The same code path works for envs hosted on the OpenReward platform, self-hosted on any container service, or running locally on localhost. A SETA training example is included.
by @adithya-s-k in https://github.com/huggingface/trl/pull/5696
Unit tests don't catch trainer-level numerical drift (gradient-accumulation normalization bugs, attention-impl divergence (eager ↔ FA2 / kernels)) they silently shift the loss trajectory and users only notice when their run no longer reproduces. (Cf. last year's transformers grad-accum bug, or the "We found two bugs in DeepSpeed" paper.)
A new opt-in pytest -m invariant suite asserts the loss / grad_norm trajectory of short end-to-end SFT/DPO runs against committed reference snapshots, with equivalence classes for configs that should produce identical trajectories (e.g. pdb=1, gas=8 ≡ default; eager ≡ FA2 ≡ kernels). Hardware-pinned to H100 80GB, real pretrained model, full_determinism, fixed seed. Initial coverage: 2 trainers × 2 invariance axes (grad-accum, attn-impl) × gradient-checkpointing equivalence.
by @qgallouedec in https://github.com/huggingface/trl/pull/5686, https://github.com/huggingface/trl/pull/5688 and https://github.com/huggingface/trl/pull/5689
Three new pure helpers in trl.trainer.utils for measuring training efficiency:
compute_flops_per_token(config, seq_len) — handles dense and MoE (Mixtral, Qwen3-MoE, DeepSeek-V2)compute_mfu(flops_per_token, tps, world_size, peak_flops) — Model FLOPs Utilization as a percentageadjusted_mfu(mfu, config, seq_len) — non-causal → causal-corrected (Llama / DS Ulysses convention)by @AmineDiro in https://github.com/huggingface/trl/pull/5698
GRPO's Liger-kernel integration is updated for Liger 0.8.0: delta two-sided clipping, use_bias_correction_kl, and SAPO/VESPO parameters are now forwarded into LigerFusedLinearGRPOLoss. The previous delta + use_liger_kernel guard is removed — both can be combined.
by @kashif in https://github.com/huggingface/trl/pull/5690
A new loss_type="sigmoid_norm" option for DPOConfig implements the per-token (length-normalized) DPO loss used by Tülu 3 / OLMo (paper §5.1.2 eq. 6) to mitigate length bias.
from trl import DPOConfig, DPOTrainer
trainer = DPOTrainer(
model="Qwen/Qwen3-4B",
args=DPOConfig(loss_type="sigmoid_norm"),
train_dataset=dataset,
)
by @BrownianNotion in https://github.com/huggingface/trl/pull/5406
Four more model families gain training-compatible chat templates with {% generation %} markers (assistant-only loss masking) and/or response schemas (tool-calling parsing):
{% generation %} markers by @qgallouedec in https://github.com/huggingface/trl/pull/5675get_training_chat_template now also accepts a processor (not just a tokenizer) — useful for VLMs (https://github.com/huggingface/trl/pull/5560).
Another batch of alignment PRs this cycle. KTO and DPO are now structurally aligned across PEFT handling, model initialization, training-arg grouping, ref-logp precomputation, and metric handling — promotion of KTO out of experimental is imminent.
PRs (all by @albertvillanova): #5659, #5660, #5661, #5679, #5701, #5702, #5703, #5704, #5705, #5714.
parallelism_config with cp_size>1 or sp_size>1 in GRPO/RLOO — fail fast at config init with a clear error instead of mid-training crash. By @kashif in https://github.com/huggingface/trl/pull/5699model_accepts_loss_kwargs=False in DPO and Reward by @albertvillanova in https://github.com/huggingface/trl/pull/5710_tokenizer attribute in experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5566peft_config handling in core / experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5673 and https://github.com/huggingface/trl/pull/5674isinstance with is_peft_model / drop redundant is_peft_available by @albertvillanova in https://github.com/huggingface/trl/pull/5682 and https://github.com/huggingface/trl/pull/5683parse_response by @qgallouedec in https://github.com/huggingface/trl/pull/5561OffloadActivations.__exit__ now syncs the compute/offload streams and clears the stash dictionaries, preventing orphaned offload tensors from leaking onto a dead stream (~0.2 GiB/step accumulation observed during QLoRA vision training before the fix). By @butterwecksolutions in https://github.com/huggingface/trl/pull/5694 and https://github.com/huggingface/trl/pull/5700DistillationTrainer by @k1064190 in https://github.com/huggingface/trl/pull/5594GKDTrainer: fix return_outputs in the Liger kernel path by @roycho96 in https://github.com/huggingface/trl/pull/4688GKDTrainer: fix seq-KD wasted teacher forward by @roycho96 in https://github.com/huggingface/trl/pull/5726GKDTrainer: fix Liger fused JSD path computing wrong loss by @roycho96 in https://github.com/huggingface/trl/pull/5731peft_config to core / experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5664 and https://github.com/huggingface/trl/pull/5665peft_config type hint in experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5666DistillationTrainer by @cmpatino in https://github.com/huggingface/trl/pull/5615Qwen3-4B-Instruct-2507 by @qgallouedec in https://github.com/huggingface/trl/pull/5586Qwen/Qwen3-30B-A3B by @qgallouedec in https://github.com/huggingface/trl/pull/5716DistillationTrainer by @cmpatino in https://github.com/huggingface/trl/pull/5615{% generation %} markers for Cohere2 chat template by @qgallouedec in https://github.com/huggingface/trl/pull/5675get_training_chat_template by @qgallouedec in https://github.com/huggingface/trl/pull/5560parse_response by @qgallouedec in https://github.com/huggingface/trl/pull/5561Full Changelog: https://github.com/huggingface/trl/compare/v1.3.0...v1.4.0
Fetched May 9, 2026