Features
Chunked cross-entropy loss for SFT (up to –50% VRAM)
<img width="2704" height="1455" alt="chunked_loss_idea" src="https://github.com/user-attachments/assets/3957f39b-3e71-4465-949a-22b2cf894d03" />A new loss_type="chunked_nll" option drastically reduces peak activation memory in SFT by avoiding the full [batch × seq × vocab] logits tensor. Ignored-label tokens are dropped before the lm_head matmul, and the cross-entropy is computed over the remaining tokens in checkpointed chunks (default chunk_size=256, the sweet spot consistent across model sizes and sequence lengths).
from trl import SFTConfig, SFTTrainer
trainer = SFTTrainer(
model="Qwen/Qwen3-4B",
args=SFTConfig(loss_type="chunked_nll"),
train_dataset=dataset,
)
trainer.train()
Peak GPU memory, AdamW fp32:
| Model | Hardware | Seq | nll | chunked_nll |
|---|---|---|---|---|
| Qwen3-1.7B + LoRA | 1×H100 80GB | 2048 | 47.9 GB | 12.3 GB (3.9× less) |
| Qwen3-4B | 1×H100 80GB | 16384 | OOM | 63.8 GB |
| Qwen3-14B | 8×H100 FSDP2 | 16384 | 58.9 GB | 38.9 GB (1.5× less) |
| Qwen3-32B | 8×H100 FSDP2 | 8192 | OOM | 71.2 GB |
End-to-end, chunked NLL is consistently as fast or faster than nll — and it unlocks sequence lengths that don't fit at all under the standard path.
The chunked path also supports VLMs (https://github.com/huggingface/trl/pull/5684).
by @qgallouedec in https://github.com/huggingface/trl/pull/5575, https://github.com/huggingface/trl/pull/5676 and https://github.com/huggingface/trl/pull/5684
OpenReward Standard environment adapter (experimental)
A new trl.experimental.openreward adapter plugs any environment speaking the Open Reward Standard (ORS) protocol into any TRL trainer accepting an environment_factory (GRPOTrainer, AsyncGRPOTrainer). One identifier wires all three trainer slots — dataset, factory, reward_func:
from trl import GRPOConfig, GRPOTrainer
from trl.experimental.openreward import OpenRewardEnv
env = OpenRewardEnv("Eigent/SETA") # or "http://localhost:8000"
trainer = GRPOTrainer(
model="Qwen/Qwen3-4B",
args=GRPOConfig(...),
train_dataset=env.dataset,
environment_factory=env.factory,
reward_funcs=env.reward_func,
)
Tools are bound dynamically from JSON Schema at construction (no per-env wrapper code), and env.dataset autoderives task lists from the ORS task endpoints. The same code path works for envs hosted on the OpenReward platform, self-hosted on any container service, or running locally on localhost. A SETA training example is included.
by @adithya-s-k in https://github.com/huggingface/trl/pull/5696
Training-invariance test suite
Unit tests don't catch trainer-level numerical drift (gradient-accumulation normalization bugs, attention-impl divergence (eager ↔ FA2 / kernels)) they silently shift the loss trajectory and users only notice when their run no longer reproduces. (Cf. last year's transformers grad-accum bug, or the "We found two bugs in DeepSpeed" paper.)
A new opt-in pytest -m invariant suite asserts the loss / grad_norm trajectory of short end-to-end SFT/DPO runs against committed reference snapshots, with equivalence classes for configs that should produce identical trajectories (e.g. pdb=1, gas=8 ≡ default; eager ≡ FA2 ≡ kernels). Hardware-pinned to H100 80GB, real pretrained model, full_determinism, fixed seed. Initial coverage: 2 trainers × 2 invariance axes (grad-accum, attn-impl) × gradient-checkpointing equivalence.
by @qgallouedec in https://github.com/huggingface/trl/pull/5686, https://github.com/huggingface/trl/pull/5688 and https://github.com/huggingface/trl/pull/5689
MFU helpers
Three new pure helpers in trl.trainer.utils for measuring training efficiency:
compute_flops_per_token(config, seq_len)— handles dense and MoE (Mixtral, Qwen3-MoE, DeepSeek-V2)compute_mfu(flops_per_token, tps, world_size, peak_flops)— Model FLOPs Utilization as a percentageadjusted_mfu(mfu, config, seq_len)— non-causal → causal-corrected (Llama / DS Ulysses convention)
by @AmineDiro in https://github.com/huggingface/trl/pull/5698
GRPO Liger kernel update (Liger 0.8.0)
GRPO's Liger-kernel integration is updated for Liger 0.8.0: delta two-sided clipping, use_bias_correction_kl, and SAPO/VESPO parameters are now forwarded into LigerFusedLinearGRPOLoss. The previous delta + use_liger_kernel guard is removed — both can be combined.
by @kashif in https://github.com/huggingface/trl/pull/5690
Length-normalized DPO sigmoid loss
A new loss_type="sigmoid_norm" option for DPOConfig implements the per-token (length-normalized) DPO loss used by Tülu 3 / OLMo (paper §5.1.2 eq. 6) to mitigate length bias.
from trl import DPOConfig, DPOTrainer
trainer = DPOTrainer(
model="Qwen/Qwen3-4B",
args=DPOConfig(loss_type="sigmoid_norm"),
train_dataset=dataset,
)
by @BrownianNotion in https://github.com/huggingface/trl/pull/5406
Even more training chat templates
Four more model families gain training-compatible chat templates with {% generation %} markers (assistant-only loss masking) and/or response schemas (tool-calling parsing):
- Cohere training template by @dschulmeist in https://github.com/huggingface/trl/pull/5627
- Cohere2
{% generation %}markers by @qgallouedec in https://github.com/huggingface/trl/pull/5675 - Gemma 3 training template by @hwanython in https://github.com/huggingface/trl/pull/5685
- Qwen3-2507 training template by @SwayamInSync in https://github.com/huggingface/trl/pull/5574
- Qwen2.5 response schema by @aazizyan in https://github.com/huggingface/trl/pull/5728
get_training_chat_template now also accepts a processor (not just a tokenizer) — useful for VLMs (https://github.com/huggingface/trl/pull/5560).
KTO ↔ DPO alignment: closing in on graduation
Another batch of alignment PRs this cycle. KTO and DPO are now structurally aligned across PEFT handling, model initialization, training-arg grouping, ref-logp precomputation, and metric handling — promotion of KTO out of experimental is imminent.
PRs (all by @albertvillanova): #5659, #5660, #5661, #5679, #5701, #5702, #5703, #5704, #5705, #5714.
Other
- Reject
parallelism_configwithcp_size>1orsp_size>1in GRPO/RLOO — fail fast at config init with a clear error instead of mid-training crash. By @kashif in https://github.com/huggingface/trl/pull/5699 - Fail early for unsupported PEFT + Liger Kernel in DPO by @albertvillanova in https://github.com/huggingface/trl/pull/5709
- Explicitly set
model_accepts_loss_kwargs=Falsein DPO and Reward by @albertvillanova in https://github.com/huggingface/trl/pull/5710 - Set
_tokenizerattribute in experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5566 - Simplify
peft_confighandling in core / experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5673 and https://github.com/huggingface/trl/pull/5674 - Replace
isinstancewithis_peft_model/ drop redundantis_peft_availableby @albertvillanova in https://github.com/huggingface/trl/pull/5682 and https://github.com/huggingface/trl/pull/5683 - Reduce inconsistency across trainer test files by @qgallouedec in https://github.com/huggingface/trl/pull/5678
- Refactor tiny-model generation scripts by @qgallouedec in https://github.com/huggingface/trl/pull/5637
- Revert VLM support in
parse_responseby @qgallouedec in https://github.com/huggingface/trl/pull/5561
Fixes
- 5 GB+ CUDA memory leak in activation offloading —
OffloadActivations.__exit__now syncs the compute/offload streams and clears the stash dictionaries, preventing orphaned offload tensors from leaking onto a dead stream (~0.2 GiB/step accumulation observed during QLoRA vision training before the fix). By @butterwecksolutions in https://github.com/huggingface/trl/pull/5694 and https://github.com/huggingface/trl/pull/5700 - Fix reverse-KL server path NaN on variable completion length in
DistillationTrainerby @k1064190 in https://github.com/huggingface/trl/pull/5594 GKDTrainer: fixreturn_outputsin the Liger kernel path by @roycho96 in https://github.com/huggingface/trl/pull/4688GKDTrainer: fix seq-KD wasted teacher forward by @roycho96 in https://github.com/huggingface/trl/pull/5726GKDTrainer: fix Liger fused JSD path computing wrong loss by @roycho96 in https://github.com/huggingface/trl/pull/5731- Fix missing PEFT validation when passing
peft_configto core / experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5664 and https://github.com/huggingface/trl/pull/5665 - Fix
peft_configtype hint in experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5666 - Fix discarded assertion message in trainer parameter checks by @qgallouedec in https://github.com/huggingface/trl/pull/5677
- Fix typo in model name in README by @qgallouedec in https://github.com/huggingface/trl/pull/5711
Documentation and Examples
- Upload testing suite for
DistillationTrainerby @cmpatino in https://github.com/huggingface/trl/pull/5615
CI
- Fix OOM in CI by reducing batch size in VLM SFT tests by @albertvillanova in https://github.com/huggingface/trl/pull/5687
- Fix OOM in CI by reducing image size of tiny Gemma 3 model by @albertvillanova in https://github.com/huggingface/trl/pull/5680
- Fix OOM in CI test reruns due to GPU memory leak from traceback frame locals by @albertvillanova in https://github.com/huggingface/trl/pull/5681
- Add tiny
Qwen3-4B-Instruct-2507by @qgallouedec in https://github.com/huggingface/trl/pull/5586 - Align tiny Qwen3 MoE config with
Qwen/Qwen3-30B-A3Bby @qgallouedec in https://github.com/huggingface/trl/pull/5716 - Fix GRPO VLM tests: multimodal training requires conversational prompts by @kaixuanliu in https://github.com/huggingface/trl/pull/5550
- Regenerate invariance data + relax the tolerance by @qgallouedec in https://github.com/huggingface/trl/pull/5688
New Contributors
- @dschulmeist made their first contribution in https://github.com/huggingface/trl/pull/5627
- @k1064190 made their first contribution in https://github.com/huggingface/trl/pull/5594
- @butterwecksolutions made their first contribution in https://github.com/huggingface/trl/pull/5694
- @hwanython made their first contribution in https://github.com/huggingface/trl/pull/5685
- @BrownianNotion made their first contribution in https://github.com/huggingface/trl/pull/5406
- @adithya-s-k made their first contribution in https://github.com/huggingface/trl/pull/5696
- @roycho96 made their first contribution in https://github.com/huggingface/trl/pull/4688
- @aazizyan made their first contribution in https://github.com/huggingface/trl/pull/5728
What's Changed
- ⬆️ Bump dev version by @qgallouedec in https://github.com/huggingface/trl/pull/5648
- Align KTO with DPO: Remove model_init parameter by @albertvillanova in https://github.com/huggingface/trl/pull/5659
- Align KTO with DPO: Remove preprocess_logits_for_metrics parameter by @albertvillanova in https://github.com/huggingface/trl/pull/5660
- Add tiny Qwen3-4B-Instruct-2507 by @qgallouedec in https://github.com/huggingface/trl/pull/5586
- Chunked cross-entropy loss for SFT (up to –50% VRAM) by @qgallouedec in https://github.com/huggingface/trl/pull/5575
- Fix missing PEFT validation when passing peft_config to core trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5664
- Fix missing PEFT availability check when passing peft_config to experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5665
- Align KTO with DPO: Align PEFT handling by @albertvillanova in https://github.com/huggingface/trl/pull/5661
- Set _tokenizer attribute in experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5566
- Fix peft_config type hint in experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5666
- Add Cohere training chat template by @dschulmeist in https://github.com/huggingface/trl/pull/5627
- Simplify peft_config handling in core trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5673
- Simplify peft_config handling in experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5674
- fix(distillation): reverse-KL server path NaN on variable completion length by @k1064190 in https://github.com/huggingface/trl/pull/5594
- Fix discarded assertion message in trainer parameter checks by @qgallouedec in https://github.com/huggingface/trl/pull/5677
- Align KTO with DPO: Replace direct type check with is_peft_model by @albertvillanova in https://github.com/huggingface/trl/pull/5679
- Remove redundant is_peft_available from core trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5682
- Replace isinstance with is_peft_model in experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5683
- Upload testing suite for
DistillationTrainerby @cmpatino in https://github.com/huggingface/trl/pull/5615 - Fix OOM in CI by reducing batch size in VLM SFT tests by @albertvillanova in https://github.com/huggingface/trl/pull/5687
- Fix OOM in CI by reducing image size of tiny Gemma3 model by @albertvillanova in https://github.com/huggingface/trl/pull/5680
- Fix OOM in CI test reruns due to GPU memory leak from traceback frame locals by @albertvillanova in https://github.com/huggingface/trl/pull/5681
- Add training-invariance tests by @qgallouedec in https://github.com/huggingface/trl/pull/5686
- Regenerate invariance data + relax the tolerance by @qgallouedec in https://github.com/huggingface/trl/pull/5688
- fix: prevent RuntimeError crash in activation offloading for non-contiguous tensors by @butterwecksolutions in https://github.com/huggingface/trl/pull/5694
- [GRPO] update Liger-kernel grpo loss (delta, vespo, KL bias correction) by @kashif in https://github.com/huggingface/trl/pull/5690
- Extend invariant suite with gradient-checkpointing equivalence by @qgallouedec in https://github.com/huggingface/trl/pull/5689
- Add Gemma 3 training chat template by @hwanython in https://github.com/huggingface/trl/pull/5685
- Add
{% generation %}markers for Cohere2 chat template by @qgallouedec in https://github.com/huggingface/trl/pull/5675 - Add length-normalized sigmoid loss type to DPO trainer by @BrownianNotion in https://github.com/huggingface/trl/pull/5406
- Add training chat template for Qwen3-2507 by @SwayamInSync in https://github.com/huggingface/trl/pull/5574
- Align KTO with DPO: Remove enforcement of causal language models by @albertvillanova in https://github.com/huggingface/trl/pull/5701
- Align KTO with DPO: Remove duplicate import of PreTrainedModel by @albertvillanova in https://github.com/huggingface/trl/pull/5702
- Align KTO with DPO: Simplify max_length init logic by @albertvillanova in https://github.com/huggingface/trl/pull/5703
- Align KTO with DPO: Group training arguments by @albertvillanova in https://github.com/huggingface/trl/pull/5704
- Align KTO with DPO: Use _metrics attribute by @albertvillanova in https://github.com/huggingface/trl/pull/5705
- Reduce inconsistency across trainer test files by @qgallouedec in https://github.com/huggingface/trl/pull/5678
- Refactor tiny-model generation scripts by @qgallouedec in https://github.com/huggingface/trl/pull/5637
- Accept processor in
get_training_chat_templateby @qgallouedec in https://github.com/huggingface/trl/pull/5560 - Enable chunked NLL loss with PEFT in SFT by @qgallouedec in https://github.com/huggingface/trl/pull/5676
- Fix GRPO VLM tests: Multimodal training requires conversational prompts by @kaixuanliu in https://github.com/huggingface/trl/pull/5550
- [experimental] Add OpenReward Standard environment adapter by @adithya-s-k in https://github.com/huggingface/trl/pull/5696
- GKDTrainer: Fix return_outputs in Liger kernel path and update tests by @roycho96 in https://github.com/huggingface/trl/pull/4688
- Reject parallelism_config with cp_size>1 or sp_size>1 in GRPO/RLOO by @kashif in https://github.com/huggingface/trl/pull/5699
- Fix typo in model name in README by @qgallouedec in https://github.com/huggingface/trl/pull/5711
- Explicitly set model_accepts_loss_kwargs=False in DPO and Reward by @albertvillanova in https://github.com/huggingface/trl/pull/5710
- Fail early for unsupported PEFT + Liger Kernel in DPO by @albertvillanova in https://github.com/huggingface/trl/pull/5709
- Revert VLM support in
parse_responseby @qgallouedec in https://github.com/huggingface/trl/pull/5561 - Align KTO with DPO: Align _precompute_ref_logps by @albertvillanova in https://github.com/huggingface/trl/pull/5714
- fix: prevent 5 GB+ CUDA memory leak in activation offloading by syncing streams and clear stashes in OffloadActivations.exit by @butterwecksolutions in https://github.com/huggingface/trl/pull/5700
- Align tiny Qwen3 MoE config with Qwen/Qwen3-30B-A3B by @qgallouedec in https://github.com/huggingface/trl/pull/5716
- Add MFU helpers by @AmineDiro in https://github.com/huggingface/trl/pull/5698
- [GKD] Fix seq kd wasted teacher forward by @roycho96 in https://github.com/huggingface/trl/pull/5726
- Add Qwen2.5 response schema by @aazizyan in https://github.com/huggingface/trl/pull/5728
- Enable chunked NLL loss with VLM in SFT by @qgallouedec in https://github.com/huggingface/trl/pull/5684
- [GKD] Fix Liger fused JSD path computing wrong loss by @roycho96 in https://github.com/huggingface/trl/pull/5731
- Release: v1.4 by @qgallouedec in https://github.com/huggingface/trl/pull/5732
Full Changelog: https://github.com/huggingface/trl/compare/v1.3.0...v1.4.0
Fetched May 9, 2026
