v1.4.0

Features

Chunked cross-entropy loss for SFT (up to –50% VRAM)

A new loss_type="chunked_nll" option drastically reduces peak activation memory in SFT by avoiding the full [batch × seq × vocab] logits tensor. Ignored-label tokens are dropped before the lm_head matmul, and the cross-entropy is computed over the remaining tokens in checkpointed chunks (default chunk_size=256, the sweet spot consistent across model sizes and sequence lengths).

from trl import SFTConfig, SFTTrainer

trainer = SFTTrainer(
    model="Qwen/Qwen3-4B",
    args=SFTConfig(loss_type="chunked_nll"),
    train_dataset=dataset,
)
trainer.train()

Peak GPU memory, AdamW fp32:

Model	Hardware	Seq	`nll`	`chunked_nll`
Qwen3-1.7B + LoRA	1×H100 80GB	2048	47.9 GB	12.3 GB (3.9× less)
Qwen3-4B	1×H100 80GB	16384	OOM	63.8 GB
Qwen3-14B	8×H100 FSDP2	16384	58.9 GB	38.9 GB (1.5× less)
Qwen3-32B	8×H100 FSDP2	8192	OOM	71.2 GB

End-to-end, chunked NLL is consistently as fast or faster than nll — and it unlocks sequence lengths that don't fit at all under the standard path.

The chunked path also supports VLMs (https://github.com/huggingface/trl/pull/5684).

by @qgallouedec in https://github.com/huggingface/trl/pull/5575, https://github.com/huggingface/trl/pull/5676 and https://github.com/huggingface/trl/pull/5684

OpenReward Standard environment adapter (experimental)

A new trl.experimental.openreward adapter plugs any environment speaking the Open Reward Standard (ORS) protocol into any TRL trainer accepting an environment_factory (GRPOTrainer, AsyncGRPOTrainer). One identifier wires all three trainer slots — dataset, factory, reward_func:

from trl import GRPOConfig, GRPOTrainer
from trl.experimental.openreward import OpenRewardEnv

env = OpenRewardEnv("Eigent/SETA")  # or "http://localhost:8000"

trainer = GRPOTrainer(
    model="Qwen/Qwen3-4B",
    args=GRPOConfig(...),
    train_dataset=env.dataset,
    environment_factory=env.factory,
    reward_funcs=env.reward_func,
)

Tools are bound dynamically from JSON Schema at construction (no per-env wrapper code), and env.dataset autoderives task lists from the ORS task endpoints. The same code path works for envs hosted on the OpenReward platform, self-hosted on any container service, or running locally on localhost. A SETA training example is included.

by @adithya-s-k in https://github.com/huggingface/trl/pull/5696

Training-invariance test suite

Unit tests don't catch trainer-level numerical drift (gradient-accumulation normalization bugs, attention-impl divergence (eager ↔ FA2 / kernels)) they silently shift the loss trajectory and users only notice when their run no longer reproduces. (Cf. last year's transformers grad-accum bug, or the "We found two bugs in DeepSpeed" paper.)

A new opt-in pytest -m invariant suite asserts the loss / grad_norm trajectory of short end-to-end SFT/DPO runs against committed reference snapshots, with equivalence classes for configs that should produce identical trajectories (e.g. pdb=1, gas=8 ≡ default; eager ≡ FA2 ≡ kernels). Hardware-pinned to H100 80GB, real pretrained model, full_determinism, fixed seed. Initial coverage: 2 trainers × 2 invariance axes (grad-accum, attn-impl) × gradient-checkpointing equivalence.

by @qgallouedec in https://github.com/huggingface/trl/pull/5686, https://github.com/huggingface/trl/pull/5688 and https://github.com/huggingface/trl/pull/5689

MFU helpers

Three new pure helpers in trl.trainer.utils for measuring training efficiency:

compute_flops_per_token(config, seq_len) — handles dense and MoE (Mixtral, Qwen3-MoE, DeepSeek-V2)
compute_mfu(flops_per_token, tps, world_size, peak_flops) — Model FLOPs Utilization as a percentage
adjusted_mfu(mfu, config, seq_len) — non-causal → causal-corrected (Llama / DS Ulysses convention)

by @AmineDiro in https://github.com/huggingface/trl/pull/5698

GRPO Liger kernel update (Liger 0.8.0)

GRPO's Liger-kernel integration is updated for Liger 0.8.0: delta two-sided clipping, use_bias_correction_kl, and SAPO/VESPO parameters are now forwarded into LigerFusedLinearGRPOLoss. The previous delta + use_liger_kernel guard is removed — both can be combined.

by @kashif in https://github.com/huggingface/trl/pull/5690

Length-normalized DPO sigmoid loss

A new loss_type="sigmoid_norm" option for DPOConfig implements the per-token (length-normalized) DPO loss used by Tülu 3 / OLMo (paper §5.1.2 eq. 6) to mitigate length bias.

from trl import DPOConfig, DPOTrainer

trainer = DPOTrainer(
    model="Qwen/Qwen3-4B",
    args=DPOConfig(loss_type="sigmoid_norm"),
    train_dataset=dataset,
)

by @BrownianNotion in https://github.com/huggingface/trl/pull/5406

Even more training chat templates

Four more model families gain training-compatible chat templates with {% generation %} markers (assistant-only loss masking) and/or response schemas (tool-calling parsing):

Cohere training template by @dschulmeist in https://github.com/huggingface/trl/pull/5627
Cohere2 {% generation %} markers by @qgallouedec in https://github.com/huggingface/trl/pull/5675
Gemma 3 training template by @hwanython in https://github.com/huggingface/trl/pull/5685
Qwen3-2507 training template by @SwayamInSync in https://github.com/huggingface/trl/pull/5574
Qwen2.5 response schema by @aazizyan in https://github.com/huggingface/trl/pull/5728

get_training_chat_template now also accepts a processor (not just a tokenizer) — useful for VLMs (https://github.com/huggingface/trl/pull/5560).

KTO ↔ DPO alignment: closing in on graduation

Another batch of alignment PRs this cycle. KTO and DPO are now structurally aligned across PEFT handling, model initialization, training-arg grouping, ref-logp precomputation, and metric handling — promotion of KTO out of experimental is imminent.

PRs (all by @albertvillanova): #5659, #5660, #5661, #5679, #5701, #5702, #5703, #5704, #5705, #5714.

Other

Reject parallelism_config with cp_size>1 or sp_size>1 in GRPO/RLOO — fail fast at config init with a clear error instead of mid-training crash. By @kashif in https://github.com/huggingface/trl/pull/5699
Fail early for unsupported PEFT + Liger Kernel in DPO by @albertvillanova in https://github.com/huggingface/trl/pull/5709
Explicitly set model_accepts_loss_kwargs=False in DPO and Reward by @albertvillanova in https://github.com/huggingface/trl/pull/5710
Set _tokenizer attribute in experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5566
Simplify peft_config handling in core / experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5673 and https://github.com/huggingface/trl/pull/5674
Replace isinstance with is_peft_model / drop redundant is_peft_available by @albertvillanova in https://github.com/huggingface/trl/pull/5682 and https://github.com/huggingface/trl/pull/5683
Reduce inconsistency across trainer test files by @qgallouedec in https://github.com/huggingface/trl/pull/5678
Refactor tiny-model generation scripts by @qgallouedec in https://github.com/huggingface/trl/pull/5637
Revert VLM support in parse_response by @qgallouedec in https://github.com/huggingface/trl/pull/5561

Fixes

5 GB+ CUDA memory leak in activation offloading — OffloadActivations.__exit__ now syncs the compute/offload streams and clears the stash dictionaries, preventing orphaned offload tensors from leaking onto a dead stream (~0.2 GiB/step accumulation observed during QLoRA vision training before the fix). By @butterwecksolutions in https://github.com/huggingface/trl/pull/5694 and https://github.com/huggingface/trl/pull/5700
Fix reverse-KL server path NaN on variable completion length in DistillationTrainer by @k1064190 in https://github.com/huggingface/trl/pull/5594
GKDTrainer: fix return_outputs in the Liger kernel path by @roycho96 in https://github.com/huggingface/trl/pull/4688
GKDTrainer: fix seq-KD wasted teacher forward by @roycho96 in https://github.com/huggingface/trl/pull/5726
GKDTrainer: fix Liger fused JSD path computing wrong loss by @roycho96 in https://github.com/huggingface/trl/pull/5731
Fix missing PEFT validation when passing peft_config to core / experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5664 and https://github.com/huggingface/trl/pull/5665
Fix peft_config type hint in experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5666
Fix discarded assertion message in trainer parameter checks by @qgallouedec in https://github.com/huggingface/trl/pull/5677
Fix typo in model name in README by @qgallouedec in https://github.com/huggingface/trl/pull/5711

Documentation and Examples

Upload testing suite for DistillationTrainer by @cmpatino in https://github.com/huggingface/trl/pull/5615

CI

Fix OOM in CI by reducing batch size in VLM SFT tests by @albertvillanova in https://github.com/huggingface/trl/pull/5687
Fix OOM in CI by reducing image size of tiny Gemma 3 model by @albertvillanova in https://github.com/huggingface/trl/pull/5680
Fix OOM in CI test reruns due to GPU memory leak from traceback frame locals by @albertvillanova in https://github.com/huggingface/trl/pull/5681
Add tiny Qwen3-4B-Instruct-2507 by @qgallouedec in https://github.com/huggingface/trl/pull/5586
Align tiny Qwen3 MoE config with Qwen/Qwen3-30B-A3B by @qgallouedec in https://github.com/huggingface/trl/pull/5716
Fix GRPO VLM tests: multimodal training requires conversational prompts by @kaixuanliu in https://github.com/huggingface/trl/pull/5550
Regenerate invariance data + relax the tolerance by @qgallouedec in https://github.com/huggingface/trl/pull/5688

New Contributors

@dschulmeist made their first contribution in https://github.com/huggingface/trl/pull/5627
@k1064190 made their first contribution in https://github.com/huggingface/trl/pull/5594
@butterwecksolutions made their first contribution in https://github.com/huggingface/trl/pull/5694
@hwanython made their first contribution in https://github.com/huggingface/trl/pull/5685
@BrownianNotion made their first contribution in https://github.com/huggingface/trl/pull/5406
@adithya-s-k made their first contribution in https://github.com/huggingface/trl/pull/5696
@roycho96 made their first contribution in https://github.com/huggingface/trl/pull/4688
@aazizyan made their first contribution in https://github.com/huggingface/trl/pull/5728

What's Changed

⬆️ Bump dev version by @qgallouedec in https://github.com/huggingface/trl/pull/5648
Align KTO with DPO: Remove model_init parameter by @albertvillanova in https://github.com/huggingface/trl/pull/5659
Align KTO with DPO: Remove preprocess_logits_for_metrics parameter by @albertvillanova in https://github.com/huggingface/trl/pull/5660
Add tiny Qwen3-4B-Instruct-2507 by @qgallouedec in https://github.com/huggingface/trl/pull/5586
Chunked cross-entropy loss for SFT (up to –50% VRAM) by @qgallouedec in https://github.com/huggingface/trl/pull/5575
Fix missing PEFT validation when passing peft_config to core trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5664
Fix missing PEFT availability check when passing peft_config to experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5665
Align KTO with DPO: Align PEFT handling by @albertvillanova in https://github.com/huggingface/trl/pull/5661
Set _tokenizer attribute in experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5566
Fix peft_config type hint in experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5666
Add Cohere training chat template by @dschulmeist in https://github.com/huggingface/trl/pull/5627
Simplify peft_config handling in core trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5673
Simplify peft_config handling in experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5674
fix(distillation): reverse-KL server path NaN on variable completion length by @k1064190 in https://github.com/huggingface/trl/pull/5594
Fix discarded assertion message in trainer parameter checks by @qgallouedec in https://github.com/huggingface/trl/pull/5677
Align KTO with DPO: Replace direct type check with is_peft_model by @albertvillanova in https://github.com/huggingface/trl/pull/5679
Remove redundant is_peft_available from core trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5682
Replace isinstance with is_peft_model in experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5683
Upload testing suite for DistillationTrainer by @cmpatino in https://github.com/huggingface/trl/pull/5615
Fix OOM in CI by reducing batch size in VLM SFT tests by @albertvillanova in https://github.com/huggingface/trl/pull/5687
Fix OOM in CI by reducing image size of tiny Gemma3 model by @albertvillanova in https://github.com/huggingface/trl/pull/5680
Fix OOM in CI test reruns due to GPU memory leak from traceback frame locals by @albertvillanova in https://github.com/huggingface/trl/pull/5681
Add training-invariance tests by @qgallouedec in https://github.com/huggingface/trl/pull/5686
Regenerate invariance data + relax the tolerance by @qgallouedec in https://github.com/huggingface/trl/pull/5688
fix: prevent RuntimeError crash in activation offloading for non-contiguous tensors by @butterwecksolutions in https://github.com/huggingface/trl/pull/5694
[GRPO] update Liger-kernel grpo loss (delta, vespo, KL bias correction) by @kashif in https://github.com/huggingface/trl/pull/5690
Extend invariant suite with gradient-checkpointing equivalence by @qgallouedec in https://github.com/huggingface/trl/pull/5689
Add Gemma 3 training chat template by @hwanython in https://github.com/huggingface/trl/pull/5685
Add {% generation %} markers for Cohere2 chat template by @qgallouedec in https://github.com/huggingface/trl/pull/5675
Add length-normalized sigmoid loss type to DPO trainer by @BrownianNotion in https://github.com/huggingface/trl/pull/5406
Add training chat template for Qwen3-2507 by @SwayamInSync in https://github.com/huggingface/trl/pull/5574
Align KTO with DPO: Remove enforcement of causal language models by @albertvillanova in https://github.com/huggingface/trl/pull/5701
Align KTO with DPO: Remove duplicate import of PreTrainedModel by @albertvillanova in https://github.com/huggingface/trl/pull/5702
Align KTO with DPO: Simplify max_length init logic by @albertvillanova in https://github.com/huggingface/trl/pull/5703
Align KTO with DPO: Group training arguments by @albertvillanova in https://github.com/huggingface/trl/pull/5704
Align KTO with DPO: Use _metrics attribute by @albertvillanova in https://github.com/huggingface/trl/pull/5705
Reduce inconsistency across trainer test files by @qgallouedec in https://github.com/huggingface/trl/pull/5678
Refactor tiny-model generation scripts by @qgallouedec in https://github.com/huggingface/trl/pull/5637
Accept processor in get_training_chat_template by @qgallouedec in https://github.com/huggingface/trl/pull/5560
Enable chunked NLL loss with PEFT in SFT by @qgallouedec in https://github.com/huggingface/trl/pull/5676
Fix GRPO VLM tests: Multimodal training requires conversational prompts by @kaixuanliu in https://github.com/huggingface/trl/pull/5550
[experimental] Add OpenReward Standard environment adapter by @adithya-s-k in https://github.com/huggingface/trl/pull/5696
GKDTrainer: Fix return_outputs in Liger kernel path and update tests by @roycho96 in https://github.com/huggingface/trl/pull/4688
Reject parallelism_config with cp_size>1 or sp_size>1 in GRPO/RLOO by @kashif in https://github.com/huggingface/trl/pull/5699
Fix typo in model name in README by @qgallouedec in https://github.com/huggingface/trl/pull/5711
Explicitly set model_accepts_loss_kwargs=False in DPO and Reward by @albertvillanova in https://github.com/huggingface/trl/pull/5710
Fail early for unsupported PEFT + Liger Kernel in DPO by @albertvillanova in https://github.com/huggingface/trl/pull/5709
Revert VLM support in parse_response by @qgallouedec in https://github.com/huggingface/trl/pull/5561
Align KTO with DPO: Align _precompute_ref_logps by @albertvillanova in https://github.com/huggingface/trl/pull/5714
fix: prevent 5 GB+ CUDA memory leak in activation offloading by syncing streams and clear stashes in OffloadActivations.exit by @butterwecksolutions in https://github.com/huggingface/trl/pull/5700
Align tiny Qwen3 MoE config with Qwen/Qwen3-30B-A3B by @qgallouedec in https://github.com/huggingface/trl/pull/5716
Add MFU helpers by @AmineDiro in https://github.com/huggingface/trl/pull/5698
[GKD] Fix seq kd wasted teacher forward by @roycho96 in https://github.com/huggingface/trl/pull/5726
Add Qwen2.5 response schema by @aazizyan in https://github.com/huggingface/trl/pull/5728
Enable chunked NLL loss with VLM in SFT by @qgallouedec in https://github.com/huggingface/trl/pull/5684
[GKD] Fix Liger fused JSD path computing wrong loss by @roycho96 in https://github.com/huggingface/trl/pull/5731
Release: v1.4 by @qgallouedec in https://github.com/huggingface/trl/pull/5732

Full Changelog: https://github.com/huggingface/trl/compare/v1.3.0...v1.4.0