What's Changed
- 🔒 Gate trainer telemetry on an explicit class-name allowlist by @qgallouedec in https://github.com/huggingface/trl/pull/5851
Full Changelog: https://github.com/huggingface/trl/compare/v1.5.0...v1.5.1
Libraries for efficient model fine-tuning and alignment
Full Changelog: https://github.com/huggingface/trl/compare/v1.5.0...v1.5.1
Three more model families gain training-compatible templates with {% generation %} markers (so assistant_only_loss=True just works):
The chunked LM-head path used by AsyncGRPOTrainer now supports models that use final_logit_softcapping (notably Gemma 2). _ChunkedLogProbFunction applies logit_scale, optional tanh-based softcapping, and temperature consistently in both forward and backward — softcapped models are no longer rejected.
by @mlarnouhet in https://github.com/huggingface/trl/pull/5691
Two more cycles closer to KTO graduation:
compute_loss flow by @albertvillanova in https://github.com/huggingface/trl/pull/5810_compute_loss_liger flow by @albertvillanova in https://github.com/huggingface/trl/pull/5816_BaseTrainer.__init__ now emits a single anonymous huggingface_hub.send_telemetry ping per trainer instantiation, so we can finally see which trainers / model families / distributed backends are actually being used in practice and prioritize accordingly.
The payload is intentionally minimal — TRL version, trainer class name, model architecture, PEFT yes/no, distributed backend (deepspeed/fsdp/ddp/none), bucketed world size, device type, GPU model when available. No user data, no dataset names, no model paths, no hyperparameter values, never sent in CI / offline / HF_HUB_DISABLE_TELEMETRY mode.
See usage_stats.md for what's collected and how to opt out.
by @qgallouedec in https://github.com/huggingface/trl/pull/5758
OpenRewardSpec: fix omitting task-scoped tools during rollout binding (fixes #5727) by @rycerzes in https://github.com/huggingface/trl/pull/5729GRPOTrainer was hanging indefinitely on truncated <tool_call> blocks (a degenerate case that happens naturally when generation hits max_completion_length mid-tool-call). Rewrote the regex to be non-backtracking — worst case goes from O(2ⁿ) to O(n). By @xodn348 in https://github.com/huggingface/trl/pull/5798OffloadActivations — follow-up to v1.4's activation-offloading leak fix. By @butterwecksolutions in https://github.com/huggingface/trl/pull/5730add_hooks by @roycho96 in https://github.com/huggingface/trl/pull/4693vocab_size for DistillationTrainer and GOLDTrainer by @Beichen-Ma in https://github.com/huggingface/trl/pull/5592empty_cache() by @jamie-peterson-ml in https://github.com/huggingface/trl/pull/5799metric_for_best_model for trainer-specific eval metrics by @qgallouedec in https://github.com/huggingface/trl/pull/5811generate_batch: inference tensors blocking inplace ops in background thread by @albertvillanova in https://github.com/huggingface/trl/pull/5818torch_dtype with dtype across examples, docs, notebooks, tests, and experimental distillation / gold trainers by @qgallouedec in https://github.com/huggingface/trl/pull/5717Glm4MoeForCausalLM / Cohere / Cohere2 / Qwen2.5-VL configs with their reference models by @qgallouedec in https://github.com/huggingface/trl/pull/5638, https://github.com/huggingface/trl/pull/5706, https://github.com/huggingface/trl/pull/5707 and https://github.com/huggingface/trl/pull/5739deepstack_visual_indexes and drop the test skip by @qgallouedec in https://github.com/huggingface/trl/pull/5779fullatt_block_indexes out of range for depth=2 by @albertvillanova in https://github.com/huggingface/trl/pull/5805num_heads key in Qwen VL tiny model scripts by @matdou in https://github.com/huggingface/trl/pull/5792model.visual. skip in GRPO/RLOO Qwen2.5-VL tests by @qgallouedec in https://github.com/huggingface/trl/pull/5780AttributeError: 'GptOssConfig' object has no attribute 'num_experts' by @albertvillanova in https://github.com/huggingface/trl/pull/5756apply_model_revisions by removing _commit_hash kwarg by @albertvillanova in https://github.com/huggingface/trl/pull/5762model.visual params by @albertvillanova in https://github.com/huggingface/trl/pull/5806torch < 2.12.0 (later reverted) by @albertvillanova in https://github.com/huggingface/trl/pull/5769pytest --only-rerun by @albertvillanova in https://github.com/huggingface/trl/pull/5784tests_latest.yml by @hf-security-analysis[bot] in https://github.com/huggingface/trl/pull/5733torch_dtype -> dtype by @qgallouedec in https://github.com/huggingface/trl/pull/5717model.visual. skip in GRPO / RLOO Qwen2.5-VL tests by @qgallouedec in https://github.com/huggingface/trl/pull/5780deepstack_visual_indexes and drop the test skip by @qgallouedec in https://github.com/huggingface/trl/pull/5779metric_for_best_model for trainer-specific eval metrics by @qgallouedec in https://github.com/huggingface/trl/pull/5811OpenRewardSpec omitting task‑scoped tools during rollout binding (fixes #5727) by @rycerzes in https://github.com/huggingface/trl/pull/5729Full Changelog: https://github.com/huggingface/trl/compare/v1.4.0...v1.5.0
A new loss_type="chunked_nll" option drastically reduces peak activation memory in SFT by avoiding the full [batch × seq × vocab] logits tensor. Ignored-label tokens are dropped before the lm_head matmul, and the cross-entropy is computed over the remaining tokens in checkpointed chunks (default chunk_size=256, the sweet spot consistent across model sizes and sequence lengths).
from trl import SFTConfig, SFTTrainer
trainer = SFTTrainer(
model="Qwen/Qwen3-4B",
args=SFTConfig(loss_type="chunked_nll"),
train_dataset=dataset,
)
trainer.train()
Peak GPU memory, AdamW fp32:
| Model | Hardware | Seq | nll | chunked_nll |
|---|---|---|---|---|
| Qwen3-1.7B + LoRA | 1×H100 80GB | 2048 | 47.9 GB | 12.3 GB (3.9× less) |
| Qwen3-4B | 1×H100 80GB | 16384 | OOM | 63.8 GB |
| Qwen3-14B | 8×H100 FSDP2 | 16384 | 58.9 GB | 38.9 GB (1.5× less) |
| Qwen3-32B | 8×H100 FSDP2 | 8192 | OOM | 71.2 GB |
End-to-end, chunked NLL is consistently as fast or faster than nll — and it unlocks sequence lengths that don't fit at all under the standard path.
The chunked path also supports VLMs (https://github.com/huggingface/trl/pull/5684).
by @qgallouedec in https://github.com/huggingface/trl/pull/5575, https://github.com/huggingface/trl/pull/5676 and https://github.com/huggingface/trl/pull/5684
A new trl.experimental.openreward adapter plugs any environment speaking the Open Reward Standard (ORS) protocol into any TRL trainer accepting an environment_factory (GRPOTrainer, AsyncGRPOTrainer). One identifier wires all three trainer slots — dataset, factory, reward_func:
from trl import GRPOConfig, GRPOTrainer
from trl.experimental.openreward import OpenRewardEnv
env = OpenRewardEnv("Eigent/SETA") # or "http://localhost:8000"
trainer = GRPOTrainer(
model="Qwen/Qwen3-4B",
args=GRPOConfig(...),
train_dataset=env.dataset,
environment_factory=env.factory,
reward_funcs=env.reward_func,
)
Tools are bound dynamically from JSON Schema at construction (no per-env wrapper code), and env.dataset autoderives task lists from the ORS task endpoints. The same code path works for envs hosted on the OpenReward platform, self-hosted on any container service, or running locally on localhost. A SETA training example is included.
by @adithya-s-k in https://github.com/huggingface/trl/pull/5696
Unit tests don't catch trainer-level numerical drift (gradient-accumulation normalization bugs, attention-impl divergence (eager ↔ FA2 / kernels)) they silently shift the loss trajectory and users only notice when their run no longer reproduces. (Cf. last year's transformers grad-accum bug, or the "We found two bugs in DeepSpeed" paper.)
A new opt-in pytest -m invariant suite asserts the loss / grad_norm trajectory of short end-to-end SFT/DPO runs against committed reference snapshots, with equivalence classes for configs that should produce identical trajectories (e.g. pdb=1, gas=8 ≡ default; eager ≡ FA2 ≡ kernels). Hardware-pinned to H100 80GB, real pretrained model, full_determinism, fixed seed. Initial coverage: 2 trainers × 2 invariance axes (grad-accum, attn-impl) × gradient-checkpointing equivalence.
by @qgallouedec in https://github.com/huggingface/trl/pull/5686, https://github.com/huggingface/trl/pull/5688 and https://github.com/huggingface/trl/pull/5689
Three new pure helpers in trl.trainer.utils for measuring training efficiency:
compute_flops_per_token(config, seq_len) — handles dense and MoE (Mixtral, Qwen3-MoE, DeepSeek-V2)compute_mfu(flops_per_token, tps, world_size, peak_flops) — Model FLOPs Utilization as a percentageadjusted_mfu(mfu, config, seq_len) — non-causal → causal-corrected (Llama / DS Ulysses convention)by @AmineDiro in https://github.com/huggingface/trl/pull/5698
GRPO's Liger-kernel integration is updated for Liger 0.8.0: delta two-sided clipping, use_bias_correction_kl, and SAPO/VESPO parameters are now forwarded into LigerFusedLinearGRPOLoss. The previous delta + use_liger_kernel guard is removed — both can be combined.
by @kashif in https://github.com/huggingface/trl/pull/5690
A new loss_type="sigmoid_norm" option for DPOConfig implements the per-token (length-normalized) DPO loss used by Tülu 3 / OLMo (paper §5.1.2 eq. 6) to mitigate length bias.
from trl import DPOConfig, DPOTrainer
trainer = DPOTrainer(
model="Qwen/Qwen3-4B",
args=DPOConfig(loss_type="sigmoid_norm"),
train_dataset=dataset,
)
by @BrownianNotion in https://github.com/huggingface/trl/pull/5406
Four more model families gain training-compatible chat templates with {% generation %} markers (assistant-only loss masking) and/or response schemas (tool-calling parsing):
{% generation %} markers by @qgallouedec in https://github.com/huggingface/trl/pull/5675get_training_chat_template now also accepts a processor (not just a tokenizer) — useful for VLMs (https://github.com/huggingface/trl/pull/5560).
Another batch of alignment PRs this cycle. KTO and DPO are now structurally aligned across PEFT handling, model initialization, training-arg grouping, ref-logp precomputation, and metric handling — promotion of KTO out of experimental is imminent.
PRs (all by @albertvillanova): #5659, #5660, #5661, #5679, #5701, #5702, #5703, #5704, #5705, #5714.
parallelism_config with cp_size>1 or sp_size>1 in GRPO/RLOO — fail fast at config init with a clear error instead of mid-training crash. By @kashif in https://github.com/huggingface/trl/pull/5699model_accepts_loss_kwargs=False in DPO and Reward by @albertvillanova in https://github.com/huggingface/trl/pull/5710_tokenizer attribute in experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5566peft_config handling in core / experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5673 and https://github.com/huggingface/trl/pull/5674isinstance with is_peft_model / drop redundant is_peft_available by @albertvillanova in https://github.com/huggingface/trl/pull/5682 and https://github.com/huggingface/trl/pull/5683parse_response by @qgallouedec in https://github.com/huggingface/trl/pull/5561OffloadActivations.__exit__ now syncs the compute/offload streams and clears the stash dictionaries, preventing orphaned offload tensors from leaking onto a dead stream (~0.2 GiB/step accumulation observed during QLoRA vision training before the fix). By @butterwecksolutions in https://github.com/huggingface/trl/pull/5694 and https://github.com/huggingface/trl/pull/5700DistillationTrainer by @k1064190 in https://github.com/huggingface/trl/pull/5594GKDTrainer: fix return_outputs in the Liger kernel path by @roycho96 in https://github.com/huggingface/trl/pull/4688GKDTrainer: fix seq-KD wasted teacher forward by @roycho96 in https://github.com/huggingface/trl/pull/5726GKDTrainer: fix Liger fused JSD path computing wrong loss by @roycho96 in https://github.com/huggingface/trl/pull/5731peft_config to core / experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5664 and https://github.com/huggingface/trl/pull/5665peft_config type hint in experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5666DistillationTrainer by @cmpatino in https://github.com/huggingface/trl/pull/5615Qwen3-4B-Instruct-2507 by @qgallouedec in https://github.com/huggingface/trl/pull/5586Qwen/Qwen3-30B-A3B by @qgallouedec in https://github.com/huggingface/trl/pull/5716DistillationTrainer by @cmpatino in https://github.com/huggingface/trl/pull/5615{% generation %} markers for Cohere2 chat template by @qgallouedec in https://github.com/huggingface/trl/pull/5675get_training_chat_template by @qgallouedec in https://github.com/huggingface/trl/pull/5560parse_response by @qgallouedec in https://github.com/huggingface/trl/pull/5561Full Changelog: https://github.com/huggingface/trl/compare/v1.3.0...v1.4.0
TRL v1.3 ships training support for the new Qwen 3.6 family (Qwen/Qwen3.6-27B, Qwen/Qwen3.6-35B-A3B). Qwen 3.6 reuses the Qwen3_5Moe* architecture but ships a slightly different chat template (adds a preserve_thinking flag, tweaks tool-arg stringification), so exact-string template matching needed updates across the stack.
What landed:
qwen3_6.jinja (verbatim from upstream) and qwen3_6_training.jinja (prefix-preserving + {% generation %} markers for assistant_only_loss=True)qwen3_5_schema for tool-call parsing — output format unchangedtiny-Qwen3_5MoeForConditionalGeneration-3.6 (with MoE-specific shrinking)test_(train|training)_vlm casesfrom trl import SFTConfig, SFTTrainer
trainer = SFTTrainer(
model="Qwen/Qwen3.6-27B",
args=SFTConfig(assistant_only_loss=True), # works out of the box
train_dataset=dataset,
)
trainer.train()
Tool-calling agent training also works end-to-end via the existing Qwen 3.5 response schema:
from trl import GRPOConfig, GRPOTrainer
def multiply(a: int, b: int) -> int:
"""
Multiplies two integers.
Args:
a: The first integer.
b: The second integer.
Returns:
The product of the two integers.
"""
return a * b
trainer = GRPOTrainer(
model="Qwen/Qwen3.6-27B",
reward_funcs=my_reward_fn,
args=GRPOConfig(...),
train_dataset=dataset,
tools=[multiply],
)
trainer.train()
by @qgallouedec in https://github.com/huggingface/trl/pull/5642
A new experimental TPOTrainer implements Triple Preference Optimization, which augments DPO with a reference (gold) completion alongside chosen/rejected. The paper reports +7-19 points over DPO/SimPO on Arena-Hard, MixEval-Hard, MMLU-Pro and GSM8K, with less data.
from trl.experimental.tpo import TPOConfig, TPOTrainer
trainer = TPOTrainer(
model="Qwen/Qwen3-0.6B",
args=TPOConfig(output_dir="Qwen3-0.6B-TPO"),
train_dataset=load_dataset("tpo-alignment/triple-preference-ultrafeedback-40K", split="train"),
)
trainer.train()
by @kashif in https://github.com/huggingface/trl/pull/5506
trl vllm-serveA new --speculative_config JSON flag exposes vLLM's speculative decoding directly through trl vllm-serve — works with native MTP heads (Qwen3 Next), Eagle3 drafts, etc. — without forking the serve script.
# Qwen3 native MTP (no extra draft model)
trl vllm-serve --model Qwen/Qwen3-Next-80B-A3B-Instruct \
--speculative_config '{"method": "qwen3_next_mtp", "num_speculative_tokens": 5}'
# Eagle3 draft model
trl vllm-serve --model Qwen/Qwen3-32B \
--speculative_config '{"model": "RedHatAI/Qwen3-32B-speculator.eagle3", "method": "eagle3", "num_speculative_tokens": 3}'
by @Ofir408 in https://github.com/huggingface/trl/pull/5605
Twelve more alignment PRs this cycle, bringing KTOTrainer and DPOTrainer essentially into structural parity. Notable shifts include moving completion assembly out of _prepare_dataset into a new DataCollatorForKTO, inlining the two-pass tokenization into a single pass, removing BOS/EOS handling, and supporting IterableDataset and dict eval_dataset. The goal — promoting KTO out of experimental and into stable — is now within reach for an upcoming release.
PRs (all by @albertvillanova): #5582, #5578, #5579, #5583, #5587, #5599, #5601, #5600, #5606, #5612, #5632, #5635
{% generation %} training chat templatesThree more model families gain training-compatible chat templates with {% generation %} markers, so assistant_only_loss=True works out of the box:
maybe_apply_chat_template by @albertvillanova in https://github.com/huggingface/trl/pull/5567is_chat_template_prefix_preserving by @qgallouedec in https://github.com/huggingface/trl/pull/5558forward_masked_logits by @qgallouedec in https://github.com/huggingface/trl/pull/5626_tokenizer as trainer attribute by @albertvillanova in https://github.com/huggingface/trl/pull/5489PreTrainedTokenizerBase for tokenizer type hints by @qgallouedec in https://github.com/huggingface/trl/pull/5629async_reward_X to async_X by @qgallouedec in https://github.com/huggingface/trl/pull/5616attention_mask instead of label != -100), and wrong cross-rank aggregation (unweighted mean instead of sum/count). The reported entropy under completion_only_loss=True and sequence parallelism is now correct. Same fix applied to DPO entropy logging. By @qgallouedec in https://github.com/huggingface/trl/pull/5620AsyncGRPOTrainer's processing_class to AsyncRolloutWorker by @xuanduy04 in https://github.com/huggingface/trl/pull/5538generate_tiny_models for gpt-oss by @albertvillanova in https://github.com/huggingface/trl/pull/5622TestSupportsToolCalling for improved coverage by @qgallouedec in https://github.com/huggingface/trl/pull/5537is_chat_template_prefix_preserving by @qgallouedec in https://github.com/huggingface/trl/pull/5558TestSupportsToolCalling for improved coverage by @qgallouedec in https://github.com/huggingface/trl/pull/5537{% generation %} markers for training chat template by @casinca in https://github.com/huggingface/trl/pull/5519forward_masked_logits by @qgallouedec in https://github.com/huggingface/trl/pull/5626PreTrainedTokenizerBase for tokenizer type hints by @qgallouedec in https://github.com/huggingface/trl/pull/5629async_reward_X to async_X by @qgallouedec in https://github.com/huggingface/trl/pull/5616Full Changelog: https://github.com/huggingface/trl/compare/v1.2.0...v1.3.0
SSDTrainer — Simple Self-DistillationA new experimental SSDTrainer implements the method described in Embarrassingly Simple Self-Distillation Improves Code Generation. SSD samples completions from the model itself at a training-time temperature/truncation setting, then fine-tunes on those raw, unverified samples with standard cross-entropy loss. No reward model, verifier, teacher model, or RL: just prompts and the model.
from datasets import Dataset
from trl.experimental.ssd import SSDConfig, SSDTrainer
dataset = Dataset.from_dict({
"prompt": [
[{"role": "user", "content": "Write a function to add two numbers."}],
[{"role": "user", "content": "Write a function to check if a number is prime."}],
],
})
trainer = SSDTrainer(
model="Qwen/Qwen3-4B-Instruct",
args=SSDConfig(
output_dir="ssd-model",
temperature=0.6, # T_train from the paper
top_k=20,
top_p=0.95,
learning_rate=5e-6,
),
train_dataset=dataset,
)
trainer.train()
by @kashif in https://github.com/huggingface/trl/pull/5505
GRPOTrainerWhen tool calls produce more tokens than max_completion_length allows, GRPOTrainer now rolls back the tool messages/images added in the current iteration instead of trying to truncate them. This removes ~80 lines of fragile, image-boundary-aware bookkeeping in favor of a ~15-line snapshot-and-rollback. Since overlong samples almost always get rewarded as failures anyway, the learning signal is effectively unchanged — but the code is dramatically simpler and no longer needs per-VLM-family vision-token lookup tables.
by @qgallouedec in https://github.com/huggingface/trl/pull/5521
Continuing the effort from v1.1:
{% generation %} markers, enabling assistant-only loss masking for DeepSeek-V3 models. By @RudrenduPaul in https://github.com/huggingface/trl/pull/5527As a result of a tightened detection (see fixes below), the list of templates reported as tool-calling capable is now correct — notably, the basic Llama 3 template is no longer falsely classified as tool-calling capable.
A major cleanup sweep keeps KTOTrainer and DPOTrainer in lockstep, same initialization patterns, same config surface, same precompute behavior:
precompute_ref_batch_size to KTO (https://github.com/huggingface/trl/pull/5530)ref_model initialization (https://github.com/huggingface/trl/pull/5534)None args (https://github.com/huggingface/trl/pull/5531)generate_during_eval (https://github.com/huggingface/trl/pull/5551)ref_model when precompute_ref_log_probs is set in DPO/KTO (https://github.com/huggingface/trl/pull/5542)All by @albertvillanova.
prepare_multimodal_messages by @albertvillanova in https://github.com/huggingface/trl/pull/5474prepare_multimodal_messages by @albertvillanova in https://github.com/huggingface/trl/pull/5508supports_tool_calling falsely accepting templates that drop assistant tool_calls by @qgallouedec in https://github.com/huggingface/trl/pull/5517add_response_schema for VLM processors — the schema was being set on the outer processor instead of the inner tokenizer, so it had no effect. This also collapses a handful of __init__/decode-gate workarounds. By @qgallouedec in https://github.com/huggingface/trl/pull/5520use_transformers_paged in GRPOConfig and RLOOConfig (and remove entirely from experimental OnlineDPOConfig, GOLDConfig, SelfDistillationConfig). Will be removed from the remaining configs in v2.0.0. In a small A/B benchmark (Qwen3-0.6B GRPO), the paged path is ~20% slower and uses ~6x more peak VRAM than the default; it's also superseded by transformers continuous batching. By @qgallouedec in https://github.com/huggingface/trl/pull/5544chat_templates/README by @qgallouedec in https://github.com/huggingface/trl/pull/5545supports_tool_calling falsely accepting templates that drop assistant tool_calls by @qgallouedec in https://github.com/huggingface/trl/pull/5517use_transformers_paged by @qgallouedec in https://github.com/huggingface/trl/pull/5544add_response_schema for VLM processors by @qgallouedec in https://github.com/huggingface/trl/pull/5520chat_templates/README by @qgallouedec in https://github.com/huggingface/trl/pull/5545Full Changelog: https://github.com/huggingface/trl/compare/v1.1.0...v1.2.0
A small patch release containing these fixes:
Full Changelog: https://github.com/huggingface/peft/compare/v0.19.0...v0.19.1
This PEFT release contains no less than nine new PEFT methods, described below. It also contains numerous enhancements that should make PEFT more useful to many users.
<img width="1248" height="560" alt="peft-v0 19 0" src="https://github.com/user-attachments/assets/f2878d0d-b1a1-46d0-9b61-55ab6097694c" />@yeonjoon-jung01 added "GraLoRA: Granular Low-Rank Adaptation for Parameter-Efficient Fine-Tuning" to PEFT (#2851). This method subdivides the base weight into smaller blocks and applies LoRA to those. This more granular adaptation promises to increase expressiveness and improve performance, especially at higher ranks (64+), closing the gap to full fine-tuning.
@Conzel contributed BD-LoRA: "Block-Diagonal LoRA for Eliminating Communication Overhead in Tensor Parallel LoRA Serving" (#2895). With BD-LoRA, the LoRA weights are implemented in a block-diagonal way. This allows to reduce communication overhead when using tensor parallelism (TP) and thus faster serving.
There is an experiment branch for BD-LoRA support in vLLM: vllm-project/vllm#28136.
Thanks to @kashif, PEFT now also supports Cartridges (#2953). The main purpose of this method is to train a prefix to compress a long context to a short size and thus save on tokens. On a low level, this is similar to prefix tuning. The PR also added an example recipe to quickly get started.
"PVeRA: Probabilistic Vector-Based Random Matrix Adaptation" was added to PEFT by @leofillioux in #2952. It is an extension of VeRA, a PEFT method that uses weight sharing between layers to be especially parameter efficient. PVeRA builds on top of that by adding a probabilistic element, sampling from the shared parameters and promising better performance overall.
@fei407 added PSOFT, "Efficient Orthogonal Fine-Tuning with Principal Subspace Adaptation", to PEFT in #3037. Orthogonal fine-tuning techniques like OFT and BOFT are good at preserving the structure and thus capabilities of the underlying base model. PSOFT improves efficiency of this technique by constraining the adaptation to low-rank principal subspace.
@yibozhong added Lily: "Low-Rank Interconnected Adaptation across Layers" to PEFT in #2563. Lily is on the surface similar to LoRA but has a sophisticated parameter sharing scheme. The A parameters are shared blockwise (e.g. 4 consecutive q_proj layers share the same A). There is a pool of B parameters that is shared globally, the actual B's are chosen in a data-dependent way through a router. This allows Lily to use higher ranks than LoRA while maintaining a low trainable parameter count.
In #3084, "PEANuT: Parameter-Efficient Adaptation with Weight-aware Neural Tweakers" was added to PEFT, again by @yibozhong. PEANuT adds a small, neural net (so called weight-aware neural tweakers) to the base model. Compared to LoRA, this increases expressivity for the same trainable parameter count or allows to greatly lower the parameter count without sacrificing expressivity. This comes at the expensive of a higher memory requirement for the same parameter count and decreased speed.
We have another serial contributor in @kashif, who also contributed TinyLoRA: "Learning to Reason in 13 Parameters" in #3024. This is a PEFT method that allows to train an extremely small number of parameters, much lower than what could be achieved even with LoRA rank 1. The paper shows that in particular with reinforcement learning, it can often be enough to train just a few parameters to achieve good results.
@LonglongaaaGo added "AdaMSS: Adaptive Multi-Subspace Approach for Parameter-Efficient Fine-Tuning" to PEFT. This method segments the base weights of the model into smaller subspaces that are targeted for fine-tuning. Moreover, it's possible to dynamically assign a lower parameter budget to less important subspaces during training, similar to what AdaLoRA does. This promises to provide higher expressiveness and better generalization than similar PEFT methods.
In #2939, we added functions to PEFT to allow converting checkpoints of many non-LoRA methods into LoRA checkpoints. This can be useful because many other packages support only LoRA but not other PEFT methods, e.g. Diffusers and vLLM. With the new conversions tools, more PEFT methods than just LoRA can thus be used with those packages. Conversion is lossy but empirical testing showed that with a sufficiently high LoRA rank, the error can be quite low.
@sambhavnoobcoder added a new way to initialize LoRA weights with "LoRA-GA: Low-Rank Adaptation with Gradient Approximation" (#2926). This allows you to initialize the LoRA weights in a way that aligns the gradients with full fine-tuning and should lead to faster training convergence.
In "LoRA vs Full Fine-tuning: An Illusion of Equivalence", the authors showed that LoRA fine-tuning can introduce so-called "intruder dimensions" which contribute to forgetting. We now have a utility function to remove intruder dimension in PEFT, reduce_intruder_dimension. When calling this on a fine-tuned LoRA model, forgetting should be reduced while the fine-tuned task performance should remain almost the same.
In #3048, @balvisio added support for Transformer Engine, a quantization method by NVIDIA, to PEFT.
In a series of PRs (#3079, #3091, #3096), @michaelbenayoun added support for Tensor Parallelism to LoRA.
In many LLMs, the embedding and the LM head have tied weights to save on parameter count. This can, however, lead to tricky situations when trying to fine-tune those layers. Through a series of PRs (#2803, #2922, #2870, #2879, #3126), we improved the user experience when doing so. Most notably, users can now pass ensure_weight_tying=True to their PEFT config to force weight tying to be upheld. Please check the PEFT weight tying docs for how weight tying is now being handled. Thanks to @romitjain, @sambhavnoobcoder, and @Cursx for their contributions.
#3055 makes LoRA work with base models that use very low precision floats like torch.float8_e4m3fn. An example of that would be MiniMax-M2.5.
#3128 introduces zero init to Prefix Tuning which, according to our benchmarks, reduced the result variance significantly and yielded good task accuracy without the need for prompt engineering.
With #3088 the LoftQ implementation now supports correcting errors for int8 quantization without utilizing activation thresholding alongside the already existing nf4 quantization.
The Bone PEFT method was removed in #3115. Users are directed to use MiSS instead, which is the improved replacement for Bone. Use this Bone-to-MiSS conversion script if you want to port old Bone checkpoints.
These two quantization methods now use GPTQModel as their backend (#2932) thanks to @ZX-ModelCloud.
requires_grad in modules_to_savePreviously, PEFT would enable requires_grad on the original module if the corresponding modules_to_save was disabled. This is almost never desirable and was thus fixed. Although this change is technically backwards-incompatible, it's an extreme niche case, so we don't expect any user to be negatively affected by it.
no_split_modules now captures values recursively by @githubnemo in https://github.com/huggingface/peft/pull/3032inference_mode when setting adapters with modules_to_save (Issue #2928) by @ada-ggf25 in https://github.com/huggingface/peft/pull/2931Full Changelog: https://github.com/huggingface/peft/compare/v0.18.1...v0.19.0
DistillationTrainer for efficient on-policy distillationRead the blog post: https://huggingface.co/spaces/HuggingFaceTB/trl-distillation-trainer
The new DistillationTrainer implements on-policy knowledge distillation as described in On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes. It extends the ideas from the GKDTrainer with three key optimizations: a generation buffer that decouples the training microbatch size from the generation batch size (up to 40x speedup), external teacher server support so the teacher doesn't need to fit on training GPUs, and binary-encoded logprob payloads that shrink transfer payloads by ~5x.
from datasets import load_dataset
from trl.experimental.distillation import DistillationConfig, DistillationTrainer
dataset = load_dataset("openai/gsm8k", "main", split="train")
dataset = dataset.map(
lambda x: {"messages": [{"role": "user", "content": x["question"]}]},
remove_columns=dataset.column_names,
)
trainer = DistillationTrainer(
model="Qwen/Qwen2.5-1.5B-Instruct",
teacher_model="Qwen/Qwen2.5-7B-Instruct",
args=DistillationConfig(
output_dir="results/distill-qwen-gsm8k",
lmbda=1.0, # fully on-policy (student generates)
beta=1.0, # reverse KL
teacher_model_init_kwargs={"torch_dtype": "bfloat16"},
),
train_dataset=dataset,
)
trainer.train()
by @cmpatino in https://github.com/huggingface/trl/pull/5407, https://github.com/huggingface/trl/pull/5500 and https://github.com/huggingface/trl/pull/5501
AsyncGRPOTrainerAsyncGRPOTrainer now supports a chunked LM-head path that computes per-token log-probs and entropy via online logsumexp without materializing the full [N, V] logits tensor. Combined with completion_mask filtering to skip prompt tokens, this brings massive memory savings on long sequences — up to 44x lower peak-allocated memory on an 8192-token sequence:
chunk_lm_head_size | Peak Alloc (GB) | Reduction | Wall Time (ms) |
|---|---|---|---|
None (baseline) | 18.55 | 1.00x | 808.7 |
4096 | 0.42 | 44.32x | 459.0 |
8192 | 0.76 | 24.34x | 393.0 |
Enable it via the new chunk_lm_head_size option in AsyncGRPOConfig:
from trl.experimental.async_grpo import AsyncGRPOConfig, AsyncGRPOTrainer
trainer = AsyncGRPOTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct",
args=AsyncGRPOConfig(chunk_lm_head_size=4096),
...
)
Note: mutually exclusive with use_liger_kernel (both replace the LM head forward pass).
by @AmineDiro in https://github.com/huggingface/trl/pull/5349
{% generation %} support in training chat templatesSFT with assistant_only_loss=True requires chat templates to include {% generation %} / {% endgeneration %} markers so that return_assistant_tokens_mask=True produces correct masks. Very few models ship these markers natively, so users hit a cryptic error when enabling assistant-only loss with models like Qwen3, Llama 3 or GPT-OSS.
SFTTrainer now automatically swaps in a patched training chat template when the original template lacks generation markers — no manual template surgery required. Training templates are shipped for Qwen2.5, Qwen3, Llama 3 and GPT-OSS, stored as standalone .jinja files under trl/chat_templates/ for readability, diffability, and editor syntax highlighting.
from trl import SFTConfig, SFTTrainer
trainer = SFTTrainer(
model="Qwen/Qwen3-4B",
args=SFTConfig(assistant_only_loss=True), # now just works
train_dataset=dataset,
)
trainer.train()
by @qgallouedec in https://github.com/huggingface/trl/pull/5459, https://github.com/huggingface/trl/pull/5470, by @RudrenduPaul in https://github.com/huggingface/trl/pull/5493 and https://github.com/huggingface/trl/pull/5522, and by @casinca in https://github.com/huggingface/trl/pull/5484
Agent training now supports a broader family of models via native tool-call response schemas:
A new supports_tool_calling() utility detects whether a tokenizer/processor can render a full tool-calling turn, and GRPOTrainer now validates tool support at initialization — raising a clear error upfront instead of failing cryptically mid-training.
by @qgallouedec in https://github.com/huggingface/trl/pull/5462, https://github.com/huggingface/trl/pull/5464, https://github.com/huggingface/trl/pull/5463, https://github.com/huggingface/trl/pull/5469 and https://github.com/huggingface/trl/pull/5454
environment_factory tool methods can now return multimodal content blocks (images + text) for VLM training. Previously, tool responses were always converted to str(result), discarding any visual information. Now tools can return content block lists with images, and the trainer handles them end-to-end through tokenization, generation, and the forward pass — including correct pixel_values plumbing.
class ScreenshotEnv:
def take_screenshot(self) -> list[dict]:
return [
{"type": "image", "image": self.browser.screenshot()},
{"type": "text", "text": "Current page state"},
]
The OpenEnv browsergym.py example has been migrated to this pattern, and a new carla_vlm.py example demonstrates VLM training against CARLA with camera-image tool responses.
by @sergiopaniego in https://github.com/huggingface/trl/pull/5323 and https://github.com/huggingface/trl/pull/5437, and by @qgallouedec in https://github.com/huggingface/trl/pull/5448
accuracy_reward and reasoning_accuracy_reward now emit extra diagnostic columns (solution, gold_parsed, answer_parsed) via the log_extra callback introduced in v1.0.0. These show up in the rich completions table, making it much easier to debug why a reward was (or wasn't) assigned.
from datasets import load_dataset
from trl import GRPOConfig, GRPOTrainer
from trl.rewards import accuracy_reward
dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
trainer = GRPOTrainer(
model="Qwen/Qwen2-0.5B-Instruct",
reward_funcs=accuracy_reward,
args=GRPOConfig(log_completions=True),
train_dataset=dataset,
)
trainer.train()
by @qgallouedec in https://github.com/huggingface/trl/pull/5308
KTOConfig by @albertvillanova in https://github.com/huggingface/trl/pull/5477prepare_multimodal_messages by @albertvillanova in https://github.com/huggingface/trl/pull/5424prepare_multimodal_messages by @albertvillanova in https://github.com/huggingface/trl/pull/5475pixel_position_ids with image_position_ids for Gemma 4 support by @qgallouedec in https://github.com/huggingface/trl/pull/5452isinstance(part, dict) checks in image extraction by @qgallouedec in https://github.com/huggingface/trl/pull/5439_get_tool_suffix_ids by @qgallouedec in https://github.com/huggingface/trl/pull/5440prepare_deepspeed by @albertvillanova in https://github.com/huggingface/trl/pull/5414ImportError with vllm-0.10.2 in OnlineDPO and OpenEnv by @albertvillanova in https://github.com/huggingface/trl/pull/5423_get_per_token_logps_and_entropies return type by @kashif in https://github.com/huggingface/trl/pull/5456prepare_multimodal_messages not normalizing empty string content for assistant/tool roles by @albertvillanova in https://github.com/huggingface/trl/pull/5496pad_token_id by @albertvillanova in https://github.com/huggingface/trl/pull/5487huggingface-cli references with hf by @hanouticelina in https://github.com/huggingface/trl/pull/5486truncation_mode from experimental truncate_dataset by @albertvillanova in https://github.com/huggingface/trl/pull/5467keep_end truncation mode in DPOConfig and SFTConfig — will be removed in v2.0.0. Use keep_start instead. By @albertvillanova in https://github.com/huggingface/trl/pull/5465pad_token config parameter in DPOConfig, SFTConfig, and RewardConfig — will be removed in v2.0.0. Set tokenizer.pad_token directly on the processing_class instead. By @albertvillanova in https://github.com/huggingface/trl/pull/5480trl.experimental.judges module and all judge support from trainers. Judges were experimental, unused in practice, and llm-blender (backing PairRMJudge) was unmaintained and incompatible with transformers v5 — actively blocking v5 adoption. Everything judges did can be achieved with reward_funcs. OnlineDPOTrainer, NashMDTrainer, and XPOTrainer are now unified on reward-model scoring only. By @qgallouedec in https://github.com/huggingface/trl/pull/5485carla_vlm OpenEnv example by @sergiopaniego in https://github.com/huggingface/trl/pull/5437completion_only_loss in SFT trainer docs by @RudrenduPaul in https://github.com/huggingface/trl/pull/5494DistillationTrainer by @cmpatino in https://github.com/huggingface/trl/pull/5500prepare_multimodal_messages by @albertvillanova in https://github.com/huggingface/trl/pull/5476make precommit to fix docstring style by @albertvillanova in https://github.com/huggingface/trl/pull/5436test_rloo[fsdp2]: replace non-deterministic xfail with skipif for transformers 5.4.0 by @albertvillanova in https://github.com/huggingface/trl/pull/5403input_ids or inputs_embeds by @albertvillanova in https://github.com/huggingface/trl/pull/5422eval_strategy by @SunMarc in https://github.com/huggingface/trl/pull/5426TypeError: 'NoneType' object is not iterable by @albertvillanova in https://github.com/huggingface/trl/pull/5427TypeError: 'NoneType' object is not iterable by @albertvillanova in https://github.com/huggingface/trl/pull/5438test_rloo[fsdp2] after transformers 5.5.0 release by @albertvillanova in https://github.com/huggingface/trl/pull/5442eval_strategy by @SunMarc in https://github.com/huggingface/trl/pull/5426environment_factory for VLM training by @sergiopaniego in https://github.com/huggingface/trl/pull/5323isinstance(part, dict) checks in image extraction by @qgallouedec in https://github.com/huggingface/trl/pull/5439pixel_position_ids with image_position_ids for Gemma4 support by @qgallouedec in https://github.com/huggingface/trl/pull/5452_get_tool_suffix_ids by @qgallouedec in https://github.com/huggingface/trl/pull/5440.jinja files by @qgallouedec in https://github.com/huggingface/trl/pull/5459supports_tool_calling utility and validate tool support at init by @qgallouedec in https://github.com/huggingface/trl/pull/5462{% generation %} support to training chat templates by @qgallouedec in https://github.com/huggingface/trl/pull/5470DistillationTrainer for efficient on-policy distillation by @cmpatino in https://github.com/huggingface/trl/pull/5407huggingface-cli references with hf by @hanouticelina in https://github.com/huggingface/trl/pull/5486{% generation %} markers for training chat template by @casinca in https://github.com/huggingface/trl/pull/5484trl.experimental.judges module and all judge support from trainers by @qgallouedec in https://github.com/huggingface/trl/pull/5485DistillationTrainer by @cmpatino in https://github.com/huggingface/trl/pull/5501DistillationTrainer by @cmpatino in https://github.com/huggingface/trl/pull/5500Full Changelog: https://github.com/huggingface/trl/compare/v1.0.0...v1.1.0
Read our blog post for an overview of TRL v1.
Asynchronous GRPO decouples generation from the gradient update loop by offloading rollouts to an external vLLM server. Generation runs in parallel while training continues, eliminating idle GPU time and improving hardware utilization.
from trl.experimental.async_grpo import AsyncGRPOTrainer
from trl.rewards import accuracy_reward
from datasets import load_dataset
dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
trainer = AsyncGRPOTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct",
reward_funcs=accuracy_reward,
train_dataset=dataset,
)
trainer.train()
by @qgallouedec in https://github.com/huggingface/trl/pull/5293
VESPO addresses training instability in off-policy RL caused by policy staleness, asynchronous updates, and train-inference mismatches. Rather than relying on heuristic token-level clipping (GRPO) or sequence-length normalization (GSPO), VESPO derives a principled reshaping kernel from a variational framework. In practice, this yields a smooth, asymmetric Gamma weighting function that gracefully suppresses extreme sequence-level importance weights without introducing length bias. It can be enabled via the loss_type parameter of GRPOConfig:
from trl import GRPOConfig, GRPOTrainer
trainer = GRPOTrainer(
model="Qwen/Qwen3-0.6B",
args=GRPOConfig(loss_type="vespo"),
...
)
by @casinca in https://github.com/huggingface/trl/pull/5199
DPPO is a new experimental trainer that replaces the standard PPO clipping mechanism with divergence constraints, providing more principled trust-region updates.
by @LeonEricsson in https://github.com/huggingface/trl/pull/5117
SDPO is a new experimental trainer that augments on-policy RL with self-distillation from the model's own high-reward trajectories. Instead of using an external teacher, SDPO treats the current model conditioned on feedback as a self-teacher, distilling its feedback-informed predictions back into the policy.
from trl.experimental import SDPOTrainer, SDPOConfig
config = SDPOConfig(
output_dir="./results",
num_generations=8,
success_reward_threshold=1.0,
use_successful_as_teacher=True,
)
trainer = SDPOTrainer(
model="Qwen/Qwen2.5-Math-1.5B-Instruct",
reward_funcs=[accuracy_reward],
args=config,
train_dataset=dataset,
)
trainer.train()
by @MengAiDev in https://github.com/huggingface/trl/pull/4935
Reward functions can return a dictionary of extra values (scalars or per-sample columns) that will be logged alongside the reward. This makes it easier to track intermediate signals without writing custom callbacks.
def my_reward_fn(completions, answer, log_extra=None, log_metric=None, **kwargs):
extracted = [extract_answer(c) for c in completions]
rewards = [1.0 if e == a else 0.0 for e, a in zip(extracted, answer)]
if log_extra:
log_extra("golden_answer", list(answer))
log_extra("extracted_answer", extracted)
if log_metric:
log_metric("accuracy", sum(rewards) / len(rewards))
return rewards
<img width="1400" height="407" alt="image" src="https://github.com/user-attachments/assets/d345b0ac-0d3c-446f-9321-a26e73ee16b4" />
<img width="1353" height="673" alt="image" src="https://github.com/user-attachments/assets/b4c0302b-f69a-4715-9aad-278b4ad13299" />
by @manueldeprada in https://github.com/huggingface/trl/pull/5233
VLLMClient.chat()VLLMClient.chat() now supports tool calling, enabling agentic workflows directly through the vLLM client interface.
by @kansalaman in https://github.com/huggingface/trl/pull/4889
BFD packing is 35% faster. The "bfd-requeue" packing strategy has also been renamed to "bfd_split". See MIGRATION.md for details.
by @mariosasko in https://github.com/huggingface/trl/pull/5189
The GKD/GOLD trainer now supports buffered rollout generation, decoupling generation from gradient updates for more efficient distillation. vLLM inference support has also been added to the base self-distillation trainer.
by @cmpatino in https://github.com/huggingface/trl/pull/5137 and https://github.com/huggingface/trl/pull/5388
A MIGRATION.md guide has been added covering all breaking changes when upgrading from TRL v0 to v1. If you're already on v0.29, the changes are minimal.
by @qgallouedec in https://github.com/huggingface/trl/pull/5255
vllm_mode to "colocate" by @qgallouedec in https://github.com/huggingface/trl/pull/5255truncation_mode in SFT by @albertvillanova in https://github.com/huggingface/trl/pull/5306max_length in DPO VLM training by @albertvillanova in https://github.com/huggingface/trl/pull/5284pad_to_multiple_of to GRPOTrainer and RLOOTrainer by @czkkkkkk in https://github.com/huggingface/trl/pull/5180print_prompt_completions_sample to include reasoning content by @qgallouedec in https://github.com/huggingface/trl/pull/5327pixel_position_ids vision key by @qgallouedec in https://github.com/huggingface/trl/pull/5374None to apply_chat_template when it is an empty list by @rabinadk1 in https://github.com/huggingface/trl/pull/5380accuracy_reward crash when called from non-main thread by @qgallouedec in https://github.com/huggingface/trl/pull/5281RewardFunc type alias to reflect actual calling convention by @s-zx in https://github.com/huggingface/trl/pull/5246prepare_multimodal_messages to support tool_calls and tool role by @alvarobartt in https://github.com/huggingface/trl/pull/5212AGENTS.md) by @qgallouedec in https://github.com/huggingface/trl/pull/5236environment_factory by @sergiopaniego in https://github.com/huggingface/trl/pull/5235.ai by @qgallouedec in https://github.com/huggingface/trl/pull/5268pad_to_multiple_of to GRPOTrainer and RLOOTrainer by @czkkkkkk in https://github.com/huggingface/trl/pull/5180prepare_multimodal_messages to support tool_calls and tool role by @alvarobartt in https://github.com/huggingface/trl/pull/5212AGENTS.md) by @qgallouedec in https://github.com/huggingface/trl/pull/5236prompts in vLLM client and server by @qgallouedec in https://github.com/huggingface/trl/pull/5225truncate_prompt_tokens for vLLM 0.17.0 by @winglian in https://github.com/huggingface/trl/pull/5248rollout_func from _generate_single_turn to _generate by @qgallouedec in https://github.com/huggingface/trl/pull/5232RewardFunc type alias to reflect actual calling convention by @s-zx in https://github.com/huggingface/trl/pull/5246_generate_single_turn by @qgallouedec in https://github.com/huggingface/trl/pull/5239_generate_single_turn by @qgallouedec in https://github.com/huggingface/trl/pull/5240bfd-requeue to bfd_split by @mariosasko in https://github.com/huggingface/trl/pull/5189vllm_mode to "colocate" and add v0→v1 migration guide by @qgallouedec in https://github.com/huggingface/trl/pull/5255grpo_trainer.py): Variational Sequence-Level Soft Policy Optimization (VESPO) by @casinca in https://github.com/huggingface/trl/pull/5199accuracy_reward crash when called from non-main thread by @qgallouedec in https://github.com/huggingface/trl/pull/5281hasattr and getattr with defaults in AGENTS.md by @qgallouedec in https://github.com/huggingface/trl/pull/5294RewardFunc type annotation to allow Nonevalues in reward list by @qgallouedec in https://github.com/huggingface/trl/pull/5297Json() type for tool calling dataset format by @lhoestq in https://github.com/huggingface/trl/pull/5307GRPOTrainer/async: fix prefix EOS slicing for tool suffix (with Qwen3/3.5 type of chat templates) by @casinca in https://github.com/huggingface/trl/pull/5330grpo_trainer.py by @casinca in https://github.com/huggingface/trl/pull/5332environment_factory by @sergiopaniego in https://github.com/huggingface/trl/pull/5235print_prompt_completions_sample to include reasoning content by @qgallouedec in https://github.com/huggingface/trl/pull/5327AGENTS.md by @qgallouedec in https://github.com/huggingface/trl/pull/5280pixel_position_ids vision key by @qgallouedec in https://github.com/huggingface/trl/pull/5374.ai by @qgallouedec in https://github.com/huggingface/trl/pull/5268apply_chat_template when it is an empty list by @rabinadk1 in https://github.com/huggingface/trl/pull/5380TRACKIO_SPACE_ID env var from all scripts by @sergiopaniego in https://github.com/huggingface/trl/pull/5365pr_template_check.yml by @qgallouedec in https://github.com/huggingface/trl/pull/5393disable_config=True from generate to GenerationConfig by @qgallouedec in https://github.com/huggingface/trl/pull/5384Full Changelog: https://github.com/huggingface/trl/compare/v0.29.0...v1.0.0
VESPO addresses training instability in off-policy RL caused by policy staleness, asynchronous updates, and train-inference mismatches. Rather than relying on heuristic token-level clipping (GRPO) or sequence-length normalization (GSPO), VESPO derives a principled reshaping kernel from a variational framework. In practice, this yields a smooth, asymmetric Gamma weighting function that gracefully suppresses extreme sequence-level importance weights without introducing length bias. It can be enabled via the loss_type parameter of GRPOConfig:
from trl import GRPOConfig, GRPOTrainer
trainer = GRPOTrainer(
model="Qwen/Qwen3-0.6B",
args=GRPOConfig(loss_type="vespo"),
...
)
by @casinca in https://github.com/huggingface/trl/pull/5199
DPPO is a new experimental trainer that replaces the standard PPO clipping mechanism with divergence constraints, providing more principled trust-region updates.
by @LeonEricsson in https://github.com/huggingface/trl/pull/5117
Reward functions can return a dictionary of extra values (scalars or per-sample columns) that will be logged alongside the reward. This makes it easier to track intermediate signals without writing custom callbacks.
def my_reward_fn(completions, answer, log_extra=None, log_metric=None, **kwargs):
extracted = [extract_answer(c) for c in completions]
rewards = [1.0 if e == a else 0.0 for e, a in zip(extracted, answer)]
if log_extra:
log_extra("golden_answer", list(answer))
log_extra("extracted_answer", extracted)
if log_metric:
log_metric("accuracy", sum(rewards) / len(rewards))
return rewards
<img width="1400" height="407" alt="image" src="https://github.com/user-attachments/assets/d345b0ac-0d3c-446f-9321-a26e73ee16b4" />
<img width="1353" height="673" alt="image" src="https://github.com/user-attachments/assets/b4c0302b-f69a-4715-9aad-278b4ad13299" />
by @manueldeprada in https://github.com/huggingface/trl/pull/5233
VLLMClient.chat()VLLMClient.chat() now supports tool calling, enabling agentic workflows directly through the vLLM client interface.
by @kansalaman in https://github.com/huggingface/trl/pull/4889
BFD packing is 35% faster. The "bfd-requeue" packing strategy has also been renamed to "bfd_split". See MIGRATION.md for details.
by @mariosasko in https://github.com/huggingface/trl/pull/5189
The GKD/GOLD trainer now supports buffered rollout generation, decoupling generation from gradient updates for more efficient distillation.
by @cmpatino in https://github.com/huggingface/trl/pull/5137
A MIGRATION.md guide has been added covering all breaking changes when upgrading from TRL v0 to v1. If you're already on v0.29, the changes are minimal.
by @qgallouedec in https://github.com/huggingface/trl/pull/5255
vllm_mode to "colocate" by @qgallouedec in https://github.com/huggingface/trl/pull/5255truncation_mode in SFT by @albertvillanova in https://github.com/huggingface/trl/pull/5306max_length in DPO VLM training by @albertvillanova in https://github.com/huggingface/trl/pull/5284pad_to_multiple_of to GRPOTrainer and RLOOTrainer by @czkkkkkk in https://github.com/huggingface/trl/pull/5180accuracy_reward crash when called from non-main thread by @qgallouedec in https://github.com/huggingface/trl/pull/5281RewardFunc type alias to reflect actual calling convention by @s-zx in https://github.com/huggingface/trl/pull/5246prepare_multimodal_messages to support tool_calls and tool role by @alvarobartt in https://github.com/huggingface/trl/pull/5212AGENTS.md) by @qgallouedec in https://github.com/huggingface/trl/pull/5236pad_to_multiple_of to GRPOTrainer and RLOOTrainer by @czkkkkkk in https://github.com/huggingface/trl/pull/5180prepare_multimodal_messages to support tool_calls and tool role by @alvarobartt in https://github.com/huggingface/trl/pull/5212AGENTS.md) by @qgallouedec in https://github.com/huggingface/trl/pull/5236prompts in vLLM client and server by @qgallouedec in https://github.com/huggingface/trl/pull/5225truncate_prompt_tokens for vLLM 0.17.0 by @winglian in https://github.com/huggingface/trl/pull/5248rollout_func from _generate_single_turn to _generate by @qgallouedec in https://github.com/huggingface/trl/pull/5232RewardFunc type alias to reflect actual calling convention by @s-zx in https://github.com/huggingface/trl/pull/5246_generate_single_turn by @qgallouedec in https://github.com/huggingface/trl/pull/5239_generate_single_turn by @qgallouedec in https://github.com/huggingface/trl/pull/5240bfd-requeue to bfd_split by @mariosasko in https://github.com/huggingface/trl/pull/5189vllm_mode to "colocate" and add v0→v1 migration guide by @qgallouedec in https://github.com/huggingface/trl/pull/5255grpo_trainer.py): Variational Sequence-Level Soft Policy Optimization (VESPO) by @casinca in https://github.com/huggingface/trl/pull/5199accuracy_reward crash when called from non-main thread by @qgallouedec in https://github.com/huggingface/trl/pull/5281hasattr and getattr with defaults in AGENTS.md by @qgallouedec in https://github.com/huggingface/trl/pull/5294RewardFunc type annotation to allow Nonevalues in reward list by @qgallouedec in https://github.com/huggingface/trl/pull/5297Json() type for tool calling dataset format by @lhoestq in https://github.com/huggingface/trl/pull/5307Full Changelog: https://github.com/huggingface/trl/compare/v0.29.0...v1.0.0rc1
prepare_multimodal_messages to support tool_calls and tool role by @alvarobartt in https://github.com/huggingface/trl/pull/5212prompts in vLLM client and server by @qgallouedec in https://github.com/huggingface/trl/pull/5225rollout_func from _generate_single_turn to _generate by @qgallouedec in https://github.com/huggingface/trl/pull/5232_generate_single_turn by @qgallouedec in https://github.com/huggingface/trl/pull/5239_generate_single_turn by @qgallouedec in https://github.com/huggingface/trl/pull/5240Full Changelog: https://github.com/huggingface/trl/compare/v0.29.0...v0.29.1
environment_factory to GRPOTrainerGRPOTrainer now accepts an environment_factory argument, allowing users to specify a custom environment class for training. This enables more flexible and diverse training scenarios by letting users define their own environments with specific dynamics and reward structures.
from datasets import Dataset
from trl import GRPOConfig, GRPOTrainer
dataset = Dataset.from_dict({
"prompt": [[{"role": "user", "content": f"Increment the counter by {i}."}] for i in range(1, 7)]
})
def reward_func(environments, **kwargs):
return [env.counter for env in environments]
class IncrementEnv:
def reset(self):
self.counter = 0
def increment(self, step: int) -> int:
"""
Increment the internal counter.
Args:
step: Value to add to the counter.
Returns:
The updated counter value.
"""
self.counter += step
return self.counter
trainer = GRPOTrainer(
model="Qwen/Qwen3-0.6B",
args=GRPOConfig(chat_template_kwargs={"enable_thinking": False}),
train_dataset=dataset,
reward_funcs=reward_func,
environment_factory=IncrementEnv,
)
trainer.train()
by @qgallouedec in https://github.com/huggingface/trl/pull/5093
TRL introduces agent-native CLI Integration: trl-training, a first-class Agent Skill that exposes TRL’s training workflows (SFT, DPO, GRPO, etc.) in a structured, agent-readable format. The skill is packaged directly with the trl library and can be installed via the CLI:
# Install into the project's agent directory (default scope=project), by agent name: claude, codex, opencode
trl skills install trl-training --target <agent>
This enables AI agents to safely and reproducibly execute TRL training workflows using a well-defined interface.
Skills can be installed at the project or global scope, and support explicit targets and overwrite controls.
launch_args for all trainers by @qgallouedec in https://github.com/huggingface/trl/pull/5059num_labels to 1 in causal model initialization for RewardTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/5066None in get_trackio_space_url() to prevent errors by @qgallouedec in https://github.com/huggingface/trl/pull/5115trl <command> --help TypeError caused by unescaped % in TrainingArguments help strings by @albertvillanova in https://github.com/huggingface/trl/pull/5135SFTTrainer support for single-image data by @qgallouedec in https://github.com/huggingface/trl/pull/5132grpo_trainer.md by @casinca in https://github.com/huggingface/trl/pull/5047RewardTrainer collator from chosen/rejected_input_ids to chosen/rejected_ids by @qgallouedec in https://github.com/huggingface/trl/pull/5179RewardTrainer tests by @qgallouedec in https://github.com/huggingface/trl/pull/5060get_training_chat_template by @qgallouedec in https://github.com/huggingface/trl/pull/5108train_dataset is required by @albertvillanova in https://github.com/huggingface/trl/pull/5171grpo_trainer.md by @casinca in https://github.com/huggingface/trl/pull/5047launch_args for all trainers by @qgallouedec in https://github.com/huggingface/trl/pull/5059num_labels to 1 in causal model initialization for RewardTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/5066RewardTrainer tests by @qgallouedec in https://github.com/huggingface/trl/pull/5060get_training_chat_template by @qgallouedec in https://github.com/huggingface/trl/pull/5108None in get_trackio_space_url() to prevent errors by @qgallouedec in https://github.com/huggingface/trl/pull/5115trl <command> --help TypeError caused by unescaped % in TrainingArguments help strings by @albertvillanova in https://github.com/huggingface/trl/pull/5135environment_factory to GRPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/5093SFTTrainer support for single-image data by @qgallouedec in https://github.com/huggingface/trl/pull/5132train_dataset is required by @albertvillanova in https://github.com/huggingface/trl/pull/5171RewardTrainer collator from chosen/rejected_input_ids to chosen/rejected_ids by @qgallouedec in https://github.com/huggingface/trl/pull/5179Full Changelog: https://github.com/huggingface/trl/compare/v0.28.0...v0.29.0
is_conversational by @qgallouedec in https://github.com/huggingface/trl/pull/4923openenv/utils.py: fallback for no vLLM installed case by @Datta0 in https://github.com/huggingface/trl/pull/4868current_gradient_accumulation_steps by @qgallouedec in https://github.com/huggingface/trl/pull/4852get_open_port based on vLLM version by @qgallouedec in https://github.com/huggingface/trl/pull/4883device_map init consistency in GRPO/RLOO/KTO by @qgallouedec in https://github.com/huggingface/trl/pull/4909warnings_issued by @qgallouedec in https://github.com/huggingface/trl/pull/4960DPOConfig by @qgallouedec in https://github.com/huggingface/trl/pull/4969warmup_ratio with warmup_steps by @qgallouedec in https://github.com/huggingface/trl/pull/4983RewardTrainer, RLOOTrainer and GRPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/4823TestGRPOTrainer.test_training_vlm_and_liger and update version checks by @qgallouedec in https://github.com/huggingface/trl/pull/4898compute_metrics in SFTTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/4950RewardTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/4959compute_metrics in RewardTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/4958CITATION.cff by @qgallouedec in https://github.com/huggingface/trl/pull/4856DataCollatorForVisionLanguageModeling by @qgallouedec in https://github.com/huggingface/trl/pull/4911max_length in RewardConfig and SFTConfig by @qgallouedec in https://github.com/huggingface/trl/pull/4910sync_ref_model in GRPOTrainer and RLOOTrainer when using PEFT models by @qgallouedec in https://github.com/huggingface/trl/pull/4912⬆️ Bump dev version by @qgallouedec in https://github.com/huggingface/trl/pull/4835
Support triggering CI via push to ci-* branches by @albertvillanova in https://github.com/huggingface/trl/pull/4840
Revert CI hotfix pinning transformers 4.57.4 after tiny model regeneration by @albertvillanova in https://github.com/huggingface/trl/pull/4833
Use pytest-datadir in CI tests by @albertvillanova in https://github.com/huggingface/trl/pull/4836
Refactor KTO coordinated with DPO [c/N]: Remove ref_model_init_kwargs by @albertvillanova in https://github.com/huggingface/trl/pull/4837
Fix _patch_transformers_hybrid_cache for peft by @albertvillanova in https://github.com/huggingface/trl/pull/4844
Refactor KTO [4/N]: Remove unused padding_value by @albertvillanova in https://github.com/huggingface/trl/pull/4839
Remove unused padding_value from BCO by @albertvillanova in https://github.com/huggingface/trl/pull/4846
Fix CI with dev dependencies: Mark Qwen3-VL tests as xfail by @albertvillanova in https://github.com/huggingface/trl/pull/4851
Fix: undefined current_gradient_accumulation_steps by @qgallouedec in https://github.com/huggingface/trl/pull/4852
Remove deprecated parameters by @qgallouedec in https://github.com/huggingface/trl/pull/4847
Add Nash Learning from Human Feedback paper to paper index by @kansalaman in https://github.com/huggingface/trl/pull/4860
Use pytest-datadir for accelerate config files by @albertvillanova in https://github.com/huggingface/trl/pull/4861
Update OpenEnv dependency to new version for hf jobs scripts by @sergiopaniego in https://github.com/huggingface/trl/pull/4843
Update CITATION.cff by @qgallouedec in https://github.com/huggingface/trl/pull/4856
[GRPOTrainer]: Agent Training Supports Async Tool Calls by @pramodith in https://github.com/huggingface/trl/pull/4742
Enhance GRPO documentation with scaling notes by @javadtaghia in https://github.com/huggingface/trl/pull/4849
Add retry strategy to vLLM Client for increased robustness by @apalmas-saifh in https://github.com/huggingface/trl/pull/4845
Update generate_tiny_models.py: CohereForAI -> CohereLabs by @Michellehbn in https://github.com/huggingface/trl/pull/4877
Refactor KTO coordinated with DPO [e/N]: Remove label_pad_token_id by @albertvillanova in https://github.com/huggingface/trl/pull/4875
Refactor KTO coordinated with DPO [d/N]: Remove base_model_attribute_name by @albertvillanova in https://github.com/huggingface/trl/pull/4862
Fix type hint in openenv/utils.py: fallback for no vLLM installed case by @Datta0 in https://github.com/huggingface/trl/pull/4868
Update transformer version checks and documentation for lr_scheduler_kwargs workaround by @qgallouedec in https://github.com/huggingface/trl/pull/4876
fix(DeepSeek OPSM): passing correct (vLLM) logprobs by @casinca in https://github.com/huggingface/trl/pull/4857
Remove label_pad_token_id from experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/4878
Fix SFT training for prompt-completion type and transformers v5 by @qgallouedec in https://github.com/huggingface/trl/pull/4880
Bugfix: Logprob drift in vLLM serving mode (compared to colocate mode) by @kdubovikov in https://github.com/huggingface/trl/pull/4873
Enable vLLM sleep mode for generation in Online DPO by @winglian in https://github.com/huggingface/trl/pull/4882
Test distributed training for RewardTrainer, RLOOTrainer and GRPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/4823
Mark ZeRO 2 as xfail in distributed tests due to current failure by @qgallouedec in https://github.com/huggingface/trl/pull/4885
Fix import path for get_open_port based on vLLM version by @qgallouedec in https://github.com/huggingface/trl/pull/4883
Fix RewardTrainer's results not reproducible by @liyc-ai in https://github.com/huggingface/trl/pull/4887
GOLD training speed up by @141forever in https://github.com/huggingface/trl/pull/4888
Transformers v5 release: extend xfail condition for TestGRPOTrainer.test_training_vlm_and_liger and update version checks by @qgallouedec in https://github.com/huggingface/trl/pull/4898
Fix CI NotImplementedError for bfloat16 by @albertvillanova in https://github.com/huggingface/trl/pull/4902
Fix CI AssertionError: Parameter has not changed by @albertvillanova in https://github.com/huggingface/trl/pull/4904
Refactor vLLM generation [1/N]: Extract vLLM generation by @albertvillanova in https://github.com/huggingface/trl/pull/4700
Created new PTT integration docs as requested by @adityachallapally in https://github.com/huggingface/trl/pull/4907
Fix CI TypeError in llm-blender tests by @albertvillanova in https://github.com/huggingface/trl/pull/4919
Rearrange variable assignments in DataCollatorForVisionLanguageModeling by @qgallouedec in https://github.com/huggingface/trl/pull/4911
Fix help text formatting for max_length in RewardConfig and SFTConfig by @qgallouedec in https://github.com/huggingface/trl/pull/4910
device_map init consistency in GRPO/RLOO/KTO by @qgallouedec in https://github.com/huggingface/trl/pull/4909
Comment about overriding prediction_step in GRPOTrainer and RLOOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/4913
Remove gradient checkpointing option from various training scripts by @qgallouedec in https://github.com/huggingface/trl/pull/4905
docs: add DoRA (2402.09353) to Paper Index by @billycrapediem in https://github.com/huggingface/trl/pull/4892
Fix CI AssertionError: assert not True by @albertvillanova in https://github.com/huggingface/trl/pull/4921
Fix CI ValueError for 0 temperature by @albertvillanova in https://github.com/huggingface/trl/pull/4916
Fix extra EOS appended in DPO preprocessing for conversational data by @qgallouedec in https://github.com/huggingface/trl/pull/4908
Remove chat template setup in dpo_vlm.py by @qgallouedec in https://github.com/huggingface/trl/pull/4906
Update learning rate comments and add assertions for reference model parameters in GRPO and RLOO tests by @qgallouedec in https://github.com/huggingface/trl/pull/4914
Add validation for sync_ref_model in GRPOTrainer and RLOOTrainer when using PEFT models by @qgallouedec in https://github.com/huggingface/trl/pull/4912
Support tool call data in is_conversational by @qgallouedec in https://github.com/huggingface/trl/pull/4923
Set model dtype to float32 in tests of trainers by @albertvillanova in https://github.com/huggingface/trl/pull/4924
Require transformers<5 with PairRMJudge by @albertvillanova in https://github.com/huggingface/trl/pull/4926
Move VLLMClient to generation module by @albertvillanova in https://github.com/huggingface/trl/pull/4928
Set model dtype to float32 in experimental tests of trainers by @albertvillanova in https://github.com/huggingface/trl/pull/4925
Fix profiling of VLLMGeneration.sync_weights by @albertvillanova in https://github.com/huggingface/trl/pull/4931
Fix import statement for import_utils in vllm_client.py by @qgallouedec in https://github.com/huggingface/trl/pull/4932
Set default top_k to 0 in VLLMClient by @albertvillanova in https://github.com/huggingface/trl/pull/4927
[GRPO] Add parquet logging for completions with individual rewards by @qgallouedec in https://github.com/huggingface/trl/pull/4818
Fix SFTTrainer init logic: remove TrainingArguments.push_to_hub_token only for transformers < v5 by @albertvillanova in https://github.com/huggingface/trl/pull/4942
Remove ref_model_init_kwargs from experimental BCO by @albertvillanova in https://github.com/huggingface/trl/pull/4946
Update wordle.py example with masking of env tokens by @sergiopaniego in https://github.com/huggingface/trl/pull/4895
Fix PPO run_name parameter not taking effect by @mel3c in https://github.com/huggingface/trl/pull/4945
Minor fix docs style by @albertvillanova in https://github.com/huggingface/trl/pull/4953
Add test for training with compute_metrics in SFTTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/4950
Remove access to warnings_issued by @qgallouedec in https://github.com/huggingface/trl/pull/4960
NeMo-Gym Integration by @cmunley1 in https://github.com/huggingface/trl/pull/4848
Add test for tool call data in RewardTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/4959
Add test for training with compute_metrics in RewardTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/4958
Remove max_prompt_length from experimental PRM by @albertvillanova in https://github.com/huggingface/trl/pull/4963
Remove max_prompt_length from experimental BCO by @albertvillanova in https://github.com/huggingface/trl/pull/4964
Remove max_prompt_length from experimental CPO by @albertvillanova in https://github.com/huggingface/trl/pull/4965
Remove max_prompt_length from experimental ORPO by @albertvillanova in https://github.com/huggingface/trl/pull/4966
Revert change in GRPO from NeMo-Gym Integration by @qgallouedec in https://github.com/huggingface/trl/pull/4970
Fix test_train_with_chat_template_kwargs by @qgallouedec in https://github.com/huggingface/trl/pull/4971
Remove padding_value from experimental CPO and use pad_token_id by @albertvillanova in https://github.com/huggingface/trl/pull/4962
Remove truncation from tokenizer calls if no max_length by @albertvillanova in https://github.com/huggingface/trl/pull/4972
Set specific OpenEnv version when installed by @sergiopaniego in https://github.com/huggingface/trl/pull/4978
Fix add_column in test_train_with_chat_template_kwargs by @albertvillanova in https://github.com/huggingface/trl/pull/4979
Support truncated completions in GRPO multi-turn training by @albertvillanova in https://github.com/huggingface/trl/pull/4976
Replace torch.allclose with torch.testing.assert_close by @qgallouedec in https://github.com/huggingface/trl/pull/4977
Simplify instructions of installation of OpenEnv by @sergiopaniego in https://github.com/huggingface/trl/pull/4980
Deprecate parameters in DPOConfig by @qgallouedec in https://github.com/huggingface/trl/pull/4969
[CI] Disallow installation of transformers 5.1.0 due to compatibility issues with DeepSpeed by @qgallouedec in https://github.com/huggingface/trl/pull/4982
Replace warmup_ratio with warmup_steps by @qgallouedec in https://github.com/huggingface/trl/pull/4983
Pin transformers!=5.1.0 in deepspeed extra due to incompatibility by @albertvillanova in https://github.com/huggingface/trl/pull/4985
Fix passing tokenizer in test_train_with_chat_template_kwargs by @albertvillanova in https://github.com/huggingface/trl/pull/4987
Update dataset configuration name in toolcall dataset loading by @qgallouedec in https://github.com/huggingface/trl/pull/4984
Use local variable instead of attribute in collator tests by @qgallouedec in https://github.com/huggingface/trl/pull/4957
Fix import of AutoModelForCausalLMWithValueHead from experimental by @albertvillanova in https://github.com/huggingface/trl/pull/4990
Assert chat_template is applied in test_train_with_chat_template_kwargs by @albertvillanova in https://github.com/huggingface/trl/pull/4991
Fix deprecation of DPOConfig.max_completion_length by @albertvillanova in https://github.com/huggingface/trl/pull/4992
Fix post_init warning stacklevel to 3 by @albertvillanova in https://github.com/huggingface/trl/pull/4993
Fix ZeRO-3 + PEFT + gradient checkpointing by @qgallouedec in https://github.com/huggingface/trl/pull/4951
Add GitHub Actions workflow for testing against Transformers branch by @qgallouedec in https://github.com/huggingface/trl/pull/4995
Add distributed smoke tests workflow for Transformers branch by @qgallouedec in https://github.com/huggingface/trl/pull/4996
Update NeMo-Gym to use env_mask by @cmunley1 in https://github.com/huggingface/trl/pull/4986
Update sampling mode to token level for safety by @sergiopaniego in https://github.com/huggingface/trl/pull/4989
perf: Qwen SAPO loss optimization by @casinca in https://github.com/huggingface/trl/pull/4956
Fix GRPO tool calling for corrupted tool calls by @akshayballal95 in https://github.com/huggingface/trl/pull/4890
Add sanitize_logprob function for NaN handling in vLLM log probabilities by @qgallouedec in https://github.com/huggingface/trl/pull/5001
[tests] Remove xfail for transformers version >= 5.0.0 due to upstream bug resolution by @qgallouedec in https://github.com/huggingface/trl/pull/5000
docs: add CGPO/Mixture of Judges (2409.20370) to Paper Index + link ref to AllTrueJudge by @nabin2004 in https://github.com/huggingface/trl/pull/5002
Filter CI SWIG deprecation warnings by @albertvillanova in https://github.com/huggingface/trl/pull/5004
Fix CI TRLExperimentalWarning in regular tests by @albertvillanova in https://github.com/huggingface/trl/pull/5007
Add support for nested_gather in OnlineDPOTrainer for transformers v5.2.0 and above by @qgallouedec in https://github.com/huggingface/trl/pull/4981
Fix CI FutureWarning: ref_model_init_kwargs is deprecated by @albertvillanova in https://github.com/huggingface/trl/pull/5009
Fix typo in DPO max_prompt_length deprecation warning message by @albertvillanova in https://github.com/huggingface/trl/pull/5020
Fix vision model prompt truncation bug in DPOTrainer by @albertvillanova in https://github.com/huggingface/trl/pull/5023
Pin transformers < 5 in judges extra due to incompatibility by @albertvillanova in https://github.com/huggingface/trl/pull/5024
Fix CI FutureWarning: generate_during_eval is deprecated by @albertvillanova in https://github.com/huggingface/trl/pull/5017
Fix typo in xfail test reason by @albertvillanova in https://github.com/huggingface/trl/pull/5028
Fix CI FutureWarning: rpo_alpha is deprecated by @albertvillanova in https://github.com/huggingface/trl/pull/5011
Fix CI FutureWarning: use_logits_to_keep is deprecated by @albertvillanova in https://github.com/huggingface/trl/pull/5013
Mark Qwen3VL tests as xfail for transformers 5.0.x by @albertvillanova in https://github.com/huggingface/trl/pull/5029
[CI] Silence PyTorch JIT and DataLoader deprecation warnings by @qgallouedec in https://github.com/huggingface/trl/pull/4999
Add length-unbiased GRPO loss (LUSPO) by @Haseebasif7 in https://github.com/huggingface/trl/pull/4988
Fix CI FutureWarning: tools is deprecated by @albertvillanova in https://github.com/huggingface/trl/pull/5015
Filter max_prompt_length UserWarning in all test cases by @albertvillanova in https://github.com/huggingface/trl/pull/5035
Fix CI FutureWarning: max_prompt_length is deprecated by @albertvillanova in https://github.com/huggingface/trl/pull/5019
Allow testing with transformers 5.1.0 via xfail marks by @albertvillanova in https://github.com/huggingface/trl/pull/5034
Rename AOT loss type 'aot_pair' to 'aot_unpaired' in DPO by @qgallouedec in https://github.com/huggingface/trl/pull/5038
Deprecate string usage for ref_model in DPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/5040
Deprecate FDivergenceType in DPOConfig; update f_divergence_type to use string values by @qgallouedec in https://github.com/huggingface/trl/pull/5039
Fix multiprocessing start method to 'spawn' for test compatibility with Python 3.12+ by @qgallouedec in https://github.com/huggingface/trl/pull/5036
Add Online Direct Preference Optimization section to paper index by @qgallouedec in https://github.com/huggingface/trl/pull/5037
Release: 0.28 by @albertvillanova in https://github.com/huggingface/trl/pull/5043
Full Changelog: https://github.com/huggingface/trl/compare/v0.27.0...v0.28.0
warnings_issued by @qgallouedec in #4960Full Changelog: https://github.com/huggingface/trl/compare/v0.27.1...v0.27.2
current_gradient_accumulation_steps by @qgallouedec in https://github.com/huggingface/trl/pull/4852Full Changelog: https://github.com/huggingface/trl/compare/v0.27.0...v0.27.1
vllm_group_port argument to GRPO, RLOO and OnlineDPO configuration by @pointerhacker in https://github.com/huggingface/trl/pull/4545forward_masked_logits function by @qgallouedec in https://github.com/huggingface/trl/pull/4729AutoModelForCausalLMWithValueHead and AutoModelForSeq2SeqLMWithValueHead to experimental by @qgallouedec in https://github.com/huggingface/trl/pull/4654experimental.utils by @qgallouedec in https://github.com/huggingface/trl/pull/4667DataCollatorForChatML to experimental.utils by @qgallouedec in https://github.com/huggingface/trl/pull/4668add_bos_token_if_needed and add_eos_token_if_needed to experimental.utils by @qgallouedec in https://github.com/huggingface/trl/pull/4674truncate_right and SIMPLE_CHAT_TEMPLATE to experimental.utils by @qgallouedec in https://github.com/huggingface/trl/pull/4677prepare_model_for_kbit_training, enable_gradient_checkpointing, prepare_peft_model to experimental.utils by @qgallouedec in https://github.com/huggingface/trl/pull/4704get_reward function to experimental.utils by @qgallouedec in https://github.com/huggingface/trl/pull/4683num_generations_eval=1 in the calculation of the advantage by @qgallouedec in https://github.com/huggingface/trl/pull/4662num_generations_eval is specified and different than num_generations by @apalmas-saifh in https://github.com/huggingface/trl/pull/4682generation_config for tiny model uploads by @qgallouedec in https://github.com/huggingface/trl/pull/4643qwen3_schema by @mattbui in https://github.com/huggingface/trl/pull/4709HybridCache in Liger-Kernel with transformers v5 by @qgallouedec in https://github.com/huggingface/trl/pull/4798args by @carlyou in https://github.com/huggingface/trl/pull/4801grpo_trainer.md): Added Qwen SAPO details under Loss Types by @casinca in https://github.com/huggingface/trl/pull/4681MergeModelCallback from import structure by @qgallouedec in https://github.com/huggingface/trl/pull/4664ChatMlSpecialTokens by @qgallouedec in https://github.com/huggingface/trl/pull/4666_win_rate_completions_df function from callbacks by @qgallouedec in https://github.com/huggingface/trl/pull/4672DbrxForCausalLM support by @qgallouedec in https://github.com/huggingface/trl/pull/4799compute_accuracy to PRM Trainer file by @qgallouedec in https://github.com/huggingface/trl/pull/4656clone_chat_template to chat_template_utils by @qgallouedec in https://github.com/huggingface/trl/pull/4653GeometricMixtureWrapper to nash_md_trainer.py by @qgallouedec in https://github.com/huggingface/trl/pull/4670exact_div, print_rich_table, truncate_response, forward to ppo_trainer by @qgallouedec in https://github.com/huggingface/trl/pull/4676OnPolicyConfig and PPOConfig and move OnlineTrainerState by @qgallouedec in https://github.com/huggingface/trl/pull/4671AutoModelForCausalLMWithValueHead to test_ppo_trainer by @qgallouedec in https://github.com/huggingface/trl/pull/4678generate and batch_generation to ppo_trainer.py by @qgallouedec in https://github.com/huggingface/trl/pull/4675TrainerCallback from top-level transformers by @qgallouedec in https://github.com/huggingface/trl/pull/4694top_k parameter in OnlineDPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/4714PeftModel + peft_config in trainers by @qgallouedec in https://github.com/huggingface/trl/pull/4713TestParseResponse by @qgallouedec in https://github.com/huggingface/trl/pull/4736GuidedDecodingParams with StructuredOutputsParams in sampling parameter configuration by @qgallouedec in https://github.com/huggingface/trl/pull/4797--
Full Changelog: https://github.com/huggingface/trl/compare/v0.26.0...v0.27.0
Small patch release containing the following changes:
Full Changelog: https://github.com/huggingface/trl/compare/v0.26.1...v0.26.2
num_generations_eval is specified and different than num_generations by @apalmas-saifh in https://github.com/huggingface/trl/pull/4682Full Changelog: https://github.com/huggingface/trl/compare/v0.26.0...v0.26.1
GRPOTrainer now supports training agents using tools. This allows language models to interact with external functions or APIs during training.
from datasets import Dataset
from trl import GRPOTrainer
def multiply(a: int, b: int) -> int:
"""
Multiplies two integers.
Args:
a: The first integer.
b: The second integer.
Returns:
The product of the two integers.
"""
return a * b
dataset = Dataset.from_list(
[
{"prompt": [{"role": "user", "content": "What is 3 multiplied by 4?"}], "answer": 12},
{"prompt": [{"role": "user", "content": "Calculate 7 times 8."}], "answer": 56},
{"prompt": [{"role": "user", "content": "Find the product of 5 and 6."}], "answer": 30},
{"prompt": [{"role": "user", "content": "What do you get when you multiply 9 by 9?"}], "answer": 81},
{"prompt": [{"role": "user", "content": "Compute 12 multiplied by 11."}], "answer": 132},
{"prompt": [{"role": "user", "content": "What is 15 times 14?"}], "answer": 210},
]
)
def accuracy(completions, answer, **kwargs):
predictions = [completion[-1]["content"] for completion in completions]
rewards = [float(str(ans) in pred) for pred, ans in zip(predictions, answer)]
return rewards
trainer = GRPOTrainer(
model="Qwen/Qwen3-0.6B",
train_dataset=dataset,
tools=[multiply],
reward_funcs=accuracy,
)
trainer.train()
by @qgallouedec in https://github.com/huggingface/trl/pull/4300
CISPO Loss was first introduced in the Minimax-M1 paper, the ScaleRL paper subsequently showed that CISPO loss scales the best in terms of performance and efficiency as models are trained for longer.
GRPOTrainer now supports the CISPO loss using loss_type="cispo" in the GRPOConfig.
by @pramodith in https://github.com/huggingface/trl/pull/4495
When the input model is quantized using bitsandbytes, vLLM will now also use quantization when in colocate mode.
by @sergiopaniego in https://github.com/huggingface/trl/pull/4496
TRL nows includes a reasoning reward function
from trl.rewards import reasoning_accuracy_reward
solutions = [r"\frac{1}{3}", r"\frac{1}{3}", r"\frac{1}{3}"]
completions = [
[
{
"role": "assistant",
"content": r"<think> Reasoning content </think> The final answer is \boxed{\frac{1}{3}}",
}
],
[
{
"role": "assistant",
"content": r"<think> Reasoning content </think> The final answer is \boxed{\frac{1}{2}}",
}
],
[
{
"role": "assistant",
"content": r"<think> Reasoning content with partial answers \boxed{\frac{1}{3}} but no final answer",
}
],
]
reasoning_accuracy_reward(completions, solutions) # [1.0, 0.0, 0.0]
As any other reward function, it can be used in GRPOTrainer or RLOOTrainer.
from trl import GRPOTrainer
from trl.rewards import reasoning_accuracy_reward
trainer = GRPOTrainer(
...,
reward_funcs=reasoning_accuracy_reward,
)
by @lewtun in https://github.com/huggingface/trl/pull/4563
shuffle_dataset option to SFTTrainerYou can now shuffle the dataset in SFTTrainer by setting the shuffle_dataset argument to True in SFTConfig. This is useful when the dataset features high similarity between consecutive samples.
from trl import SFTTrainer, SFTConfig
SFTConfig(shuffle_dataset=True)
by @qgallouedec in https://github.com/huggingface/trl/pull/4564
Soft Adaptive Policy Optimization (SAPO), replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals. Compared with GSPO and GRPO, SAPO is both sequence-coherent and token-adaptive. Like GSPO, SAPO maintains sequence-level coherence, but its soft gating forms a continuous trust region that avoids the brittle hard clipping band used in GSPO.
You can now use SAPO loss in GRPOTrainer by setting loss_type="sapo" in the GRPOConfig.
by @pramodith in https://github.com/huggingface/trl/pull/4600
num_generations_eval parameter for efficient evaluation by @mingxuetian in https://github.com/huggingface/trl/pull/4458WinRateCallback to experimental by @qgallouedec in https://github.com/huggingface/trl/pull/4558device_map and dtype to "auto" by default by @qgallouedec in https://github.com/huggingface/trl/pull/4509flash-attn to flash-attn2 by @qgallouedec in https://github.com/huggingface/trl/pull/4514device_map=None for DeepSpeed and add ZeRO paper (1910.02054) to Paper Index by @JenWei0312 in https://github.com/huggingface/trl/pull/4551num_completions to num_generations by @pramodith in https://github.com/huggingface/trl/pull/4515rnj_1_instruct notebook by @sergiopaniego in https://github.com/huggingface/trl/pull/4646wandb_log_unique_prompts with log_unique_prompts by @taha-yassine in https://github.com/huggingface/trl/pull/4508prepare_model_for_kbit_training by @sergiopaniego in https://github.com/huggingface/trl/pull/4457lr_scheduler_kwargs dtype issue in Transformers 4.57.0 by @qgallouedec in https://github.com/huggingface/trl/pull/4513device_map and dtype to "auto" by default by @qgallouedec in https://github.com/huggingface/trl/pull/4509wandb_log_unique_prompts with log_unique_prompts by @taha-yassine in https://github.com/huggingface/trl/pull/4508num_completions to num_generations by @pramodith in https://github.com/huggingface/trl/pull/4515flash-attn to flash-attn2 by @qgallouedec in https://github.com/huggingface/trl/pull/4514prepare_model_for_kbit_training by @sergiopaniego in https://github.com/huggingface/trl/pull/4457device_map=None for DeepSpeed and add ZeRO paper (1910.02054) to Paper Index by @JenWei0312 in https://github.com/huggingface/trl/pull/4551num_generations_eval parameter for efficient evaluation by @mingxuetian in https://github.com/huggingface/trl/pull/4458shuffle_dataset option to SFTTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/4564WinRateCallback to experimental by @qgallouedec in https://github.com/huggingface/trl/pull/4558rnj_1_instruct notebook by @sergiopaniego in https://github.com/huggingface/trl/pull/4646Full Changelog: https://github.com/huggingface/trl/compare/v0.25.0...v0.26.0