DistillationTrainer for efficient on-policy distillationRead the blog post: https://huggingface.co/spaces/HuggingFaceTB/trl-distillation-trainer
The new DistillationTrainer implements on-policy knowledge distillation as described in On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes. It extends the ideas from the GKDTrainer with three key optimizations: a generation buffer that decouples the training microbatch size from the generation batch size (up to 40x speedup), external teacher server support so the teacher doesn't need to fit on training GPUs, and binary-encoded logprob payloads that shrink transfer payloads by ~5x.
from datasets import load_dataset
from trl.experimental.distillation import DistillationConfig, DistillationTrainer
dataset = load_dataset("openai/gsm8k", "main", split="train")
dataset = dataset.map(
lambda x: {"messages": [{"role": "user", "content": x["question"]}]},
remove_columns=dataset.column_names,
)
trainer = DistillationTrainer(
model="Qwen/Qwen2.5-1.5B-Instruct",
teacher_model="Qwen/Qwen2.5-7B-Instruct",
args=DistillationConfig(
output_dir="results/distill-qwen-gsm8k",
lmbda=1.0, # fully on-policy (student generates)
beta=1.0, # reverse KL
teacher_model_init_kwargs={"torch_dtype": "bfloat16"},
),
train_dataset=dataset,
)
trainer.train()
by @cmpatino in https://github.com/huggingface/trl/pull/5407, https://github.com/huggingface/trl/pull/5500 and https://github.com/huggingface/trl/pull/5501
AsyncGRPOTrainerAsyncGRPOTrainer now supports a chunked LM-head path that computes per-token log-probs and entropy via online logsumexp without materializing the full [N, V] logits tensor. Combined with completion_mask filtering to skip prompt tokens, this brings massive memory savings on long sequences — up to 44x lower peak-allocated memory on an 8192-token sequence:
chunk_lm_head_size | Peak Alloc (GB) | Reduction | Wall Time (ms) |
|---|---|---|---|
None (baseline) | 18.55 | 1.00x | 808.7 |
4096 | 0.42 | 44.32x | 459.0 |
8192 | 0.76 | 24.34x | 393.0 |
Enable it via the new chunk_lm_head_size option in AsyncGRPOConfig:
from trl.experimental.async_grpo import AsyncGRPOConfig, AsyncGRPOTrainer
trainer = AsyncGRPOTrainer(
model="Qwen/Qwen2.5-0.5B-Instruct",
args=AsyncGRPOConfig(chunk_lm_head_size=4096),
...
)
Note: mutually exclusive with use_liger_kernel (both replace the LM head forward pass).
by @AmineDiro in https://github.com/huggingface/trl/pull/5349
{% generation %} support in training chat templatesSFT with assistant_only_loss=True requires chat templates to include {% generation %} / {% endgeneration %} markers so that return_assistant_tokens_mask=True produces correct masks. Very few models ship these markers natively, so users hit a cryptic error when enabling assistant-only loss with models like Qwen3, Llama 3 or GPT-OSS.
SFTTrainer now automatically swaps in a patched training chat template when the original template lacks generation markers — no manual template surgery required. Training templates are shipped for Qwen2.5, Qwen3, Llama 3 and GPT-OSS, stored as standalone .jinja files under trl/chat_templates/ for readability, diffability, and editor syntax highlighting.
from trl import SFTConfig, SFTTrainer
trainer = SFTTrainer(
model="Qwen/Qwen3-4B",
args=SFTConfig(assistant_only_loss=True), # now just works
train_dataset=dataset,
)
trainer.train()
by @qgallouedec in https://github.com/huggingface/trl/pull/5459, https://github.com/huggingface/trl/pull/5470, by @RudrenduPaul in https://github.com/huggingface/trl/pull/5493 and https://github.com/huggingface/trl/pull/5522, and by @casinca in https://github.com/huggingface/trl/pull/5484
Agent training now supports a broader family of models via native tool-call response schemas:
A new supports_tool_calling() utility detects whether a tokenizer/processor can render a full tool-calling turn, and GRPOTrainer now validates tool support at initialization — raising a clear error upfront instead of failing cryptically mid-training.
by @qgallouedec in https://github.com/huggingface/trl/pull/5462, https://github.com/huggingface/trl/pull/5464, https://github.com/huggingface/trl/pull/5463, https://github.com/huggingface/trl/pull/5469 and https://github.com/huggingface/trl/pull/5454
environment_factory tool methods can now return multimodal content blocks (images + text) for VLM training. Previously, tool responses were always converted to str(result), discarding any visual information. Now tools can return content block lists with images, and the trainer handles them end-to-end through tokenization, generation, and the forward pass — including correct pixel_values plumbing.
class ScreenshotEnv:
def take_screenshot(self) -> list[dict]:
return [
{"type": "image", "image": self.browser.screenshot()},
{"type": "text", "text": "Current page state"},
]
The OpenEnv browsergym.py example has been migrated to this pattern, and a new carla_vlm.py example demonstrates VLM training against CARLA with camera-image tool responses.
by @sergiopaniego in https://github.com/huggingface/trl/pull/5323 and https://github.com/huggingface/trl/pull/5437, and by @qgallouedec in https://github.com/huggingface/trl/pull/5448
accuracy_reward and reasoning_accuracy_reward now emit extra diagnostic columns (solution, gold_parsed, answer_parsed) via the log_extra callback introduced in v1.0.0. These show up in the rich completions table, making it much easier to debug why a reward was (or wasn't) assigned.
from datasets import load_dataset
from trl import GRPOConfig, GRPOTrainer
from trl.rewards import accuracy_reward
dataset = load_dataset("trl-lib/DeepMath-103K", split="train")
trainer = GRPOTrainer(
model="Qwen/Qwen2-0.5B-Instruct",
reward_funcs=accuracy_reward,
args=GRPOConfig(log_completions=True),
train_dataset=dataset,
)
trainer.train()
by @qgallouedec in https://github.com/huggingface/trl/pull/5308
KTOConfig by @albertvillanova in https://github.com/huggingface/trl/pull/5477prepare_multimodal_messages by @albertvillanova in https://github.com/huggingface/trl/pull/5424prepare_multimodal_messages by @albertvillanova in https://github.com/huggingface/trl/pull/5475pixel_position_ids with image_position_ids for Gemma 4 support by @qgallouedec in https://github.com/huggingface/trl/pull/5452isinstance(part, dict) checks in image extraction by @qgallouedec in https://github.com/huggingface/trl/pull/5439_get_tool_suffix_ids by @qgallouedec in https://github.com/huggingface/trl/pull/5440prepare_deepspeed by @albertvillanova in https://github.com/huggingface/trl/pull/5414ImportError with vllm-0.10.2 in OnlineDPO and OpenEnv by @albertvillanova in https://github.com/huggingface/trl/pull/5423_get_per_token_logps_and_entropies return type by @kashif in https://github.com/huggingface/trl/pull/5456prepare_multimodal_messages not normalizing empty string content for assistant/tool roles by @albertvillanova in https://github.com/huggingface/trl/pull/5496pad_token_id by @albertvillanova in https://github.com/huggingface/trl/pull/5487huggingface-cli references with hf by @hanouticelina in https://github.com/huggingface/trl/pull/5486truncation_mode from experimental truncate_dataset by @albertvillanova in https://github.com/huggingface/trl/pull/5467keep_end truncation mode in DPOConfig and SFTConfig — will be removed in v2.0.0. Use keep_start instead. By @albertvillanova in https://github.com/huggingface/trl/pull/5465pad_token config parameter in DPOConfig, SFTConfig, and RewardConfig — will be removed in v2.0.0. Set tokenizer.pad_token directly on the processing_class instead. By @albertvillanova in https://github.com/huggingface/trl/pull/5480trl.experimental.judges module and all judge support from trainers. Judges were experimental, unused in practice, and llm-blender (backing PairRMJudge) was unmaintained and incompatible with transformers v5 — actively blocking v5 adoption. Everything judges did can be achieved with reward_funcs. OnlineDPOTrainer, NashMDTrainer, and XPOTrainer are now unified on reward-model scoring only. By @qgallouedec in https://github.com/huggingface/trl/pull/5485carla_vlm OpenEnv example by @sergiopaniego in https://github.com/huggingface/trl/pull/5437completion_only_loss in SFT trainer docs by @RudrenduPaul in https://github.com/huggingface/trl/pull/5494DistillationTrainer by @cmpatino in https://github.com/huggingface/trl/pull/5500prepare_multimodal_messages by @albertvillanova in https://github.com/huggingface/trl/pull/5476make precommit to fix docstring style by @albertvillanova in https://github.com/huggingface/trl/pull/5436test_rloo[fsdp2]: replace non-deterministic xfail with skipif for transformers 5.4.0 by @albertvillanova in https://github.com/huggingface/trl/pull/5403input_ids or inputs_embeds by @albertvillanova in https://github.com/huggingface/trl/pull/5422eval_strategy by @SunMarc in https://github.com/huggingface/trl/pull/5426TypeError: 'NoneType' object is not iterable by @albertvillanova in https://github.com/huggingface/trl/pull/5427TypeError: 'NoneType' object is not iterable by @albertvillanova in https://github.com/huggingface/trl/pull/5438test_rloo[fsdp2] after transformers 5.5.0 release by @albertvillanova in https://github.com/huggingface/trl/pull/5442eval_strategy by @SunMarc in https://github.com/huggingface/trl/pull/5426environment_factory for VLM training by @sergiopaniego in https://github.com/huggingface/trl/pull/5323isinstance(part, dict) checks in image extraction by @qgallouedec in https://github.com/huggingface/trl/pull/5439pixel_position_ids with image_position_ids for Gemma4 support by @qgallouedec in https://github.com/huggingface/trl/pull/5452_get_tool_suffix_ids by @qgallouedec in https://github.com/huggingface/trl/pull/5440.jinja files by @qgallouedec in https://github.com/huggingface/trl/pull/5459supports_tool_calling utility and validate tool support at init by @qgallouedec in https://github.com/huggingface/trl/pull/5462{% generation %} support to training chat templates by @qgallouedec in https://github.com/huggingface/trl/pull/5470DistillationTrainer for efficient on-policy distillation by @cmpatino in https://github.com/huggingface/trl/pull/5407huggingface-cli references with hf by @hanouticelina in https://github.com/huggingface/trl/pull/5486{% generation %} markers for training chat template by @casinca in https://github.com/huggingface/trl/pull/5484trl.experimental.judges module and all judge support from trainers by @qgallouedec in https://github.com/huggingface/trl/pull/5485DistillationTrainer by @cmpatino in https://github.com/huggingface/trl/pull/5501DistillationTrainer by @cmpatino in https://github.com/huggingface/trl/pull/5500Full Changelog: https://github.com/huggingface/trl/compare/v1.0.0...v1.1.0
Fetched April 12, 2026