SSDTrainer — Simple Self-DistillationA new experimental SSDTrainer implements the method described in Embarrassingly Simple Self-Distillation Improves Code Generation. SSD samples completions from the model itself at a training-time temperature/truncation setting, then fine-tunes on those raw, unverified samples with standard cross-entropy loss. No reward model, verifier, teacher model, or RL: just prompts and the model.
from datasets import Dataset
from trl.experimental.ssd import SSDConfig, SSDTrainer
dataset = Dataset.from_dict({
"prompt": [
[{"role": "user", "content": "Write a function to add two numbers."}],
[{"role": "user", "content": "Write a function to check if a number is prime."}],
],
})
trainer = SSDTrainer(
model="Qwen/Qwen3-4B-Instruct",
args=SSDConfig(
output_dir="ssd-model",
temperature=0.6, # T_train from the paper
top_k=20,
top_p=0.95,
learning_rate=5e-6,
),
train_dataset=dataset,
)
trainer.train()
by @kashif in https://github.com/huggingface/trl/pull/5505
GRPOTrainerWhen tool calls produce more tokens than max_completion_length allows, GRPOTrainer now rolls back the tool messages/images added in the current iteration instead of trying to truncate them. This removes ~80 lines of fragile, image-boundary-aware bookkeeping in favor of a ~15-line snapshot-and-rollback. Since overlong samples almost always get rewarded as failures anyway, the learning signal is effectively unchanged — but the code is dramatically simpler and no longer needs per-VLM-family vision-token lookup tables.
by @qgallouedec in https://github.com/huggingface/trl/pull/5521
Continuing the effort from v1.1:
{% generation %} markers, enabling assistant-only loss masking for DeepSeek-V3 models. By @RudrenduPaul in https://github.com/huggingface/trl/pull/5527As a result of a tightened detection (see fixes below), the list of templates reported as tool-calling capable is now correct — notably, the basic Llama 3 template is no longer falsely classified as tool-calling capable.
A major cleanup sweep keeps KTOTrainer and DPOTrainer in lockstep, same initialization patterns, same config surface, same precompute behavior:
precompute_ref_batch_size to KTO (https://github.com/huggingface/trl/pull/5530)ref_model initialization (https://github.com/huggingface/trl/pull/5534)None args (https://github.com/huggingface/trl/pull/5531)generate_during_eval (https://github.com/huggingface/trl/pull/5551)ref_model when precompute_ref_log_probs is set in DPO/KTO (https://github.com/huggingface/trl/pull/5542)All by @albertvillanova.
prepare_multimodal_messages by @albertvillanova in https://github.com/huggingface/trl/pull/5474prepare_multimodal_messages by @albertvillanova in https://github.com/huggingface/trl/pull/5508supports_tool_calling falsely accepting templates that drop assistant tool_calls by @qgallouedec in https://github.com/huggingface/trl/pull/5517add_response_schema for VLM processors — the schema was being set on the outer processor instead of the inner tokenizer, so it had no effect. This also collapses a handful of __init__/decode-gate workarounds. By @qgallouedec in https://github.com/huggingface/trl/pull/5520use_transformers_paged in GRPOConfig and RLOOConfig (and remove entirely from experimental OnlineDPOConfig, GOLDConfig, SelfDistillationConfig). Will be removed from the remaining configs in v2.0.0. In a small A/B benchmark (Qwen3-0.6B GRPO), the paged path is ~20% slower and uses ~6x more peak VRAM than the default; it's also superseded by transformers continuous batching. By @qgallouedec in https://github.com/huggingface/trl/pull/5544chat_templates/README by @qgallouedec in https://github.com/huggingface/trl/pull/5545supports_tool_calling falsely accepting templates that drop assistant tool_calls by @qgallouedec in https://github.com/huggingface/trl/pull/5517use_transformers_paged by @qgallouedec in https://github.com/huggingface/trl/pull/5544add_response_schema for VLM processors by @qgallouedec in https://github.com/huggingface/trl/pull/5520chat_templates/README by @qgallouedec in https://github.com/huggingface/trl/pull/5545Full Changelog: https://github.com/huggingface/trl/compare/v1.1.0...v1.2.0
Fetched April 17, 2026