releases.shpreview

v1.2.0

April 17, 2026TRLView original ↗
$npx -y @buildinternet/releases show rel_vSOkDQsjEJAbDUIKmPuqj

Features

New SSDTrainer — Simple Self-Distillation

<img width="778" height="334" alt="Screenshot 2026-04-16 at 9 08 04 PM" src="https://github.com/user-attachments/assets/8ca223f0-6740-48a8-967c-ec10cb262a93" />

A new experimental SSDTrainer implements the method described in Embarrassingly Simple Self-Distillation Improves Code Generation. SSD samples completions from the model itself at a training-time temperature/truncation setting, then fine-tunes on those raw, unverified samples with standard cross-entropy loss. No reward model, verifier, teacher model, or RL: just prompts and the model.

from datasets import Dataset
from trl.experimental.ssd import SSDConfig, SSDTrainer

dataset = Dataset.from_dict({
    "prompt": [
        [{"role": "user", "content": "Write a function to add two numbers."}],
        [{"role": "user", "content": "Write a function to check if a number is prime."}],
    ],
})

trainer = SSDTrainer(
    model="Qwen/Qwen3-4B-Instruct",
    args=SSDConfig(
        output_dir="ssd-model",
        temperature=0.6,      # T_train from the paper
        top_k=20,
        top_p=0.95,
        learning_rate=5e-6,
    ),
    train_dataset=dataset,
)
trainer.train()

by @kashif in https://github.com/huggingface/trl/pull/5505

Drop, don't truncate, overlong tool results in GRPOTrainer

When tool calls produce more tokens than max_completion_length allows, GRPOTrainer now rolls back the tool messages/images added in the current iteration instead of trying to truncate them. This removes ~80 lines of fragile, image-boundary-aware bookkeeping in favor of a ~15-line snapshot-and-rollback. Since overlong samples almost always get rewarded as failures anyway, the learning signal is effectively unchanged — but the code is dramatically simpler and no longer needs per-VLM-family vision-token lookup tables.

by @qgallouedec in https://github.com/huggingface/trl/pull/5521

Expanded tool-calling model support: LLaMA 3.1 / 3.2 & DeepSeek-V3

Continuing the effort from v1.1:

  • LLaMA 3.1 and 3.2 tool-calling response schemas, with dedicated templates for identity matching. Note that these templates only support a single tool call and no content alongside the tool call — limitations inherited from the models' native templates. By @qgallouedec in https://github.com/huggingface/trl/pull/5518
  • DeepSeek-V3 training chat template with {% generation %} markers, enabling assistant-only loss masking for DeepSeek-V3 models. By @RudrenduPaul in https://github.com/huggingface/trl/pull/5527

As a result of a tightened detection (see fixes below), the list of templates reported as tool-calling capable is now correct — notably, the basic Llama 3 template is no longer falsely classified as tool-calling capable.

KTO/DPO alignment push

A major cleanup sweep keeps KTOTrainer and DPOTrainer in lockstep, same initialization patterns, same config surface, same precompute behavior:

All by @albertvillanova.

Other

Fixes

Deprecations

  • Deprecate use_transformers_paged in GRPOConfig and RLOOConfig (and remove entirely from experimental OnlineDPOConfig, GOLDConfig, SelfDistillationConfig). Will be removed from the remaining configs in v2.0.0. In a small A/B benchmark (Qwen3-0.6B GRPO), the paged path is ~20% slower and uses ~6x more peak VRAM than the default; it's also superseded by transformers continuous batching. By @qgallouedec in https://github.com/huggingface/trl/pull/5544

Documentation and Examples

CI

What's Changed

Full Changelog: https://github.com/huggingface/trl/compare/v1.1.0...v1.2.0

Fetched April 17, 2026