v1.3.0

Features

Qwen 3.6 integration

TRL v1.3 ships training support for the new Qwen 3.6 family (Qwen/Qwen3.6-27B, Qwen/Qwen3.6-35B-A3B). Qwen 3.6 reuses the Qwen3_5Moe* architecture but ships a slightly different chat template (adds a preserve_thinking flag, tweaks tool-arg stringification), so exact-string template matching needed updates across the stack.

What landed:

Chat templates: qwen3_6.jinja (verbatim from upstream) and qwen3_6_training.jinja (prefix-preserving + {% generation %} markers for assistant_only_loss=True)
Response schema: routes to the existing qwen3_5_schema for tool-call parsing — output format unchanged
Tiny test models for VLM training: tiny-Qwen3_5MoeForConditionalGeneration-3.6 (with MoE-specific shrinking)
Test matrix updated across SFT/DPO/GRPO/RLOO test_(train|training)_vlm cases

from trl import SFTConfig, SFTTrainer

trainer = SFTTrainer(
    model="Qwen/Qwen3.6-27B",
    args=SFTConfig(assistant_only_loss=True),  # works out of the box
    train_dataset=dataset,
)
trainer.train()

Tool-calling agent training also works end-to-end via the existing Qwen 3.5 response schema:

from trl import GRPOConfig, GRPOTrainer

def multiply(a: int, b: int) -> int:
    """
    Multiplies two integers.

    Args:
        a: The first integer.
        b: The second integer.

    Returns:
        The product of the two integers.
    """
    return a * b

trainer = GRPOTrainer(
    model="Qwen/Qwen3.6-27B",
    reward_funcs=my_reward_fn,
    args=GRPOConfig(...),
    train_dataset=dataset,
    tools=[multiply],
)
trainer.train()

by @qgallouedec in https://github.com/huggingface/trl/pull/5642

New experimental TPO trainer

A new experimental TPOTrainer implements Triple Preference Optimization, which augments DPO with a reference (gold) completion alongside chosen/rejected. The paper reports +7-19 points over DPO/SimPO on Arena-Hard, MixEval-Hard, MMLU-Pro and GSM8K, with less data.

from trl.experimental.tpo import TPOConfig, TPOTrainer

trainer = TPOTrainer(
    model="Qwen/Qwen3-0.6B",
    args=TPOConfig(output_dir="Qwen3-0.6B-TPO"),
    train_dataset=load_dataset("tpo-alignment/triple-preference-ultrafeedback-40K", split="train"),
)
trainer.train()

by @kashif in https://github.com/huggingface/trl/pull/5506

Speculative decoding in `trl vllm-serve`

A new --speculative_config JSON flag exposes vLLM's speculative decoding directly through trl vllm-serve — works with native MTP heads (Qwen3 Next), Eagle3 drafts, etc. — without forking the serve script.

# Qwen3 native MTP (no extra draft model)
trl vllm-serve --model Qwen/Qwen3-Next-80B-A3B-Instruct \
    --speculative_config '{"method": "qwen3_next_mtp", "num_speculative_tokens": 5}'

# Eagle3 draft model
trl vllm-serve --model Qwen/Qwen3-32B \
    --speculative_config '{"model": "RedHatAI/Qwen3-32B-speculator.eagle3", "method": "eagle3", "num_speculative_tokens": 3}'

by @Ofir408 in https://github.com/huggingface/trl/pull/5605

KTO ↔ DPO alignment: nearing the finish line

Twelve more alignment PRs this cycle, bringing KTOTrainer and DPOTrainer essentially into structural parity. Notable shifts include moving completion assembly out of _prepare_dataset into a new DataCollatorForKTO, inlining the two-pass tokenization into a single pass, removing BOS/EOS handling, and supporting IterableDataset and dict eval_dataset. The goal — promoting KTO out of experimental and into stable — is now within reach for an upcoming release.

PRs (all by @albertvillanova): #5582, #5578, #5579, #5583, #5587, #5599, #5601, #5600, #5606, #5612, #5632, #5635

More `{% generation %}` training chat templates

Three more model families gain training-compatible chat templates with {% generation %} markers, so assistant_only_loss=True works out of the box:

Gemma / Gemma 2 by @ps-abhi in https://github.com/huggingface/trl/pull/5523
Phi-3 by @RudrenduPaul in https://github.com/huggingface/trl/pull/5526
GLM-4-MoE by @casinca in https://github.com/huggingface/trl/pull/5519

Other

Support processor in maybe_apply_chat_template by @albertvillanova in https://github.com/huggingface/trl/pull/5567
Support VLM processors in is_chat_template_prefix_preserving by @qgallouedec in https://github.com/huggingface/trl/pull/5558
Check prefix preservation at the token level (not string level) by @qgallouedec in https://github.com/huggingface/trl/pull/5559
Drop vLLM 0.11 support by @qgallouedec in https://github.com/huggingface/trl/pull/5549
Remove forward_masked_logits by @qgallouedec in https://github.com/huggingface/trl/pull/5626
Remove dead token attributes from experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5565
Set _tokenizer as trainer attribute by @albertvillanova in https://github.com/huggingface/trl/pull/5489
Use PreTrainedTokenizerBase for tokenizer type hints by @qgallouedec in https://github.com/huggingface/trl/pull/5629
Renaming of internal variables: async_reward_X to async_X by @qgallouedec in https://github.com/huggingface/trl/pull/5616

Fixes

Fix entropy calculation in SFT — three bugs at once: misaligned by one position (next-token shift), averaged over the wrong tokens (used attention_mask instead of label != -100), and wrong cross-rank aggregation (unweighted mean instead of sum/count). The reported entropy under completion_only_loss=True and sequence parallelism is now correct. Same fix applied to DPO entropy logging. By @qgallouedec in https://github.com/huggingface/trl/pull/5620
Pass AsyncGRPOTrainer's processing_class to AsyncRolloutWorker by @xuanduy04 in https://github.com/huggingface/trl/pull/5538
Fix generate_tiny_models for gpt-oss by @albertvillanova in https://github.com/huggingface/trl/pull/5622
Fix docstring style in vllm-serve script by @albertvillanova in https://github.com/huggingface/trl/pull/5628
Replace wrong comment about chat template with EOS by @albertvillanova in https://github.com/huggingface/trl/pull/5607

Documentation and Examples

Add chat templates page to web docs by @sergiopaniego in https://github.com/huggingface/trl/pull/5581
Update AsyncGRPO example with GSM8K and tested hyperparameters by @sergiopaniego in https://github.com/huggingface/trl/pull/5580
Update RapidFire AI integration with FSDP and multi-backend tracking by @kamran-rapidfireAI in https://github.com/huggingface/trl/pull/5618

CI

Add doc-builder style check to pre-commit and CI by @albertvillanova in https://github.com/huggingface/trl/pull/5630
Align and update doc-builder commit hash in CI GitHub Actions by @albertvillanova in https://github.com/huggingface/trl/pull/5631
Hotfix CI: Add ruff dependency to doc-builder style check by @albertvillanova in https://github.com/huggingface/trl/pull/5634
Fix CI with dev dependencies for Llava models by @albertvillanova in https://github.com/huggingface/trl/pull/5499
Add additional model parameters to TestSupportsToolCalling for improved coverage by @qgallouedec in https://github.com/huggingface/trl/pull/5537
Differentiate Phi-3 and Phi-3.5 in tests by @qgallouedec in https://github.com/huggingface/trl/pull/5546

New Contributors

@Ofir408 made their first contribution in https://github.com/huggingface/trl/pull/5605
@ps-abhi made their first contribution in https://github.com/huggingface/trl/pull/5523

What's Changed

⬆️ Bump dev version by @qgallouedec in https://github.com/huggingface/trl/pull/5577
Support processor in maybe_apply_chat_template by @albertvillanova in https://github.com/huggingface/trl/pull/5567
Remove dead token attributes from experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5565
Support VLM processors in is_chat_template_prefix_preserving by @qgallouedec in https://github.com/huggingface/trl/pull/5558
Align KTO with DPO: Align add_model_tags by @albertvillanova in https://github.com/huggingface/trl/pull/5582
Align KTO with DPO: Align processing_class initialization by @albertvillanova in https://github.com/huggingface/trl/pull/5578
Align KTO with DPO: Align _prepare_dataset by @albertvillanova in https://github.com/huggingface/trl/pull/5579
Align KTO with DPO: Align ref_model preparation for distributed training by @albertvillanova in https://github.com/huggingface/trl/pull/5583
Align KTO with DPO: Make conditional prompt extraction and unpairing in _prepare_dataset by @albertvillanova in https://github.com/huggingface/trl/pull/5587
Update AsyncGRPO example with GSM8K and tested hyperparameters by @sergiopaniego in https://github.com/huggingface/trl/pull/5580
[docs] Add chat templates page to web docs by @sergiopaniego in https://github.com/huggingface/trl/pull/5581
Add additional model parameters to TestSupportsToolCalling for improved coverage by @qgallouedec in https://github.com/huggingface/trl/pull/5537
Fix CI with dev dependencies for Llava models by @albertvillanova in https://github.com/huggingface/trl/pull/5499
Differentiate Phi-3 and Phi-3.5 in tests by @qgallouedec in https://github.com/huggingface/trl/pull/5546
Set _tokenizer as trainer attribute by @albertvillanova in https://github.com/huggingface/trl/pull/5489
Align KTO with DPO: Support dict eval_dataset by @albertvillanova in https://github.com/huggingface/trl/pull/5599
Align KTO with DPO: Align tokenization by @albertvillanova in https://github.com/huggingface/trl/pull/5601
Check prefix preservation at the token level by @qgallouedec in https://github.com/huggingface/trl/pull/5559
Replace wrong comment about chat template with EOS by @albertvillanova in https://github.com/huggingface/trl/pull/5607
Align KTO with DPO: Support IterableDataset by @albertvillanova in https://github.com/huggingface/trl/pull/5600
Drop vLLM 0.11 support by @qgallouedec in https://github.com/huggingface/trl/pull/5549
Align KTO with DPO: Remove maybe_apply_chat_template by @albertvillanova in https://github.com/huggingface/trl/pull/5606
[TPO] experimental TPO trainer by @kashif in https://github.com/huggingface/trl/pull/5506
fix: Pass AsyncGRPOTrainer's processing_class to AsyncRolloutWorker by @xuanduy04 in https://github.com/huggingface/trl/pull/5538
docs: update RapidFire AI integration with FSDP and multi-backend tracking by @kamran-rapidfireAI in https://github.com/huggingface/trl/pull/5618
Fix generate_tiny_models for gpt-oss by @albertvillanova in https://github.com/huggingface/trl/pull/5622
Added speculative_config to vllm-serve by @Ofir408 in https://github.com/huggingface/trl/pull/5605
feat(glm-4-moe): Add {% generation %} markers for training chat template by @casinca in https://github.com/huggingface/trl/pull/5519
Fix docstring style in vllm-serve script by @albertvillanova in https://github.com/huggingface/trl/pull/5628
feat: add Gemma/Gemma2 training chat templates with generation markers by @ps-abhi in https://github.com/huggingface/trl/pull/5523
Align KTO with DPO: Inline tokenization, new output format, DataCollatorForKTO by @albertvillanova in https://github.com/huggingface/trl/pull/5612
feat: add Phi-3 training chat template with generation markers by @RudrenduPaul in https://github.com/huggingface/trl/pull/5526
Remove forward_masked_logits by @qgallouedec in https://github.com/huggingface/trl/pull/5626
Use PreTrainedTokenizerBase for tokenizer type hints by @qgallouedec in https://github.com/huggingface/trl/pull/5629
Add doc-builder style check to pre-commit and CI by @albertvillanova in https://github.com/huggingface/trl/pull/5630
Align and update doc-builder commit hash in CI GitHub Actions by @albertvillanova in https://github.com/huggingface/trl/pull/5631
Align KTO with DPO: Move completion assembly from _prepare_dataset to data collator by @albertvillanova in https://github.com/huggingface/trl/pull/5632
Hotfix CI: Add ruff dependency to doc-builder style check by @albertvillanova in https://github.com/huggingface/trl/pull/5634
Fix entropy calculation in SFT by @qgallouedec in https://github.com/huggingface/trl/pull/5620
Renaming of internal variables: async_reward_X to async_X by @qgallouedec in https://github.com/huggingface/trl/pull/5616
Align KTO with DPO: Remove BOS/EOS handling by @albertvillanova in https://github.com/huggingface/trl/pull/5635
Qwen3.6 integration by @qgallouedec in https://github.com/huggingface/trl/pull/5642
Release: v1.3 by @qgallouedec in https://github.com/huggingface/trl/pull/5647

New Contributors

@Ofir408 made their first contribution in https://github.com/huggingface/trl/pull/5605
@ps-abhi made their first contribution in https://github.com/huggingface/trl/pull/5523

Full Changelog: https://github.com/huggingface/trl/compare/v1.2.0...v1.3.0

Features

Qwen 3.6 integration

New experimental TPO trainer

Speculative decoding in `trl vllm-serve`

KTO ↔ DPO alignment: nearing the finish line

More `{% generation %}` training chat templates

Other

Fixes

Documentation and Examples

CI

New Contributors

What's Changed

New Contributors

More from Hugging Face

From other products

From other products

More from Hugging Face

v1.3.0

Features

Qwen 3.6 integration

New experimental TPO trainer

Speculative decoding in trl vllm-serve

KTO ↔ DPO alignment: nearing the finish line

More {% generation %} training chat templates

Other

Fixes

Documentation and Examples

CI

New Contributors

What's Changed

New Contributors

More from Hugging Face

From other products

From other products

More from Hugging Face

Speculative decoding in `trl vllm-serve`

More `{% generation %}` training chat templates