TRL v1.3 ships training support for the new Qwen 3.6 family (Qwen/Qwen3.6-27B, Qwen/Qwen3.6-35B-A3B). Qwen 3.6 reuses the Qwen3_5Moe* architecture but ships a slightly different chat template (adds a preserve_thinking flag, tweaks tool-arg stringification), so exact-string template matching needed updates across the stack.
What landed:
qwen3_6.jinja (verbatim from upstream) and qwen3_6_training.jinja (prefix-preserving + {% generation %} markers for assistant_only_loss=True)qwen3_5_schema for tool-call parsing — output format unchangedtiny-Qwen3_5MoeForConditionalGeneration-3.6 (with MoE-specific shrinking)test_(train|training)_vlm casesfrom trl import SFTConfig, SFTTrainer
trainer = SFTTrainer(
model="Qwen/Qwen3.6-27B",
args=SFTConfig(assistant_only_loss=True), # works out of the box
train_dataset=dataset,
)
trainer.train()
Tool-calling agent training also works end-to-end via the existing Qwen 3.5 response schema:
from trl import GRPOConfig, GRPOTrainer
def multiply(a: int, b: int) -> int:
"""
Multiplies two integers.
Args:
a: The first integer.
b: The second integer.
Returns:
The product of the two integers.
"""
return a * b
trainer = GRPOTrainer(
model="Qwen/Qwen3.6-27B",
reward_funcs=my_reward_fn,
args=GRPOConfig(...),
train_dataset=dataset,
tools=[multiply],
)
trainer.train()
by @qgallouedec in https://github.com/huggingface/trl/pull/5642
A new experimental TPOTrainer implements Triple Preference Optimization, which augments DPO with a reference (gold) completion alongside chosen/rejected. The paper reports +7-19 points over DPO/SimPO on Arena-Hard, MixEval-Hard, MMLU-Pro and GSM8K, with less data.
from trl.experimental.tpo import TPOConfig, TPOTrainer
trainer = TPOTrainer(
model="Qwen/Qwen3-0.6B",
args=TPOConfig(output_dir="Qwen3-0.6B-TPO"),
train_dataset=load_dataset("tpo-alignment/triple-preference-ultrafeedback-40K", split="train"),
)
trainer.train()
by @kashif in https://github.com/huggingface/trl/pull/5506
trl vllm-serveA new --speculative_config JSON flag exposes vLLM's speculative decoding directly through trl vllm-serve — works with native MTP heads (Qwen3 Next), Eagle3 drafts, etc. — without forking the serve script.
# Qwen3 native MTP (no extra draft model)
trl vllm-serve --model Qwen/Qwen3-Next-80B-A3B-Instruct \
--speculative_config '{"method": "qwen3_next_mtp", "num_speculative_tokens": 5}'
# Eagle3 draft model
trl vllm-serve --model Qwen/Qwen3-32B \
--speculative_config '{"model": "RedHatAI/Qwen3-32B-speculator.eagle3", "method": "eagle3", "num_speculative_tokens": 3}'
by @Ofir408 in https://github.com/huggingface/trl/pull/5605
Twelve more alignment PRs this cycle, bringing KTOTrainer and DPOTrainer essentially into structural parity. Notable shifts include moving completion assembly out of _prepare_dataset into a new DataCollatorForKTO, inlining the two-pass tokenization into a single pass, removing BOS/EOS handling, and supporting IterableDataset and dict eval_dataset. The goal — promoting KTO out of experimental and into stable — is now within reach for an upcoming release.
PRs (all by @albertvillanova): #5582, #5578, #5579, #5583, #5587, #5599, #5601, #5600, #5606, #5612, #5632, #5635
{% generation %} training chat templatesThree more model families gain training-compatible chat templates with {% generation %} markers, so assistant_only_loss=True works out of the box:
maybe_apply_chat_template by @albertvillanova in https://github.com/huggingface/trl/pull/5567is_chat_template_prefix_preserving by @qgallouedec in https://github.com/huggingface/trl/pull/5558forward_masked_logits by @qgallouedec in https://github.com/huggingface/trl/pull/5626_tokenizer as trainer attribute by @albertvillanova in https://github.com/huggingface/trl/pull/5489PreTrainedTokenizerBase for tokenizer type hints by @qgallouedec in https://github.com/huggingface/trl/pull/5629async_reward_X to async_X by @qgallouedec in https://github.com/huggingface/trl/pull/5616attention_mask instead of label != -100), and wrong cross-rank aggregation (unweighted mean instead of sum/count). The reported entropy under completion_only_loss=True and sequence parallelism is now correct. Same fix applied to DPO entropy logging. By @qgallouedec in https://github.com/huggingface/trl/pull/5620AsyncGRPOTrainer's processing_class to AsyncRolloutWorker by @xuanduy04 in https://github.com/huggingface/trl/pull/5538generate_tiny_models for gpt-oss by @albertvillanova in https://github.com/huggingface/trl/pull/5622TestSupportsToolCalling for improved coverage by @qgallouedec in https://github.com/huggingface/trl/pull/5537is_chat_template_prefix_preserving by @qgallouedec in https://github.com/huggingface/trl/pull/5558TestSupportsToolCalling for improved coverage by @qgallouedec in https://github.com/huggingface/trl/pull/5537{% generation %} markers for training chat template by @casinca in https://github.com/huggingface/trl/pull/5519forward_masked_logits by @qgallouedec in https://github.com/huggingface/trl/pull/5626PreTrainedTokenizerBase for tokenizer type hints by @qgallouedec in https://github.com/huggingface/trl/pull/5629async_reward_X to async_X by @qgallouedec in https://github.com/huggingface/trl/pull/5616Full Changelog: https://github.com/huggingface/trl/compare/v1.2.0...v1.3.0
Fetched April 26, 2026