v1.3.0
Features
Qwen 3.6 integration
<img width="1536" height="1024" alt="ChatGPT Image Apr 26, 2026 at 11_16_18 AM" src="https://github.com/user-attachments/assets/789aad15-03b2-4ece-9828-d5c1dfed1f1e" />TRL v1.3 ships training support for the new Qwen 3.6 family (Qwen/Qwen3.6-27B, Qwen/Qwen3.6-35B-A3B). Qwen 3.6 reuses the Qwen3_5Moe* architecture but ships a slightly different chat template (adds a preserve_thinking flag, tweaks tool-arg stringification), so exact-string template matching needed updates across the stack.
What landed:
- Chat templates:
qwen3_6.jinja(verbatim from upstream) andqwen3_6_training.jinja(prefix-preserving +{% generation %}markers forassistant_only_loss=True) - Response schema: routes to the existing
qwen3_5_schemafor tool-call parsing — output format unchanged - Tiny test models for VLM training:
tiny-Qwen3_5MoeForConditionalGeneration-3.6(with MoE-specific shrinking) - Test matrix updated across SFT/DPO/GRPO/RLOO
test_(train|training)_vlmcases
from trl import SFTConfig, SFTTrainer
trainer = SFTTrainer(
model="Qwen/Qwen3.6-27B",
args=SFTConfig(assistant_only_loss=True), # works out of the box
train_dataset=dataset,
)
trainer.train()
Tool-calling agent training also works end-to-end via the existing Qwen 3.5 response schema:
from trl import GRPOConfig, GRPOTrainer
def multiply(a: int, b: int) -> int:
"""
Multiplies two integers.
Args:
a: The first integer.
b: The second integer.
Returns:
The product of the two integers.
"""
return a * b
trainer = GRPOTrainer(
model="Qwen/Qwen3.6-27B",
reward_funcs=my_reward_fn,
args=GRPOConfig(...),
train_dataset=dataset,
tools=[multiply],
)
trainer.train()
by @qgallouedec in https://github.com/huggingface/trl/pull/5642
New experimental TPO trainer
<img width="711" height="177" alt="Screenshot 2026-04-26 at 11 37 28 AM" src="https://github.com/user-attachments/assets/6090212e-5c95-45c1-b137-87333d91daa6" />A new experimental TPOTrainer implements Triple Preference Optimization, which augments DPO with a reference (gold) completion alongside chosen/rejected. The paper reports +7-19 points over DPO/SimPO on Arena-Hard, MixEval-Hard, MMLU-Pro and GSM8K, with less data.
from trl.experimental.tpo import TPOConfig, TPOTrainer
trainer = TPOTrainer(
model="Qwen/Qwen3-0.6B",
args=TPOConfig(output_dir="Qwen3-0.6B-TPO"),
train_dataset=load_dataset("tpo-alignment/triple-preference-ultrafeedback-40K", split="train"),
)
trainer.train()
by @kashif in https://github.com/huggingface/trl/pull/5506
Speculative decoding in trl vllm-serve
A new --speculative_config JSON flag exposes vLLM's speculative decoding directly through trl vllm-serve — works with native MTP heads (Qwen3 Next), Eagle3 drafts, etc. — without forking the serve script.
# Qwen3 native MTP (no extra draft model)
trl vllm-serve --model Qwen/Qwen3-Next-80B-A3B-Instruct \
--speculative_config '{"method": "qwen3_next_mtp", "num_speculative_tokens": 5}'
# Eagle3 draft model
trl vllm-serve --model Qwen/Qwen3-32B \
--speculative_config '{"model": "RedHatAI/Qwen3-32B-speculator.eagle3", "method": "eagle3", "num_speculative_tokens": 3}'
by @Ofir408 in https://github.com/huggingface/trl/pull/5605
KTO ↔ DPO alignment: nearing the finish line
Twelve more alignment PRs this cycle, bringing KTOTrainer and DPOTrainer essentially into structural parity. Notable shifts include moving completion assembly out of _prepare_dataset into a new DataCollatorForKTO, inlining the two-pass tokenization into a single pass, removing BOS/EOS handling, and supporting IterableDataset and dict eval_dataset. The goal — promoting KTO out of experimental and into stable — is now within reach for an upcoming release.
PRs (all by @albertvillanova): #5582, #5578, #5579, #5583, #5587, #5599, #5601, #5600, #5606, #5612, #5632, #5635
More {% generation %} training chat templates
Three more model families gain training-compatible chat templates with {% generation %} markers, so assistant_only_loss=True works out of the box:
- Gemma / Gemma 2 by @ps-abhi in https://github.com/huggingface/trl/pull/5523
- Phi-3 by @RudrenduPaul in https://github.com/huggingface/trl/pull/5526
- GLM-4-MoE by @casinca in https://github.com/huggingface/trl/pull/5519
Other
- Support processor in
maybe_apply_chat_templateby @albertvillanova in https://github.com/huggingface/trl/pull/5567 - Support VLM processors in
is_chat_template_prefix_preservingby @qgallouedec in https://github.com/huggingface/trl/pull/5558 - Check prefix preservation at the token level (not string level) by @qgallouedec in https://github.com/huggingface/trl/pull/5559
- Drop vLLM 0.11 support by @qgallouedec in https://github.com/huggingface/trl/pull/5549
- Remove
forward_masked_logitsby @qgallouedec in https://github.com/huggingface/trl/pull/5626 - Remove dead token attributes from experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5565
- Set
_tokenizeras trainer attribute by @albertvillanova in https://github.com/huggingface/trl/pull/5489 - Use
PreTrainedTokenizerBasefor tokenizer type hints by @qgallouedec in https://github.com/huggingface/trl/pull/5629 - Renaming of internal variables:
async_reward_Xtoasync_Xby @qgallouedec in https://github.com/huggingface/trl/pull/5616
Fixes
- Fix entropy calculation in SFT — three bugs at once: misaligned by one position (next-token shift), averaged over the wrong tokens (used
attention_maskinstead oflabel != -100), and wrong cross-rank aggregation (unweighted mean instead of sum/count). The reported entropy undercompletion_only_loss=Trueand sequence parallelism is now correct. Same fix applied to DPO entropy logging. By @qgallouedec in https://github.com/huggingface/trl/pull/5620 - Pass
AsyncGRPOTrainer'sprocessing_classtoAsyncRolloutWorkerby @xuanduy04 in https://github.com/huggingface/trl/pull/5538 - Fix
generate_tiny_modelsfor gpt-oss by @albertvillanova in https://github.com/huggingface/trl/pull/5622 - Fix docstring style in vllm-serve script by @albertvillanova in https://github.com/huggingface/trl/pull/5628
- Replace wrong comment about chat template with EOS by @albertvillanova in https://github.com/huggingface/trl/pull/5607
Documentation and Examples
- Add chat templates page to web docs by @sergiopaniego in https://github.com/huggingface/trl/pull/5581
- Update AsyncGRPO example with GSM8K and tested hyperparameters by @sergiopaniego in https://github.com/huggingface/trl/pull/5580
- Update RapidFire AI integration with FSDP and multi-backend tracking by @kamran-rapidfireAI in https://github.com/huggingface/trl/pull/5618
CI
- Add doc-builder style check to pre-commit and CI by @albertvillanova in https://github.com/huggingface/trl/pull/5630
- Align and update doc-builder commit hash in CI GitHub Actions by @albertvillanova in https://github.com/huggingface/trl/pull/5631
- Hotfix CI: Add ruff dependency to doc-builder style check by @albertvillanova in https://github.com/huggingface/trl/pull/5634
- Fix CI with dev dependencies for Llava models by @albertvillanova in https://github.com/huggingface/trl/pull/5499
- Add additional model parameters to
TestSupportsToolCallingfor improved coverage by @qgallouedec in https://github.com/huggingface/trl/pull/5537 - Differentiate Phi-3 and Phi-3.5 in tests by @qgallouedec in https://github.com/huggingface/trl/pull/5546
New Contributors
- @Ofir408 made their first contribution in https://github.com/huggingface/trl/pull/5605
- @ps-abhi made their first contribution in https://github.com/huggingface/trl/pull/5523
What's Changed
- ⬆️ Bump dev version by @qgallouedec in https://github.com/huggingface/trl/pull/5577
- Support processor in maybe_apply_chat_template by @albertvillanova in https://github.com/huggingface/trl/pull/5567
- Remove dead token attributes from experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5565
- Support VLM processors in
is_chat_template_prefix_preservingby @qgallouedec in https://github.com/huggingface/trl/pull/5558 - Align KTO with DPO: Align add_model_tags by @albertvillanova in https://github.com/huggingface/trl/pull/5582
- Align KTO with DPO: Align processing_class initialization by @albertvillanova in https://github.com/huggingface/trl/pull/5578
- Align KTO with DPO: Align _prepare_dataset by @albertvillanova in https://github.com/huggingface/trl/pull/5579
- Align KTO with DPO: Align ref_model preparation for distributed training by @albertvillanova in https://github.com/huggingface/trl/pull/5583
- Align KTO with DPO: Make conditional prompt extraction and unpairing in _prepare_dataset by @albertvillanova in https://github.com/huggingface/trl/pull/5587
- Update AsyncGRPO example with GSM8K and tested hyperparameters by @sergiopaniego in https://github.com/huggingface/trl/pull/5580
- [docs] Add chat templates page to web docs by @sergiopaniego in https://github.com/huggingface/trl/pull/5581
- Add additional model parameters to
TestSupportsToolCallingfor improved coverage by @qgallouedec in https://github.com/huggingface/trl/pull/5537 - Fix CI with dev dependencies for Llava models by @albertvillanova in https://github.com/huggingface/trl/pull/5499
- Differentiate Phi-3 and Phi-3.5 in tests by @qgallouedec in https://github.com/huggingface/trl/pull/5546
- Set _tokenizer as trainer attribute by @albertvillanova in https://github.com/huggingface/trl/pull/5489
- Align KTO with DPO: Support dict eval_dataset by @albertvillanova in https://github.com/huggingface/trl/pull/5599
- Align KTO with DPO: Align tokenization by @albertvillanova in https://github.com/huggingface/trl/pull/5601
- Check prefix preservation at the token level by @qgallouedec in https://github.com/huggingface/trl/pull/5559
- Replace wrong comment about chat template with EOS by @albertvillanova in https://github.com/huggingface/trl/pull/5607
- Align KTO with DPO: Support IterableDataset by @albertvillanova in https://github.com/huggingface/trl/pull/5600
- Drop vLLM 0.11 support by @qgallouedec in https://github.com/huggingface/trl/pull/5549
- Align KTO with DPO: Remove maybe_apply_chat_template by @albertvillanova in https://github.com/huggingface/trl/pull/5606
- [TPO] experimental TPO trainer by @kashif in https://github.com/huggingface/trl/pull/5506
- fix: Pass AsyncGRPOTrainer's processing_class to AsyncRolloutWorker by @xuanduy04 in https://github.com/huggingface/trl/pull/5538
- docs: update RapidFire AI integration with FSDP and multi-backend tracking by @kamran-rapidfireAI in https://github.com/huggingface/trl/pull/5618
- Fix generate_tiny_models for gpt-oss by @albertvillanova in https://github.com/huggingface/trl/pull/5622
- Added speculative_config to vllm-serve by @Ofir408 in https://github.com/huggingface/trl/pull/5605
- feat(glm-4-moe): Add
{% generation %}markers for training chat template by @casinca in https://github.com/huggingface/trl/pull/5519 - Fix docstring style in vllm-serve script by @albertvillanova in https://github.com/huggingface/trl/pull/5628
- feat: add Gemma/Gemma2 training chat templates with generation markers by @ps-abhi in https://github.com/huggingface/trl/pull/5523
- Align KTO with DPO: Inline tokenization, new output format, DataCollatorForKTO by @albertvillanova in https://github.com/huggingface/trl/pull/5612
- feat: add Phi-3 training chat template with generation markers by @RudrenduPaul in https://github.com/huggingface/trl/pull/5526
- Remove
forward_masked_logitsby @qgallouedec in https://github.com/huggingface/trl/pull/5626 - Use
PreTrainedTokenizerBasefor tokenizer type hints by @qgallouedec in https://github.com/huggingface/trl/pull/5629 - Add doc-builder style check to pre-commit and CI by @albertvillanova in https://github.com/huggingface/trl/pull/5630
- Align and update doc-builder commit hash in CI GitHub Actions by @albertvillanova in https://github.com/huggingface/trl/pull/5631
- Align KTO with DPO: Move completion assembly from _prepare_dataset to data collator by @albertvillanova in https://github.com/huggingface/trl/pull/5632
- Hotfix CI: Add ruff dependency to doc-builder style check by @albertvillanova in https://github.com/huggingface/trl/pull/5634
- Fix entropy calculation in SFT by @qgallouedec in https://github.com/huggingface/trl/pull/5620
- Renaming of internal variables:
async_reward_Xtoasync_Xby @qgallouedec in https://github.com/huggingface/trl/pull/5616 - Align KTO with DPO: Remove BOS/EOS handling by @albertvillanova in https://github.com/huggingface/trl/pull/5635
- Qwen3.6 integration by @qgallouedec in https://github.com/huggingface/trl/pull/5642
- Release: v1.3 by @qgallouedec in https://github.com/huggingface/trl/pull/5647
New Contributors
- @Ofir408 made their first contribution in https://github.com/huggingface/trl/pull/5605
- @ps-abhi made their first contribution in https://github.com/huggingface/trl/pull/5523
Full Changelog: https://github.com/huggingface/trl/compare/v1.2.0...v1.3.0
Fetched April 26, 2026
