The release introduces a new experimental A2POTrainer for optimal advantage regression and grants KTO trainer support for vision-language models. The AsyncRolloutWorker now runs in a separate process to avoid GIL contention and potential NCCL watchdog timeouts, along with fixes for aiohttp retries and all-NaN reward columns. Gold distillation trainer now aligns tokens via byte offsets, and SDFT/SDPO leverage the vLLM server for live teacher logprobs. Other features include bidirectional masked importance sampling for IcePop, support for NemotronH and Nemotron 3 Ultra, additional training chat templates, and decoupled self-distillation trainers.
TRL
Trainer telemetry is now gated on an explicit class-name allowlist, restricting which trainer classes can send telemetry.
Fixed an exponential backtracking bug in Qwen3/Qwen3.5/GLM4MoE response parsing that caused GRPOTrainer to hang indefinitely on truncated tool-call blocks, reducing worst-case complexity from O(2ⁿ) to O(n). Also fixed a CUDA memory leak in BNB dequantization buffers and stale state in OffloadActivations. Added training chat templates for Phi-3.5, Qwen3-VL, and Qwen3.5 Think/NoThink, and final logits softcapping support for AsyncGRPOTrainer on models like Gemma 2.
A new loss_type="chunked_nll" option for SFT drastically reduces peak activation memory by computing cross-entropy over tokens in checkpointed chunks instead of materializing the full [batch × seq × vocab] logits tensor, unlocking sequence lengths that previously caused out-of-memory errors. Also added OpenReward Standard environment adapter support, length-normalized DPO sigmoid loss, training chat templates for Cohere, Cohere2, Gemma 3, Qwen3, and Qwen2.5, and a training-invariance test suite to catch numerical drift across trainer configurations.
Features
Qwen 3.6 integration
<img width="1536" height="1024" alt="ChatGPT Image Apr 26, 2026 at 11_16_18 AM" src="https://github.com/use...
Features
New SSDTrainer — Simple Self-Distillation
<img width="778" height="334" alt="Screenshot 2026-04-16 at 9 08 04 PM" src="https:/...
Features
DistillationTrainer for efficient on-policy distillation
Read the blog post: https://huggingface.co/spaces/HuggingFaceTB/trl-d...
Read our...
Features
Variational Sequence-Level Soft Policy Optimization (VESPO)
<img width="465" height="279" alt="Screenshot 2026-03-20 at 5 49 50 ...
What's Changed
- Handle mm_token_type_ids in SFT/GRPO/RLOO to fix IndexError by @albertvillanova in https://github.com/huggingface/trl/pull/5178...
Features
Add environment_factory to GRPOTrainer
GRPOTrainer now accepts an environment_factory argument, allowing users to specif...
Features
- [GRPOTrainer]: Agent Training Supports Async Tool Calls by @pramodith in https://github.com/huggingface/trl/pull/4742
- Add retry stra...
What's Changed
- Remove access to
warnings_issuedby @qgallouedec in #4960 - Fix SFTTrainer init logic: remove TrainingArguments.push_to_hub_...
What's Changed
- Fix: undefined
current_gradient_accumulation_stepsby @qgallouedec in https://github.com/huggingface/trl/pull/4852 - fix(Dee...
Features
- Add
vllm_group_portargument to GRPO, RLOO and OnlineDPO configuration by @pointerhacker in https://github.com/huggingface/trl/pull...
What's Changed
- Overwrite model default generation config used by model.generate by @albertvillanova in https://github.com/huggingface/trl/pull...
What's Changed
- Fix vLLM error for tools usage not supported when running GRPO training by @apalmas-saifh in https://github.com/huggingface/trl...
Features
🕵️♂️ GRPO: Agent training
GRPOTrainer now supports training agents using tools. This allows language models to interact with...
What's Changed
- Replace accelerate logging with stdlib in CLI by @lewtun in https://github.com/huggingface/trl/pull/4512
- Add temporary worka...
Features
- 💤 Switch to sleep level=2 and split wake-ups in GRPO and RLOO trainers by @xxrjun in https://github.com/huggingface/trl/pull/4296 *...

