releases.shpreview

TRL

Mon
Wed
Fri
JunJulAugSepOctNovDecJanFebMarAprMayJun
Less
More
Releases10Avg3/moVersionsv0.29.1 to v1.6.0
v1.6.0

The release introduces a new experimental A2POTrainer for optimal advantage regression and grants KTO trainer support for vision-language models. The AsyncRolloutWorker now runs in a separate process to avoid GIL contention and potential NCCL watchdog timeouts, along with fixes for aiohttp retries and all-NaN reward columns. Gold distillation trainer now aligns tokens via byte offsets, and SDFT/SDPO leverage the vLLM server for live teacher logprobs. Other features include bidirectional masked importance sampling for IcePop, support for NemotronH and Nemotron 3 Ultra, additional training chat templates, and decoupled self-distillation trainers.

Read more →
v1.5.0

Fixed an exponential backtracking bug in Qwen3/Qwen3.5/GLM4MoE response parsing that caused GRPOTrainer to hang indefinitely on truncated tool-call blocks, reducing worst-case complexity from O(2ⁿ) to O(n). Also fixed a CUDA memory leak in BNB dequantization buffers and stale state in OffloadActivations. Added training chat templates for Phi-3.5, Qwen3-VL, and Qwen3.5 Think/NoThink, and final logits softcapping support for AsyncGRPOTrainer on models like Gemma 2.

Read more →
v1.4.0

A new loss_type="chunked_nll" option for SFT drastically reduces peak activation memory by computing cross-entropy over tokens in checkpointed chunks instead of materializing the full [batch × seq × vocab] logits tensor, unlocking sequence lengths that previously caused out-of-memory errors. Also added OpenReward Standard environment adapter support, length-normalized DPO sigmoid loss, training chat templates for Cohere, Cohere2, Gemma 3, Qwen3, and Qwen2.5, and a training-invariance test suite to catch numerical drift across trainer configurations.

Read more →

Features

New SSDTrainer — Simple Self-Distillation

<img width="778" height="334" alt="Screenshot 2026-04-16 at 9 08 04 PM" src="https:/...

Read more →
<img width="1800" height="1013" alt="thumbnail-2" src="https://github.com/user-attachments/assets/5c55b86a-0600-4f70-bf37-41ab240af851" />

Read our...

Read more →

Features

Variational Sequence-Level Soft Policy Optimization (VESPO)

<img width="465" height="279" alt="Screenshot 2026-03-20 at 5 49 50 ...

Read more →

Features

Add environment_factory to GRPOTrainer

GRPOTrainer now accepts an environment_factory argument, allowing users to specif...

Read more →

What's Changed

  • Remove access to warnings_issued by @qgallouedec in #4960
  • Fix SFTTrainer init logic: remove TrainingArguments.push_to_hub_...
Read more →

Features

🕵️‍♂️ GRPO: Agent training

GRPOTrainer now supports training agents using tools. This allows language models to interact with...

Read more →
Last Checked
46m ago
Latest
v1.6.0
Tracking since Jan 25, 2023