TRL

Sun

Mon

Tue

Wed

Thu

Fri

Sat

JulAugSepOctNovDecJanFebMarAprMayJunJul

Less

Releases10Avg3/moVersionsv1.1.0v1.8.0

Highlights All Releases

Jul 9, 2026

KTO now stable; environment rewards and GRPO entropy regularization added

↗

v1.8.0

KTO trainer graduates from experimental to the top-level trl package with the same API as DPO/GRPO/SFT, and the experimental import path still works with a FutureWarning. Environment-owned rewards let agentic RL environments define their own reward via a reserved get_reward() method, and multi-environment support allows a single training run to handle multiple environments with environment-specific tool schemas. GRPO now supports both static and adaptive entropy regularization to encourage exploration and prevent policy collapse.

Jul 4, 2026

GRPO + vLLM hang fixed on non-NVLink; dataset fingerprinting corrected

↗

v1.7.1

Fixed a hang in GRPO + vLLM colocate + PEFT on non-NVLink hardware and corrected dataset fingerprinting in DPO/SFT tokenization. Also integrated the new response parsing API, added a prompt-learning guard for PEFT with Liger in GRPO, and fixed activation offload storage deduplication.

Jun 25, 2026

SFT default loss flips to chunked_nll; GMPO trainer arrives

↗

v1.7.0

The default SFT loss_type is now "chunked_nll", delivering ~30% less peak VRAM on average with neutral or slightly faster wall-clock time. Also introduces experimental GMPO trainer, transformers continuous batching, AsyncGRPO weight sync with vLLM 0.22+, and paddding-free AsyncGRPO.

Jun 11, 2026

A2PO trainer debuts; VLM KTO support; Async GRPO spawns process

↗

v1.6.0

The release introduces a new experimental A2POTrainer for optimal advantage regression and grants KTO trainer support for vision-language models. The AsyncRolloutWorker now runs in a separate process to avoid GIL contention and potential NCCL watchdog timeouts, along with fixes for aiohttp retries and all-NaN reward columns. Gold distillation trainer now aligns tokens via byte offsets, and SDFT/SDPO leverage the vLLM server for live teacher logprobs. Other features include bidirectional masked importance sampling for IcePop, support for NemotronH and Nemotron 3 Ultra, additional training chat templates, and decoupled self-distillation trainers.

May 27, 2026

Trainer telemetry now allowlisted

↗

v1.5.1

Trainer telemetry is now gated on an explicit class-name allowlist, restricting which trainer classes can send telemetry.

May 25, 2026

Response parsing hang fixed; CUDA memory leak patched

↗

v1.5.0

Fixed an exponential backtracking bug in Qwen3/Qwen3.5/GLM4MoE response parsing that caused GRPOTrainer to hang indefinitely on truncated tool-call blocks, reducing worst-case complexity from O(2ⁿ) to O(n). Also fixed a CUDA memory leak in BNB dequantization buffers and stale state in OffloadActivations. Added training chat templates for Phi-3.5, Qwen3-VL, and Qwen3.5 Think/NoThink, and final logits softcapping support for AsyncGRPOTrainer on models like Gemma 2.

May 9, 2026

Chunked cross-entropy loss cuts SFT VRAM by 50%

↗

v1.4.0

A new loss_type="chunked_nll" option for SFT drastically reduces peak activation memory by computing cross-entropy over tokens in checkpointed chunks instead of materializing the full [batch × seq × vocab] logits tensor, unlocking sequence lengths that previously caused out-of-memory errors. Also added OpenReward Standard environment adapter support, length-normalized DPO sigmoid loss, training chat templates for Cohere, Cohere2, Gemma 3, Qwen3, and Qwen2.5, and a training-invariance test suite to catch numerical drift across trainer configurations.

Apr 26, 2026

v1.3.0

↗

Features

Qwen 3.6 integration

TRL v1.3 ships training support for the new Qwen 3.6…

Apr 17, 2026

v1.2.0

↗

Features

New `SSDTrainer` — Simple Self-Distillation

A new experimental SSDTrainer implements the…

Apr 12, 2026

v1.1.0

↗

Features

`DistillationTrainer` for efficient on-policy distillation

Read the blog post: https://huggingface.co/spaces/HuggingFaceTB/trl-distillation-trainer

![off_vs_on_policy_distillation…

Mar 31, 2026

v1.0.0

↗

Read our blog post for an overview of TRL v1.

Features

Asynchronous GRPO

Asynchronous GRPO…

Mar 20, 2026

v1.0.0rc1

↗

Features

Variational Sequence-Level Soft Policy Optimization (VESPO)

<img width="465" height="279" alt="Screenshot 2026-03-20 at 5 49 50 PM" src="https://github.com/user-attachments/assets/b60c9697-6eb7-498e-95b3-df78c367f5fa"…

v0.29.1

↗

What's Changed

Handle mm_token_type_ids in SFT/GRPO/RLOO to fix IndexError by @albertvillanova in https://github.com/huggingface/trl/pull/5178
Fix prepare_multimodal_messages to support tool_calls and tool role by @alvarobartt in…

Feb 25, 2026

v0.29.0

↗

Features

Add `environment_factory` to `GRPOTrainer`

GRPOTrainer now accepts an environment_factory argument, allowing users to specify a custom environment class for training. This enables more flexible and diverse training scenarios by letting users define…

Feb 10, 2026

v0.28.0

↗

Features

[GRPOTrainer]: Agent Training Supports Async Tool Calls by @pramodith in https://github.com/huggingface/trl/pull/4742
Add retry strategy to vLLM Client for increased robustness by @apalmas-saifh in https://github.com/huggingface/trl/pull/4845
Enable vLLM…

Feb 3, 2026

v0.27.2

↗

What's Changed

Remove access to warnings_issued by @qgallouedec in #4960
Fix SFTTrainer init logic: remove TrainingArguments.push_to_hub_token only for transformers < v5 by @albertvillanova in #4942
Fix extra EOS appended in DPO preprocessing for conversational…

Jan 24, 2026

v0.27.1

↗

What's Changed

Fix: undefined current_gradient_accumulation_steps by @qgallouedec in https://github.com/huggingface/trl/pull/4852
fix(DeepSeek OPSM): passing correct (vLLM) logprobs by @casinca in https://github.com/huggingface/trl/pull/4857
Fix SFT training for…

Jan 16, 2026

v0.27.0

↗

Features

Add vllm_group_port argument to GRPO, RLOO and OnlineDPO configuration by @pointerhacker in https://github.com/huggingface/trl/pull/4545
Preserve truncated tokens in BFD packing by @qgallouedec in https://github.com/huggingface/trl/pull/4632
Support…

Dec 18, 2025

v0.26.2

↗

What's Changed

Overwrite model default generation config used by model.generate by @albertvillanova in https://github.com/huggingface/trl/pull/4647

Full Changelog: https://github.com/huggingface/trl/compare/v0.26.1...v0.26.2

Dec 12, 2025

v0.26.1

↗

What's Changed

Fix vLLM error for tools usage not supported when running GRPO training by @apalmas-saifh in https://github.com/huggingface/trl/pull/4663
Fix GRPO config validation in case num_generations_eval is specified and different than num_generations by…

From other products

v0.8.16Jun 15, 2026
SandboxClient sync/async conversion added
Sync/async conversion for Sandbox and SandboxClient has been added. Also fixed experiment keys from wrapped evaluator functions and derived…
LangChain · LangSmith

Last Checked

14m ago

Latest

v1.8.0

Source

@huggingface/trl

Tracking since Jan 25, 2023

.json·.md·.atom

TRL

Features

Qwen 3.6 integration

Features

New SSDTrainer — Simple Self-Distillation

Features

DistillationTrainer for efficient on-policy distillation

Features

Asynchronous GRPO

Features

Variational Sequence-Level Soft Policy Optimization (VESPO)

What's Changed

Features

Add environment_factory to GRPOTrainer

Features

What's Changed

What's Changed

Features

What's Changed

What's Changed

More from Hugging Face

From other products

More from Hugging Face

From other products

New `SSDTrainer` — Simple Self-Distillation

`DistillationTrainer` for efficient on-policy distillation

Add `environment_factory` to `GRPOTrainer`