A2PO trainer debuts; VLM KTO support; Async GRPO spawns process
v1.6.0
Features
AsyncGRPO rollout worker now runs in a separate process
AsyncRolloutWorker is no longer a thread — it's a spawned child process with its own GIL. The trainer's autograd engine no longer competes with recursive_parse / accuracy_reward for the GIL, which was causing 1-5s stalls in real Qwen3-30B-A3B @ 16k runs and ultimately NCCL watchdog timeouts on other ranks.
Architectural changes:
AsyncRolloutWorker(parent) owns the child process + sharedmp.Queue/mp.Value/mp.Event._AsyncRolloutLoop(child-only) handles tokenization, dataset iteration, reward funcs, and asyncio loops.- A new
WeightTransferClientowns the NCCL group with vLLM (/pause,/resume,/init_weight_transfer_engine,/update_weights); the rollout child only talks to/v1/completions.
Two correctness fixes shipped alongside (they would have conflicted otherwise): broader aiohttp retry (now catches ClientPayloadError) with bounded exponential backoff, and all-NaN reward columns are now preserved — np.nansum was silently returning 0, giving unscorable completions a real advantage signal and pushing the policy away from correct answers (~30% of DeepMath / OpenR1-Math rows).
Note
reward_funcs / tools / environment_factory must now be picklable, and the child runs CPU-only (CUDA_VISIBLE_DEVICES="").
by @AmineDiro in https://github.com/huggingface/trl/pull/5749
New experimental A2PO trainer (Optimal Advantage Regression)
A new A2POTrainer implements A*-PO from "Accelerating RL for LLM Reasoning with Optimal Advantage Regression". Two stages: an offline V* estimation pass from reference policy samples (with optional filter_all_incorrect to drop prompts where every reference completion fails), then on-policy training with one generation per prompt and a plain least-squares loss on β₂·log(π/π_ref) vs r − V*. No group, no critic, no clipping, no reward normalization.
from trl.experimental.a2po import A2POConfig, A2POTrainer
trainer = A2POTrainer(
model="Qwen/Qwen3-4B",
args=A2POConfig(num_value_samples=8, filter_all_incorrect=True),
train_dataset=dataset,
reward_funcs=accuracy_reward,
)
trainer.train()
Designed for binary verifiable rewards (math/code), not open-ended problems.
by @raghulchandramouli in https://github.com/huggingface/trl/pull/5940
KTO now supports VLMs + big alignment push
The biggest KTO ↔ DPO alignment cycle yet — KTOTrainer now supports vision-language models, plus a deep restructuring of compute_loss, KL dataset generation, ref-logp precomputation, activation offloading, sampler strategy, metrics, and more. KTO graduation is very close.
from trl.experimental.kto import KTOConfig, KTOTrainer
trainer = KTOTrainer(
model="Qwen/Qwen2.5-VL-3B-Instruct",
args=KTOConfig(...),
train_dataset=vision_kto_dataset,
)
VLM support: by @albertvillanova in https://github.com/huggingface/trl/pull/5939. Plus ~20 alignment PRs all by @albertvillanova: #5820, #5849, #5852, #5850, #5866, #5864, #5856, #5872, #5875, #5900, #5901, #5899, #5906, #5909, #5914, #5982, #5936, #5996, #5998, #5999.
Cross-tokenizer alignment in GOLD via byte offsets
The GOLD distillation trainer used to align student/teacher tokens by extending two decoded strings and flushing on equality. It silently broke on any byte-level disagreement — including the common case of one tokenizer prepending BOS while the other doesn't (Llama-3 ↔ Qwen-3). The X-Token paper called this out by name.
Each side now carries (start_byte, end_byte) spans derived once from the fast tokenizer's char offsets, and the walker syncs on cumulative byte boundaries. On the on-policy path, spans come from piece_byte_len over the sampled token ids (not from re-encoding the decoded completion — BPE makes that round-trip non-injective).
Two related fixes shipped: long rows no longer lose the completion (now keeping the last max_length tokens), and the vLLM on-policy original_prompt_text is now decoded from the truncated ids the student actually consumed.
by @kashif in https://github.com/huggingface/trl/pull/5885
SDFT / SDPO: live teacher logprobs from the vLLM server
When teacher_model_kind="live" and vllm_mode="server", the vLLM generation server already holds the current student weights (synced every step for rollouts). The new use_teacher_server=True flag scores the teacher's log-probs on that same server instead of running a separate local teacher forward — removing the teacher from the training step entirely.
Supported modes: sampled_token (reverse KL on the realized token) and topk_logits. When buffered batches reuse steps (num_iterations > 1), weights are re-synced before scoring so the teacher never scores stale.
by @kashif in https://github.com/huggingface/trl/pull/5989
Bidirectional masked importance sampling (MIS) for IcePop
vLLM importance sampling in GRPO now uses a two-sided band [C_min, C_max] instead of a single upper cap, aligning TIS/MIS with IcePop's bidirectional handling of train–inference ratio outliers.
from trl import GRPOConfig
config = GRPOConfig(
vllm_importance_sampling_clip_min=0.5,
vllm_importance_sampling_clip_max=2.0,
vllm_importance_sampling_correction="mask", # or "truncate"
)
The old vllm_importance_sampling_cap is deprecated and maps to clip_max.
by @casinca in https://github.com/huggingface/trl/pull/4732
NemotronH and Nemotron 3 Ultra support
Day-zero training support for NVIDIA's new model families.
- NemotronH integration by @qgallouedec in https://github.com/huggingface/trl/pull/5938
- Nemotron 3 Ultra support by @qgallouedec in https://github.com/huggingface/trl/pull/5942
- Enable gradient checkpointing in Nemotron 3 SFT example by @sergiopaniego in https://github.com/huggingface/trl/pull/5944
Even more training chat templates
Three more model families with {% generation %} markers (assistant-only loss out of the box):
- Qwen2.5-VL by @aazizyan in https://github.com/huggingface/trl/pull/5838
- Qwen2-VL by @aazizyan in https://github.com/huggingface/trl/pull/5839
- Llava-Next by @aazizyan in https://github.com/huggingface/trl/pull/5959
Distributed backend boilerplate, hidden
A new trl/distributed.py introduces a single DistributedBackend class that detects ZeRO stage and FSDP version once, then exposes two context managers (gather_params, summon_full_params) used everywhere. Replaces the scattered getattr(state, "fsdp_plugin", None) / gather_if_zero3 / summon_full_params if ... else nullcontext() boilerplate spread across vllm_generation.py, models/utils.py, and the main trainers. Future deprecations land in one place.
by @albertvillanova in https://github.com/huggingface/trl/pull/6000
Decoupled self-distillation trainers
A two-PR refactor that disentangles SDPO, SDFT, and other self-distillation trainers from their shared base, making each one self-contained and consistent with the rest of the codebase.
by @LeonEricsson in https://github.com/huggingface/trl/pull/5862 and https://github.com/huggingface/trl/pull/5883
Heads-up: SFT default loss_type will change in 1.7
Setting SFTConfig.loss_type is now optional, and leaving it unset emits a FutureWarning: in TRL 1.7 the default will switch from "nll" to "chunked_nll". No action needed — you'll just get the new default automatically on upgrade — unless you want to pin the current behavior (e.g. for custom models) with loss_type="nll".
by @qgallouedec in https://github.com/huggingface/trl/pull/5997
Other
- Support
'None'as CLI value forOptional[T]fields by @qgallouedec in https://github.com/huggingface/trl/pull/5843 - Support non-
lm_headoutput projections in chunked SFT loss (GPTNeoX) by @qgallouedec in https://github.com/huggingface/trl/pull/5857 SFTTrainer: merge entropy and accuracy computation to eliminate redundant logits copy by @flutist in https://github.com/huggingface/trl/pull/5897- Remove redundant
.contiguous()calls inDPOTrainerto reduce peak memory by @flutist in https://github.com/huggingface/trl/pull/5926 - Remove unnecessary explicit
.contiguous()beforeentropy_from_logitsby @qgallouedec in https://github.com/huggingface/trl/pull/5930 - Exclude
Nonereward completions from GRPO/RLOO advantage baseline by @AmineDiro in https://github.com/huggingface/trl/pull/5902 - Support multimodal config in PPO ValueHead by @albertvillanova in https://github.com/huggingface/trl/pull/5907
- Support vision datasets for Liger in DPO by @albertvillanova in https://github.com/huggingface/trl/pull/5943
- Raise if
precompute_ref_log_probswith vision datasets in DPO by @albertvillanova in https://github.com/huggingface/trl/pull/5867 - 🔒 Gate trainer telemetry on an explicit class-name allowlist by @qgallouedec in https://github.com/huggingface/trl/pull/5851
- Update vLLM version support to 0.19.0 by @sergiopaniego in https://github.com/huggingface/trl/pull/5879
- Improve error message when image tokens are truncated by
max_lengthby @lxk8998 in https://github.com/huggingface/trl/pull/5927 - Padding-free invariance test by @qgallouedec in https://github.com/huggingface/trl/pull/5842
- Per-field invariance tolerances, calibrated by @qgallouedec in https://github.com/huggingface/trl/pull/5844
Fixes
- Fix
loss_type="chunked_nll"under DeepSpeed ZeRO-3 by @qgallouedec in https://github.com/huggingface/trl/pull/5873 - Fix GRPO
use_liger_kernelunder DeepSpeed ZeRO-3 by @kashif in https://github.com/huggingface/trl/pull/5891 async_grpo: don't return onqueue.Emptyby @AmineDiro in https://github.com/huggingface/trl/pull/5751- Don't treat ROCm GPUs as Ampere by @kashif in https://github.com/huggingface/trl/pull/5917
- Route liger student forward through DDP wrapper in GKD, GOLD, and Distillation trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5934
- Fix backbone access in GRPO by aligning with SFT by @albertvillanova in https://github.com/huggingface/trl/pull/5949
- Fix priority order in PPO ValueHead and raise ValueError for unsupported config by @albertvillanova in https://github.com/huggingface/trl/pull/5908
- Fix
generate_batch: inference tensors block inplace ops in background thread by @albertvillanova in https://github.com/huggingface/trl/pull/5818 (cross-listed from v1.5 changelog window) - Fix SFT padding-free test config by @kashif in https://github.com/huggingface/trl/pull/5923
- Specify
encoding="utf-8"when reading.jinjachat templates on Windows by @ColebyPearson in https://github.com/huggingface/trl/pull/5869 - Fix
ValueErrorby pinningkernels < 0.15.1by @albertvillanova in https://github.com/huggingface/trl/pull/5880 - Set
kernelsoptional dependency via transformers by @albertvillanova in https://github.com/huggingface/trl/pull/5884 - Support
kernelsextra fortransformers < 5.1.0by @albertvillanova in https://github.com/huggingface/trl/pull/5928 - Add missing
use_liger_kernelguard to SDPO teacher-server validation by @DaoyuanLi2816 in https://github.com/huggingface/trl/pull/5994 - Flash Attention capitalization fix by @qgallouedec in https://github.com/huggingface/trl/pull/5855
Documentation and Examples
- Remove NeMo Gym Integration Guide (broken) by @cmunley1 in https://github.com/huggingface/trl/pull/5840
- docs(GRPOTrainer): remove duplicate sentence by @zafstojano in https://github.com/huggingface/trl/pull/5957
- docs(RLOOTrainer): fix blockquote math not rendering by @zafstojano in https://github.com/huggingface/trl/pull/5958
- docs: highlight the role of KL in RLOO compared to GRPO by @zafstojano in https://github.com/huggingface/trl/pull/5966
- docs: clarify PPO entropy metrics in PPO trainer docs by @biefan in https://github.com/huggingface/trl/pull/5289
- docs: update OpenEnv GitHub org references and package name by @sergiopaniego in https://github.com/huggingface/trl/pull/5919
- docs: update OpenEnv doc URLs to
huggingface.co/docs/openenvby @sergiopaniego in https://github.com/huggingface/trl/pull/5929 - docs: sync SDFT/SDPO config docstrings with their fields by @DaoyuanLi2816 in https://github.com/huggingface/trl/pull/5992
- docs: sync Distillation/GOLD/OnlineDPO config docstrings with their fields by @DaoyuanLi2816 in https://github.com/huggingface/trl/pull/5995
- docs: Document
bnb_4bit_quant_storageand normalize docstring param headers by @DaoyuanLi2816 in https://github.com/huggingface/trl/pull/5993 - docs: fix rendering typos by @zafstojano in https://github.com/huggingface/trl/pull/5991
- fix(docs): correct broken GKD Trainer link in MiniLLM docs by @DaoyuanLi2816 in https://github.com/huggingface/trl/pull/5960 and https://github.com/huggingface/trl/pull/5961
- fix(docs): drop duplicate "a" in
online_dpo_vlmexample description by @DaoyuanLi2816 in https://github.com/huggingface/trl/pull/5978 - Fix broken doc links by @DaoyuanLi2816 in https://github.com/huggingface/trl/pull/5971
- Fix broken code examples in docs (RLOO syntax,
SFTConfigmax_length) by @DaoyuanLi2816 in https://github.com/huggingface/trl/pull/5970 - Fix malformed ScaleRL paper link in
GRPOConfigepsilon_highhelp by @DaoyuanLi2816 in https://github.com/huggingface/trl/pull/5972 - fix(cli): drop duplicate "to" in
trl skills installdescription by @DaoyuanLi2816 in https://github.com/huggingface/trl/pull/6008 - Remove invalid
max_prompt_lengthargument from GRPO example by @DaoyuanLi2816 in https://github.com/huggingface/trl/pull/5964
CI
- Refresh
sft.json/dpo.jsonsnapshots after transformersnum_items_in_batch fixby @qgallouedec in https://github.com/huggingface/trl/pull/5845 - Add testing for Olmo 3 by @qgallouedec in https://github.com/huggingface/trl/pull/5962
- Align trainer train tests by @qgallouedec in https://github.com/huggingface/trl/pull/5963
- Align trainers: Remove redundant else branch by @albertvillanova in https://github.com/huggingface/trl/pull/5983
- [CI] Check that training chat templates keep the stop token in the loss mask by @kashif in https://github.com/huggingface/trl/pull/5988
- Create CI workflow to sync TRL skill with
huggingface/skillsby @albertvillanova in https://github.com/huggingface/trl/pull/5950 - Simplify agent skills target and default to
.agentsby @albertvillanova in https://github.com/huggingface/trl/pull/5987 - chore: enable Dependabot weekly GitHub Actions bumps by @hf-dependantbot-rollout[bot] in https://github.com/huggingface/trl/pull/5910
- Bump the actions group with 9 updates by @dependabot[bot] in https://github.com/huggingface/trl/pull/5913
- Bump the actions group with 4 updates by @dependabot[bot] in https://github.com/huggingface/trl/pull/5954
- chore: update
docker-build.ymlwith version parsing by @hf-security-analysis[bot] in https://github.com/huggingface/trl/pull/5920 - ci: use GitHub App auth for doc preview comment bot by @sergiopaniego in https://github.com/huggingface/trl/pull/5915
New Contributors
- @ColebyPearson made their first contribution in https://github.com/huggingface/trl/pull/5869
- @hf-dependantbot-rollout[bot] made their first contribution in https://github.com/huggingface/trl/pull/5910
- @raghulchandramouli made their first contribution in https://github.com/huggingface/trl/pull/5940
- @zafstojano made their first contribution in https://github.com/huggingface/trl/pull/5957
- @DaoyuanLi2816 made their first contribution in https://github.com/huggingface/trl/pull/5964
- @lxk8998 made their first contribution in https://github.com/huggingface/trl/pull/5927
- @biefan made their first contribution in https://github.com/huggingface/trl/pull/5289
What's Changed
- ⬆️ Bump dev version by @qgallouedec in https://github.com/huggingface/trl/pull/5836
- Add Qwen2.5-VL original and training chat template with generation markers by @aazizyan in https://github.com/huggingface/trl/pull/5838
- Align KTO with DPO: Simplify metrics from sum/count to direct averages by @albertvillanova in https://github.com/huggingface/trl/pull/5820
- async_grpo don't return on queue.Empty by @AmineDiro in https://github.com/huggingface/trl/pull/5751
- Align KTO with DPO: Refactor forward by @albertvillanova in https://github.com/huggingface/trl/pull/5849
- Per-field invariance tolerances, calibrated by @qgallouedec in https://github.com/huggingface/trl/pull/5844
- Add Qwen2-VL original and training chat template with generation markers by @aazizyan in https://github.com/huggingface/trl/pull/5839
- Remove NeMo Gym Integration Guide (broken) by @cmunley1 in https://github.com/huggingface/trl/pull/5840
- Align KTO with DPO: Align compute_ref_log_probs by @albertvillanova in https://github.com/huggingface/trl/pull/5852
- Align KTO with DPO: Align precompute_ref_logps by @albertvillanova in https://github.com/huggingface/trl/pull/5850
- Flash Attention capitalization fix by @qgallouedec in https://github.com/huggingface/trl/pull/5855
- 🔒 Gate trainer telemetry on an explicit class-name allowlist by @qgallouedec in https://github.com/huggingface/trl/pull/5851
- Align KTO with DPO: Support remove_unused_columns by @albertvillanova in https://github.com/huggingface/trl/pull/5866
- Raise if precompute_ref_log_probs with vision datasets in DPO by @albertvillanova in https://github.com/huggingface/trl/pull/5867
- Support
'None'as CLI value forOptional[T]fields by @qgallouedec in https://github.com/huggingface/trl/pull/5843 - KTO: Replace _get_train_sampler with train_sampling_strategy for transformers >= 5.2.0 by @albertvillanova in https://github.com/huggingface/trl/pull/5864
- Fix: specify encoding="utf-8" when reading .jinja chat templates on Windows by @ColebyPearson in https://github.com/huggingface/trl/pull/5869
- Align KTO with DPO: Align ref log probability names by @albertvillanova in https://github.com/huggingface/trl/pull/5856
- KTO: Support non-sequential train_sampling_strategy for apo_zero_unpaired by @albertvillanova in https://github.com/huggingface/trl/pull/5872
- Align KTO with DPO: Remove null_ref_context by @albertvillanova in https://github.com/huggingface/trl/pull/5875
- Fix ValueError by pinning kernels < 0.15.1 by @albertvillanova in https://github.com/huggingface/trl/pull/5880
- Update vLLM version support to 0.19.0 by @sergiopaniego in https://github.com/huggingface/trl/pull/5879
- Set kernels optional dependency via transformers by @albertvillanova in https://github.com/huggingface/trl/pull/5884
- Support non-lm_head output projections in chunked SFT loss (GPTNeoX) by @qgallouedec in https://github.com/huggingface/trl/pull/5857
- SFTTrainer: merge entropy and accuracy computation to eliminate redundant logits copy by @flutist in https://github.com/huggingface/trl/pull/5897
- Align KTO with DPO: Add disable_gradient_checkpointing to ref model forward passes by @albertvillanova in https://github.com/huggingface/trl/pull/5900
- Align KTO with DPO: Add activation offloading support by @albertvillanova in https://github.com/huggingface/trl/pull/5901
- Align KTO with DPO: Decouple KL dataset generation by @albertvillanova in https://github.com/huggingface/trl/pull/5899
- Fix GRPO use_liger_kernel under DeepSpeed ZeRO-3 by @kashif in https://github.com/huggingface/trl/pull/5891
- Replace custom numpy cache in precompute_ref_logps with native datasets by @albertvillanova in https://github.com/huggingface/trl/pull/5906
- [1/2] refactor: decoupled self distillation trainers (sdpo, sdft, ...) by @LeonEricsson in https://github.com/huggingface/trl/pull/5862
- Align KTO with DPO: Use datasets caching in precompute_ref_logps by @albertvillanova in https://github.com/huggingface/trl/pull/5909
- Support multimodal config in PPO ValueHead by @albertvillanova in https://github.com/huggingface/trl/pull/5907
- Fix priority order in PPO ValueHead and raise ValueError for unsupported config by @albertvillanova in https://github.com/huggingface/trl/pull/5908
- Fix
loss_type="chunked_nll"under DeepSpeed ZeRO-3 by @qgallouedec in https://github.com/huggingface/trl/pull/5873 - chore: enable Dependabot weekly GitHub Actions bumps by @hf-dependantbot-rollout[bot] in https://github.com/huggingface/trl/pull/5910
- Exclude None reward completions from GRPO/RLOO advantage baseline by @AmineDiro in https://github.com/huggingface/trl/pull/5902
- Don't treat ROCm GPUs as Ampere by @kashif in https://github.com/huggingface/trl/pull/5917
- ci: use GitHub App auth for doc preview comment bot by @sergiopaniego in https://github.com/huggingface/trl/pull/5915
- Bump the actions group with 9 updates by @dependabot[bot] in https://github.com/huggingface/trl/pull/5913
- Align KTO with DPO: Replace completion_labels/get_batch_logps with completion_mask by @albertvillanova in https://github.com/huggingface/trl/pull/5914
- chore: update docker-build.yml with version parsing by @hf-security-analysis[bot] in https://github.com/huggingface/trl/pull/5920
- docs: update OpenEnv GitHub org references and package name by @sergiopaniego in https://github.com/huggingface/trl/pull/5919
- Support kernels extra for transformers < 5.1.0 by @albertvillanova in https://github.com/huggingface/trl/pull/5928
- feat: move async rollout worker to separate process by @AmineDiro in https://github.com/huggingface/trl/pull/5749
- Remove redundant .contiguous() calls in DPOTrainer to reduce peak memory by @flutist in https://github.com/huggingface/trl/pull/5926
- docs: update OpenEnv doc URLs from meta-pytorch.org to huggingface.co/docs/openenv by @sergiopaniego in https://github.com/huggingface/trl/pull/5929
- Refresh
sft.json/dpo.jsonsnapshots after transformersnum_items_in_batch fixby @qgallouedec in https://github.com/huggingface/trl/pull/5845 - NemotronH integration by @qgallouedec in https://github.com/huggingface/trl/pull/5938
- Nemotron 3 Ultra support by @qgallouedec in https://github.com/huggingface/trl/pull/5942
- Enable gradient checkpointing in Nemotron 3 SFT example (transformers>=5.7.0) by @sergiopaniego in https://github.com/huggingface/trl/pull/5944
- Fix SFT padding-free test config by @kashif in https://github.com/huggingface/trl/pull/5923
- Add experimental A2PO trainer (Optimal Advantage Regression) by @raghulchandramouli in https://github.com/huggingface/trl/pull/5940
- Align KTO with DPO: Inline _compute_logps into _compute_loss by @albertvillanova in https://github.com/huggingface/trl/pull/5936
- Fix: Route liger student forward through DDP wrapper in GKD, GOLD, and Distillation trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5934
- Fix backbone access in GRPO by aligning with SFT by @albertvillanova in https://github.com/huggingface/trl/pull/5949
- fix(docs): Remove duplicate sentence in GRPOTrainer docs by @zafstojano in https://github.com/huggingface/trl/pull/5957
- fix(docs): Blockquote math not rendering in RLooTrainer docs by @zafstojano in https://github.com/huggingface/trl/pull/5958
- Remove invalid max_prompt_length argument from GRPO example by @DaoyuanLi2816 in https://github.com/huggingface/trl/pull/5964
- Bump the actions group with 4 updates by @dependabot[bot] in https://github.com/huggingface/trl/pull/5954
- Add Llava-Next training tempalates support with generation markers by @aazizyan in https://github.com/huggingface/trl/pull/5959
- fix(docs): correct broken GKD Trainer link in MiniLLM docs by @DaoyuanLi2816 in https://github.com/huggingface/trl/pull/5961
- Remove unnecessary explicit
.contiguous()beforeentropy_from_logitsby @qgallouedec in https://github.com/huggingface/trl/pull/5930 - Improve error message when image tokens are truncated by max_length by @lxk8998 in https://github.com/huggingface/trl/pull/5927
- fix(docs): correct broken GKD Trainer link in MiniLLM docs by @DaoyuanLi2816 in https://github.com/huggingface/trl/pull/5960
- Cross-tokenizer alignment via byte offsets in GOLD trainer by @kashif in https://github.com/huggingface/trl/pull/5885
- [2/2] refactor: decoupled self distillation trainers; cleanup by @LeonEricsson in https://github.com/huggingface/trl/pull/5883
- Create CI workflow to sync TRL skill with huggingface/skills by @albertvillanova in https://github.com/huggingface/trl/pull/5950
- Support vision datasets for Liger in DPO by @albertvillanova in https://github.com/huggingface/trl/pull/5943
- Align trainer train tests by @qgallouedec in https://github.com/huggingface/trl/pull/5963
- Fix malformed ScaleRL paper link in GRPOConfig epsilon_high help by @DaoyuanLi2816 in https://github.com/huggingface/trl/pull/5972
- Add testing for Olmo3 by @qgallouedec in https://github.com/huggingface/trl/pull/5962
- chore(docs): Highlight the role of KL in RLOO compared to GRPO by @zafstojano in https://github.com/huggingface/trl/pull/5966
- fix(docs): drop duplicate "a" in online_dpo_vlm example description by @DaoyuanLi2816 in https://github.com/huggingface/trl/pull/5978
- Fix broken doc links (CONTRIBUTING online DPO paths, async GRPO anchor) by @DaoyuanLi2816 in https://github.com/huggingface/trl/pull/5971
- Align KTO with DPO: Support VLM by @albertvillanova in https://github.com/huggingface/trl/pull/5939
- Fix broken code examples in docs (RLOO syntax, SFTConfig max_length) by @DaoyuanLi2816 in https://github.com/huggingface/trl/pull/5970
- Align KTO with DPO: Improve error message for VLM truncation by @albertvillanova in https://github.com/huggingface/trl/pull/5982
- Align trainers: Remove redundant else branch by @albertvillanova in https://github.com/huggingface/trl/pull/5983
- SDFT/SDPO: live teacher logprobs from the vLLM server by @kashif in https://github.com/huggingface/trl/pull/5989
- docs: sync SDFT/SDPO config docstrings with their fields by @DaoyuanLi2816 in https://github.com/huggingface/trl/pull/5992
- fix(docs): Fix rendering typos by @zafstojano in https://github.com/huggingface/trl/pull/5991
- Add missing use_liger_kernel guard to SDPO teacher-server validation by @DaoyuanLi2816 in https://github.com/huggingface/trl/pull/5994
- docs: sync Distillation/GOLD/OnlineDPO config docstrings with their fields by @DaoyuanLi2816 in https://github.com/huggingface/trl/pull/5995
- feat: Bidirectional masked importance sampling ratio (MIS) for IcePop by @casinca in https://github.com/huggingface/trl/pull/4732
- Simplify agent skills target and default to
.agentsby @albertvillanova in https://github.com/huggingface/trl/pull/5987 - Align KTO with DPO: Remove unused use_dpo_data_collator attribute by @albertvillanova in https://github.com/huggingface/trl/pull/5996
- Align KTO with DPO: Rename kto_loss_fn to liger_loss_fn by @albertvillanova in https://github.com/huggingface/trl/pull/5998
- Align KTO with DPO: Inline kto_loss in _compute_loss by @albertvillanova in https://github.com/huggingface/trl/pull/5999
- Document bnb_4bit_quant_storage and normalize docstring param headers by @DaoyuanLi2816 in https://github.com/huggingface/trl/pull/5993
- [CI] Check that training chat templates keep the stop token in the loss mask by @kashif in https://github.com/huggingface/trl/pull/5988
- Announce upcoming SFT
loss_typedefault change from'nll'to'chunked_nll'by @qgallouedec in https://github.com/huggingface/trl/pull/5997 - Padding-free invariance test by @qgallouedec in https://github.com/huggingface/trl/pull/5842
- Hide DeepSpeed/FSDP distributed backend boilerplate by @albertvillanova in https://github.com/huggingface/trl/pull/6000
- fix(cli): drop duplicate "to" in trl skills install description by @DaoyuanLi2816 in https://github.com/huggingface/trl/pull/6008
- docs: clarify PPO entropy metrics in PPO trainer docs by @biefan in https://github.com/huggingface/trl/pull/5289
- Release: v1.6 by @qgallouedec in https://github.com/huggingface/trl/pull/6009
Full Changelog: https://github.com/huggingface/trl/compare/v1.5.0...v1.6.0
Fetched June 11, 2026

