{"id":"src_g_3xbu2oRX52tM8mwXJzq","slug":"trl","name":"TRL","type":"github","url":"https://github.com/huggingface/trl","orgId":"org_GDdYeYynEgCEBNBwy-m6s","org":{"slug":"hugging-face","name":"Hugging Face"},"isPrimary":false,"metadata":"{\"evaluatedMethod\":\"github\",\"evaluatedAt\":\"2026-04-07T17:19:15.726Z\",\"changelogDetectedAt\":\"2026-04-07T17:28:23.856Z\"}","releaseCount":81,"releasesLast30Days":4,"avgReleasesPerWeek":0.7,"latestVersion":"v1.2.0","latestDate":"2026-04-17T01:13:05.000Z","changelogUrl":null,"hasChangelogFile":false,"lastFetchedAt":"2026-04-19T07:02:03.397Z","trackingSince":"2023-01-25T14:04:19.000Z","releases":[{"id":"rel_vSOkDQsjEJAbDUIKmPuqj","version":"v1.2.0","title":"v1.2.0","summary":"## Features\r\n\r\n### New `SSDTrainer` — Simple Self-Distillation\r\n\r\n<img width=\"778\" height=\"334\" alt=\"Screenshot 2026-04-16 at 9 08 04 PM\" src=\"https:/...","content":"## Features\r\n\r\n### New `SSDTrainer` — Simple Self-Distillation\r\n\r\n<img width=\"778\" height=\"334\" alt=\"Screenshot 2026-04-16 at 9 08 04 PM\" src=\"https://github.com/user-attachments/assets/8ca223f0-6740-48a8-967c-ec10cb262a93\" />\r\n\r\nA new experimental `SSDTrainer` implements the method described in [Embarrassingly Simple Self-Distillation Improves Code Generation](https://huggingface.co/papers/2604.01193). SSD samples completions from the model itself at a training-time temperature/truncation setting, then fine-tunes on those raw, unverified samples with standard cross-entropy loss. No reward model, verifier, teacher model, or RL: just prompts and the model.\r\n\r\n```python\r\nfrom datasets import Dataset\r\nfrom trl.experimental.ssd import SSDConfig, SSDTrainer\r\n\r\ndataset = Dataset.from_dict({\r\n    \"prompt\": [\r\n        [{\"role\": \"user\", \"content\": \"Write a function to add two numbers.\"}],\r\n        [{\"role\": \"user\", \"content\": \"Write a function to check if a number is prime.\"}],\r\n    ],\r\n})\r\n\r\ntrainer = SSDTrainer(\r\n    model=\"Qwen/Qwen3-4B-Instruct\",\r\n    args=SSDConfig(\r\n        output_dir=\"ssd-model\",\r\n        temperature=0.6,      # T_train from the paper\r\n        top_k=20,\r\n        top_p=0.95,\r\n        learning_rate=5e-6,\r\n    ),\r\n    train_dataset=dataset,\r\n)\r\ntrainer.train()\r\n```\r\n\r\nby @kashif in https://github.com/huggingface/trl/pull/5505\r\n\r\n### Drop, don't truncate, overlong tool results in `GRPOTrainer`\r\n\r\nWhen tool calls produce more tokens than `max_completion_length` allows, `GRPOTrainer` now rolls back the tool messages/images added in the current iteration instead of trying to truncate them. This removes ~80 lines of fragile, image-boundary-aware bookkeeping in favor of a ~15-line snapshot-and-rollback. Since overlong samples almost always get rewarded as failures anyway, the learning signal is effectively unchanged — but the code is dramatically simpler and no longer needs per-VLM-family vision-token lookup tables.\r\n\r\nby @qgallouedec in https://github.com/huggingface/trl/pull/5521\r\n\r\n### Expanded tool-calling model support: LLaMA 3.1 / 3.2 & DeepSeek-V3\r\n\r\nContinuing the effort from v1.1:\r\n\r\n- **LLaMA 3.1 and 3.2** tool-calling response schemas, with dedicated templates for identity matching. Note that these templates only support a single tool call and no content alongside the tool call — limitations inherited from the models' native templates. By @qgallouedec in https://github.com/huggingface/trl/pull/5518\r\n- **DeepSeek-V3** training chat template with `{% generation %}` markers, enabling assistant-only loss masking for `DeepSeek-V3` models. By @RudrenduPaul in https://github.com/huggingface/trl/pull/5527\r\n\r\nAs a result of a tightened detection (see fixes below), the list of templates reported as tool-calling capable is now correct — notably, the basic Llama 3 template is **no longer** falsely classified as tool-calling capable.\r\n\r\n### KTO/DPO alignment push\r\n\r\nA major cleanup sweep keeps `KTOTrainer` and `DPOTrainer` in lockstep, same initialization patterns, same config surface, same precompute behavior:\r\n\r\n- Add `precompute_ref_batch_size` to KTO (https://github.com/huggingface/trl/pull/5530)\r\n- Align `ref_model` initialization (https://github.com/huggingface/trl/pull/5534)\r\n- Align model initialization (https://github.com/huggingface/trl/pull/5533)\r\n- Support `None` args (https://github.com/huggingface/trl/pull/5531)\r\n- Remove `generate_during_eval` (https://github.com/huggingface/trl/pull/5551)\r\n- Remove model and ref adapter names (https://github.com/huggingface/trl/pull/5552)\r\n- Don't load `ref_model` when `precompute_ref_log_probs` is set in DPO/KTO (https://github.com/huggingface/trl/pull/5542)\r\n\r\nAll by @albertvillanova.\r\n\r\n### Other\r\n\r\n* Support messages with images in `prepare_multimodal_messages` by @albertvillanova in https://github.com/huggingface/trl/pull/5474\r\n* Simplify role handling in `prepare_multimodal_messages` by @albertvillanova in https://github.com/huggingface/trl/pull/5508\r\n* Update vLLM version support to 0.18.0 by @qgallouedec in https://github.com/huggingface/trl/pull/5547\r\n\r\n## Fixes\r\n\r\n* Fix `supports_tool_calling` falsely accepting templates that drop assistant `tool_calls` by @qgallouedec in https://github.com/huggingface/trl/pull/5517\r\n* Fix `add_response_schema` for VLM processors — the schema was being set on the outer processor instead of the inner tokenizer, so it had no effect. This also collapses a handful of `__init__`/decode-gate workarounds. By @qgallouedec in https://github.com/huggingface/trl/pull/5520\r\n* Remove xfail condition for Gemma 4 response_schema regex bug by @qgallouedec in https://github.com/huggingface/trl/pull/5510\r\n* Remove unused dependencies for judges from dev requirements by @qgallouedec in https://github.com/huggingface/trl/pull/5515\r\n\r\n## Deprecations\r\n\r\n* **Deprecate `use_transformers_paged`** in `GRPOConfig` and `RLOOConfig` (and remove entirely from experimental `OnlineDPOConfig`, `GOLDConfig`, `SelfDistillationConfig`). Will be removed from the remaining configs in v2.0.0. In a small A/B benchmark (Qwen3-0.6B GRPO), the paged path is ~20% slower and uses ~6x more peak VRAM than the default; it's also superseded by `transformers` continuous batching. By @qgallouedec in https://github.com/huggingface/trl/pull/5544\r\n\r\n## Documentation and Examples\r\n\r\n* Add example script section to experimental trainer docs by @sergiopaniego in https://github.com/huggingface/trl/pull/5543\r\n* [Docs] Fix formatting in SSD training example script by @kashif in https://github.com/huggingface/trl/pull/5548\r\n* Nits in SSD docs by @sergiopaniego in https://github.com/huggingface/trl/pull/5554\r\n* [docs] Add LLaMA 3 / Qwen 2.5 entries to `chat_templates/README` by @qgallouedec in https://github.com/huggingface/trl/pull/5545\r\n* Update CARLA VLM example scripts by @sergiopaniego in https://github.com/huggingface/trl/pull/5557\r\n\r\n## CI\r\n\r\n* Fix CI dependency installs to use a single resolve by @qgallouedec in https://github.com/huggingface/trl/pull/5513\r\n* Set upper transformers version to skip distributed test_rloo after fixed by @albertvillanova in https://github.com/huggingface/trl/pull/5535\r\n* Update tests with zero3 for RLOO and GRPO once fixed in transformers 5.5.4 by @albertvillanova in https://github.com/huggingface/trl/pull/5541\r\n* Bump doc-builder SHA for PR upload workflow by @rtrompier in https://github.com/huggingface/trl/pull/5553\r\n\r\n## What's Changed\r\n\r\n* ⬆️ Bump dev version by @qgallouedec in https://github.com/huggingface/trl/pull/5525\r\n* Simplify role handling in prepare_multimodal_messages by @albertvillanova in https://github.com/huggingface/trl/pull/5508\r\n* Fix CI dependency installs to use a single resolve by @qgallouedec in https://github.com/huggingface/trl/pull/5513\r\n* Fix `supports_tool_calling` falsely accepting templates that drop assistant `tool_calls` by @qgallouedec in https://github.com/huggingface/trl/pull/5517\r\n* feat: add DeepSeek-V3 training chat template with generation markers by @RudrenduPaul in https://github.com/huggingface/trl/pull/5527\r\n* Drop, don't truncate, overlong tool results in GRPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/5521\r\n* Set upper transformers version to skip distributed test_rloo after fixed by @albertvillanova in https://github.com/huggingface/trl/pull/5535\r\n* Align KTO with DPO: Add precompute_ref_batch_size by @albertvillanova in https://github.com/huggingface/trl/pull/5530\r\n* Update tests with zero3 for RLOO and GRPO once fixed in transformers 5.5.4 by @albertvillanova in https://github.com/huggingface/trl/pull/5541\r\n* Align KTO with DPO: Align ref_model initialization by @albertvillanova in https://github.com/huggingface/trl/pull/5534\r\n* Align KTO with DPO: Align model initialization by @albertvillanova in https://github.com/huggingface/trl/pull/5533\r\n* Remove unused dependencies for judges from dev requirements by @qgallouedec in https://github.com/huggingface/trl/pull/5515\r\n* Remove xfail condition for Gemma4 response_schema regex bug by @qgallouedec in https://github.com/huggingface/trl/pull/5510\r\n* Align KTO with DPO: Support None args by @albertvillanova in https://github.com/huggingface/trl/pull/5531\r\n* Add example script section to experimental trainer docs by @sergiopaniego in https://github.com/huggingface/trl/pull/5543\r\n* [SSD] Added SSD trainer in experimental by @kashif in https://github.com/huggingface/trl/pull/5505\r\n* [Docs] Fix formatting in SSD training example script by @kashif in https://github.com/huggingface/trl/pull/5548\r\n* Don't load ref_model when precompute_ref_log_probs in DPO/KTO by @albertvillanova in https://github.com/huggingface/trl/pull/5542\r\n* chore: bump doc-builder SHA for PR upload workflow by @rtrompier in https://github.com/huggingface/trl/pull/5553\r\n* Nits is SSD docs by @sergiopaniego in https://github.com/huggingface/trl/pull/5554\r\n* Deprecate `use_transformers_paged` by @qgallouedec in https://github.com/huggingface/trl/pull/5544\r\n* Update vLLM version support to 0.18.0 by @qgallouedec in https://github.com/huggingface/trl/pull/5547\r\n* Align KTO with DPO: Remove generate_during_eval by @albertvillanova in https://github.com/huggingface/trl/pull/5551\r\n* Align KTO with DPO: Remove model and ref adapter names by @albertvillanova in https://github.com/huggingface/trl/pull/5552\r\n* Support messages with images in prepare_multimodal_messages by @albertvillanova in https://github.com/huggingface/trl/pull/5474\r\n* Update CARLA VLM example scripts by @sergiopaniego in https://github.com/huggingface/trl/pull/5557\r\n* Fix `add_response_schema` for VLM processors by @qgallouedec in https://github.com/huggingface/trl/pull/5520\r\n* [docs] Add LLaMA 3 / Qwen 2.5 entries to `chat_templates/README` by @qgallouedec in https://github.com/huggingface/trl/pull/5545\r\n* Add LLaMA 3.1 and 3.2 tool calling support by @qgallouedec in https://github.com/huggingface/trl/pull/5518\r\n* Release: v1.2 by @qgallouedec in https://github.com/huggingface/trl/pull/5576\r\n\r\n**Full Changelog**: https://github.com/huggingface/trl/compare/v1.1.0...v1.2.0\r\n","publishedAt":"2026-04-17T01:13:05.000Z","url":"https://github.com/huggingface/trl/releases/tag/v1.2.0","media":[]},{"id":"rel_-_Lo5QLOolB09xyuoL6kA","version":"v1.1.0","title":"v1.1.0","summary":"## Features\r\n\r\n### `DistillationTrainer` for efficient on-policy distillation\r\n\r\nRead the blog post: https://huggingface.co/spaces/HuggingFaceTB/trl-d...","content":"## Features\r\n\r\n### `DistillationTrainer` for efficient on-policy distillation\r\n\r\nRead the blog post: https://huggingface.co/spaces/HuggingFaceTB/trl-distillation-trainer\r\n\r\n![off_vs_on_policy_distillation yePX-mwe_1umXK5](https://github.com/user-attachments/assets/73b4d47c-adea-440e-ab39-9c5df40b9915)\r\n\r\nThe new `DistillationTrainer` implements on-policy knowledge distillation as described in [On-Policy Distillation of Language Models: Learning from Self-Generated Mistakes](https://huggingface.co/papers/2306.13649). It extends the ideas from the `GKDTrainer` with three key optimizations: a **generation buffer** that decouples the training microbatch size from the generation batch size (up to 40x speedup), **external teacher server support** so the teacher doesn't need to fit on training GPUs, and **binary-encoded logprob payloads** that shrink transfer payloads by ~5x.\r\n\r\n```python\r\nfrom datasets import load_dataset\r\nfrom trl.experimental.distillation import DistillationConfig, DistillationTrainer\r\n\r\ndataset = load_dataset(\"openai/gsm8k\", \"main\", split=\"train\")\r\ndataset = dataset.map(\r\n    lambda x: {\"messages\": [{\"role\": \"user\", \"content\": x[\"question\"]}]},\r\n    remove_columns=dataset.column_names,\r\n)\r\n\r\ntrainer = DistillationTrainer(\r\n    model=\"Qwen/Qwen2.5-1.5B-Instruct\",\r\n    teacher_model=\"Qwen/Qwen2.5-7B-Instruct\",\r\n    args=DistillationConfig(\r\n        output_dir=\"results/distill-qwen-gsm8k\",\r\n        lmbda=1.0,                   # fully on-policy (student generates)\r\n        beta=1.0,                    # reverse KL\r\n        teacher_model_init_kwargs={\"torch_dtype\": \"bfloat16\"},\r\n    ),\r\n    train_dataset=dataset,\r\n)\r\ntrainer.train()\r\n```\r\n\r\nby @cmpatino in https://github.com/huggingface/trl/pull/5407, https://github.com/huggingface/trl/pull/5500 and https://github.com/huggingface/trl/pull/5501\r\n\r\n### Chunked LM head for memory-efficient log-prob computation in `AsyncGRPOTrainer`\r\n\r\n`AsyncGRPOTrainer` now supports a chunked LM-head path that computes per-token log-probs and entropy via online `logsumexp` without materializing the full `[N, V]` logits tensor. Combined with `completion_mask` filtering to skip prompt tokens, this brings massive memory savings on long sequences — up to **44x** lower peak-allocated memory on an 8192-token sequence:\r\n\r\n| `chunk_lm_head_size` | Peak Alloc (GB) | Reduction | Wall Time (ms) |\r\n| :--- | :--- | :--- | :--- |\r\n| `None` (baseline) | 18.55 | 1.00x | 808.7 |\r\n| `4096` | 0.42 | **44.32x** | 459.0 |\r\n| `8192` | 0.76 | **24.34x** | 393.0 |\r\n\r\nEnable it via the new `chunk_lm_head_size` option in `AsyncGRPOConfig`:\r\n\r\n```python\r\nfrom trl.experimental.async_grpo import AsyncGRPOConfig, AsyncGRPOTrainer\r\n\r\ntrainer = AsyncGRPOTrainer(\r\n    model=\"Qwen/Qwen2.5-0.5B-Instruct\",\r\n    args=AsyncGRPOConfig(chunk_lm_head_size=4096),\r\n    ...\r\n)\r\n```\r\n\r\nNote: mutually exclusive with `use_liger_kernel` (both replace the LM head forward pass).\r\n\r\nby @AmineDiro in https://github.com/huggingface/trl/pull/5349\r\n\r\n### `{% generation %}` support in training chat templates\r\n\r\nSFT with `assistant_only_loss=True` requires chat templates to include `{% generation %}` / `{% endgeneration %}` markers so that `return_assistant_tokens_mask=True` produces correct masks. Very few models ship these markers natively, so users hit a cryptic error when enabling assistant-only loss with models like Qwen3, Llama 3 or GPT-OSS.\r\n\r\n`SFTTrainer` now automatically swaps in a patched *training chat template* when the original template lacks generation markers — no manual template surgery required. Training templates are shipped for **Qwen2.5**, **Qwen3**, **Llama 3** and **GPT-OSS**, stored as standalone `.jinja` files under `trl/chat_templates/` for readability, diffability, and editor syntax highlighting.\r\n\r\n```python\r\nfrom trl import SFTConfig, SFTTrainer\r\n\r\ntrainer = SFTTrainer(\r\n    model=\"Qwen/Qwen3-4B\",\r\n    args=SFTConfig(assistant_only_loss=True),  # now just works\r\n    train_dataset=dataset,\r\n)\r\ntrainer.train()\r\n```\r\n\r\nby @qgallouedec in https://github.com/huggingface/trl/pull/5459, https://github.com/huggingface/trl/pull/5470, by @RudrenduPaul in https://github.com/huggingface/trl/pull/5493 and https://github.com/huggingface/trl/pull/5522, and by @casinca in https://github.com/huggingface/trl/pull/5484\r\n\r\n### Expanded tool-calling model support\r\n\r\nAgent training now supports a broader family of models via native tool-call response schemas:\r\n\r\n- **GPT-OSS** (https://github.com/huggingface/trl/pull/5464)\r\n- **GLM-4-MoE** (https://github.com/huggingface/trl/pull/5463)\r\n- **Qwen3-VL** (https://github.com/huggingface/trl/pull/5469)\r\n- **Gemma 4** — the first model to natively ship a response schema (https://github.com/huggingface/trl/pull/5454)\r\n\r\nA new `supports_tool_calling()` utility detects whether a tokenizer/processor can render a full tool-calling turn, and `GRPOTrainer` now validates tool support at initialization — raising a clear error upfront instead of failing cryptically mid-training.\r\n\r\nby @qgallouedec in https://github.com/huggingface/trl/pull/5462, https://github.com/huggingface/trl/pull/5464, https://github.com/huggingface/trl/pull/5463, https://github.com/huggingface/trl/pull/5469 and https://github.com/huggingface/trl/pull/5454\r\n\r\n### Multimodal tool responses for VLM training\r\n\r\n`environment_factory` tool methods can now return multimodal content blocks (images + text) for VLM training. Previously, tool responses were always converted to `str(result)`, discarding any visual information. Now tools can return content block lists with images, and the trainer handles them end-to-end through tokenization, generation, and the forward pass — including correct `pixel_values` plumbing.\r\n\r\n```python\r\nclass ScreenshotEnv:\r\n    def take_screenshot(self) -> list[dict]:\r\n        return [\r\n            {\"type\": \"image\", \"image\": self.browser.screenshot()},\r\n            {\"type\": \"text\", \"text\": \"Current page state\"},\r\n        ]\r\n```\r\n\r\nThe OpenEnv `browsergym.py` example has been migrated to this pattern, and a new `carla_vlm.py` example demonstrates VLM training against CARLA with camera-image tool responses.\r\n\r\nby @sergiopaniego in https://github.com/huggingface/trl/pull/5323 and https://github.com/huggingface/trl/pull/5437, and by @qgallouedec in https://github.com/huggingface/trl/pull/5448\r\n\r\n### Built-in reward functions now log extra columns\r\n\r\n`accuracy_reward` and `reasoning_accuracy_reward` now emit extra diagnostic columns (`solution`, `gold_parsed`, `answer_parsed`) via the `log_extra` callback introduced in v1.0.0. These show up in the rich completions table, making it much easier to debug why a reward was (or wasn't) assigned.\r\n\r\n<img width=\"1627\" alt=\"accuracy reward with extra columns\" src=\"https://github.com/user-attachments/assets/d7f6e9c2-4d7b-4886-ba7a-f58f0ccfcb9b\" />\r\n\r\n```python\r\nfrom datasets import load_dataset\r\nfrom trl import GRPOConfig, GRPOTrainer\r\nfrom trl.rewards import accuracy_reward\r\n\r\ndataset = load_dataset(\"trl-lib/DeepMath-103K\", split=\"train\")\r\n\r\ntrainer = GRPOTrainer(\r\n    model=\"Qwen/Qwen2-0.5B-Instruct\",\r\n    reward_funcs=accuracy_reward,\r\n    args=GRPOConfig(log_completions=True),\r\n    train_dataset=dataset,\r\n)\r\ntrainer.train()\r\n```\r\n\r\nby @qgallouedec in https://github.com/huggingface/trl/pull/5308\r\n\r\n### Other\r\n\r\n* Align KTO with DPO: precompute reference log probs at init by @albertvillanova in https://github.com/huggingface/trl/pull/5447\r\n* Align KTO with DPO: reorganize `KTOConfig` by @albertvillanova in https://github.com/huggingface/trl/pull/5477\r\n* Use generic VLM key passthrough in DPO by @albertvillanova in https://github.com/huggingface/trl/pull/5468\r\n* Make images optional in `prepare_multimodal_messages` by @albertvillanova in https://github.com/huggingface/trl/pull/5424\r\n* Avoid image deepcopy in `prepare_multimodal_messages` by @albertvillanova in https://github.com/huggingface/trl/pull/5475\r\n* Replace `pixel_position_ids` with `image_position_ids` for Gemma 4 support by @qgallouedec in https://github.com/huggingface/trl/pull/5452\r\n* Update vLLM minimum supported version to 0.11.0 by @albertvillanova in https://github.com/huggingface/trl/pull/5443\r\n* Remove dead token attributes from trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5483\r\n* Remove unnecessary `isinstance(part, dict)` checks in image extraction by @qgallouedec in https://github.com/huggingface/trl/pull/5439\r\n* Simplify `_get_tool_suffix_ids` by @qgallouedec in https://github.com/huggingface/trl/pull/5440\r\n* Narrow prefix-preserving check to the actual requirement by @qgallouedec in https://github.com/huggingface/trl/pull/5458\r\n* Better test consistency RLOO vs GRPO by @qgallouedec in https://github.com/huggingface/trl/pull/5396\r\n* Remove duplicated `prepare_deepspeed` by @albertvillanova in https://github.com/huggingface/trl/pull/5414\r\n\r\n## Fixes\r\n\r\n* Fix targeting fused parameters with LoRA by @BenjaminBossan in https://github.com/huggingface/trl/pull/5430\r\n* Fix `ImportError` with vllm-0.10.2 in OnlineDPO and OpenEnv by @albertvillanova in https://github.com/huggingface/trl/pull/5423\r\n* Fix `_get_per_token_logps_and_entropies` return type by @kashif in https://github.com/huggingface/trl/pull/5456\r\n* Fix SFT deprecation warning by @albertvillanova in https://github.com/huggingface/trl/pull/5466\r\n* Fix broken validation of user-specified tokens by @albertvillanova in https://github.com/huggingface/trl/pull/5482\r\n* Fix `prepare_multimodal_messages` not normalizing empty string content for assistant/tool roles by @albertvillanova in https://github.com/huggingface/trl/pull/5496\r\n* Remove redundant alignment of `pad_token_id` by @albertvillanova in https://github.com/huggingface/trl/pull/5487\r\n* Replace deprecated `huggingface-cli` references with `hf` by @hanouticelina in https://github.com/huggingface/trl/pull/5486\r\n* Remove unused `truncation_mode` from experimental `truncate_dataset` by @albertvillanova in https://github.com/huggingface/trl/pull/5467\r\n* Fix PR template check bot reopen loop by @qgallouedec in https://github.com/huggingface/trl/pull/5488\r\n\r\n## Deprecations and Removals\r\n\r\n* **Deprecate `keep_end` truncation mode** in `DPOConfig` and `SFTConfig` — will be removed in v2.0.0. Use `keep_start` instead. By @albertvillanova in https://github.com/huggingface/trl/pull/5465\r\n* **Deprecate `pad_token` config parameter** in `DPOConfig`, `SFTConfig`, and `RewardConfig` — will be removed in v2.0.0. Set `tokenizer.pad_token` directly on the `processing_class` instead. By @albertvillanova in https://github.com/huggingface/trl/pull/5480\r\n* **Remove `trl.experimental.judges` module and all judge support from trainers.** Judges were experimental, unused in practice, and `llm-blender` (backing `PairRMJudge`) was unmaintained and incompatible with transformers v5 — actively blocking v5 adoption. Everything judges did can be achieved with `reward_funcs`. `OnlineDPOTrainer`, `NashMDTrainer`, and `XPOTrainer` are now unified on reward-model scoring only. By @qgallouedec in https://github.com/huggingface/trl/pull/5485\r\n\r\n## Documentation and Examples\r\n\r\n* Update \"What's New\": TRL v1 blog post by @qgallouedec in https://github.com/huggingface/trl/pull/5385\r\n* New `carla_vlm` OpenEnv example by @sergiopaniego in https://github.com/huggingface/trl/pull/5437\r\n* Add code example for `completion_only_loss` in SFT trainer docs by @RudrenduPaul in https://github.com/huggingface/trl/pull/5494\r\n* Add docs and good defaults for `DistillationTrainer` by @cmpatino in https://github.com/huggingface/trl/pull/5500\r\n* Add test and docs for multimodal tool responses by @qgallouedec in https://github.com/huggingface/trl/pull/5448\r\n* Add tests for Gemma pixel splitting by @qgallouedec in https://github.com/huggingface/trl/pull/5450\r\n* Update docstring about tool messages in `prepare_multimodal_messages` by @albertvillanova in https://github.com/huggingface/trl/pull/5476\r\n* Run `make precommit` to fix docstring style by @albertvillanova in https://github.com/huggingface/trl/pull/5436\r\n\r\n## CI\r\n\r\n* Pin GitHub Actions to commit SHAs by @paulinebm in https://github.com/huggingface/trl/pull/5435\r\n* Update GitHub Action to use specific version of github-script by @qgallouedec in https://github.com/huggingface/trl/pull/5491\r\n* Generic device support for CI tests by @kaixuanliu in https://github.com/huggingface/trl/pull/5357\r\n* CI: Gemma 4 support by @qgallouedec in https://github.com/huggingface/trl/pull/5453\r\n* Fix CI slow-tests cannot remove: No such file or directory by @albertvillanova in https://github.com/huggingface/trl/pull/5401\r\n* Remove xfail for Qwen3VL CI tests by @albertvillanova in https://github.com/huggingface/trl/pull/5402\r\n* Fix flaky CI `test_rloo[fsdp2]`: replace non-deterministic xfail with skipif for transformers 5.4.0 by @albertvillanova in https://github.com/huggingface/trl/pull/5403\r\n* Mark as strict the xfail tests with zero3 for RLOO and GRPO by @albertvillanova in https://github.com/huggingface/trl/pull/5404\r\n* Hotfix CI: mark tests as xfail due to missing `input_ids` or `inputs_embeds` by @albertvillanova in https://github.com/huggingface/trl/pull/5422\r\n* Update tests to not pass `eval_strategy` by @SunMarc in https://github.com/huggingface/trl/pull/5426\r\n* Hotfix CI: mark tests as xfail with transformers dev due to `TypeError: 'NoneType' object is not iterable` by @albertvillanova in https://github.com/huggingface/trl/pull/5427\r\n* Revert hotfix CI for `TypeError: 'NoneType' object is not iterable` by @albertvillanova in https://github.com/huggingface/trl/pull/5438\r\n* Update tests with zero3 for RLOO and GRPO as xfail only with transformers >= v5 by @albertvillanova in https://github.com/huggingface/trl/pull/5420\r\n* Hotfix CI: update skipif for `test_rloo[fsdp2]` after transformers 5.5.0 release by @albertvillanova in https://github.com/huggingface/trl/pull/5442\r\n* Remove xfail for ZeRO 2 and 3 + SFT + PEFT test by @qgallouedec in https://github.com/huggingface/trl/pull/5383\r\n* Hotfix CI: mark tests as xfail with transformers dev for Llava models by @albertvillanova in https://github.com/huggingface/trl/pull/5504\r\n* Restrict VLM padding workaround to transformers 5.3.0 by @albertvillanova in https://github.com/huggingface/trl/pull/5503\r\n\r\n\r\n## What's Changed\r\n* ⬆️ Bump dev version by @qgallouedec in https://github.com/huggingface/trl/pull/5410\r\n* Update \"What's New\": TRL v1 blog post by @qgallouedec in https://github.com/huggingface/trl/pull/5385\r\n* Fix CI slow-tests cannot remove: No such file or directory by @albertvillanova in https://github.com/huggingface/trl/pull/5401\r\n* Remove xfail for Qwen3VL CI tests by @albertvillanova in https://github.com/huggingface/trl/pull/5402\r\n* Fix flaky CI test_rloo[fsdp2]: Replace non-deterministic xfail with skipif for transformers 5.4.0 by @albertvillanova in https://github.com/huggingface/trl/pull/5403\r\n* Mark as strict the xfail tests with zero3 for RLOO and GRPO by @albertvillanova in https://github.com/huggingface/trl/pull/5404\r\n* Remove duplicated prepare_deepspeed by @albertvillanova in https://github.com/huggingface/trl/pull/5414\r\n* Hotfix CI: Mark tests as xfail due to missing input_ids or inputs_embeds by @albertvillanova in https://github.com/huggingface/trl/pull/5422\r\n* Update tests to not pass `eval_strategy` by @SunMarc in https://github.com/huggingface/trl/pull/5426\r\n* Hotfix CI: Mark tests as xfail with transformers dev due to TypeError: 'NoneType' object is not iterable by @albertvillanova in https://github.com/huggingface/trl/pull/5427\r\n* FIX CI: Targeting fused parameters with LoRA by @BenjaminBossan in https://github.com/huggingface/trl/pull/5430\r\n* Support multimodal tool responses in `environment_factory` for VLM training by @sergiopaniego in https://github.com/huggingface/trl/pull/5323\r\n* 🔒 Pin GitHub Actions to commit SHAs by @paulinebm in https://github.com/huggingface/trl/pull/5435\r\n* New carla vlm example by @sergiopaniego in https://github.com/huggingface/trl/pull/5437\r\n* Revert hotfix CI for TypeError: 'NoneType' object is not iterable by @albertvillanova in https://github.com/huggingface/trl/pull/5438\r\n* Run make precommit to fix docstring style by @albertvillanova in https://github.com/huggingface/trl/pull/5436\r\n* Fix ImportError with vllm-0.10.2 in OnlineDPO and OpenEnv by @albertvillanova in https://github.com/huggingface/trl/pull/5423\r\n* Add chunked LM head for memory-efficient log-prob computation  for AsyncGRPOTrainer by @AmineDiro in https://github.com/huggingface/trl/pull/5349\r\n* Update tests with zero3 for RLOO and GRPO as xfail only with transformers >= v5 by @albertvillanova in https://github.com/huggingface/trl/pull/5420\r\n* Make images optional in prepare_multimodal_messages by @albertvillanova in https://github.com/huggingface/trl/pull/5424\r\n* Hotfix CI: Update skipif for test_rloo[fsdp2] after transformers 5.5.0 release by @albertvillanova in https://github.com/huggingface/trl/pull/5442\r\n* Update vLLM minimum supported version to 0.11.0 by @albertvillanova in https://github.com/huggingface/trl/pull/5443\r\n* Better test consistency RLOO vs GRPO by @qgallouedec in https://github.com/huggingface/trl/pull/5396\r\n* Align KTO with DPO: Precompute reference log probs at init by @albertvillanova in https://github.com/huggingface/trl/pull/5447\r\n* Add support for logging extra columns in reward functions and update related tests by @qgallouedec in https://github.com/huggingface/trl/pull/5308\r\n* Remove unnecessary `isinstance(part, dict)` checks in image extraction by @qgallouedec in https://github.com/huggingface/trl/pull/5439\r\n* Remove xfail for ZeRO 2 and 3 + SFT + PEFT test by @qgallouedec in https://github.com/huggingface/trl/pull/5383\r\n* Replace `pixel_position_ids` with `image_position_ids` for Gemma4 support by @qgallouedec in https://github.com/huggingface/trl/pull/5452\r\n* Add test and docs for multimodal tool responses by @qgallouedec in https://github.com/huggingface/trl/pull/5448\r\n* Add tests for Gemma pixel splitting by @qgallouedec in https://github.com/huggingface/trl/pull/5450\r\n* Generic device support for CI tests by @kaixuanliu in https://github.com/huggingface/trl/pull/5357\r\n* Revert speculative argument parsing and add Gemma4 agent support by @qgallouedec in https://github.com/huggingface/trl/pull/5454\r\n* fix _get_per_token_logps_and_entropies return type by @kashif in https://github.com/huggingface/trl/pull/5456\r\n* Deprecate keep_end truncation mode by @albertvillanova in https://github.com/huggingface/trl/pull/5465\r\n* Fix SFT deprecation warning by @albertvillanova in https://github.com/huggingface/trl/pull/5466\r\n* Remove unused truncation_mode from experimental truncate_dataset by @albertvillanova in https://github.com/huggingface/trl/pull/5467\r\n* Use generic VLM key passthrough in DPO by @albertvillanova in https://github.com/huggingface/trl/pull/5468\r\n* Narrow prefix-preserving check to the actual requirement by @qgallouedec in https://github.com/huggingface/trl/pull/5458\r\n* Simplify `_get_tool_suffix_ids` by @qgallouedec in https://github.com/huggingface/trl/pull/5440\r\n* Update docstring about tool messages in prepare_multimodal_messages by @albertvillanova in https://github.com/huggingface/trl/pull/5476\r\n* CI Gemma 4 support by @qgallouedec in https://github.com/huggingface/trl/pull/5453\r\n* Move chat templates from inline strings to `.jinja` files by @qgallouedec in https://github.com/huggingface/trl/pull/5459\r\n* Align KTO with DPO: Reorganize KTOConfig by @albertvillanova in https://github.com/huggingface/trl/pull/5477\r\n* Add `supports_tool_calling` utility and validate tool support at init by @qgallouedec in https://github.com/huggingface/trl/pull/5462\r\n* Add GPT-OSS tool calling support by @qgallouedec in https://github.com/huggingface/trl/pull/5464\r\n* Add `{% generation %}` support to training chat templates by @qgallouedec in https://github.com/huggingface/trl/pull/5470\r\n* Avoid image deepcopy in prepare_multimodal_messages by @albertvillanova in https://github.com/huggingface/trl/pull/5475\r\n* Remove dead token attributes from trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5483\r\n* Add `DistillationTrainer` for efficient on-policy distillation by @cmpatino in https://github.com/huggingface/trl/pull/5407\r\n* Replace deprecated `huggingface-cli` references with `hf` by @hanouticelina in https://github.com/huggingface/trl/pull/5486\r\n* Fix broken validation of user-specified tokens by @albertvillanova in https://github.com/huggingface/trl/pull/5482\r\n* Deprecate pad_token config parameter by @albertvillanova in https://github.com/huggingface/trl/pull/5480\r\n* Remove redundant alignment of pad_token_id by @albertvillanova in https://github.com/huggingface/trl/pull/5487\r\n* Fix PR template check bot reopen loop by @qgallouedec in https://github.com/huggingface/trl/pull/5488\r\n* feat(gpt-oss): Add `{% generation %}` markers for training chat template by @casinca in https://github.com/huggingface/trl/pull/5484\r\n* Remove the `trl.experimental.judges` module and all judge support from trainers by @qgallouedec in https://github.com/huggingface/trl/pull/5485\r\n* Hotfix CI: Mark tests as xfail with transformers dev for Llava models by @albertvillanova in https://github.com/huggingface/trl/pull/5504\r\n* Restrict VLM padding workaround to transformers 5.3.0 by @albertvillanova in https://github.com/huggingface/trl/pull/5503\r\n* Update GitHub Action to use specific version of github-script by @qgallouedec in https://github.com/huggingface/trl/pull/5491\r\n* [docs] Add code example for completion_only_loss in SFT trainer docs by @RudrenduPaul in https://github.com/huggingface/trl/pull/5494\r\n* Fix prepare_multimodal_messages not normalizing empty string content for assistant/tool roles by @albertvillanova in https://github.com/huggingface/trl/pull/5496\r\n* Add trackio support to `DistillationTrainer` by @cmpatino in https://github.com/huggingface/trl/pull/5501\r\n* feat: add Llama 3 training chat template with generation markers by @RudrenduPaul in https://github.com/huggingface/trl/pull/5493\r\n* Add GLM-4-MoE tool calling support by @qgallouedec in https://github.com/huggingface/trl/pull/5463\r\n* Add Qwen3-VL tool calling support by @qgallouedec in https://github.com/huggingface/trl/pull/5469\r\n* Add docs and good defaults for `DistillationTrainer` by @cmpatino in https://github.com/huggingface/trl/pull/5500\r\n* feat: add Qwen2.5 training chat template with generation markers by @RudrenduPaul in https://github.com/huggingface/trl/pull/5522\r\n* Release: v1.1 by @qgallouedec in https://github.com/huggingface/trl/pull/5524\r\n\r\n## New Contributors\r\n* @BenjaminBossan made their first contribution in https://github.com/huggingface/trl/pull/5430\r\n* @hanouticelina made their first contribution in https://github.com/huggingface/trl/pull/5486\r\n* @RudrenduPaul made their first contribution in https://github.com/huggingface/trl/pull/5494\r\n\r\n**Full Changelog**: https://github.com/huggingface/trl/compare/v1.0.0...v1.1.0","publishedAt":"2026-04-12T02:15:56.000Z","url":"https://github.com/huggingface/trl/releases/tag/v1.1.0","media":[]},{"id":"rel_p4jBIBiR7l1UL1yoc7ZSF","version":"v1.0.0","title":"v1.0.0","summary":"<img width=\"1800\" height=\"1013\" alt=\"thumbnail-2\" src=\"https://github.com/user-attachments/assets/5c55b86a-0600-4f70-bf37-41ab240af851\" />\r\n\r\nRead our...","content":"<img width=\"1800\" height=\"1013\" alt=\"thumbnail-2\" src=\"https://github.com/user-attachments/assets/5c55b86a-0600-4f70-bf37-41ab240af851\" />\r\n\r\nRead our [blog post](https://hf.co/blog/trl-v1) for an overview of TRL v1.\r\n\r\n## Features\r\n\r\n### Asynchronous GRPO\r\n\r\nAsynchronous GRPO decouples generation from the gradient update loop by offloading rollouts to an external vLLM server. Generation runs in parallel while training continues, eliminating idle GPU time and improving hardware utilization.\r\n\r\n```python\r\nfrom trl.experimental.async_grpo import AsyncGRPOTrainer\r\nfrom trl.rewards import accuracy_reward\r\nfrom datasets import load_dataset\r\n\r\ndataset = load_dataset(\"trl-lib/DeepMath-103K\", split=\"train\")\r\n\r\ntrainer = AsyncGRPOTrainer(\r\n    model=\"Qwen/Qwen2.5-0.5B-Instruct\",\r\n    reward_funcs=accuracy_reward,\r\n    train_dataset=dataset,\r\n)\r\ntrainer.train()\r\n```\r\n\r\nby @qgallouedec in https://github.com/huggingface/trl/pull/5293\r\n\r\n### Variational Sequence-Level Soft Policy Optimization (VESPO)\r\n\r\n<img width=\"465\" height=\"279\" alt=\"Screenshot 2026-03-20 at 5 49 50 PM\" src=\"https://github.com/user-attachments/assets/b60c9697-6eb7-498e-95b3-df78c367f5fa\" />\r\n\r\n[VESPO](https://huggingface.co/papers/2602.10693) addresses training instability in off-policy RL caused by policy staleness, asynchronous updates, and train-inference mismatches. Rather than relying on heuristic token-level clipping (GRPO) or sequence-length normalization (GSPO), VESPO derives a principled reshaping kernel from a variational framework. In practice, this yields a smooth, asymmetric Gamma weighting function that gracefully suppresses extreme sequence-level importance weights without introducing length bias. It can be enabled via the `loss_type` parameter of `GRPOConfig`:\r\n\r\n```python\r\nfrom trl import GRPOConfig, GRPOTrainer\r\n\r\ntrainer = GRPOTrainer(\r\n    model=\"Qwen/Qwen3-0.6B\",\r\n    args=GRPOConfig(loss_type=\"vespo\"),\r\n    ...\r\n)\r\n```\r\n\r\nby @casinca in https://github.com/huggingface/trl/pull/5199\r\n\r\n### Divergence Proximal Policy Optimization (DPPO)\r\n\r\n<img width=\"3180\" height=\"1187\" alt=\"z_TXYw37xZqsQ21YiDkYL\" src=\"https://github.com/user-attachments/assets/40f1d538-82b3-4097-91c6-119ea9f7797b\" />\r\n<img width=\"1189\" height=\"490\" alt=\"SfgWotuuuRKPkg-0bxWv1\" src=\"https://github.com/user-attachments/assets/2b090df3-0bfb-42e4-9f94-15943736e689\" />\r\n\r\n[DPPO](https://huggingface.co/papers/2602.04879) is a new experimental trainer that replaces the standard PPO clipping mechanism with divergence constraints, providing more principled trust-region updates.\r\n\r\nby @LeonEricsson in https://github.com/huggingface/trl/pull/5117\r\n\r\n### Self-Distillation Policy Optimization (SDPO)\r\n\r\n[SDPO](https://huggingface.co/papers/2601.20802) is a new experimental trainer that augments on-policy RL with self-distillation from the model's own high-reward trajectories. Instead of using an external teacher, SDPO treats the current model conditioned on feedback as a self-teacher, distilling its feedback-informed predictions back into the policy.\r\n\r\n```python\r\nfrom trl.experimental import SDPOTrainer, SDPOConfig\r\n\r\nconfig = SDPOConfig(\r\n    output_dir=\"./results\",\r\n    num_generations=8,\r\n    success_reward_threshold=1.0,\r\n    use_successful_as_teacher=True,\r\n)\r\n\r\ntrainer = SDPOTrainer(\r\n    model=\"Qwen/Qwen2.5-Math-1.5B-Instruct\",\r\n    reward_funcs=[accuracy_reward],\r\n    args=config,\r\n    train_dataset=dataset,\r\n)\r\ntrainer.train()\r\n```\r\n\r\nby @MengAiDev in https://github.com/huggingface/trl/pull/4935\r\n\r\n### Reward functions can now log extra columns and scalar metrics\r\n\r\nReward functions can return a dictionary of extra values (scalars or per-sample columns) that will be logged alongside the reward. This makes it easier to track intermediate signals without writing custom callbacks.\r\n\r\n```python\r\ndef my_reward_fn(completions, answer, log_extra=None, log_metric=None, **kwargs):\r\n    extracted = [extract_answer(c) for c in completions]\r\n    rewards = [1.0 if e == a else 0.0 for e, a in zip(extracted, answer)]\r\n\r\n    if log_extra:\r\n        log_extra(\"golden_answer\", list(answer))\r\n        log_extra(\"extracted_answer\", extracted)\r\n\r\n    if log_metric:\r\n        log_metric(\"accuracy\", sum(rewards) / len(rewards))\r\n\r\n    return rewards\r\n```\r\n\r\n<img width=\"1400\" height=\"407\" alt=\"image\" src=\"https://github.com/user-attachments/assets/d345b0ac-0d3c-446f-9321-a26e73ee16b4\" />\r\n<img width=\"1353\" height=\"673\" alt=\"image\" src=\"https://github.com/user-attachments/assets/b4c0302b-f69a-4715-9aad-278b4ad13299\" />\r\n\r\nby @manueldeprada in https://github.com/huggingface/trl/pull/5233\r\n\r\n### Tool calling support in `VLLMClient.chat()`\r\n\r\n`VLLMClient.chat()` now supports tool calling, enabling agentic workflows directly through the vLLM client interface.\r\n\r\nby @kansalaman in https://github.com/huggingface/trl/pull/4889\r\n\r\n### 35% faster packing\r\n\r\nBFD packing is 35% faster. The `\"bfd-requeue\"` packing strategy has also been renamed to `\"bfd_split\"`. See [MIGRATION.md](MIGRATION.md) for details.\r\n\r\n<img width=\"1784\" height=\"732\" alt=\"benchmark_results\" src=\"https://github.com/user-attachments/assets/8f0a35ad-cf1a-4fe1-a1f4-9b102637bdca\" />\r\n\r\nby @mariosasko in https://github.com/huggingface/trl/pull/5189\r\n\r\n### [GKD] Buffer implementation and vLLM inference for distillation trainer\r\n\r\nThe GKD/GOLD trainer now supports buffered rollout generation, decoupling generation from gradient updates for more efficient distillation. vLLM inference support has also been added to the base self-distillation trainer.\r\n\r\nby @cmpatino in https://github.com/huggingface/trl/pull/5137 and https://github.com/huggingface/trl/pull/5388\r\n\r\n### v0 → v1 migration guide\r\n\r\nA [`MIGRATION.md`](MIGRATION.md) guide has been added covering all breaking changes when upgrading from TRL v0 to v1. If you're already on v0.29, the changes are minimal.\r\n\r\nby @qgallouedec in https://github.com/huggingface/trl/pull/5255\r\n\r\n### Other\r\n\r\n* Change default `vllm_mode` to `\"colocate\"` by @qgallouedec in https://github.com/huggingface/trl/pull/5255\r\n* Support `truncation_mode` in SFT by @albertvillanova in https://github.com/huggingface/trl/pull/5306\r\n* Support `max_length` in DPO VLM training by @albertvillanova in https://github.com/huggingface/trl/pull/5284\r\n* Add `pad_to_multiple_of` to GRPOTrainer and RLOOTrainer by @czkkkkkk in https://github.com/huggingface/trl/pull/5180\r\n* Support sequence sampling in Liger Kernel by @michaelroyzen in https://github.com/huggingface/trl/pull/5190\r\n* Add tool calling support to VLLMClient.chat() by @kansalaman in https://github.com/huggingface/trl/pull/4889\r\n* Add support for raw token IDs in vLLM client prompts by @qgallouedec in https://github.com/huggingface/trl/pull/5225\r\n* Add VLM support when passing raw token IDs to vLLM client by @qgallouedec in https://github.com/huggingface/trl/pull/5227\r\n* Enhance `print_prompt_completions_sample` to include reasoning content by @qgallouedec in https://github.com/huggingface/trl/pull/5327\r\n* Add support for `pixel_position_ids` vision key by @qgallouedec in https://github.com/huggingface/trl/pull/5374\r\n* Add second version of Qwen 3.5 chat template by @apardyl in https://github.com/huggingface/trl/pull/5405\r\n* Pass tools as `None` to `apply_chat_template` when it is an empty list by @rabinadk1 in https://github.com/huggingface/trl/pull/5380\r\n\r\n## Fixes\r\n\r\n* Fix DPOTrainer collators to truncate sequences before padding by @albertvillanova in https://github.com/huggingface/trl/pull/5305\r\n* Prevent corruption of DPO VLM training if \"keep_end\" truncation_mode by @albertvillanova in https://github.com/huggingface/trl/pull/5286\r\n* Fix mm_token_type_ids silently dropped in DPO VLM training by @albertvillanova in https://github.com/huggingface/trl/pull/5279\r\n* Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a reward model by @albertvillanova in https://github.com/huggingface/trl/pull/5295\r\n* Fix `accuracy_reward` crash when called from non-main thread by @qgallouedec in https://github.com/huggingface/trl/pull/5281\r\n* Fix GRPOTrainer attribute access for vLLM model config by @falcondai in https://github.com/huggingface/trl/pull/5302\r\n* [GRPO] Fix re-tokenization bug in tool-calling loop by @qgallouedec in https://github.com/huggingface/trl/pull/5242\r\n* [CPO/ORPO] Fix handling of different length chosen/rejected prompts by @davmels in https://github.com/huggingface/trl/pull/4639\r\n* Fix `RewardFunc` type alias to reflect actual calling convention by @s-zx in https://github.com/huggingface/trl/pull/5246\r\n* fix(ppo): add gradient_checkpointing_enable/disable to PolicyAndValueWrapper by @s-zx in https://github.com/huggingface/trl/pull/5245\r\n* Fix `prepare_multimodal_messages` to support `tool_calls` and `tool` role by @alvarobartt in https://github.com/huggingface/trl/pull/5212\r\n* Fix support for model_init_kwargs when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5230\r\n* Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5274\r\n* Fix support for model_init_kwargs in GKD/GOLD when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5266\r\n* Sync entire prompt/completion token tensors before indexing by @shawnghu in https://github.com/huggingface/trl/pull/5218\r\n* Clean up model update group on worker exit by @AmineDiro in https://github.com/huggingface/trl/pull/5325\r\n* Fix prefix EOS slicing for tool suffix (with Qwen3/3.5 chat templates) by @casinca in https://github.com/huggingface/trl/pull/5330\r\n* Fix: apply reward_weights to logged reward/reward_std in GRPOTrainer by @lailanelkoussy in https://github.com/huggingface/trl/pull/5353\r\n* Fix IDs shape mismatch in SFT for VLMs with text-only by @albertvillanova in https://github.com/huggingface/trl/pull/5354\r\n\r\n## Documentation and Examples\r\n\r\n* Add minimal CARLA example script by @sergiopaniego in https://github.com/huggingface/trl/pull/5161\r\n* Nemotron 3 examples added by @sergiopaniego in https://github.com/huggingface/trl/pull/5272\r\n* Align docs about tool calling in trainers with dataset format by @albertvillanova in https://github.com/huggingface/trl/pull/5311\r\n* Add repository-specific guidance for agents (`AGENTS.md`) by @qgallouedec in https://github.com/huggingface/trl/pull/5236\r\n* Align documentation with the intended public API by @qgallouedec in https://github.com/huggingface/trl/pull/5162\r\n* Update openenv examples to use `environment_factory` by @sergiopaniego in https://github.com/huggingface/trl/pull/5235\r\n* Add \"It Takes Two: Your GRPO Is Secretly DPO\" paper to GRPOTrainer by @DhruvvArora in https://github.com/huggingface/trl/pull/5347\r\n* Centralize AI agent templates in `.ai` by @qgallouedec in https://github.com/huggingface/trl/pull/5268\r\n\r\n## What's Changed\r\n\r\n* ⬆️ Bump dev version by @qgallouedec in https://github.com/huggingface/trl/pull/5182\r\n* Handle mm_token_type_ids in SFT/GRPO/RLOO to fix IndexError by @albertvillanova in https://github.com/huggingface/trl/pull/5178\r\n* Document parameters with differing default values in core configs by @albertvillanova in https://github.com/huggingface/trl/pull/5168\r\n* Make _BaseConfig and _BaseTrainer explicitly private by @albertvillanova in https://github.com/huggingface/trl/pull/5169\r\n* Refactor CLI [4/N]: Replace top-level TrlParser with ArgumentParser by @albertvillanova in https://github.com/huggingface/trl/pull/5170\r\n* Add minimal CARLA example script by @sergiopaniego in https://github.com/huggingface/trl/pull/5161\r\n* Align documentation with the intended public API by @qgallouedec in https://github.com/huggingface/trl/pull/5162\r\n* Fix deprecation warning of create_reference_model by @albertvillanova in https://github.com/huggingface/trl/pull/5184\r\n* Fix deprecation warning of fork in multi-threaded process by @albertvillanova in https://github.com/huggingface/trl/pull/5185\r\n* Refactor CLI [5/N]: Refactor TrainingCommand with delayed imports by @albertvillanova in https://github.com/huggingface/trl/pull/5186\r\n* Refactor CLI [6/N]: Refactor env/vllm-serve commands with delayed imports by @albertvillanova in https://github.com/huggingface/trl/pull/5187\r\n* Fix CI tests patching BaseTrainer by @albertvillanova in https://github.com/huggingface/trl/pull/5192\r\n* Add `pad_to_multiple_of` to GRPOTrainer and RLOOTrainer by @czkkkkkk in https://github.com/huggingface/trl/pull/5180\r\n* Re-add liger-kernel to dev deps by @qgallouedec in https://github.com/huggingface/trl/pull/5164\r\n* Set CI PYTORCH_ALLOC_CONF env variable to avoid OOM by @albertvillanova in https://github.com/huggingface/trl/pull/5197\r\n* Support sequence sampling in Liger Kernel and pass importance_samplin… by @michaelroyzen in https://github.com/huggingface/trl/pull/5190\r\n* Mark CI test_training_vlm_and_liger as xfail by @albertvillanova in https://github.com/huggingface/trl/pull/5202\r\n* Decouple rollout dispatch from vLLM backend in GRPO _generate_single_turn by @albertvillanova in https://github.com/huggingface/trl/pull/5122\r\n* CI: Add Qwen 3.5 tiny model to tests by @qgallouedec in https://github.com/huggingface/trl/pull/5204\r\n* Add support for Qwen3.5 for agent training by @qgallouedec in https://github.com/huggingface/trl/pull/5205\r\n* Update vLLM version support to include 0.13.0 by @qgallouedec in https://github.com/huggingface/trl/pull/5206\r\n* feat: Add tool calling support to VLLMClient.chat() by @kansalaman in https://github.com/huggingface/trl/pull/4889\r\n* Refactor CLI [7/N]: Move patching to compat and import transformers conditionally by @albertvillanova in https://github.com/huggingface/trl/pull/5208\r\n* Update vLLM version support to include 0.14.0 and 0.14.1 by @qgallouedec in https://github.com/huggingface/trl/pull/5214\r\n* Refactor CLI [8/N]: Refactor scripts/utils with delayed imports by @albertvillanova in https://github.com/huggingface/trl/pull/5209\r\n* Simplify logic for structured outputs across vLLM versions by @albertvillanova in https://github.com/huggingface/trl/pull/5215\r\n* Refactor CLI [9/N]: Replace HfArgumentParser from transformers with local by @albertvillanova in https://github.com/huggingface/trl/pull/5210\r\n* Refactor CLI [10/N]: Refactor scripts with delayed imports by @albertvillanova in https://github.com/huggingface/trl/pull/5219\r\n* Refactor CLI [11/N]: Refactor scripts/vllm_serve with delayed imports by @albertvillanova in https://github.com/huggingface/trl/pull/5220\r\n* Refactor CLI [12/N]: Fix command name in scripts help usage by @albertvillanova in https://github.com/huggingface/trl/pull/5221\r\n* Refactor CLI [13/N]: Pass clean training args to scripts by @albertvillanova in https://github.com/huggingface/trl/pull/5223\r\n* Fix `prepare_multimodal_messages` to support `tool_calls` and `tool` role by @alvarobartt in https://github.com/huggingface/trl/pull/5212\r\n* Fix link to Hugging Face Hub in OpenEnv documentation by @thesteve0 in https://github.com/huggingface/trl/pull/5229\r\n* Fix type for model_init_kwargs when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5230\r\n* Add repository-specific guidance for agents (`AGENTS.md`) by @qgallouedec in https://github.com/huggingface/trl/pull/5236\r\n* Add support for raw ids in `prompts` in vLLM client and server by @qgallouedec in https://github.com/huggingface/trl/pull/5225\r\n* Deprecate `truncate_prompt_tokens` for vLLM 0.17.0 by @winglian in https://github.com/huggingface/trl/pull/5248\r\n* Add VLM support when passing raw token IDs to vLLM client by @qgallouedec in https://github.com/huggingface/trl/pull/5227\r\n* Move `rollout_func` from `_generate_single_turn` to `_generate` by @qgallouedec in https://github.com/huggingface/trl/pull/5232\r\n* Fix `RewardFunc` type alias to reflect actual calling convention by @s-zx in https://github.com/huggingface/trl/pull/5246\r\n* [GRPO] In-place temperature scaling operation by @winglian in https://github.com/huggingface/trl/pull/5254\r\n* Update vLLM version support to 0.15.0 by @qgallouedec in https://github.com/huggingface/trl/pull/5251\r\n* Sync entire prompt/completion token tensors before indexing by @shawnghu in https://github.com/huggingface/trl/pull/5218\r\n* Update vLLM version support to 0.16.0 by @qgallouedec in https://github.com/huggingface/trl/pull/5252\r\n* Update vLLM version support to 0.17.0 by @qgallouedec in https://github.com/huggingface/trl/pull/5253\r\n* [GRPO/RLOO] Tokenize before vLLM generation call by @qgallouedec in https://github.com/huggingface/trl/pull/5238\r\n* Refactor CLI [14/N] : Remove TrainingArguments import from core trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5257\r\n* Support JSON string parsing of teacher_model_init_kwargs in MiniLLMConfig by @albertvillanova in https://github.com/huggingface/trl/pull/5259\r\n* Fix typo in docstring for teacher_model_init_kwargs by @albertvillanova in https://github.com/huggingface/trl/pull/5260\r\n* Remove extra_fields dead code [1/N]: Remove extra_fields handling from VLLMGeneration.generate by @albertvillanova in https://github.com/huggingface/trl/pull/5262\r\n* [GRPO/RLOO] Unify tokenization across all generation backends in `_generate_single_turn` by @qgallouedec in https://github.com/huggingface/trl/pull/5239\r\n* Remove extra_fields dead code [2/N]: Remove extra_fields from VLLMGeneration.generate return value by @albertvillanova in https://github.com/huggingface/trl/pull/5263\r\n* Remove extra_fields dead code [3/N]: Remove extra_fields from GRPOTrainer._generate_single_turn return value by @albertvillanova in https://github.com/huggingface/trl/pull/5264\r\n* fix(ppo): add gradient_checkpointing_enable/disable to PolicyAndValueWrapper by @s-zx in https://github.com/huggingface/trl/pull/5245\r\n* [GRPO/RLOO] Extract tokenize prompts from `_generate_single_turn` by @qgallouedec in https://github.com/huggingface/trl/pull/5240\r\n* [CPO/ORPO] Fix handling of different length chosen/rejected prompts. by @davmels in https://github.com/huggingface/trl/pull/4639\r\n* Fix type for teacher_model_init_kwargs when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5258\r\n* Align GOLDConfig docstrings for optional params with None default by @albertvillanova in https://github.com/huggingface/trl/pull/5261\r\n* Fix support for model_init_kwargs in GKD/GOLD when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5266\r\n* Update TRL banner to support light/dark mode by @qgallouedec in https://github.com/huggingface/trl/pull/5270\r\n* Fix error message in OnlineDPO by @qgallouedec in https://github.com/huggingface/trl/pull/5237\r\n* Fix title consistency from \"Transformer Reinforcement Learning\" to \"Transformers Reinforcement Learning\" by @qgallouedec in https://github.com/huggingface/trl/pull/5183\r\n* Nemotron 3 examples added by @sergiopaniego in https://github.com/huggingface/trl/pull/5272\r\n* Fix mm_token_type_ids silently dropped in DPO VLM training by @albertvillanova in https://github.com/huggingface/trl/pull/5279\r\n* Simplify get_train_dataloader in GRPO and RLOO by @albertvillanova in https://github.com/huggingface/trl/pull/5276\r\n* Raise ValueError for None train_dataset in experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5275\r\n* 35% faster packing + rename `bfd-requeue` to `bfd_split` by @mariosasko in https://github.com/huggingface/trl/pull/5189\r\n* Change default `vllm_mode` to `\"colocate\"` and add v0→v1 migration guide by @qgallouedec in https://github.com/huggingface/trl/pull/5255\r\n* Allow nullable logprobs in vLLM serve responses  by @LeonEricsson in https://github.com/huggingface/trl/pull/5203\r\n* feat(`grpo_trainer.py`): Variational Sequence-Level Soft Policy Optimization (VESPO) by @casinca in https://github.com/huggingface/trl/pull/5199\r\n* Simplify structured outputs logic across vLLM versions in scripts/vllm_serve by @albertvillanova in https://github.com/huggingface/trl/pull/5273\r\n* Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5274\r\n* Fix `accuracy_reward` crash when called from non-main thread by @qgallouedec in https://github.com/huggingface/trl/pull/5281\r\n* Remove TrainingArguments import from experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5290\r\n* Remove custom get_train/eval_dataloader from OnlineDPO by @albertvillanova in https://github.com/huggingface/trl/pull/5291\r\n* [GKD] Buffer Implementation for Distillation Trainer by @cmpatino in https://github.com/huggingface/trl/pull/5137\r\n* Support max_length in DPO VLM training by @albertvillanova in https://github.com/huggingface/trl/pull/5284\r\n* Prevent corruption of DPO VLM training if \"keep_end\" truncation_mode by @albertvillanova in https://github.com/huggingface/trl/pull/5286\r\n* Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a reward model by @albertvillanova in https://github.com/huggingface/trl/pull/5295\r\n* Apply docstyle by @qgallouedec in https://github.com/huggingface/trl/pull/5296\r\n* Add guidance to avoid `hasattr` and `getattr` with defaults in `AGENTS.md` by @qgallouedec in https://github.com/huggingface/trl/pull/5294\r\n* Fix DPOTrainer collators to truncate sequences before padding by @albertvillanova in https://github.com/huggingface/trl/pull/5305\r\n* Update `RewardFunc` type annotation to allow `None`values in reward list by @qgallouedec in https://github.com/huggingface/trl/pull/5297\r\n* Suggest the `Json()` type for tool calling dataset format by @lhoestq in https://github.com/huggingface/trl/pull/5307\r\n* Allow reward functions to log extra columns and scalar metrics by @manueldeprada in https://github.com/huggingface/trl/pull/5233\r\n* Fix GRPOTrainer attribute access for vLLM model config by @falcondai in https://github.com/huggingface/trl/pull/5302\r\n* Support truncation_mode in SFT by @albertvillanova in https://github.com/huggingface/trl/pull/5306\r\n* 🔌 Asynchronous GRPO by @qgallouedec in https://github.com/huggingface/trl/pull/5293\r\n* Fix datasets version supporting Json dtype in docs about tool calling dataset format by @albertvillanova in https://github.com/huggingface/trl/pull/5310\r\n* Align docs about tool calling in trainers with dataset format by @albertvillanova in https://github.com/huggingface/trl/pull/5311\r\n* [GRPO] Fix re-tokenization bug in tool-calling loop by concatenating token IDs by @qgallouedec in https://github.com/huggingface/trl/pull/5242\r\n* feat(experimental): Divergence Proximal Policy Optimization by @LeonEricsson in https://github.com/huggingface/trl/pull/5117\r\n* Clean up model update group on worker exit by @AmineDiro in https://github.com/huggingface/trl/pull/5325\r\n* Fix style in DPPO docstrings by @albertvillanova in https://github.com/huggingface/trl/pull/5326\r\n* `GRPOTrainer/async`: fix prefix EOS slicing for tool suffix (with Qwen3/3.5 type of chat templates) by @casinca in https://github.com/huggingface/trl/pull/5330\r\n* refactor(async_rollout_worker): renamed tool variables to mirror `grpo_trainer.py` by @casinca in https://github.com/huggingface/trl/pull/5332\r\n* Add truncation to SFT DataCollatorForLanguageModeling by @albertvillanova in https://github.com/huggingface/trl/pull/5315\r\n* Add SDPO (Self-Distillation Policy Optimization) trainer by @MengAiDev in https://github.com/huggingface/trl/pull/4935\r\n* Update openenv examples to use `environment_factory` by @sergiopaniego in https://github.com/huggingface/trl/pull/5235\r\n* Enhance `print_prompt_completions_sample` to include reasoning content by @qgallouedec in https://github.com/huggingface/trl/pull/5327\r\n* Add Cursor Bugbot rules from `AGENTS.md` by @qgallouedec in https://github.com/huggingface/trl/pull/5280\r\n* Change model dtype from bfloat16 to float32 in AsyncGRPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/5333\r\n* docs: Add \"It Takes Two: Your GRPO Is Secretly DPO\" paper to GRPOTrainer by @DhruvvArora in https://github.com/huggingface/trl/pull/5347\r\n* fix: apply reward_weights to logged reward/reward_std in GRPOTrainer by @lailanelkoussy in https://github.com/huggingface/trl/pull/5353\r\n* Remove post-collation truncation from DPO by @albertvillanova in https://github.com/huggingface/trl/pull/5350\r\n* Remove unused flush_right by @albertvillanova in https://github.com/huggingface/trl/pull/5358\r\n* Fix IDs shape mismatch in SFT for VLMs with text-only by @albertvillanova in https://github.com/huggingface/trl/pull/5354\r\n* Remove post-collation truncation from SFT by @albertvillanova in https://github.com/huggingface/trl/pull/5359\r\n* Simplify DPO DataCollatorForPreference by @albertvillanova in https://github.com/huggingface/trl/pull/5362\r\n* Simplify SFT tokenization by @albertvillanova in https://github.com/huggingface/trl/pull/5363\r\n* Simplify SFT DataCollatorForLanguageModeling by @albertvillanova in https://github.com/huggingface/trl/pull/5360\r\n* Use BaseConfig post_init in experimental KTO and MiniLLM configs by @albertvillanova in https://github.com/huggingface/trl/pull/5371\r\n* Move truncate_dataset to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/5370\r\n* Simplify DPO tokenization by @albertvillanova in https://github.com/huggingface/trl/pull/5369\r\n* Kd vllm generation by @cmpatino in https://github.com/huggingface/trl/pull/5351\r\n* Adds support for the `pixel_position_ids` vision key by @qgallouedec in https://github.com/huggingface/trl/pull/5374\r\n* Minor diff reduction between RLOO and GRPO by @qgallouedec in https://github.com/huggingface/trl/pull/5368\r\n* Remove requirements.txt by @albertvillanova in https://github.com/huggingface/trl/pull/5377\r\n* Remove dead truncation_mode from experimental BCO, CPO and ORPO by @albertvillanova in https://github.com/huggingface/trl/pull/5378\r\n* Centralize AI agent templates in `.ai` by @qgallouedec in https://github.com/huggingface/trl/pull/5268\r\n* Pass tools as None to `apply_chat_template` when it is an empty list by @rabinadk1 in https://github.com/huggingface/trl/pull/5380\r\n* Require datasets>=4.7.0 for Json dtype to prevent insertion of None values by @albertvillanova in https://github.com/huggingface/trl/pull/5376\r\n* Remove deprecated `TRACKIO_SPACE_ID` env var from all scripts by @sergiopaniego in https://github.com/huggingface/trl/pull/5365\r\n* Mark test_rloo[fsdp2] as xfail for transformers 5.4.0 by @albertvillanova in https://github.com/huggingface/trl/pull/5387\r\n* Enforce PR template for first-time contributors and document AI usage policy by @qgallouedec in https://github.com/huggingface/trl/pull/5356\r\n* Enhance PR template check to exclude reopened PRs from first-time contributor validation by @qgallouedec in https://github.com/huggingface/trl/pull/5392\r\n* chore: update `pr_template_check.yml` by @qgallouedec in https://github.com/huggingface/trl/pull/5393\r\n* Move `disable_config=True` from `generate` to `GenerationConfig` by @qgallouedec in https://github.com/huggingface/trl/pull/5384\r\n* Add vLLM inference to the Base Self-Distillation Trainer by @cmpatino in https://github.com/huggingface/trl/pull/5388\r\n* Add HF_TOKEN environment variable to workflow files by @qgallouedec in https://github.com/huggingface/trl/pull/5397\r\n* Add second version of Qwen 3.5 chat template to chat_template_utils by @apardyl in https://github.com/huggingface/trl/pull/5405\r\n* Release: v1.0 by @qgallouedec in https://github.com/huggingface/trl/pull/5409\r\n\r\n## New Contributors\r\n\r\n* @czkkkkkk made their first contribution in https://github.com/huggingface/trl/pull/5180\r\n* @michaelroyzen made their first contribution in https://github.com/huggingface/trl/pull/5190\r\n* @thesteve0 made their first contribution in https://github.com/huggingface/trl/pull/5229\r\n* @s-zx made their first contribution in https://github.com/huggingface/trl/pull/5246\r\n* @shawnghu made their first contribution in https://github.com/huggingface/trl/pull/5218\r\n* @davmels made their first contribution in https://github.com/huggingface/trl/pull/4639\r\n* @manueldeprada made their first contribution in https://github.com/huggingface/trl/pull/5233\r\n* @falcondai made their first contribution in https://github.com/huggingface/trl/pull/5302\r\n* @AmineDiro made their first contribution in https://github.com/huggingface/trl/pull/5325\r\n* @DhruvvArora made their first contribution in https://github.com/huggingface/trl/pull/5347\r\n* @lailanelkoussy made their first contribution in https://github.com/huggingface/trl/pull/5353\r\n* @rabinadk1 made their first contribution in https://github.com/huggingface/trl/pull/5380\r\n* @apardyl made their first contribution in https://github.com/huggingface/trl/pull/5405\r\n\r\n**Full Changelog**: https://github.com/huggingface/trl/compare/v0.29.0...v1.0.0\r\n","publishedAt":"2026-03-31T14:15:06.000Z","url":"https://github.com/huggingface/trl/releases/tag/v1.0.0","media":[]},{"id":"rel_uH5Du5_0GLzI_59w8HPxR","version":"v1.0.0rc1","title":"v1.0.0rc1","summary":"## Features\r\n\r\n### Variational Sequence-Level Soft Policy Optimization (VESPO)\r\n\r\n<img width=\"465\" height=\"279\" alt=\"Screenshot 2026-03-20 at 5 49 50 ...","content":"## Features\r\n\r\n### Variational Sequence-Level Soft Policy Optimization (VESPO)\r\n\r\n<img width=\"465\" height=\"279\" alt=\"Screenshot 2026-03-20 at 5 49 50 PM\" src=\"https://github.com/user-attachments/assets/b60c9697-6eb7-498e-95b3-df78c367f5fa\" />\r\n\r\n[VESPO](https://huggingface.co/papers/2602.10693) addresses training instability in off-policy RL caused by policy staleness, asynchronous updates, and train-inference mismatches. Rather than relying on heuristic token-level clipping (GRPO) or sequence-length normalization (GSPO), VESPO derives a principled reshaping kernel from a variational framework. In practice, this yields a smooth, asymmetric Gamma weighting function that gracefully suppresses extreme sequence-level importance weights without introducing length bias. It can be enabled via the `loss_type` parameter of `GRPOConfig`:\r\n\r\n```python\r\nfrom trl import GRPOConfig, GRPOTrainer\r\n\r\ntrainer = GRPOTrainer(\r\n    model=\"Qwen/Qwen3-0.6B\",\r\n    args=GRPOConfig(loss_type=\"vespo\"),\r\n    ...\r\n)\r\n```\r\n\r\nby @casinca in https://github.com/huggingface/trl/pull/5199\r\n\r\n### Divergence Proximal Policy Optimization (DPPO)\r\n\r\n<img width=\"3180\" height=\"1187\" alt=\"z_TXYw37xZqsQ21YiDkYL\" src=\"https://github.com/user-attachments/assets/40f1d538-82b3-4097-91c6-119ea9f7797b\" />\r\n<img width=\"1189\" height=\"490\" alt=\"SfgWotuuuRKPkg-0bxWv1\" src=\"https://github.com/user-attachments/assets/2b090df3-0bfb-42e4-9f94-15943736e689\" />\r\n\r\n[DPPO](https://huggingface.co/papers/2602.04879) is a new experimental trainer that replaces the standard PPO clipping mechanism with divergence constraints, providing more principled trust-region updates.\r\n\r\nby @LeonEricsson in https://github.com/huggingface/trl/pull/5117\r\n\r\n### Reward functions can now log extra columns and scalar metrics\r\n\r\nReward functions can return a dictionary of extra values (scalars or per-sample columns) that will be logged alongside the reward. This makes it easier to track intermediate signals without writing custom callbacks.\r\n\r\n```python\r\ndef my_reward_fn(completions, answer, log_extra=None, log_metric=None, **kwargs):\r\n    extracted = [extract_answer(c) for c in completions]\r\n    rewards = [1.0 if e == a else 0.0 for e, a in zip(extracted, answer)]\r\n\r\n    if log_extra:\r\n        log_extra(\"golden_answer\", list(answer))\r\n        log_extra(\"extracted_answer\", extracted)\r\n\r\n    if log_metric:\r\n        log_metric(\"accuracy\", sum(rewards) / len(rewards))\r\n\r\n    return rewards\r\n```\r\n\r\n<img width=\"1400\" height=\"407\" alt=\"image\" src=\"https://github.com/user-attachments/assets/d345b0ac-0d3c-446f-9321-a26e73ee16b4\" />\r\n<img width=\"1353\" height=\"673\" alt=\"image\" src=\"https://github.com/user-attachments/assets/b4c0302b-f69a-4715-9aad-278b4ad13299\" />\r\n\r\nby @manueldeprada in https://github.com/huggingface/trl/pull/5233\r\n\r\n### Tool calling support in `VLLMClient.chat()`\r\n\r\n`VLLMClient.chat()` now supports tool calling, enabling agentic workflows directly through the vLLM client interface.\r\n\r\nby @kansalaman in https://github.com/huggingface/trl/pull/4889\r\n\r\n### 35% faster packing\r\n\r\nBFD packing is 35% faster. The `\"bfd-requeue\"` packing strategy has also been renamed to `\"bfd_split\"`. See [MIGRATION.md](MIGRATION.md) for details.\r\n\r\n<img width=\"1784\" height=\"732\" alt=\"benchmark_results\" src=\"https://github.com/user-attachments/assets/8f0a35ad-cf1a-4fe1-a1f4-9b102637bdca\" />\r\n\r\n\r\nby @mariosasko in https://github.com/huggingface/trl/pull/5189\r\n\r\n### [GKD] Buffer implementation for distillation trainer\r\n\r\nThe GKD/GOLD trainer now supports buffered rollout generation, decoupling generation from gradient updates for more efficient distillation.\r\n\r\nby @cmpatino in https://github.com/huggingface/trl/pull/5137\r\n\r\n### v0 → v1 migration guide\r\n\r\nA [`MIGRATION.md`](MIGRATION.md) guide has been added covering all breaking changes when upgrading from TRL v0 to v1. If you're already on v0.29, the changes are minimal.\r\n\r\nby @qgallouedec in https://github.com/huggingface/trl/pull/5255\r\n\r\n### Other\r\n\r\n* Change default `vllm_mode` to `\"colocate\"` by @qgallouedec in https://github.com/huggingface/trl/pull/5255\r\n* Support `truncation_mode` in SFT by @albertvillanova in https://github.com/huggingface/trl/pull/5306\r\n* Support `max_length` in DPO VLM training by @albertvillanova in https://github.com/huggingface/trl/pull/5284\r\n* Add `pad_to_multiple_of` to GRPOTrainer and RLOOTrainer by @czkkkkkk in https://github.com/huggingface/trl/pull/5180\r\n* Support sequence sampling in Liger Kernel by @michaelroyzen in https://github.com/huggingface/trl/pull/5190\r\n* Add tool calling support to VLLMClient.chat() by @kansalaman in https://github.com/huggingface/trl/pull/4889\r\n* Add support for raw token IDs in vLLM client prompts by @qgallouedec in https://github.com/huggingface/trl/pull/5225\r\n* Add VLM support when passing raw token IDs to vLLM client by @qgallouedec in https://github.com/huggingface/trl/pull/5227\r\n\r\n## Fixes\r\n\r\n* Fix DPOTrainer collators to truncate sequences before padding by @albertvillanova in https://github.com/huggingface/trl/pull/5305\r\n* Prevent corruption of DPO VLM training if \"keep_end\" truncation_mode by @albertvillanova in https://github.com/huggingface/trl/pull/5286\r\n* Fix mm_token_type_ids silently dropped in DPO VLM training by @albertvillanova in https://github.com/huggingface/trl/pull/5279\r\n* Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a reward model by @albertvillanova in https://github.com/huggingface/trl/pull/5295\r\n* Fix `accuracy_reward` crash when called from non-main thread by @qgallouedec in https://github.com/huggingface/trl/pull/5281\r\n* Fix GRPOTrainer attribute access for vLLM model config by @falcondai in https://github.com/huggingface/trl/pull/5302\r\n* [GRPO] Fix re-tokenization bug in tool-calling loop by @qgallouedec in https://github.com/huggingface/trl/pull/5242\r\n* [CPO/ORPO] Fix handling of different length chosen/rejected prompts by @davmels in https://github.com/huggingface/trl/pull/4639\r\n* Fix `RewardFunc` type alias to reflect actual calling convention by @s-zx in https://github.com/huggingface/trl/pull/5246\r\n* fix(ppo): add gradient_checkpointing_enable/disable to PolicyAndValueWrapper by @s-zx in https://github.com/huggingface/trl/pull/5245\r\n* Fix `prepare_multimodal_messages` to support `tool_calls` and `tool` role by @alvarobartt in https://github.com/huggingface/trl/pull/5212\r\n* Fix support for model_init_kwargs when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5230\r\n* Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5274\r\n* Fix support for model_init_kwargs in GKD/GOLD when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5266\r\n* Sync entire prompt/completion token tensors before indexing by @shawnghu in https://github.com/huggingface/trl/pull/5218\r\n* Clean up model update group on worker exit by @AmineDiro in https://github.com/huggingface/trl/pull/5325\r\n\r\n## Documentation and Examples\r\n\r\n* Add minimal CARLA example script by @sergiopaniego in https://github.com/huggingface/trl/pull/5161\r\n* Nemotron 3 examples added by @sergiopaniego in https://github.com/huggingface/trl/pull/5272\r\n* Align docs about tool calling in trainers with dataset format by @albertvillanova in https://github.com/huggingface/trl/pull/5311\r\n* Add repository-specific guidance for agents (`AGENTS.md`) by @qgallouedec in https://github.com/huggingface/trl/pull/5236\r\n* Align documentation with the intended public API by @qgallouedec in https://github.com/huggingface/trl/pull/5162\r\n\r\n## What's Changed\r\n\r\n* ⬆️ Bump dev version by @qgallouedec in https://github.com/huggingface/trl/pull/5182\r\n* Handle mm_token_type_ids in SFT/GRPO/RLOO to fix IndexError by @albertvillanova in https://github.com/huggingface/trl/pull/5178\r\n* Document parameters with differing default values in core configs by @albertvillanova in https://github.com/huggingface/trl/pull/5168\r\n* Make _BaseConfig and _BaseTrainer explicitly private by @albertvillanova in https://github.com/huggingface/trl/pull/5169\r\n* Refactor CLI [4/N]: Replace top-level TrlParser with ArgumentParser by @albertvillanova in https://github.com/huggingface/trl/pull/5170\r\n* Add minimal CARLA example script by @sergiopaniego in https://github.com/huggingface/trl/pull/5161\r\n* Align documentation with the intended public API by @qgallouedec in https://github.com/huggingface/trl/pull/5162\r\n* Fix deprecation warning of create_reference_model by @albertvillanova in https://github.com/huggingface/trl/pull/5184\r\n* Fix deprecation warning of fork in multi-threaded process by @albertvillanova in https://github.com/huggingface/trl/pull/5185\r\n* Refactor CLI [5/N]: Refactor TrainingCommand with delayed imports by @albertvillanova in https://github.com/huggingface/trl/pull/5186\r\n* Refactor CLI [6/N]: Refactor env/vllm-serve commands with delayed imports by @albertvillanova in https://github.com/huggingface/trl/pull/5187\r\n* Fix CI tests patching BaseTrainer by @albertvillanova in https://github.com/huggingface/trl/pull/5192\r\n* Add `pad_to_multiple_of` to GRPOTrainer and RLOOTrainer by @czkkkkkk in https://github.com/huggingface/trl/pull/5180\r\n* Re-add liger-kernel to dev deps by @qgallouedec in https://github.com/huggingface/trl/pull/5164\r\n* Set CI PYTORCH_ALLOC_CONF env variable to avoid OOM by @albertvillanova in https://github.com/huggingface/trl/pull/5197\r\n* Support sequence sampling in Liger Kernel and pass importance_samplin… by @michaelroyzen in https://github.com/huggingface/trl/pull/5190\r\n* Mark CI test_training_vlm_and_liger as xfail by @albertvillanova in https://github.com/huggingface/trl/pull/5202\r\n* Decouple rollout dispatch from vLLM backend in GRPO _generate_single_turn by @albertvillanova in https://github.com/huggingface/trl/pull/5122\r\n* CI: Add Qwen 3.5 tiny model to tests by @qgallouedec in https://github.com/huggingface/trl/pull/5204\r\n* Add support for Qwen3.5 for agent training by @qgallouedec in https://github.com/huggingface/trl/pull/5205\r\n* Update vLLM version support to include 0.13.0 by @qgallouedec in https://github.com/huggingface/trl/pull/5206\r\n* feat: Add tool calling support to VLLMClient.chat() by @kansalaman in https://github.com/huggingface/trl/pull/4889\r\n* Refactor CLI [7/N]: Move patching to compat and import transformers conditionally by @albertvillanova in https://github.com/huggingface/trl/pull/5208\r\n* Update vLLM version support to include 0.14.0 and 0.14.1 by @qgallouedec in https://github.com/huggingface/trl/pull/5214\r\n* Refactor CLI [8/N]: Refactor scripts/utils with delayed imports by @albertvillanova in https://github.com/huggingface/trl/pull/5209\r\n* Simplify logic for structured outputs across vLLM versions by @albertvillanova in https://github.com/huggingface/trl/pull/5215\r\n* Refactor CLI [9/N]: Replace HfArgumentParser from transformers with local by @albertvillanova in https://github.com/huggingface/trl/pull/5210\r\n* Refactor CLI [10/N]: Refactor scripts with delayed imports by @albertvillanova in https://github.com/huggingface/trl/pull/5219\r\n* Refactor CLI [11/N]: Refactor scripts/vllm_serve with delayed imports by @albertvillanova in https://github.com/huggingface/trl/pull/5220\r\n* Refactor CLI [12/N]: Fix command name in scripts help usage by @albertvillanova in https://github.com/huggingface/trl/pull/5221\r\n* Refactor CLI [13/N]: Pass clean training args to scripts by @albertvillanova in https://github.com/huggingface/trl/pull/5223\r\n* Fix `prepare_multimodal_messages` to support `tool_calls` and `tool` role by @alvarobartt in https://github.com/huggingface/trl/pull/5212\r\n* Fix link to Hugging Face Hub in OpenEnv documentation by @thesteve0 in https://github.com/huggingface/trl/pull/5229\r\n* Fix type for model_init_kwargs when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5230\r\n* Add repository-specific guidance for agents (`AGENTS.md`) by @qgallouedec in https://github.com/huggingface/trl/pull/5236\r\n* Add support for raw ids in `prompts` in vLLM client and server by @qgallouedec in https://github.com/huggingface/trl/pull/5225\r\n* Deprecate `truncate_prompt_tokens` for vLLM 0.17.0 by @winglian in https://github.com/huggingface/trl/pull/5248\r\n* Add VLM support when passing raw token IDs to vLLM client by @qgallouedec in https://github.com/huggingface/trl/pull/5227\r\n* Move `rollout_func` from `_generate_single_turn` to `_generate` by @qgallouedec in https://github.com/huggingface/trl/pull/5232\r\n* Fix `RewardFunc` type alias to reflect actual calling convention by @s-zx in https://github.com/huggingface/trl/pull/5246\r\n* [GRPO] In-place temperature scaling operation by @winglian in https://github.com/huggingface/trl/pull/5254\r\n* Update vLLM version support to 0.15.0 by @qgallouedec in https://github.com/huggingface/trl/pull/5251\r\n* Sync entire prompt/completion token tensors before indexing by @shawnghu in https://github.com/huggingface/trl/pull/5218\r\n* Update vLLM version support to 0.16.0 by @qgallouedec in https://github.com/huggingface/trl/pull/5252\r\n* Update vLLM version support to 0.17.0 by @qgallouedec in https://github.com/huggingface/trl/pull/5253\r\n* [GRPO/RLOO] Tokenize before vLLM generation call by @qgallouedec in https://github.com/huggingface/trl/pull/5238\r\n* Refactor CLI [14/N] : Remove TrainingArguments import from core trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5257\r\n* Support JSON string parsing of teacher_model_init_kwargs in MiniLLMConfig by @albertvillanova in https://github.com/huggingface/trl/pull/5259\r\n* Fix typo in docstring for teacher_model_init_kwargs by @albertvillanova in https://github.com/huggingface/trl/pull/5260\r\n* Remove extra_fields dead code [1/N]: Remove extra_fields handling from VLLMGeneration.generate by @albertvillanova in https://github.com/huggingface/trl/pull/5262\r\n* [GRPO/RLOO] Unify tokenization across all generation backends in `_generate_single_turn` by @qgallouedec in https://github.com/huggingface/trl/pull/5239\r\n* Remove extra_fields dead code [2/N]: Remove extra_fields from VLLMGeneration.generate return value by @albertvillanova in https://github.com/huggingface/trl/pull/5263\r\n* Remove extra_fields dead code [3/N]: Remove extra_fields from GRPOTrainer._generate_single_turn return value by @albertvillanova in https://github.com/huggingface/trl/pull/5264\r\n* fix(ppo): add gradient_checkpointing_enable/disable to PolicyAndValueWrapper by @s-zx in https://github.com/huggingface/trl/pull/5245\r\n* [GRPO/RLOO] Extract tokenize prompts from `_generate_single_turn` by @qgallouedec in https://github.com/huggingface/trl/pull/5240\r\n* [CPO/ORPO] Fix handling of different length chosen/rejected prompts. by @davmels in https://github.com/huggingface/trl/pull/4639\r\n* Fix type for teacher_model_init_kwargs when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5258\r\n* Align GOLDConfig docstrings for optional params with None default by @albertvillanova in https://github.com/huggingface/trl/pull/5261\r\n* Fix support for model_init_kwargs in GKD/GOLD when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5266\r\n* Update TRL banner to support light/dark mode by @qgallouedec in https://github.com/huggingface/trl/pull/5270\r\n* Fix error message in OnlineDPO by @qgallouedec in https://github.com/huggingface/trl/pull/5237\r\n* Fix title consistency from \"Transformer Reinforcement Learning\" to \"Transformers Reinforcement Learning\" by @qgallouedec in https://github.com/huggingface/trl/pull/5183\r\n* Nemotron 3 examples added by @sergiopaniego in https://github.com/huggingface/trl/pull/5272\r\n* Fix mm_token_type_ids silently dropped in DPO VLM training by @albertvillanova in https://github.com/huggingface/trl/pull/5279\r\n* Simplify get_train_dataloader in GRPO and RLOO by @albertvillanova in https://github.com/huggingface/trl/pull/5276\r\n* Raise ValueError for None train_dataset in experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5275\r\n* 35% faster packing + rename `bfd-requeue` to `bfd_split` by @mariosasko in https://github.com/huggingface/trl/pull/5189\r\n* Change default `vllm_mode` to `\"colocate\"` and add v0→v1 migration guide by @qgallouedec in https://github.com/huggingface/trl/pull/5255\r\n* Allow nullable logprobs in vLLM serve responses  by @LeonEricsson in https://github.com/huggingface/trl/pull/5203\r\n* feat(`grpo_trainer.py`): Variational Sequence-Level Soft Policy Optimization (VESPO) by @casinca in https://github.com/huggingface/trl/pull/5199\r\n* Simplify structured outputs logic across vLLM versions in scripts/vllm_serve by @albertvillanova in https://github.com/huggingface/trl/pull/5273\r\n* Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5274\r\n* Fix `accuracy_reward` crash when called from non-main thread by @qgallouedec in https://github.com/huggingface/trl/pull/5281\r\n* Remove TrainingArguments import from experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5290\r\n* Remove custom get_train/eval_dataloader from OnlineDPO by @albertvillanova in https://github.com/huggingface/trl/pull/5291\r\n* [GKD] Buffer Implementation for Distillation Trainer by @cmpatino in https://github.com/huggingface/trl/pull/5137\r\n* Support max_length in DPO VLM training by @albertvillanova in https://github.com/huggingface/trl/pull/5284\r\n* Prevent corruption of DPO VLM training if \"keep_end\" truncation_mode by @albertvillanova in https://github.com/huggingface/trl/pull/5286\r\n* Fix UNEXPECTED lm_head.weight warning when loading a CausalLM as a reward model by @albertvillanova in https://github.com/huggingface/trl/pull/5295\r\n* Apply docstyle by @qgallouedec in https://github.com/huggingface/trl/pull/5296\r\n* Add guidance to avoid `hasattr` and `getattr` with defaults in `AGENTS.md` by @qgallouedec in https://github.com/huggingface/trl/pull/5294\r\n* Fix DPOTrainer collators to truncate sequences before padding by @albertvillanova in https://github.com/huggingface/trl/pull/5305\r\n* Update `RewardFunc` type annotation to allow `None`values in reward list by @qgallouedec in https://github.com/huggingface/trl/pull/5297\r\n* Suggest the `Json()` type for tool calling dataset format by @lhoestq in https://github.com/huggingface/trl/pull/5307\r\n* Allow reward functions to log extra columns and scalar metrics by @manueldeprada in https://github.com/huggingface/trl/pull/5233\r\n* Fix GRPOTrainer attribute access for vLLM model config by @falcondai in https://github.com/huggingface/trl/pull/5302\r\n* Support truncation_mode in SFT by @albertvillanova in https://github.com/huggingface/trl/pull/5306\r\n* 🔌 Asynchronous GRPO by @qgallouedec in https://github.com/huggingface/trl/pull/5293\r\n* Fix datasets version supporting Json dtype in docs about tool calling dataset format by @albertvillanova in https://github.com/huggingface/trl/pull/5310\r\n* Align docs about tool calling in trainers with dataset format by @albertvillanova in https://github.com/huggingface/trl/pull/5311\r\n* [GRPO] Fix re-tokenization bug in tool-calling loop by concatenating token IDs by @qgallouedec in https://github.com/huggingface/trl/pull/5242\r\n* feat(experimental): Divergence Proximal Policy Optimization by @LeonEricsson in https://github.com/huggingface/trl/pull/5117\r\n* Clean up model update group on worker exit by @AmineDiro in https://github.com/huggingface/trl/pull/5325\r\n* Fix style in DPPO docstrings by @albertvillanova in https://github.com/huggingface/trl/pull/5326\r\n\r\n## New Contributors\r\n\r\n* @czkkkkkk made their first contribution in https://github.com/huggingface/trl/pull/5180\r\n* @michaelroyzen made their first contribution in https://github.com/huggingface/trl/pull/5190\r\n* @thesteve0 made their first contribution in https://github.com/huggingface/trl/pull/5229\r\n* @s-zx made their first contribution in https://github.com/huggingface/trl/pull/5246\r\n* @shawnghu made their first contribution in https://github.com/huggingface/trl/pull/5218\r\n* @davmels made their first contribution in https://github.com/huggingface/trl/pull/4639\r\n* @manueldeprada made their first contribution in https://github.com/huggingface/trl/pull/5233\r\n* @falcondai made their first contribution in https://github.com/huggingface/trl/pull/5302\r\n* @AmineDiro made their first contribution in https://github.com/huggingface/trl/pull/5325\r\n\r\n**Full Changelog**: https://github.com/huggingface/trl/compare/v0.29.0...v1.0.0rc1","publishedAt":"2026-03-20T23:55:04.000Z","url":"https://github.com/huggingface/trl/releases/tag/v1.0.0rc1","media":[]},{"id":"rel_7TW8tyroMX0X7yLx91KP6","version":"v0.29.1","title":"v0.29.1","summary":"## What's Changed\r\n\r\n* Handle mm_token_type_ids in SFT/GRPO/RLOO to fix IndexError by @albertvillanova in https://github.com/huggingface/trl/pull/5178...","content":"## What's Changed\r\n\r\n* Handle mm_token_type_ids in SFT/GRPO/RLOO to fix IndexError by @albertvillanova in https://github.com/huggingface/trl/pull/5178\r\n* Fix `prepare_multimodal_messages` to support `tool_calls` and `tool` role by @alvarobartt in https://github.com/huggingface/trl/pull/5212\r\n* Fix type for model_init_kwargs when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5230\r\n* Decouple rollout dispatch from vLLM backend in GRPO _generate_single_turn by @albertvillanova in https://github.com/huggingface/trl/pull/5122\r\n* Simplify logic for structured outputs across vLLM versions by @albertvillanova in https://github.com/huggingface/trl/pull/5215\r\n* Add support for raw ids in `prompts` in vLLM client and server by @qgallouedec in https://github.com/huggingface/trl/pull/5225\r\n* Add VLM support when passing raw token IDs to vLLM client by @qgallouedec in https://github.com/huggingface/trl/pull/5227\r\n* Move `rollout_func` from `_generate_single_turn` to `_generate` by @qgallouedec in https://github.com/huggingface/trl/pull/5232\r\n* [GRPO/RLOO] Tokenize before vLLM generation call by @qgallouedec in https://github.com/huggingface/trl/pull/5238\r\n* Support JSON string parsing of teacher_model_init_kwargs in MiniLLMConfig by @albertvillanova in https://github.com/huggingface/trl/pull/5259\r\n* [GRPO/RLOO] Unify tokenization across all generation backends in `_generate_single_turn` by @qgallouedec in https://github.com/huggingface/trl/pull/5239\r\n* [GRPO/RLOO] Extract tokenize prompts from `_generate_single_turn` by @qgallouedec in https://github.com/huggingface/trl/pull/5240\r\n* [CPO/ORPO] Fix handling of different length chosen/rejected prompts. by @davmels in https://github.com/huggingface/trl/pull/4639\r\n* Fix type for teacher_model_init_kwargs when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5258\r\n* Fix support for model_init_kwargs in GKD/GOLD when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5266\r\n* Fix mm_token_type_ids silently dropped in DPO VLM training by @albertvillanova in https://github.com/huggingface/trl/pull/5279\r\n* Fix support for model_init_kwargs in MiniLLM when passed as CLI JSON string by @albertvillanova in https://github.com/huggingface/trl/pull/5274\r\n* Fix GRPOTrainer attribute access for vLLM model config by @falcondai in https://github.com/huggingface/trl/pull/5302\r\n* [GRPO] Fix re-tokenization bug in tool-calling loop by concatenating token IDs by @qgallouedec in https://github.com/huggingface/trl/pull/5242\r\n\r\n## New Contributors\r\n\r\n* @davmels made their first contribution in https://github.com/huggingface/trl/pull/4639\r\n* @falcondai made their first contribution in https://github.com/huggingface/trl/pull/5302\r\n\r\n**Full Changelog**: https://github.com/huggingface/trl/compare/v0.29.0...v0.29.1\r\n","publishedAt":"2026-03-20T03:57:13.000Z","url":"https://github.com/huggingface/trl/releases/tag/v0.29.1","media":[]},{"id":"rel_sd0_rtFVZKLv1UVYmn474","version":"v0.29.0","title":"v0.29.0","summary":"## Features\r\n\r\n### Add `environment_factory` to `GRPOTrainer`\r\n\r\n`GRPOTrainer` now accepts an `environment_factory` argument, allowing users to specif...","content":"## Features\r\n\r\n### Add `environment_factory` to `GRPOTrainer`\r\n\r\n`GRPOTrainer` now accepts an `environment_factory` argument, allowing users to specify a custom environment class for training. This enables more flexible and diverse training scenarios by letting users define their own environments with specific dynamics and reward structures.\r\n\r\n```python\r\nfrom datasets import Dataset\r\nfrom trl import GRPOConfig, GRPOTrainer\r\n\r\ndataset = Dataset.from_dict({\r\n    \"prompt\": [[{\"role\": \"user\", \"content\": f\"Increment the counter by {i}.\"}] for i in range(1, 7)]\r\n})\r\n\r\ndef reward_func(environments, **kwargs):\r\n    return [env.counter for env in environments]\r\n\r\nclass IncrementEnv:\r\n    def reset(self):\r\n        self.counter = 0\r\n\r\n    def increment(self, step: int) -> int:\r\n        \"\"\"\r\n        Increment the internal counter.\r\n\r\n        Args:\r\n            step: Value to add to the counter.\r\n\r\n        Returns:\r\n            The updated counter value.\r\n        \"\"\"\r\n        self.counter += step\r\n        return self.counter\r\n\r\ntrainer = GRPOTrainer(\r\n    model=\"Qwen/Qwen3-0.6B\",\r\n    args=GRPOConfig(chat_template_kwargs={\"enable_thinking\": False}),\r\n    train_dataset=dataset,\r\n    reward_funcs=reward_func,\r\n    environment_factory=IncrementEnv,\r\n)\r\ntrainer.train()\r\n```\r\n\r\nby @qgallouedec in https://github.com/huggingface/trl/pull/5093\r\n\r\n### Skills\r\n\r\nTRL introduces agent-native CLI Integration: trl-training, a first-class Agent Skill that exposes TRL’s training workflows (SFT, DPO, GRPO, etc.) in a structured, agent-readable format. The skill is packaged directly with the trl library and can be installed via the CLI:\r\n\r\n```bash\r\n# Install into the project's agent directory (default scope=project), by agent name: claude, codex, opencode\r\ntrl skills install trl-training --target <agent>\r\n```\r\n\r\nThis enables AI agents to safely and reproducibly execute TRL training workflows using a well-defined interface.\r\n\r\nSkills can be installed at the project or global scope, and support explicit targets and overwrite controls.\r\n\r\n* Implement Agent Skills [1/N]: Create training skill (MVP) by @albertvillanova in https://github.com/huggingface/trl/pull/5096\r\n* Implement Agent Skills [2/N]:  Create skills module by @albertvillanova in https://github.com/huggingface/trl/pull/5097\r\n* Implement Agent Skills [3/N]: Create skills installer by @albertvillanova in https://github.com/huggingface/trl/pull/5100\r\n* Implement Agent Skills [4/N]: Create skills CLI by @albertvillanova in https://github.com/huggingface/trl/pull/5103\r\n\r\n### Other \r\n* Pass vllm_is_ratio to LigerFusedLinearGRPOLoss in compute_liger_loss by @yukiu00 in https://github.com/huggingface/trl/pull/5031\r\n* feature: top_k selective_log_softmax by @LeonEricsson in https://github.com/huggingface/trl/pull/5104\r\n* Add Trackio integration for model card visualization by @qgallouedec in https://github.com/huggingface/trl/pull/5101\r\n* Update tool handling to support JSON string schemas in trainers by @qgallouedec in https://github.com/huggingface/trl/pull/5118\r\n* Refactor DPO by @qgallouedec in https://github.com/huggingface/trl/pull/3906\r\n* Add support for Python 3.14 by @albertvillanova in https://github.com/huggingface/trl/pull/4225\r\n* Fix default learning_rate in PPO according to paper by @albertvillanova in https://github.com/huggingface/trl/pull/5174\r\n* Fix default learning_rate in BCO according to paper by @albertvillanova in https://github.com/huggingface/trl/pull/5173\r\n* feature: Configurable num logprobs in vLLM generation by @LeonEricsson in https://github.com/huggingface/trl/pull/5107\r\n\r\n## Fixes\r\n\r\n* [GRPO] fix: remove SAPO temperature check by @LeonEricsson in https://github.com/huggingface/trl/pull/5042\r\n* fix: Use `launch_args` for all trainers by @qgallouedec in https://github.com/huggingface/trl/pull/5059\r\n* Fix GRPO multi-turn training with liger kernels by @albertvillanova in https://github.com/huggingface/trl/pull/4975\r\n* fix: Set `num_labels` to 1 in causal model initialization for RewardTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/5066\r\n* [SFT] Fix high vRAM consumption during eval with liger kernel by @LoganVegnaSHOP in https://github.com/huggingface/trl/pull/5069\r\n* Fix BFD packing for SFT datasets by @albertvillanova in https://github.com/huggingface/trl/pull/5076\r\n* Fix DPO and RLOO incompatibility with FSDP2 by @flutist in https://github.com/huggingface/trl/pull/4838\r\n* Fix SFT loss type rewards being overwritten in dpo_loss() by @Mr-Neutr0n in https://github.com/huggingface/trl/pull/5079\r\n* Fix Qwen3 schema by @qgallouedec in https://github.com/huggingface/trl/pull/5111\r\n* Add check for `None` in `get_trackio_space_url()` to prevent errors by @qgallouedec in https://github.com/huggingface/trl/pull/5115\r\n* Fix `trl <command> --help` TypeError caused by unescaped `%` in `TrainingArguments` help strings by @albertvillanova in https://github.com/huggingface/trl/pull/5135\r\n* Fix PPOTrainer.save_model by @albertvillanova in https://github.com/huggingface/trl/pull/5151\r\n* Fix `SFTTrainer` support for single-image data by @qgallouedec in https://github.com/huggingface/trl/pull/5132\r\n* Fix structured_outputs handling and tool normalization in vLLM backend by @ehofm in https://github.com/huggingface/trl/pull/5155\r\n* fix: wake up vLLM weights before sync to prevent writes to freed memory by @bledden in https://github.com/huggingface/trl/pull/5147\r\n* Accept mm_token_type_ids in GRPO/RLOO _get_per_token_logps_and_entropies by @albertvillanova in https://github.com/huggingface/trl/pull/5176\r\n\r\n## Documentation and Examples\r\n\r\n* [minor] docs: typo in `grpo_trainer.md` by @casinca in https://github.com/huggingface/trl/pull/5047\r\n* docs: add DeepSeek-R1 training dynamics and GRPO example by @JenWei0312 in https://github.com/huggingface/trl/pull/5053\r\n* docs: Add INTELLECT-2 (2505.07291) to Paper Index by @behroozazarkhalili in https://github.com/huggingface/trl/pull/5061\r\n* docs: Add REINFORCE++ (2501.03262) to Paper Index by @behroozazarkhalili in https://github.com/huggingface/trl/pull/5062\r\n* docs: Add XPO (2405.21046) to Paper Index by @behroozazarkhalili in https://github.com/huggingface/trl/pull/5068\r\n* docs: Add RPO paper (2405.16436) to paper index by @behroozazarkhalili in https://github.com/huggingface/trl/pull/5070\r\n* docs: Add SimPO paper (2405.14734) to paper index by @behroozazarkhalili in https://github.com/huggingface/trl/pull/5071\r\n* docs: Add TR-DPO paper (2404.09656) to paper index by @behroozazarkhalili in https://github.com/huggingface/trl/pull/5078\r\n* docs: Add ORPO paper (2403.07691) to paper index by @behroozazarkhalili in https://github.com/huggingface/trl/pull/5080\r\n* docs: Add CPO paper (2401.08417) to paper index by @behroozazarkhalili in https://github.com/huggingface/trl/pull/5081\r\n* docs: Add GKD paper (2306.13649) to paper index by @behroozazarkhalili in https://github.com/huggingface/trl/pull/5082\r\n* docs: Add PRM paper (2211.14275) to paper index by @behroozazarkhalili in https://github.com/huggingface/trl/pull/5083\r\n* docs: Add T5 packing paper (1910.10683) to paper index by @behroozazarkhalili in https://github.com/huggingface/trl/pull/5084\r\n* docs: Add PPO paper (1707.06347) to paper index by @behroozazarkhalili in https://github.com/huggingface/trl/pull/5085\r\n* docs: Add MPO paper (2411.10442) to paper index by @behroozazarkhalili in https://github.com/huggingface/trl/pull/5089\r\n* docs: add Multi-Node Training subsection (#4384) by @nabin2004 in https://github.com/huggingface/trl/pull/5091\r\n* docs: Unify model examples to use trl-lib namespace by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4431\r\n* Add Tiny Aya tool calling examples (script/notebook) by @sergiopaniego in https://github.com/huggingface/trl/pull/5123\r\n* Fix wording in DPO and SFT trainer documentation for clarity by @qgallouedec in https://github.com/huggingface/trl/pull/5140\r\n* Fix type of TrainingArguments.logging_steps in docs by @albertvillanova in https://github.com/huggingface/trl/pull/5149\r\n* Fix Liquid syntax error in DPO trainer docs caused by double braces in LaTeX by @albertvillanova in https://github.com/huggingface/trl/pull/5153\r\n* Document parameters with differing default values in experimental configs by @albertvillanova in https://github.com/huggingface/trl/pull/5172\r\n\r\n## Deprecations\r\n\r\n* Remove deprecated BCO after moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/5045\r\n* Remove deprecated CPO after moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/5046\r\n* Remove deprecated Judges after moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/5048\r\n* Remove deprecated ORPO after moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/5050\r\n* Remove deprecated PPO after moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/5051\r\n* Remove deprecated PRM after moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/5052\r\n* Remove deprecated XPO after moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/5055\r\n* Remove deprecated RLOOConfig.max_prompt_length by @albertvillanova in https://github.com/huggingface/trl/pull/5056\r\n* Remove deprecated classes moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/5044\r\n* Remove deprecated mergekit_utils moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/5057\r\n* Rename input keys in `RewardTrainer` collator from `chosen/rejected_input_ids` to `chosen/rejected_ids` by @qgallouedec in https://github.com/huggingface/trl/pull/5179\r\n\r\n## CI Improvements\r\n\r\n* Upgrade GitHub Actions to latest versions by @salmanmkc in https://github.com/huggingface/trl/pull/4893\r\n* Remove duplicated tests for SFT and add gradient checkpointing tests by @qgallouedec in https://github.com/huggingface/trl/pull/5054\r\n* Update model from SequenceClassification to CausalLM in `RewardTrainer` tests by @qgallouedec in https://github.com/huggingface/trl/pull/5060\r\n* Fix CI ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?) by @albertvillanova in https://github.com/huggingface/trl/pull/5074\r\n* Add more tests for `get_training_chat_template` by @qgallouedec in https://github.com/huggingface/trl/pull/5108\r\n* Add test for Cohere2 models by @qgallouedec in https://github.com/huggingface/trl/pull/5116\r\n* Remove revision references in dataset loading for toolcall tests by @qgallouedec in https://github.com/huggingface/trl/pull/5133\r\n* Fix NameError: name 'importlib' is not defined by @albertvillanova in https://github.com/huggingface/trl/pull/5134\r\n* Fix CI by removing liger-kernel from dev deps by @qgallouedec in https://github.com/huggingface/trl/pull/5163\r\n* Fix experimental TestUpdateWithReplayBuffer: ValueError: `train_dataset` is required by @albertvillanova in https://github.com/huggingface/trl/pull/5171\r\n* Update upstream tracking info about CI PyTorch JIT deprecation warnings by @albertvillanova in https://github.com/huggingface/trl/pull/5166\r\n\r\n## Miscellaneous\r\n\r\n* Fix logging warning suppression with scoped override for seq-clf head key by @qgallouedec in https://github.com/huggingface/trl/pull/5058\r\n* Fix logging warning suppression for transformers 4.56.2 by @albertvillanova in https://github.com/huggingface/trl/pull/5077\r\n* Validate reward model has 1 num_labels by @albertvillanova in https://github.com/huggingface/trl/pull/5087\r\n* Fix style by @albertvillanova in https://github.com/huggingface/trl/pull/5106\r\n* Remove outdated liger-kernel compatibility checks and warnings in tests and SFTTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/5105\r\n* Add validation for conversational prompts in multimodal training by @qgallouedec in https://github.com/huggingface/trl/pull/5067\r\n* Update version check for transformers to 5.2.0 in online_dpo_trainer.py by @qgallouedec in https://github.com/huggingface/trl/pull/5110\r\n* Add GLM-4.5 model to tests by @qgallouedec in https://github.com/huggingface/trl/pull/5114\r\n* Fix import latency [1/N]: Extract _LazyModule to dedicated module by @albertvillanova in https://github.com/huggingface/trl/pull/5128\r\n* Fix import latency [2/N]: Implement native _is_package_available by @albertvillanova in https://github.com/huggingface/trl/pull/5129\r\n* refactor(gkd_trainer): small optim by @casinca in https://github.com/huggingface/trl/pull/5143\r\n* Move common fields from stable trainer configs to BaseConfig by @albertvillanova in https://github.com/huggingface/trl/pull/5136\r\n* Use BaseConfig in all experimental configs by @albertvillanova in https://github.com/huggingface/trl/pull/5148\r\n* Raise ValueError for None train_dataset in core trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5157\r\n* Revert changes in vLLM client/server by @qgallouedec in https://github.com/huggingface/trl/pull/5165\r\n\r\n### Refactor CLI\r\n\r\n* Refactor CLI [1/N]: Refactor into modular command architecture by @albertvillanova in https://github.com/huggingface/trl/pull/5124\r\n* Refactor CLI [2/N]: Move accelerate concerns into TrainingCommand by @albertvillanova in https://github.com/huggingface/trl/pull/5159\r\n* Refactor CLI [3/N]: Self-contain VllmServeCommand argument parsing by @albertvillanova in https://github.com/huggingface/trl/pull/5160\r\n\r\n## What's Changed\r\n\r\n* [minor] docs: typo in `grpo_trainer.md` by @casinca in https://github.com/huggingface/trl/pull/5047\r\n* ⬆️  Bump dev version by @albertvillanova in https://github.com/huggingface/trl/pull/5049\r\n* [GRPO] fix: remove SAPO temperature check by @LeonEricsson in https://github.com/huggingface/trl/pull/5042\r\n* Remove deprecated BCO after moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/5045\r\n* Remove deprecated CPO after moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/5046\r\n* Remove deprecated Judges after moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/5048\r\n* Upgrade GitHub Actions to latest versions by @salmanmkc in https://github.com/huggingface/trl/pull/4893\r\n* Remove deprecated ORPO after moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/5050\r\n* Remove deprecated PPO after moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/5051\r\n* Remove deprecated PRM after moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/5052\r\n* Remove deprecated XPO after moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/5055\r\n* Remove deprecated RLOOConfig.max_prompt_length by @albertvillanova in https://github.com/huggingface/trl/pull/5056\r\n* Remove deprecated classes moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/5044\r\n* Remove duplicated tests for SFT and add gradient checkpointing tests by @qgallouedec in https://github.com/huggingface/trl/pull/5054\r\n* Remove deprecated mergekit_utils moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/5057\r\n* docs: add DeepSeek-R1 training dynamics and GRPO example by @JenWei0312 in https://github.com/huggingface/trl/pull/5053\r\n* docs: Add INTELLECT-2 (2505.07291) to Paper Index by @behroozazarkhalili in https://github.com/huggingface/trl/pull/5061\r\n* docs: Add REINFORCE++ (2501.03262) to Paper Index by @behroozazarkhalili in https://github.com/huggingface/trl/pull/5062\r\n* docs: Add XPO (2405.21046) to Paper Index by @behroozazarkhalili in https://github.com/huggingface/trl/pull/5068\r\n* docs: Add RPO paper (2405.16436) to paper index by @behroozazarkhalili in https://github.com/huggingface/trl/pull/5070\r\n* docs: Add SimPO paper (2405.14734) to paper index by @behroozazarkhalili in https://github.com/huggingface/trl/pull/5071\r\n* Fix logging warning suppression with scoped override for seq-clf head key by @qgallouedec in https://github.com/huggingface/trl/pull/5058\r\n* fix: Use `launch_args` for all trainers by @qgallouedec in https://github.com/huggingface/trl/pull/5059\r\n* Fix GRPO multi-turn training with liger kernels by @albertvillanova in https://github.com/huggingface/trl/pull/4975\r\n* fix: Set `num_labels` to 1 in causal model initialization for RewardTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/5066\r\n* Fix logging warning suppression for transformers 4.56.2 by @albertvillanova in https://github.com/huggingface/trl/pull/5077\r\n* Update model from SequenceClassification to CausalLM in `RewardTrainer` tests by @qgallouedec in https://github.com/huggingface/trl/pull/5060\r\n* Fix CI ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?) by @albertvillanova in https://github.com/huggingface/trl/pull/5074\r\n* [SFT] Fix high vRAM consumption during eval with liger kernel by @LoganVegnaSHOP in https://github.com/huggingface/trl/pull/5069\r\n* docs: Add TR-DPO paper (2404.09656) to paper index by @behroozazarkhalili in https://github.com/huggingface/trl/pull/5078\r\n* docs: Add ORPO paper (2403.07691) to paper index by @behroozazarkhalili in https://github.com/huggingface/trl/pull/5080\r\n* docs: Add CPO paper (2401.08417) to paper index by @behroozazarkhalili in https://github.com/huggingface/trl/pull/5081\r\n* docs: Add GKD paper (2306.13649) to paper index by @behroozazarkhalili in https://github.com/huggingface/trl/pull/5082\r\n* docs: Add PRM paper (2211.14275) to paper index by @behroozazarkhalili in https://github.com/huggingface/trl/pull/5083\r\n* docs: Add T5 packing paper (1910.10683) to paper index by @behroozazarkhalili in https://github.com/huggingface/trl/pull/5084\r\n* docs: Add PPO paper (1707.06347) to paper index by @behroozazarkhalili in https://github.com/huggingface/trl/pull/5085\r\n* Fix BFD packing for SFT datasets by @albertvillanova in https://github.com/huggingface/trl/pull/5076\r\n* Validate reward model has 1 num_labels by @albertvillanova in https://github.com/huggingface/trl/pull/5087\r\n* docs: Add MPO paper (2411.10442) to paper index by @behroozazarkhalili in https://github.com/huggingface/trl/pull/5089\r\n* docs: add Multi-Node Training subsection (#4384) by @nabin2004 in https://github.com/huggingface/trl/pull/5091\r\n* docs: Unify model examples to use trl-lib namespace by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4431\r\n* Implement Agent Skills [1/N]: Create training skill (MVP) by @albertvillanova in https://github.com/huggingface/trl/pull/5096\r\n* Pass vllm_is_ratio to LigerFusedLinearGRPOLoss in compute_liger_loss by @yukiu00 in https://github.com/huggingface/trl/pull/5031\r\n* Fix DPO and RLOO incompatibility with FSDP2 by @flutist in https://github.com/huggingface/trl/pull/4838\r\n* feature: top_k selective_log_softmax by @LeonEricsson in https://github.com/huggingface/trl/pull/5104\r\n* Implement Agent Skills [2/N]:  Create skills module by @albertvillanova in https://github.com/huggingface/trl/pull/5097\r\n* Fix style by @albertvillanova in https://github.com/huggingface/trl/pull/5106\r\n* Add Trackio integration for model card visualization by @qgallouedec in https://github.com/huggingface/trl/pull/5101\r\n* Fix SFT loss type rewards being overwritten in dpo_loss() by @Mr-Neutr0n in https://github.com/huggingface/trl/pull/5079\r\n* Remove outdated liger-kernel compatibility checks and warnings in tests and SFTTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/5105\r\n* Implement Agent Skills [3/N]: Create skills installer by @albertvillanova in https://github.com/huggingface/trl/pull/5100\r\n* Add validation for conversational prompts in multimodal training by @qgallouedec in https://github.com/huggingface/trl/pull/5067\r\n* Update version check for transformers to 5.2.0 in online_dpo_trainer.py by @qgallouedec in https://github.com/huggingface/trl/pull/5110\r\n* Add more tests for `get_training_chat_template` by @qgallouedec in https://github.com/huggingface/trl/pull/5108\r\n* Add test for Cohere2 models by @qgallouedec in https://github.com/huggingface/trl/pull/5116\r\n* Fix Qwen3 schema by @qgallouedec in https://github.com/huggingface/trl/pull/5111\r\n* Add check for `None` in `get_trackio_space_url()` to prevent errors by @qgallouedec in https://github.com/huggingface/trl/pull/5115\r\n* Add GLM-4.5 model to tests by @qgallouedec in https://github.com/huggingface/trl/pull/5114\r\n* Add Tiny Aya tool calling examples (script/notebook) by @sergiopaniego in https://github.com/huggingface/trl/pull/5123\r\n* Update tool handling to support JSON string schemas in trainers by @qgallouedec in https://github.com/huggingface/trl/pull/5118\r\n* Implement Agent Skills [4/N]: Create skills CLI by @albertvillanova in https://github.com/huggingface/trl/pull/5103\r\n* Refactor CLI [1/N]: Refactor into modular command architecture by @albertvillanova in https://github.com/huggingface/trl/pull/5124\r\n* Remove revision references in dataset loading for toolcall tests by @qgallouedec in https://github.com/huggingface/trl/pull/5133\r\n* Refactor DPO by @qgallouedec in https://github.com/huggingface/trl/pull/3906\r\n* Fix import latency [1/N]: Extract _LazyModule to dedicated module by @albertvillanova in https://github.com/huggingface/trl/pull/5128\r\n* Fix import latency [2/N]: Implement native _is_package_available by @albertvillanova in https://github.com/huggingface/trl/pull/5129\r\n* Fix NameError: name 'importlib' is not defined by @albertvillanova in https://github.com/huggingface/trl/pull/5134\r\n* Fix `trl <command> --help` TypeError caused by unescaped `%` in `TrainingArguments` help strings by @albertvillanova in https://github.com/huggingface/trl/pull/5135\r\n* Add `environment_factory` to `GRPOTrainer` by @qgallouedec in https://github.com/huggingface/trl/pull/5093\r\n* refactor(gkd_trainer): small optim by @casinca in https://github.com/huggingface/trl/pull/5143\r\n* Move common fields from stable trainer configs to BaseConfig by @albertvillanova in https://github.com/huggingface/trl/pull/5136\r\n* Fix wording in DPO and SFT trainer documentation for clarity by @qgallouedec in https://github.com/huggingface/trl/pull/5140\r\n* Fix PPOTrainer.save_model by @albertvillanova in https://github.com/huggingface/trl/pull/5151\r\n* Use BaseConfig in all experimental configs by @albertvillanova in https://github.com/huggingface/trl/pull/5148\r\n* Fix type of TrainingArguments.logging_steps in docs by @albertvillanova in https://github.com/huggingface/trl/pull/5149\r\n* Add support for Python 3.14 by @albertvillanova in https://github.com/huggingface/trl/pull/4225\r\n* Fix `SFTTrainer` support for single-image data by @qgallouedec in https://github.com/huggingface/trl/pull/5132\r\n* Fix CI by removing liger-kernel from dev deps by @qgallouedec in https://github.com/huggingface/trl/pull/5163\r\n* Fix structured_outputs handling and tool normalization in vLLM backend by @ehofm in https://github.com/huggingface/trl/pull/5155\r\n* fix: wake up vLLM weights before sync to prevent writes to freed memory by @bledden in https://github.com/huggingface/trl/pull/5147\r\n* Fix Liquid syntax error in DPO trainer docs caused by double braces in LaTeX by @albertvillanova in https://github.com/huggingface/trl/pull/5153\r\n* Raise ValueError for None train_dataset in core trainers by @albertvillanova in https://github.com/huggingface/trl/pull/5157\r\n* Refactor CLI [2/N]: Move accelerate concerns into TrainingCommand by @albertvillanova in https://github.com/huggingface/trl/pull/5159\r\n* Refactor CLI [3/N]: Self-contain VllmServeCommand argument parsing by @albertvillanova in https://github.com/huggingface/trl/pull/5160\r\n* Revert changes in vLLM client/server by @qgallouedec in https://github.com/huggingface/trl/pull/5165\r\n* Fix experimental TestUpdateWithReplayBuffer: ValueError: `train_dataset` is required by @albertvillanova in https://github.com/huggingface/trl/pull/5171\r\n* Fix default learning_rate in PPO according to paper by @albertvillanova in https://github.com/huggingface/trl/pull/5174\r\n* Accept mm_token_type_ids in GRPO/RLOO _get_per_token_logps_and_entropies by @albertvillanova in https://github.com/huggingface/trl/pull/5176\r\n* Fix default learning_rate in BCO according to paper by @albertvillanova in https://github.com/huggingface/trl/pull/5173\r\n* Document parameters with differing default values in experimental configs by @albertvillanova in https://github.com/huggingface/trl/pull/5172\r\n* Update upstream tracking info about CI PyTorch JIT deprecation warnings by @albertvillanova in https://github.com/huggingface/trl/pull/5166\r\n* Rename input keys in `RewardTrainer` collator from `chosen/rejected_input_ids` to `chosen/rejected_ids` by @qgallouedec in https://github.com/huggingface/trl/pull/5179\r\n* feature: Configurable num logprobs in vLLM generation by @LeonEricsson in https://github.com/huggingface/trl/pull/5107\r\n* Release: v0.29 by @qgallouedec in https://github.com/huggingface/trl/pull/5181\r\n\r\n## New Contributors\r\n\r\n* @LoganVegnaSHOP made their first contribution in https://github.com/huggingface/trl/pull/5069\r\n* @yukiu00 made their first contribution in https://github.com/huggingface/trl/pull/5031\r\n* @flutist made their first contribution in https://github.com/huggingface/trl/pull/4838\r\n* @Mr-Neutr0n made their first contribution in https://github.com/huggingface/trl/pull/5079\r\n* @ehofm made their first contribution in https://github.com/huggingface/trl/pull/5155\r\n* @bledden made their first contribution in https://github.com/huggingface/trl/pull/5147\r\n\r\n**Full Changelog**: https://github.com/huggingface/trl/compare/v0.28.0...v0.29.0\r\n","publishedAt":"2026-02-25T22:38:09.000Z","url":"https://github.com/huggingface/trl/releases/tag/v0.29.0","media":[]},{"id":"rel_jAjl59S0OzPM70FGxKmgd","version":"v0.28.0","title":" v0.28.0","summary":"## Features\r\n* [GRPOTrainer]: Agent Training Supports Async Tool Calls by @pramodith in https://github.com/huggingface/trl/pull/4742\r\n* Add retry stra...","content":"## Features\r\n* [GRPOTrainer]: Agent Training Supports Async Tool Calls by @pramodith in https://github.com/huggingface/trl/pull/4742\r\n* Add retry strategy to vLLM Client for increased robustness by @apalmas-saifh in https://github.com/huggingface/trl/pull/4845\r\n* Enable vLLM sleep mode for generation in Online DPO by @winglian in https://github.com/huggingface/trl/pull/4882\r\n* Support tool call data in `is_conversational` by @qgallouedec in https://github.com/huggingface/trl/pull/4923\r\n* [GRPO] Add parquet logging for completions with individual rewards by @qgallouedec in https://github.com/huggingface/trl/pull/4818\r\n* Update wordle.py example with masking of env tokens by @sergiopaniego in https://github.com/huggingface/trl/pull/4895\r\n* NeMo-Gym Integration by @cmunley1 in https://github.com/huggingface/trl/pull/4848\r\n\r\n## Experimental\r\n* Refactor KTO coordinated with DPO [c/N]: Remove ref_model_init_kwargs by @albertvillanova in https://github.com/huggingface/trl/pull/4837\r\n* Refactor KTO coordinated with DPO [e/N]: Remove label_pad_token_id by @albertvillanova in https://github.com/huggingface/trl/pull/4875\r\n* Refactor KTO coordinated with DPO [d/N]: Remove base_model_attribute_name by @albertvillanova in https://github.com/huggingface/trl/pull/4862\r\n* Fix type hint in `openenv/utils.py`: fallback for no vLLM installed case by @Datta0 in https://github.com/huggingface/trl/pull/4868\r\n* Remove label_pad_token_id from experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/4878\r\n* GOLD training speed up by @141forever in https://github.com/huggingface/trl/pull/4888\r\n* Remove ref_model_init_kwargs from experimental BCO by @albertvillanova in https://github.com/huggingface/trl/pull/4946\r\n* Remove max_prompt_length from experimental PRM by @albertvillanova in https://github.com/huggingface/trl/pull/4963\r\n* Remove max_prompt_length from experimental BCO by @albertvillanova in https://github.com/huggingface/trl/pull/4964\r\n* Remove max_prompt_length from experimental CPO by @albertvillanova in https://github.com/huggingface/trl/pull/4965\r\n* Remove max_prompt_length from experimental ORPO by @albertvillanova in https://github.com/huggingface/trl/pull/4966\r\n* Remove padding_value from experimental CPO and use pad_token_id by @albertvillanova in https://github.com/huggingface/trl/pull/4962\r\n\r\n## Fixes\r\n* Fix _patch_transformers_hybrid_cache for peft by @albertvillanova in https://github.com/huggingface/trl/pull/4844\r\n* Refactor KTO [4/N]: Remove unused padding_value by @albertvillanova in https://github.com/huggingface/trl/pull/4839\r\n* Fix: undefined `current_gradient_accumulation_steps` by @qgallouedec in https://github.com/huggingface/trl/pull/4852\r\n* fix(DeepSeek OPSM): passing correct (vLLM) logprobs by @casinca in https://github.com/huggingface/trl/pull/4857\r\n* Fix SFT training for prompt-completion type and transformers v5 by @qgallouedec in https://github.com/huggingface/trl/pull/4880\r\n* Bugfix: Logprob drift in vLLM serving mode (compared to colocate mode) by @kdubovikov in https://github.com/huggingface/trl/pull/4873\r\n* Fix import path for `get_open_port` based on vLLM version by @qgallouedec in https://github.com/huggingface/trl/pull/4883\r\n* Fix RewardTrainer's results not reproducible by @liyc-ai in https://github.com/huggingface/trl/pull/4887\r\n* `device_map` init consistency in GRPO/RLOO/KTO by @qgallouedec in https://github.com/huggingface/trl/pull/4909\r\n* Fix extra EOS appended in DPO preprocessing for conversational data by @qgallouedec in https://github.com/huggingface/trl/pull/4908\r\n* Fix SFTTrainer init logic: remove TrainingArguments.push_to_hub_token only for transformers < v5 by @albertvillanova in https://github.com/huggingface/trl/pull/4942\r\n* Fix PPO run_name parameter not taking effect by @mel3c in https://github.com/huggingface/trl/pull/4945\r\n* Remove access to `warnings_issued` by @qgallouedec in https://github.com/huggingface/trl/pull/4960\r\n* Revert change in GRPO from NeMo-Gym Integration by @qgallouedec in https://github.com/huggingface/trl/pull/4970\r\n\r\n## Documentation and Examples\r\n* Add Nash Learning from Human Feedback paper to paper index by @kansalaman in https://github.com/huggingface/trl/pull/4860\r\n* Update OpenEnv dependency to new version for hf jobs scripts by @sergiopaniego in https://github.com/huggingface/trl/pull/4843\r\n* Enhance GRPO documentation with scaling notes by @javadtaghia in https://github.com/huggingface/trl/pull/4849\r\n* Created new PTT integration docs as requested by @adityachallapally in https://github.com/huggingface/trl/pull/4907\r\n* docs: add DoRA (2402.09353) to Paper Index by @billycrapediem in https://github.com/huggingface/trl/pull/4892\r\n\r\n## Deprecations\r\n* Remove unused padding_value from BCO by @albertvillanova in https://github.com/huggingface/trl/pull/4846\r\n* Remove deprecated parameters by @qgallouedec in https://github.com/huggingface/trl/pull/4847\r\n* Deprecate parameters in `DPOConfig` by @qgallouedec in https://github.com/huggingface/trl/pull/4969\r\n* Replace `warmup_ratio` with `warmup_steps` by @qgallouedec in https://github.com/huggingface/trl/pull/4983\r\n\r\n## CI Improvements\r\n* Support triggering CI via push to ci-* branches by @albertvillanova in https://github.com/huggingface/trl/pull/4840\r\n* Revert CI hotfix pinning transformers 4.57.4 after tiny model regeneration by @albertvillanova in https://github.com/huggingface/trl/pull/4833\r\n* Use pytest-datadir in CI tests by @albertvillanova in https://github.com/huggingface/trl/pull/4836\r\n* Fix CI with dev dependencies: Mark Qwen3-VL tests as xfail by @albertvillanova in https://github.com/huggingface/trl/pull/4851\r\n* Use pytest-datadir for accelerate config files by @albertvillanova in https://github.com/huggingface/trl/pull/4861\r\n* Update transformer version checks and documentation for lr_scheduler_kwargs workaround by @qgallouedec in https://github.com/huggingface/trl/pull/4876\r\n* Test distributed training for `RewardTrainer`, `RLOOTrainer` and `GRPOTrainer` by @qgallouedec in https://github.com/huggingface/trl/pull/4823\r\n* Mark ZeRO 2 as xfail in distributed tests due to current failure by @qgallouedec in https://github.com/huggingface/trl/pull/4885\r\n* Transformers v5 release: extend xfail condition for `TestGRPOTrainer.test_training_vlm_and_liger` and update version checks by @qgallouedec in https://github.com/huggingface/trl/pull/4898\r\n* Fix CI NotImplementedError for bfloat16 by @albertvillanova in https://github.com/huggingface/trl/pull/4902\r\n* Fix CI AssertionError: Parameter has not changed by @albertvillanova in https://github.com/huggingface/trl/pull/4904\r\n* Fix CI TypeError in llm-blender tests by @albertvillanova in https://github.com/huggingface/trl/pull/4919\r\n* Fix CI AssertionError: assert not True by @albertvillanova in https://github.com/huggingface/trl/pull/4921\r\n* Fix CI ValueError for 0 temperature by @albertvillanova in https://github.com/huggingface/trl/pull/4916\r\n* Set model dtype to float32 in tests of trainers by @albertvillanova in https://github.com/huggingface/trl/pull/4924\r\n* Set model dtype to float32 in experimental tests of trainers by @albertvillanova in https://github.com/huggingface/trl/pull/4925\r\n* Add test for training with `compute_metrics` in `SFTTrainer` by @qgallouedec in https://github.com/huggingface/trl/pull/4950\r\n* Add test for tool call data in `RewardTrainer` by @qgallouedec in https://github.com/huggingface/trl/pull/4959\r\n* Add test for training with `compute_metrics` in `RewardTrainer` by @qgallouedec in https://github.com/huggingface/trl/pull/4958\r\n* Fix test_train_with_chat_template_kwargs by @qgallouedec in https://github.com/huggingface/trl/pull/4971\r\n\r\n## Miscellaneous\r\n* Update `CITATION.cff` by @qgallouedec in https://github.com/huggingface/trl/pull/4856\r\n* Update generate_tiny_models.py: CohereForAI -> CohereLabs by @Michellehbn in https://github.com/huggingface/trl/pull/4877\r\n* Refactor vLLM generation [1/N]: Extract vLLM generation by @albertvillanova in https://github.com/huggingface/trl/pull/4700\r\n* Rearrange variable assignments in `DataCollatorForVisionLanguageModeling` by @qgallouedec in https://github.com/huggingface/trl/pull/4911\r\n* Fix help text formatting for `max_length` in `RewardConfig` and `SFTConfig` by @qgallouedec in https://github.com/huggingface/trl/pull/4910\r\n* Comment about overriding prediction_step in GRPOTrainer and RLOOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/4913\r\n* Remove gradient checkpointing option from various training scripts  by @qgallouedec in https://github.com/huggingface/trl/pull/4905\r\n* Remove chat template setup in dpo_vlm.py by @qgallouedec in https://github.com/huggingface/trl/pull/4906\r\n* Update learning rate comments and add assertions for reference model parameters in GRPO and RLOO tests by @qgallouedec in https://github.com/huggingface/trl/pull/4914\r\n* Add validation for `sync_ref_model` in `GRPOTrainer` and `RLOOTrainer` when using PEFT models by @qgallouedec in https://github.com/huggingface/trl/pull/4912\r\n* Require transformers<5 with PairRMJudge by @albertvillanova in https://github.com/huggingface/trl/pull/4926\r\n* Move VLLMClient to generation module by @albertvillanova in https://github.com/huggingface/trl/pull/4928\r\n* Fix profiling of VLLMGeneration.sync_weights by @albertvillanova in https://github.com/huggingface/trl/pull/4931\r\n* Fix import statement for import_utils in vllm_client.py by @qgallouedec in https://github.com/huggingface/trl/pull/4932\r\n* Set default top_k to 0 in VLLMClient by @albertvillanova in https://github.com/huggingface/trl/pull/4927\r\n* Minor fix docs style by @albertvillanova in https://github.com/huggingface/trl/pull/4953\r\n\r\n## What's Changed\r\n* ⬆️ Bump dev version by @qgallouedec in https://github.com/huggingface/trl/pull/4835\r\n* Support triggering CI via push to ci-* branches by @albertvillanova in https://github.com/huggingface/trl/pull/4840\r\n* Revert CI hotfix pinning transformers 4.57.4 after tiny model regeneration by @albertvillanova in https://github.com/huggingface/trl/pull/4833\r\n* Use pytest-datadir in CI tests by @albertvillanova in https://github.com/huggingface/trl/pull/4836\r\n* Refactor KTO coordinated with DPO [c/N]: Remove ref_model_init_kwargs by @albertvillanova in https://github.com/huggingface/trl/pull/4837\r\n* Fix _patch_transformers_hybrid_cache for peft by @albertvillanova in https://github.com/huggingface/trl/pull/4844\r\n* Refactor KTO [4/N]: Remove unused padding_value by @albertvillanova in https://github.com/huggingface/trl/pull/4839\r\n* Remove unused padding_value from BCO by @albertvillanova in https://github.com/huggingface/trl/pull/4846\r\n* Fix CI with dev dependencies: Mark Qwen3-VL tests as xfail by @albertvillanova in https://github.com/huggingface/trl/pull/4851\r\n* Fix: undefined `current_gradient_accumulation_steps` by @qgallouedec in https://github.com/huggingface/trl/pull/4852\r\n* Remove deprecated parameters by @qgallouedec in https://github.com/huggingface/trl/pull/4847\r\n* Add Nash Learning from Human Feedback paper to paper index by @kansalaman in https://github.com/huggingface/trl/pull/4860\r\n* Use pytest-datadir for accelerate config files by @albertvillanova in https://github.com/huggingface/trl/pull/4861\r\n* Update OpenEnv dependency to new version for hf jobs scripts by @sergiopaniego in https://github.com/huggingface/trl/pull/4843\r\n* Update `CITATION.cff` by @qgallouedec in https://github.com/huggingface/trl/pull/4856\r\n* [GRPOTrainer]: Agent Training Supports Async Tool Calls by @pramodith in https://github.com/huggingface/trl/pull/4742\r\n* Enhance GRPO documentation with scaling notes by @javadtaghia in https://github.com/huggingface/trl/pull/4849\r\n* Add retry strategy to vLLM Client for increased robustness by @apalmas-saifh in https://github.com/huggingface/trl/pull/4845\r\n* Update generate_tiny_models.py: CohereForAI -> CohereLabs by @Michellehbn in https://github.com/huggingface/trl/pull/4877\r\n* Refactor KTO coordinated with DPO [e/N]: Remove label_pad_token_id by @albertvillanova in https://github.com/huggingface/trl/pull/4875\r\n* Refactor KTO coordinated with DPO [d/N]: Remove base_model_attribute_name by @albertvillanova in https://github.com/huggingface/trl/pull/4862\r\n* Fix type hint in `openenv/utils.py`: fallback for no vLLM installed case by @Datta0 in https://github.com/huggingface/trl/pull/4868\r\n* Update transformer version checks and documentation for lr_scheduler_kwargs workaround by @qgallouedec in https://github.com/huggingface/trl/pull/4876\r\n* fix(DeepSeek OPSM): passing correct (vLLM) logprobs by @casinca in https://github.com/huggingface/trl/pull/4857\r\n* Remove label_pad_token_id from experimental trainers by @albertvillanova in https://github.com/huggingface/trl/pull/4878\r\n* Fix SFT training for prompt-completion type and transformers v5 by @qgallouedec in https://github.com/huggingface/trl/pull/4880\r\n* Bugfix: Logprob drift in vLLM serving mode (compared to colocate mode) by @kdubovikov in https://github.com/huggingface/trl/pull/4873\r\n* Enable vLLM sleep mode for generation in Online DPO by @winglian in https://github.com/huggingface/trl/pull/4882\r\n* Test distributed training for `RewardTrainer`, `RLOOTrainer` and `GRPOTrainer` by @qgallouedec in https://github.com/huggingface/trl/pull/4823\r\n* Mark ZeRO 2 as xfail in distributed tests due to current failure by @qgallouedec in https://github.com/huggingface/trl/pull/4885\r\n* Fix import path for `get_open_port` based on vLLM version by @qgallouedec in https://github.com/huggingface/trl/pull/4883\r\n* Fix RewardTrainer's results not reproducible by @liyc-ai in https://github.com/huggingface/trl/pull/4887\r\n* GOLD training speed up by @141forever in https://github.com/huggingface/trl/pull/4888\r\n* Transformers v5 release: extend xfail condition for `TestGRPOTrainer.test_training_vlm_and_liger` and update version checks by @qgallouedec in https://github.com/huggingface/trl/pull/4898\r\n* Fix CI NotImplementedError for bfloat16 by @albertvillanova in https://github.com/huggingface/trl/pull/4902\r\n* Fix CI AssertionError: Parameter has not changed by @albertvillanova in https://github.com/huggingface/trl/pull/4904\r\n* Refactor vLLM generation [1/N]: Extract vLLM generation by @albertvillanova in https://github.com/huggingface/trl/pull/4700\r\n* Created new PTT integration docs as requested by @adityachallapally in https://github.com/huggingface/trl/pull/4907\r\n* Fix CI TypeError in llm-blender tests by @albertvillanova in https://github.com/huggingface/trl/pull/4919\r\n* Rearrange variable assignments in `DataCollatorForVisionLanguageModeling` by @qgallouedec in https://github.com/huggingface/trl/pull/4911\r\n* Fix help text formatting for `max_length` in `RewardConfig` and `SFTConfig` by @qgallouedec in https://github.com/huggingface/trl/pull/4910\r\n* `device_map` init consistency in GRPO/RLOO/KTO by @qgallouedec in https://github.com/huggingface/trl/pull/4909\r\n* Comment about overriding prediction_step in GRPOTrainer and RLOOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/4913\r\n* Remove gradient checkpointing option from various training scripts  by @qgallouedec in https://github.com/huggingface/trl/pull/4905\r\n* docs: add DoRA (2402.09353) to Paper Index by @billycrapediem in https://github.com/huggingface/trl/pull/4892\r\n* Fix CI AssertionError: assert not True by @albertvillanova in https://github.com/huggingface/trl/pull/4921\r\n* Fix CI ValueError for 0 temperature by @albertvillanova in https://github.com/huggingface/trl/pull/4916\r\n* Fix extra EOS appended in DPO preprocessing for conversational data by @qgallouedec in https://github.com/huggingface/trl/pull/4908\r\n* Remove chat template setup in dpo_vlm.py by @qgallouedec in https://github.com/huggingface/trl/pull/4906\r\n* Update learning rate comments and add assertions for reference model parameters in GRPO and RLOO tests by @qgallouedec in https://github.com/huggingface/trl/pull/4914\r\n* Add validation for `sync_ref_model` in `GRPOTrainer` and `RLOOTrainer` when using PEFT models by @qgallouedec in https://github.com/huggingface/trl/pull/4912\r\n* Support tool call data in `is_conversational` by @qgallouedec in https://github.com/huggingface/trl/pull/4923\r\n* Set model dtype to float32 in tests of trainers by @albertvillanova in https://github.com/huggingface/trl/pull/4924\r\n* Require transformers<5 with PairRMJudge by @albertvillanova in https://github.com/huggingface/trl/pull/4926\r\n* Move VLLMClient to generation module by @albertvillanova in https://github.com/huggingface/trl/pull/4928\r\n* Set model dtype to float32 in experimental tests of trainers by @albertvillanova in https://github.com/huggingface/trl/pull/4925\r\n* Fix profiling of VLLMGeneration.sync_weights by @albertvillanova in https://github.com/huggingface/trl/pull/4931\r\n* Fix import statement for import_utils in vllm_client.py by @qgallouedec in https://github.com/huggingface/trl/pull/4932\r\n* Set default top_k to 0 in VLLMClient by @albertvillanova in https://github.com/huggingface/trl/pull/4927\r\n* [GRPO] Add parquet logging for completions with individual rewards by @qgallouedec in https://github.com/huggingface/trl/pull/4818\r\n* Fix SFTTrainer init logic: remove TrainingArguments.push_to_hub_token only for transformers < v5 by @albertvillanova in https://github.com/huggingface/trl/pull/4942\r\n* Remove ref_model_init_kwargs from experimental BCO by @albertvillanova in https://github.com/huggingface/trl/pull/4946\r\n* Update wordle.py example with masking of env tokens by @sergiopaniego in https://github.com/huggingface/trl/pull/4895\r\n* Fix PPO run_name parameter not taking effect by @mel3c in https://github.com/huggingface/trl/pull/4945\r\n* Minor fix docs style by @albertvillanova in https://github.com/huggingface/trl/pull/4953\r\n* Add test for training with `compute_metrics` in `SFTTrainer` by @qgallouedec in https://github.com/huggingface/trl/pull/4950\r\n* Remove access to `warnings_issued` by @qgallouedec in https://github.com/huggingface/trl/pull/4960\r\n* NeMo-Gym Integration by @cmunley1 in https://github.com/huggingface/trl/pull/4848\r\n* Add test for tool call data in `RewardTrainer` by @qgallouedec in https://github.com/huggingface/trl/pull/4959\r\n* Add test for training with `compute_metrics` in `RewardTrainer` by @qgallouedec in https://github.com/huggingface/trl/pull/4958\r\n* Remove max_prompt_length from experimental PRM by @albertvillanova in https://github.com/huggingface/trl/pull/4963\r\n* Remove max_prompt_length from experimental BCO by @albertvillanova in https://github.com/huggingface/trl/pull/4964\r\n* Remove max_prompt_length from experimental CPO by @albertvillanova in https://github.com/huggingface/trl/pull/4965\r\n* Remove max_prompt_length from experimental ORPO by @albertvillanova in https://github.com/huggingface/trl/pull/4966\r\n* Revert change in GRPO from NeMo-Gym Integration by @qgallouedec in https://github.com/huggingface/trl/pull/4970\r\n* Fix test_train_with_chat_template_kwargs by @qgallouedec in https://github.com/huggingface/trl/pull/4971\r\n* Remove padding_value from experimental CPO and use pad_token_id by @albertvillanova in https://github.com/huggingface/trl/pull/4962\r\n\r\n* Remove truncation from tokenizer calls if no max_length by @albertvillanova in https://github.com/huggingface/trl/pull/4972\r\n* Set specific OpenEnv version when installed by @sergiopaniego in https://github.com/huggingface/trl/pull/4978\r\n* Fix add_column in test_train_with_chat_template_kwargs by @albertvillanova in https://github.com/huggingface/trl/pull/4979\r\n* Support truncated completions in GRPO multi-turn training by @albertvillanova in https://github.com/huggingface/trl/pull/4976\r\n* Replace `torch.allclose` with `torch.testing.assert_close` by @qgallouedec in https://github.com/huggingface/trl/pull/4977\r\n* Simplify instructions of installation of OpenEnv  by @sergiopaniego in https://github.com/huggingface/trl/pull/4980\r\n\r\n* Deprecate parameters in `DPOConfig` by @qgallouedec in https://github.com/huggingface/trl/pull/4969\r\n\r\n* [CI] Disallow installation of transformers 5.1.0 due to compatibility issues with DeepSpeed by @qgallouedec in https://github.com/huggingface/trl/pull/4982\r\n\r\n* Replace `warmup_ratio` with `warmup_steps` by @qgallouedec in https://github.com/huggingface/trl/pull/4983\r\n\r\n* Pin transformers!=5.1.0 in deepspeed extra due to incompatibility by @albertvillanova in https://github.com/huggingface/trl/pull/4985\r\n* Fix passing tokenizer in test_train_with_chat_template_kwargs by @albertvillanova in https://github.com/huggingface/trl/pull/4987\r\n* Update dataset configuration name in toolcall dataset loading by @qgallouedec in https://github.com/huggingface/trl/pull/4984\r\n* Use local variable instead of attribute in collator tests by @qgallouedec in https://github.com/huggingface/trl/pull/4957\r\n* Fix import of AutoModelForCausalLMWithValueHead from experimental by @albertvillanova in https://github.com/huggingface/trl/pull/4990\r\n* Assert chat_template is applied in test_train_with_chat_template_kwargs by @albertvillanova in https://github.com/huggingface/trl/pull/4991\r\n* Fix deprecation of DPOConfig.max_completion_length by @albertvillanova in https://github.com/huggingface/trl/pull/4992\r\n* Fix post_init warning stacklevel to 3 by @albertvillanova in https://github.com/huggingface/trl/pull/4993\r\n* Fix ZeRO-3 + PEFT + gradient checkpointing by @qgallouedec in https://github.com/huggingface/trl/pull/4951\r\n* Add GitHub Actions workflow for testing against Transformers branch by @qgallouedec in https://github.com/huggingface/trl/pull/4995\r\n* Add distributed smoke tests workflow for Transformers branch by @qgallouedec in https://github.com/huggingface/trl/pull/4996\r\n* Update NeMo-Gym to use `env_mask` by @cmunley1 in https://github.com/huggingface/trl/pull/4986\r\n* Update sampling mode to token level for safety by @sergiopaniego in https://github.com/huggingface/trl/pull/4989\r\n* perf: Qwen SAPO loss optimization by @casinca in https://github.com/huggingface/trl/pull/4956\r\n* Fix GRPO tool calling for corrupted tool calls by @akshayballal95 in https://github.com/huggingface/trl/pull/4890\r\n* Add `sanitize_logprob` function for NaN handling in vLLM log probabilities by @qgallouedec in https://github.com/huggingface/trl/pull/5001\r\n* [tests] Remove xfail for transformers version >= 5.0.0 due to upstream bug resolution by @qgallouedec in https://github.com/huggingface/trl/pull/5000\r\n* docs: add CGPO/Mixture of Judges (2409.20370) to Paper Index + link ref to AllTrueJudge by @nabin2004 in https://github.com/huggingface/trl/pull/5002\r\n* Filter CI SWIG deprecation warnings by @albertvillanova in https://github.com/huggingface/trl/pull/5004\r\n* Fix CI TRLExperimentalWarning in regular tests by @albertvillanova in https://github.com/huggingface/trl/pull/5007\r\n* Add support for `nested_gather` in OnlineDPOTrainer for transformers v5.2.0 and above by @qgallouedec in https://github.com/huggingface/trl/pull/4981\r\n* Fix CI FutureWarning: ref_model_init_kwargs is deprecated by @albertvillanova in https://github.com/huggingface/trl/pull/5009\r\n* Fix typo in DPO max_prompt_length deprecation warning message by @albertvillanova in https://github.com/huggingface/trl/pull/5020\r\n* Fix vision model prompt truncation bug in DPOTrainer by @albertvillanova in https://github.com/huggingface/trl/pull/5023\r\n* Pin transformers < 5 in judges extra due to incompatibility by @albertvillanova in https://github.com/huggingface/trl/pull/5024\r\n* Fix CI FutureWarning: generate_during_eval is deprecated by @albertvillanova in https://github.com/huggingface/trl/pull/5017\r\n* Fix typo in xfail test reason by @albertvillanova in https://github.com/huggingface/trl/pull/5028\r\n* Fix CI FutureWarning: rpo_alpha is deprecated by @albertvillanova in https://github.com/huggingface/trl/pull/5011\r\n* Fix CI FutureWarning: use_logits_to_keep is deprecated by @albertvillanova in https://github.com/huggingface/trl/pull/5013\r\n* Mark Qwen3VL tests as xfail for transformers 5.0.x by @albertvillanova in https://github.com/huggingface/trl/pull/5029\r\n* [CI] Silence PyTorch JIT and DataLoader deprecation warnings by @qgallouedec in https://github.com/huggingface/trl/pull/4999\r\n* Add length-unbiased GRPO loss (LUSPO) by @Haseebasif7 in https://github.com/huggingface/trl/pull/4988\r\n* Fix CI FutureWarning: tools is deprecated by @albertvillanova in https://github.com/huggingface/trl/pull/5015\r\n* Filter max_prompt_length UserWarning in all test cases by @albertvillanova in https://github.com/huggingface/trl/pull/5035\r\n* Fix CI FutureWarning: max_prompt_length is deprecated by @albertvillanova in https://github.com/huggingface/trl/pull/5019\r\n* Allow testing with transformers 5.1.0 via xfail marks by @albertvillanova in https://github.com/huggingface/trl/pull/5034\r\n* Rename AOT loss type 'aot_pair' to 'aot_unpaired' in DPO by @qgallouedec in https://github.com/huggingface/trl/pull/5038\r\n* Deprecate string usage for `ref_model` in DPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/5040\r\n* Deprecate FDivergenceType in DPOConfig; update f_divergence_type to use string values by @qgallouedec in https://github.com/huggingface/trl/pull/5039\r\n* Fix multiprocessing start method to 'spawn' for test compatibility with Python 3.12+ by @qgallouedec in https://github.com/huggingface/trl/pull/5036\r\n* Add Online Direct Preference Optimization section to paper index by @qgallouedec in https://github.com/huggingface/trl/pull/5037\r\n* Release: 0.28 by @albertvillanova in https://github.com/huggingface/trl/pull/5043\r\n\r\n## New Contributors\r\n* @kansalaman made their first contribution in https://github.com/huggingface/trl/pull/4860\r\n* @javadtaghia made their first contribution in https://github.com/huggingface/trl/pull/4849\r\n* @Michellehbn made their first contribution in https://github.com/huggingface/trl/pull/4877\r\n* @Datta0 made their first contribution in https://github.com/huggingface/trl/pull/4868\r\n* @kdubovikov made their first contribution in https://github.com/huggingface/trl/pull/4873\r\n* @liyc-ai made their first contribution in https://github.com/huggingface/trl/pull/4887\r\n* @141forever made their first contribution in https://github.com/huggingface/trl/pull/4888\r\n* @adityachallapally made their first contribution in https://github.com/huggingface/trl/pull/4907\r\n* @billycrapediem made their first contribution in https://github.com/huggingface/trl/pull/4892\r\n* @mel3c made their first contribution in https://github.com/huggingface/trl/pull/4945\r\n* @cmunley1 made their first contribution in https://github.com/huggingface/trl/pull/4848\r\n* @akshayballal95 made their first contribution in https://github.com/huggingface/trl/pull/4890\r\n* @nabin2004 made their first contribution in https://github.com/huggingface/trl/pull/5002\r\n* @Haseebasif7 made their first contribution in https://github.com/huggingface/trl/pull/4988\r\n\r\n**Full Changelog**: https://github.com/huggingface/trl/compare/v0.27.0...v0.28.0","publishedAt":"2026-02-10T13:28:21.000Z","url":"https://github.com/huggingface/trl/releases/tag/v0.28.0","media":[]},{"id":"rel_XFESFhBPm2oPozZ-NVNhp","version":"v0.27.2","title":"v0.27.2","summary":"## What's Changed\r\n\r\n* Remove access to `warnings_issued` by @qgallouedec in #4960\r\n* Fix SFTTrainer init logic: remove TrainingArguments.push_to_hub_...","content":"## What's Changed\r\n\r\n* Remove access to `warnings_issued` by @qgallouedec in #4960\r\n* Fix SFTTrainer init logic: remove TrainingArguments.push_to_hub_token only for transformers < v5 by @albertvillanova in #4942\r\n* Fix extra EOS appended in DPO preprocessing for conversational data by @qgallouedec in https://github.com/huggingface/trl/pull/4908\r\n\r\n**Full Changelog**: https://github.com/huggingface/trl/compare/v0.27.1...v0.27.2","publishedAt":"2026-02-03T18:10:01.000Z","url":"https://github.com/huggingface/trl/releases/tag/v0.27.2","media":[]},{"id":"rel_rYcthIea281SdvkQcUhJE","version":"v0.27.1","title":"v0.27.1","summary":"## What's Changed\r\n\r\n* Fix: undefined `current_gradient_accumulation_steps` by @qgallouedec in https://github.com/huggingface/trl/pull/4852\r\n* fix(Dee...","content":"## What's Changed\r\n\r\n* Fix: undefined `current_gradient_accumulation_steps` by @qgallouedec in https://github.com/huggingface/trl/pull/4852\r\n* fix(DeepSeek OPSM): passing correct (vLLM) logprobs by @casinca in https://github.com/huggingface/trl/pull/4857\r\n* Fix SFT training for prompt-completion type and transformers v5 by @qgallouedec in https://github.com/huggingface/trl/pull/4880\r\n* Bugfix: Logprob drift in vLLM serving mode (compared to colocate mode) by @kdubovikov in https://github.com/huggingface/trl/pull/4873\r\n* Fix RewardTrainer's results not reproducible by @liyc-ai in https://github.com/huggingface/trl/pull/4887\r\n\r\n## New Contributors\r\n\r\n* @kdubovikov made their first contribution in https://github.com/huggingface/trl/pull/4873\r\n* @liyc-ai made their first contribution in https://github.com/huggingface/trl/pull/4887\r\n\r\n**Full Changelog**: https://github.com/huggingface/trl/compare/v0.27.0...v0.27.1","publishedAt":"2026-01-24T03:42:17.000Z","url":"https://github.com/huggingface/trl/releases/tag/v0.27.1","media":[]},{"id":"rel_S-yTeQ59mlshRUtbUNwVe","version":"v0.27.0","title":"v0.27.0","summary":"## Features\r\n\r\n* Add `vllm_group_port` argument to GRPO, RLOO and OnlineDPO configuration by @pointerhacker in https://github.com/huggingface/trl/pull...","content":"## Features\r\n\r\n* Add `vllm_group_port` argument to GRPO, RLOO and OnlineDPO configuration by @pointerhacker in https://github.com/huggingface/trl/pull/4545\r\n* Preserve truncated tokens in BFD packing by @qgallouedec in https://github.com/huggingface/trl/pull/4632\r\n* Support async reward functions and parallelize call to reward functions. by @pramodith in https://github.com/huggingface/trl/pull/4567\r\n* RLOO supports async rewards. by @pramodith in https://github.com/huggingface/trl/pull/4718\r\n* Support vLLM 0.12.0 by @jiqing-feng in https://github.com/huggingface/trl/pull/4117\r\n* feat: DeepSeek V3.2 Off-policy sequence masking by @casinca in https://github.com/huggingface/trl/pull/4689\r\n* 🎭 Up to 50% less VRAM during forward with `forward_masked_logits` function by @qgallouedec in https://github.com/huggingface/trl/pull/4729\r\n* [GRPO] Add a config to limit the number of tool calling iterations by @pramodith in https://github.com/huggingface/trl/pull/4761\r\n* Switch gradient checkpointing default to use_reentrant=False (PyTorch recommended) by @qgallouedec in https://github.com/huggingface/trl/pull/4811\r\n* Add support for GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization by @nbasyl in https://github.com/huggingface/trl/pull/4785\r\n\r\n## Experimental\r\n\r\n* Move `AutoModelForCausalLMWithValueHead` and `AutoModelForSeq2SeqLMWithValueHead` to experimental by @qgallouedec in https://github.com/huggingface/trl/pull/4654\r\n* Move DPODataCollatorWithPadding to `experimental.utils`  by @qgallouedec in https://github.com/huggingface/trl/pull/4667\r\n* Move `DataCollatorForChatML` to `experimental.utils` by @qgallouedec in https://github.com/huggingface/trl/pull/4668\r\n* Move `add_bos_token_if_needed` and `add_eos_token_if_needed` to `experimental.utils` by @qgallouedec in https://github.com/huggingface/trl/pull/4674\r\n* Move `truncate_right` and `SIMPLE_CHAT_TEMPLATE` to `experimental.utils` by @qgallouedec in https://github.com/huggingface/trl/pull/4677\r\n* Move `prepare_model_for_kbit_training`, `enable_gradient_checkpointing`, `prepare_peft_model` to `experimental.utils` by @qgallouedec in https://github.com/huggingface/trl/pull/4704\r\n* Move `get_reward` function to `experimental.utils` by @qgallouedec in https://github.com/huggingface/trl/pull/4683\r\n* Remove experimental imports from testing_utils by @albertvillanova in https://github.com/huggingface/trl/pull/4727\r\n* ORPO: Avoid catastrophic cancellation in loss function by @hartmans in https://github.com/huggingface/trl/pull/4763\r\n* Refactor KTO [1/N]: Modernize model initialization by @albertvillanova in https://github.com/huggingface/trl/pull/4783\r\n* [GOLD] add probability merging fix to implement chain rule by @kashif in https://github.com/huggingface/trl/pull/4765\r\n* Refactor KTO coordinated with DPO [a/N]: Remove encoder-decoder support by @albertvillanova in https://github.com/huggingface/trl/pull/4792\r\n* Refactor KTO coordinated with DPO [b/N]: Simplify truncation logic by @albertvillanova in https://github.com/huggingface/trl/pull/4808\r\n\r\n## Fixes\r\n\r\n* Accounting for case `num_generations_eval=1` in the calculation of the advantage  by @qgallouedec in https://github.com/huggingface/trl/pull/4662\r\n* Fix vLLM error for tools usage not supported when running GRPO training by @apalmas-saifh in https://github.com/huggingface/trl/pull/4663\r\n* Fix GRPO config validation in case `num_generations_eval` is specified and different than `num_generations` by @apalmas-saifh in https://github.com/huggingface/trl/pull/4682\r\n* Fix top_k default value to 0 for disabling top-k filtering by @albertvillanova in https://github.com/huggingface/trl/pull/4695\r\n* Include `generation_config` for tiny model uploads by @qgallouedec in https://github.com/huggingface/trl/pull/4643\r\n* Fix KeyError with transformers 5.0.0+ where push_to_hub_token is removed by @Manodeepray in https://github.com/huggingface/trl/pull/4691\r\n* Overwrite model default generation config used by model.generate by @albertvillanova in https://github.com/huggingface/trl/pull/4647\r\n* Fix: handle multiple tool calls in `qwen3_schema` by @mattbui in https://github.com/huggingface/trl/pull/4709\r\n* Fix bugs when using multi-gpu: dataset streaming for offline trainers + dtype initialization by @kaixuanliu in https://github.com/huggingface/trl/pull/3950\r\n* Ensure llm-blender is importable with transformers >= v5 by @albertvillanova in https://github.com/huggingface/trl/pull/4781\r\n* Monkey patch for `HybridCache` in Liger-Kernel with transformers v5 by @qgallouedec in https://github.com/huggingface/trl/pull/4798\r\n* [fix] GRPOTrainer: proper access `args` by @carlyou in https://github.com/huggingface/trl/pull/4801\r\n* Fix vllm compat patches to be applied only to affected versions by @albertvillanova in https://github.com/huggingface/trl/pull/4815\r\n* fix bug when sft calc outputs.token_accuracy by @kaixuanliu in https://github.com/huggingface/trl/pull/4814\r\n* fix xpu vllm client server by @jiqing-feng in https://github.com/huggingface/trl/pull/4780\r\n\r\n## Documentation and Examples\r\n\r\n* docs: add RapidFire AI integration section to SFT Trainer by @kamran-rapidfireAI in https://github.com/huggingface/trl/pull/4661\r\n* Fix environment image name for BrowserGym example script by @sergiopaniego in https://github.com/huggingface/trl/pull/4680\r\n* Docs(`grpo_trainer.md`): Added Qwen SAPO details under `Loss Types` by @casinca in https://github.com/huggingface/trl/pull/4681\r\n* [docs] Adds GRPO, RSO and LoRA to Paper Index by @SSusantAchary in https://github.com/huggingface/trl/pull/4441\r\n* Enable zero3 init and 16-bit model saving for ds ulysses config by @edbeeching in https://github.com/huggingface/trl/pull/4701\r\n* Set version to packaged one in notebooks by @sergiopaniego in https://github.com/huggingface/trl/pull/4648\r\n* BrowserGym example for LLMs (no vision) by @sergiopaniego in https://github.com/huggingface/trl/pull/4696\r\n* docs: Add RapidFire AI cross-references to DPO and GRPO trainer docs by @kamran-rapidfireAI in https://github.com/huggingface/trl/pull/4705\r\n* [docs] Fix RapidFire AI position in documentation by @qgallouedec in https://github.com/huggingface/trl/pull/4715\r\n* Add inference example to GRPO agent training notebook by @sergiopaniego in https://github.com/huggingface/trl/pull/4710\r\n* Upload FunctionGemma notebook by @sergiopaniego in https://github.com/huggingface/trl/pull/4721\r\n* Update agents notebook dependencies by @sergiopaniego in https://github.com/huggingface/trl/pull/4724\r\n* Add uv/hf jobs support to OpenEnv scripts  by @sergiopaniego in https://github.com/huggingface/trl/pull/4720\r\n* Add GRPO QLoRA free notebook by @sergiopaniego in https://github.com/huggingface/trl/pull/4660\r\n* Hotfix for browsergym openenv notebook by @sergiopaniego in https://github.com/huggingface/trl/pull/4740\r\n* docs: fix \"Good Second Issue\" redirection link by @casinca in https://github.com/huggingface/trl/pull/4749\r\n* [Docs] Add SRL (Supervised Reinforcement Learning) to Community Tutorials by @s23deepak in https://github.com/huggingface/trl/pull/4758\r\n* Add LFM2.5 to GRPO notebook by @sergiopaniego in https://github.com/huggingface/trl/pull/4793\r\n* Sudoku GRPO example script using TextArena by @sergiopaniego in https://github.com/huggingface/trl/pull/4762\r\n* [EXAMPLES] Update wordle to new openenv release by @burtenshaw in https://github.com/huggingface/trl/pull/4791\r\n* Update the typos in docs/source/grpo_trainer.md by @Tianyi-Billy-Ma in https://github.com/huggingface/trl/pull/4804\r\n* Updat examples to new OpenEnv version by @sergiopaniego in https://github.com/huggingface/trl/pull/4796\r\n* Update GRPO example to use Qwen2.5 instead of Qwen2 by @BurnyCoder in https://github.com/huggingface/trl/pull/4803\r\n\r\n## Deprecations\r\n\r\n* Remove deprecated functions and parameters by @qgallouedec in https://github.com/huggingface/trl/pull/4651\r\n* Remove `MergeModelCallback` from import structure by @qgallouedec in https://github.com/huggingface/trl/pull/4664\r\n* Remove `ChatMlSpecialTokens` by @qgallouedec in https://github.com/huggingface/trl/pull/4666\r\n* Remove unused `_win_rate_completions_df` function from callbacks by @qgallouedec in https://github.com/huggingface/trl/pull/4672\r\n* Deprecate max_prompt_length in RLOOTrainer by @albertvillanova in https://github.com/huggingface/trl/pull/4703\r\n* Small fix on contributing docs by @murilo-cunha in https://github.com/huggingface/trl/pull/4753\r\n* Remove `DbrxForCausalLM` support by @qgallouedec in https://github.com/huggingface/trl/pull/4799\r\n\r\n## CI Improvements\r\n\r\n* Hotfix CI due to generation config by setting tests as xfail by @albertvillanova in https://github.com/huggingface/trl/pull/4657\r\n* Upgrade GitHub Actions to latest versions by @salmanmkc in https://github.com/huggingface/trl/pull/4734\r\n* Upgrade GitHub Actions for Node 24 compatibility by @salmanmkc in https://github.com/huggingface/trl/pull/4733\r\n* Include data type for tiny models and update tests by @qgallouedec in https://github.com/huggingface/trl/pull/4728\r\n* Change tiny model dtype from float16 to bfloat16 to fix CUDA error by @albertvillanova in https://github.com/huggingface/trl/pull/4745\r\n* Add revision override mechanism for testing tiny models by @albertvillanova in https://github.com/huggingface/trl/pull/4769\r\n* Hotfix: Set float32 as default dtype for testing tiny models by @albertvillanova in https://github.com/huggingface/trl/pull/4770\r\n* Hotfix CI with dev dependencies: xfail test_training_vlm_and_liger by @albertvillanova in https://github.com/huggingface/trl/pull/4777\r\n* Add initial multi-GPU CI tests for distributed training by @qgallouedec in https://github.com/huggingface/trl/pull/4784\r\n* Set dtype default to float32 by @albertvillanova in https://github.com/huggingface/trl/pull/4778\r\n* Test FSDP2 by @qgallouedec in https://github.com/huggingface/trl/pull/4813\r\n* Test ZeRO Stage 3 by @qgallouedec in https://github.com/huggingface/trl/pull/4821\r\n* Hotfix CI main tests: Pin transformers 4.57.4 by @albertvillanova in https://github.com/huggingface/trl/pull/4830\r\n* Hotfix CI distributed smoke tests: xfail test_sft_peft[zero3] by @albertvillanova in https://github.com/huggingface/trl/pull/4831\r\n* Test ZeRO Stage 2 by @qgallouedec in https://github.com/huggingface/trl/pull/4822\r\n\r\n## Miscellaneous\r\n\r\n* Move `compute_accuracy` to PRM Trainer file by @qgallouedec in https://github.com/huggingface/trl/pull/4656\r\n* Move `clone_chat_template` to `chat_template_utils` by @qgallouedec in https://github.com/huggingface/trl/pull/4653\r\n* Move `GeometricMixtureWrapper` to `nash_md_trainer.py` by @qgallouedec in https://github.com/huggingface/trl/pull/4670\r\n* Move `exact_div`, `print_rich_table`, `truncate_response`, `forward` to `ppo_trainer` by @qgallouedec in https://github.com/huggingface/trl/pull/4676\r\n* Merge `OnPolicyConfig` and `PPOConfig` and move `OnlineTrainerState` by @qgallouedec in https://github.com/huggingface/trl/pull/4671\r\n* Move PEFT tests for `AutoModelForCausalLMWithValueHead` to `test_ppo_trainer` by @qgallouedec in https://github.com/huggingface/trl/pull/4678\r\n* Move `generate` and `batch_generation` to `ppo_trainer.py` by @qgallouedec in https://github.com/huggingface/trl/pull/4675\r\n* Import `TrainerCallback` from top-level transformers by @qgallouedec in https://github.com/huggingface/trl/pull/4694\r\n* Fix typos by @qgallouedec in https://github.com/huggingface/trl/pull/4690\r\n* Align import utils with transformers by @qgallouedec in https://github.com/huggingface/trl/pull/4684\r\n* Align stable trainers by @qgallouedec in https://github.com/huggingface/trl/pull/4687\r\n* Align GRPO and RLOO initialization by @qgallouedec in https://github.com/huggingface/trl/pull/4685\r\n* Align use of vllm_max_model_length in RLOOTrainer by @albertvillanova in https://github.com/huggingface/trl/pull/4702\r\n* Align RLOO with GRPO by @qgallouedec in https://github.com/huggingface/trl/pull/4706\r\n* Fix test assertion for `top_k` parameter in `OnlineDPOTrainer` by @qgallouedec in https://github.com/huggingface/trl/pull/4714\r\n* Disallow `PeftModel` + `peft_config` in trainers by @qgallouedec in https://github.com/huggingface/trl/pull/4713\r\n* Fix deprecation version for RLOO max_prompt_length by @albertvillanova in https://github.com/huggingface/trl/pull/4726\r\n* Refactor vLLM generation [3/N]: Decouple profiling from trainer by @albertvillanova in https://github.com/huggingface/trl/pull/4717\r\n* Avoid docstyle formatting for `TestParseResponse` by @qgallouedec in https://github.com/huggingface/trl/pull/4736\r\n* 🥂 Happy New Year by @qgallouedec in https://github.com/huggingface/trl/pull/4775\r\n* Update import structure by @qgallouedec in https://github.com/huggingface/trl/pull/4665\r\n* Improve PEFT integration by @qgallouedec in https://github.com/huggingface/trl/pull/4723\r\n* Replace `GuidedDecodingParams` with `StructuredOutputsParams` in sampling parameter configuration by @qgallouedec in https://github.com/huggingface/trl/pull/4797\r\n* Move compatibility shims to dedicated module _compat by @albertvillanova in https://github.com/huggingface/trl/pull/4807\r\n* Refactor _compat module by @albertvillanova in https://github.com/huggingface/trl/pull/4809\r\n* Revised comments explaining the higher learning rate choice given tiny gradients by @qgallouedec in https://github.com/huggingface/trl/pull/4810\r\n* Simplify version checks in compat patches by @albertvillanova in https://github.com/huggingface/trl/pull/4817\r\n* Set packaging as explicit dependency and standardize version comparison by @albertvillanova in https://github.com/huggingface/trl/pull/4819\r\n* Fix _patch_transformers_hybrid_cache also for peft by @albertvillanova in https://github.com/huggingface/trl/pull/4820\r\n* Fix _patch_vllm_cached_tokenizer to only apply if transformers >= v5 by @albertvillanova in https://github.com/huggingface/trl/pull/4827\r\n* Fix code quality in SFTTrainer file by @albertvillanova in https://github.com/huggingface/trl/pull/4832\r\n\r\n\r\n--\r\n\r\n## New Contributors\r\n* @pointerhacker made their first contribution in https://github.com/huggingface/trl/pull/4545\r\n* @apalmas-saifh made their first contribution in https://github.com/huggingface/trl/pull/4663\r\n* @Manodeepray made their first contribution in https://github.com/huggingface/trl/pull/4691\r\n* @salmanmkc made their first contribution in https://github.com/huggingface/trl/pull/4734\r\n* @mattbui made their first contribution in https://github.com/huggingface/trl/pull/4709\r\n* @murilo-cunha made their first contribution in https://github.com/huggingface/trl/pull/4753\r\n* @hartmans made their first contribution in https://github.com/huggingface/trl/pull/4763\r\n* @s23deepak made their first contribution in https://github.com/huggingface/trl/pull/4758\r\n* @Tianyi-Billy-Ma made their first contribution in https://github.com/huggingface/trl/pull/4804\r\n* @carlyou made their first contribution in https://github.com/huggingface/trl/pull/4801\r\n* @BurnyCoder made their first contribution in https://github.com/huggingface/trl/pull/4803\r\n\r\n**Full Changelog**: https://github.com/huggingface/trl/compare/v0.26.0...v0.27.0","publishedAt":"2026-01-16T02:34:32.000Z","url":"https://github.com/huggingface/trl/releases/tag/v0.27.0","media":[]},{"id":"rel_TZUcFqIsv6uXAM3nPiDt8","version":"v0.26.2","title":"v0.26.2","summary":"## What's Changed\r\n\r\n* Overwrite model default generation config used by model.generate by @albertvillanova in https://github.com/huggingface/trl/pull...","content":"## What's Changed\r\n\r\n* Overwrite model default generation config used by model.generate by @albertvillanova in https://github.com/huggingface/trl/pull/4647\r\n\r\n**Full Changelog**: https://github.com/huggingface/trl/compare/v0.26.1...v0.26.2","publishedAt":"2025-12-18T15:55:24.000Z","url":"https://github.com/huggingface/trl/releases/tag/v0.26.2","media":[]},{"id":"rel_qzCrtfI5erUFR7ZSCJIHs","version":"v0.26.1","title":"v0.26.1","summary":"## What's Changed\r\n\r\n* Fix vLLM error for tools usage not supported when running GRPO training by @apalmas-saifh in https://github.com/huggingface/trl...","content":"## What's Changed\r\n\r\n* Fix vLLM error for tools usage not supported when running GRPO training by @apalmas-saifh in https://github.com/huggingface/trl/pull/4663\r\n* Fix GRPO config validation in case `num_generations_eval` is specified and different than `num_generations` by @apalmas-saifh in https://github.com/huggingface/trl/pull/4682\r\n\r\n## New Contributors\r\n\r\n* @apalmas-saifh made their first contribution in https://github.com/huggingface/trl/pull/4663\r\n\r\n**Full Changelog**: https://github.com/huggingface/trl/compare/v0.26.0...v0.26.1","publishedAt":"2025-12-12T17:50:48.000Z","url":"https://github.com/huggingface/trl/releases/tag/v0.26.1","media":[]},{"id":"rel_Pid8rQyQ9-c3ttv41XXEr","version":"v0.26.0","title":"v0.26.0","summary":"## Features\r\n\r\n### 🕵️‍♂️ GRPO: Agent training\r\n\r\n`GRPOTrainer` now supports training agents using tools. This allows language models to interact with...","content":"## Features\r\n\r\n### 🕵️‍♂️ GRPO: Agent training\r\n\r\n`GRPOTrainer` now supports training agents using tools. This allows language models to interact with external functions or APIs during training.\r\n\r\n```python\r\nfrom datasets import Dataset\r\nfrom trl import GRPOTrainer\r\n\r\ndef multiply(a: int, b: int) -> int:\r\n    \"\"\"\r\n    Multiplies two integers.\r\n\r\n    Args:\r\n        a: The first integer.\r\n        b: The second integer.\r\n\r\n    Returns:\r\n        The product of the two integers.\r\n    \"\"\"\r\n    return a * b\r\n\r\n\r\ndataset = Dataset.from_list(\r\n    [\r\n        {\"prompt\": [{\"role\": \"user\", \"content\": \"What is 3 multiplied by 4?\"}], \"answer\": 12},\r\n        {\"prompt\": [{\"role\": \"user\", \"content\": \"Calculate 7 times 8.\"}], \"answer\": 56},\r\n        {\"prompt\": [{\"role\": \"user\", \"content\": \"Find the product of 5 and 6.\"}], \"answer\": 30},\r\n        {\"prompt\": [{\"role\": \"user\", \"content\": \"What do you get when you multiply 9 by 9?\"}], \"answer\": 81},\r\n        {\"prompt\": [{\"role\": \"user\", \"content\": \"Compute 12 multiplied by 11.\"}], \"answer\": 132},\r\n        {\"prompt\": [{\"role\": \"user\", \"content\": \"What is 15 times 14?\"}], \"answer\": 210},\r\n    ]\r\n)\r\n\r\ndef accuracy(completions, answer, **kwargs):\r\n    predictions = [completion[-1][\"content\"] for completion in completions]\r\n    rewards = [float(str(ans) in pred) for pred, ans in zip(predictions, answer)]\r\n    return rewards\r\n\r\ntrainer = GRPOTrainer(\r\n    model=\"Qwen/Qwen3-0.6B\",\r\n    train_dataset=dataset,\r\n    tools=[multiply],\r\n    reward_funcs=accuracy,\r\n)\r\ntrainer.train()\r\n```\r\n\r\nby @qgallouedec in https://github.com/huggingface/trl/pull/4300\r\n\r\n### ScaleRL: Add CISPO Loss\r\n\r\nCISPO Loss was first introduced in the [Minimax-M1 paper](https://huggingface.co/papers/2506.13585), the [ScaleRL paper](https://huggingface.co/papers/2510.13786) subsequently showed that CISPO loss scales the best in terms of performance and efficiency as models are trained for longer.\r\n\r\n`GRPOTrainer` now supports the CISPO loss using `loss_type=\"cispo\"` in the `GRPOConfig`.\r\n\r\nby @pramodith in https://github.com/huggingface/trl/pull/4495\r\n\r\n### Add vLLM quantization option for colocate\r\n\r\nWhen the input model is quantized using bitsandbytes, vLLM will now also use quantization when in colocate mode.\r\n\r\nby @sergiopaniego in https://github.com/huggingface/trl/pull/4496\r\n\r\n### Reasoning reward\r\n\r\nTRL nows includes a reasoning reward function\r\n\r\n```python\r\nfrom trl.rewards import reasoning_accuracy_reward\r\n\r\nsolutions = [r\"\\frac{1}{3}\", r\"\\frac{1}{3}\", r\"\\frac{1}{3}\"]\r\ncompletions = [\r\n    [\r\n        {\r\n            \"role\": \"assistant\",\r\n            \"content\": r\"<think> Reasoning content </think> The final answer is \\boxed{\\frac{1}{3}}\",\r\n        }\r\n    ],\r\n    [\r\n        {\r\n            \"role\": \"assistant\",\r\n            \"content\": r\"<think> Reasoning content </think> The final answer is \\boxed{\\frac{1}{2}}\",\r\n        }\r\n    ],\r\n    [\r\n        {\r\n            \"role\": \"assistant\",\r\n            \"content\": r\"<think> Reasoning content with partial answers \\boxed{\\frac{1}{3}} but no final answer\",\r\n        }\r\n    ],\r\n]\r\nreasoning_accuracy_reward(completions, solutions)  # [1.0, 0.0, 0.0] \r\n```\r\n\r\nAs any other reward function, it can be used in `GRPOTrainer` or `RLOOTrainer`.\r\n\r\n```python\r\nfrom trl import GRPOTrainer\r\nfrom trl.rewards import reasoning_accuracy_reward\r\n\r\ntrainer = GRPOTrainer(\r\n    ...,\r\n    reward_funcs=reasoning_accuracy_reward,\r\n)\r\n```\r\n\r\nby @lewtun in https://github.com/huggingface/trl/pull/4563\r\n\r\n### Add `shuffle_dataset` option to `SFTTrainer`\r\n\r\nYou can now shuffle the dataset in `SFTTrainer` by setting the `shuffle_dataset` argument to `True` in `SFTConfig`. This is useful when the dataset features high similarity between consecutive samples.\r\n\r\n```python\r\nfrom trl import SFTTrainer, SFTConfig\r\n\r\nSFTConfig(shuffle_dataset=True)\r\n```\r\n\r\nby @qgallouedec in https://github.com/huggingface/trl/pull/4564\r\n\r\n### Add SAPO Loss in GRPO\r\n\r\nSoft Adaptive Policy Optimization (SAPO), replaces hard clipping with a smooth, temperature-controlled gate that adaptively attenuates off-policy updates while preserving useful learning signals. Compared with GSPO and GRPO, SAPO is both sequence-coherent and token-adaptive. Like GSPO, SAPO maintains sequence-level coherence, but its soft gating forms a continuous trust region that avoids the brittle hard clipping band used in GSPO.\r\n\r\nYou can now use SAPO loss in `GRPOTrainer` by setting `loss_type=\"sapo\"` in the `GRPOConfig`.\r\n\r\nby @pramodith in https://github.com/huggingface/trl/pull/4600\r\n\r\n### Other Features\r\n\r\n* Support completion bootstrap for VLM in GRPO/RLOO by @SolarWindRider in https://github.com/huggingface/trl/pull/4452\r\n* Add support for images inside tables with Trackio completions logging by @taha-yassine in https://github.com/huggingface/trl/pull/4505\r\n* Add step time metric to GRPO Trainer for performance tracking by @qgallouedec in https://github.com/huggingface/trl/pull/4516\r\n* Add target_parameters to LoraConfig by @jonnyli1125 in https://github.com/huggingface/trl/pull/4536\r\n* [SFT] Log mean token accuracy from Liger kernel by @kashif in https://github.com/huggingface/trl/pull/4302\r\n* Add `num_generations_eval` parameter for efficient evaluation by @mingxuetian in https://github.com/huggingface/trl/pull/4458\r\n* [GRPO] Sequence-level TIS & MIS by @LeonEricsson in https://github.com/huggingface/trl/pull/4530\r\n* TRL supports vLLM 0.11 by @qgallouedec in https://github.com/huggingface/trl/pull/4633\r\n* feat: implement DeepSeek unbiased KL estimator for GRPO by @jlcanta in https://github.com/huggingface/trl/pull/4638\r\n\r\n## Experimental\r\n\r\n* Move XPOTrainer to trl.experimental.xpo by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4485\r\n* Move judges to experimental submodule by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4439\r\n* Add MiniLLM Trainer by @t1101675 in https://github.com/huggingface/trl/pull/4504\r\n* refactor: Move CPOTrainer to experimental module by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4470\r\n* Move GKDTrainer to experimental module by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4474\r\n* Move NashMDTrainer to experimental module by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4477\r\n* Move PPOTrainer to trl.experimental.ppo by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4482\r\n* [ORPO] Move ORPOTrainer to experimental by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4480\r\n* Move PRMTrainer to trl.experimental.prm by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4483\r\n* Move OnlineDPOTrainer to experimental module by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4473\r\n* Move `WinRateCallback` to experimental by @qgallouedec in https://github.com/huggingface/trl/pull/4558\r\n* Move tests for GSPOTokenTrainer to experimental by @qgallouedec in https://github.com/huggingface/trl/pull/4572\r\n* Raise FutureWarning for classes moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/4605\r\n* Move MergeModelCallback to experimental by @qgallouedec in https://github.com/huggingface/trl/pull/4608\r\n* Raise FutureWarning for trainer moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/4620\r\n* Remove no longer applicable warning once BCO was moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/4628\r\n* Refactor suppression of warning at experimental import by @albertvillanova in https://github.com/huggingface/trl/pull/4629\r\n* 🚚 Move KTO to trl.experimental by @neha222222 in https://github.com/huggingface/trl/pull/4575\r\n\r\n## Fixes\r\n\r\n* Buffer samples based on group level stds. by @pramodith in https://github.com/huggingface/trl/pull/4492\r\n* Fix bugs in CISPO conditions by @pramodith in https://github.com/huggingface/trl/pull/4499\r\n* `device_map` and `dtype` to `\"auto\"` by default by @qgallouedec in https://github.com/huggingface/trl/pull/4509\r\n* MiniLLM: Fix arguments in config & add to documentation index by @t1101675 in https://github.com/huggingface/trl/pull/4518\r\n* [Bug Fix] OnlineDPOTrainer with vLLM Server Mode by @YangKai0616 in https://github.com/huggingface/trl/pull/4500\r\n* Rename `flash-attn` to `flash-attn2` by @qgallouedec in https://github.com/huggingface/trl/pull/4514\r\n* fix(GOLDTrainer): Resolve incorrect attribute access and VLLMClient.generate() output type by @fabio-sim in https://github.com/huggingface/trl/pull/4526\r\n* Fix bug with VLM processors in prompt-completion completion text-only training by @kschwethelm in https://github.com/huggingface/trl/pull/4553\r\n* fix+docs: `device_map=None` for DeepSpeed and add ZeRO paper (1910.02054) to Paper Index by @JenWei0312 in https://github.com/huggingface/trl/pull/4551\r\n* Fix vLLM sleep mode: add collective RPC call to reload weights in vLLM wake-up process by @qgallouedec in https://github.com/huggingface/trl/pull/4571\r\n* fix: use shift_labels for metrics when using CP or SP by @jue-jue-zi in https://github.com/huggingface/trl/pull/4579\r\n* Fix 'generation_config' AttributeError by @albertvillanova in https://github.com/huggingface/trl/pull/4596\r\n* Fix FSDP2 model key miss match when sync LoRA model to vLLM server by @Xiao-Chenguang in https://github.com/huggingface/trl/pull/4603\r\n* Fix KTOTrainer CUDA error for large-vocab models via tensor indexing by @bhuvanprakash in https://github.com/huggingface/trl/pull/4635\r\n\r\n## Documentation and Examples\r\n\r\n* docs: Add PEFT subsection to reducing memory usage guide by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4430\r\n* [DOCS] update and fix openenv by @burtenshaw in https://github.com/huggingface/trl/pull/4490\r\n* Fix link to OpenEnv docs by @lukehinds in https://github.com/huggingface/trl/pull/4502\r\n* Tweak description for vLLM sleep mode by @lewtun in https://github.com/huggingface/trl/pull/4506\r\n* Paper Index: Change `num_completions` to `num_generations` by @pramodith in https://github.com/huggingface/trl/pull/4515\r\n* docs: Extend CLI basic usage examples to all supported CLIs by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4425\r\n* [OpenEnv] add vllm colocate mode to openenv scripts by @kashif in https://github.com/huggingface/trl/pull/4510\r\n* [Doc] Drop dummy reward and dataset for DeepMath-103K and accuracy reward by @qgallouedec in https://github.com/huggingface/trl/pull/4524\r\n* Add OpenEnv Script examples to docs by @sergiopaniego in https://github.com/huggingface/trl/pull/4533\r\n* Update OpenEnv example scripts by @sergiopaniego in https://github.com/huggingface/trl/pull/4547\r\n* [OpenEnv] browsergym example script by @kashif in https://github.com/huggingface/trl/pull/4539\r\n* Update OpenEnv guide with latest details by @sergiopaniego in https://github.com/huggingface/trl/pull/4552\r\n* Add GRPO Wordle OpenEnv Colab by @sergiopaniego in https://github.com/huggingface/trl/pull/4542\r\n* Update OpenEnv guide with new notebook by @sergiopaniego in https://github.com/huggingface/trl/pull/4555\r\n* docs: add KTO (2402.01306) to Paper Index + link ref to KTOTrainer by @SSusantAchary in https://github.com/huggingface/trl/pull/4440\r\n* Add LFM2 to SFT notebook examples by @sergiopaniego in https://github.com/huggingface/trl/pull/4455\r\n* docs: Rewrite PEFT integration guide with comprehensive examples by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4421\r\n* Reorder documentation TOC to surface key trainer sections by @qgallouedec in https://github.com/huggingface/trl/pull/4565\r\n* Fix typo in GRPO description in README by @iliasmerigh in https://github.com/huggingface/trl/pull/4573\r\n* Fix Replay Buffer docs. by @pramodith in https://github.com/huggingface/trl/pull/4574\r\n* Fix PPO example by @qgallouedec in https://github.com/huggingface/trl/pull/4556\r\n* docs: Add Beyond the 80/20 Rule (2506.01939) to Paper Index by @xuanduy04 in https://github.com/huggingface/trl/pull/4580\r\n* docs: Expand training customization examples by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4427\r\n* docs: Expand speeding up training guide with acceleration methods by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4428\r\n* Update How-to guides by @qgallouedec in https://github.com/huggingface/trl/pull/4604\r\n* Fixed OpenEnv example scripts by @sergiopaniego in https://github.com/huggingface/trl/pull/4610\r\n* Add ministral 3 free notebooks by @sergiopaniego in https://github.com/huggingface/trl/pull/4614\r\n* Replace arXiv paper links with HF links by @albertvillanova in https://github.com/huggingface/trl/pull/4613\r\n* Add experimental imports to docs by @albertvillanova in https://github.com/huggingface/trl/pull/4616\r\n* Fix README style by @sergiopaniego in https://github.com/huggingface/trl/pull/4619\r\n* Fix link to OpenEnv blog in docs by @sergiopaniego in https://github.com/huggingface/trl/pull/4625\r\n* Update ministral notebooks with official bf16 ckpt by @sergiopaniego in https://github.com/huggingface/trl/pull/4626\r\n* Add missing experimental autodoc classes to docs by @albertvillanova in https://github.com/huggingface/trl/pull/4618\r\n* Add logos as assets by @qgallouedec in https://github.com/huggingface/trl/pull/4627\r\n* fix(PPO examples): passing model dict to models by @casinca in https://github.com/huggingface/trl/pull/4630\r\n* [ALST/Ulysses] Added ALST/Ulysses documentation by @kashif in https://github.com/huggingface/trl/pull/4420\r\n* Adding  EssentialAI/rnj-1-instruct GRPO example by @sergiopaniego in https://github.com/huggingface/trl/pull/4640\r\n* Update `rnj_1_instruct` notebook by @sergiopaniego in https://github.com/huggingface/trl/pull/4646\r\n* Add agent training notebook to examples by @sergiopaniego in https://github.com/huggingface/trl/pull/4645\r\n\r\n## Deprecations\r\n\r\n* Replace `wandb_log_unique_prompts` with `log_unique_prompts` by @taha-yassine in https://github.com/huggingface/trl/pull/4508\r\n* Remove deprecations for 0.26 release by @albertvillanova in https://github.com/huggingface/trl/pull/4607\r\n* Remove deprecated batched formatting in GOLDTrainer by @albertvillanova in https://github.com/huggingface/trl/pull/4622\r\n\r\n## Miscellaneous\r\n\r\n* ⛴️ Add kernels to Docker images by @ishitab02 in https://github.com/huggingface/trl/pull/4445\r\n* Replace accelerate logging with stdlib in CLI by @lewtun in https://github.com/huggingface/trl/pull/4512\r\n* Replace flash attention2 with kernels-community/flash-attn2 by @tamoghnokandar in https://github.com/huggingface/trl/pull/4426\r\n* Fix Docker images for Liger by @lewtun in https://github.com/huggingface/trl/pull/4522\r\n* Remove test trainer args by @qgallouedec in https://github.com/huggingface/trl/pull/4517\r\n* Prevent upcasting norm layers in `prepare_model_for_kbit_training` by @sergiopaniego in https://github.com/huggingface/trl/pull/4457\r\n* Remove module-level imports of extra deps in experimental.judges by @albertvillanova in https://github.com/huggingface/trl/pull/4598\r\n* Clean up model preparation  by @qgallouedec in https://github.com/huggingface/trl/pull/4577\r\n* Remove deprecation warning from RLOOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/4644\r\n* Disable gradient checkpointing during no-grad inference to avoid PyTorch warning by @qgallouedec in https://github.com/huggingface/trl/pull/4636\r\n\r\n## What's Changed\r\n\r\n* ⬆️ Bump dev version by @qgallouedec in https://github.com/huggingface/trl/pull/4479\r\n* Add LFM2 to SFT notebook examples by @sergiopaniego in https://github.com/huggingface/trl/pull/4455\r\n* Add tiny model Qwen3VLForConditionalGeneration to CI by @albertvillanova in https://github.com/huggingface/trl/pull/4494\r\n* Buffer samples based on group level stds. by @pramodith in https://github.com/huggingface/trl/pull/4492\r\n* Move XPOTrainer to trl.experimental.xpo by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4485\r\n* ⛴️ Add kernels to Docker images by @ishitab02 in https://github.com/huggingface/trl/pull/4445\r\n* ScaleRL: Add CISPO Loss by @pramodith in https://github.com/huggingface/trl/pull/4495\r\n* Support completion bootstrap for VLM in GRPO/RLOO by @SolarWindRider in https://github.com/huggingface/trl/pull/4452\r\n* docs: Add PEFT subsection to reducing memory usage guide by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4430\r\n* Fix bugs in CISPO conditions by @pramodith in https://github.com/huggingface/trl/pull/4499\r\n* Move judges to experimental submodule by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4439\r\n* [DOCS] update and fix openenv by @burtenshaw in https://github.com/huggingface/trl/pull/4490\r\n* Consistency regarding relative imports by @qgallouedec in https://github.com/huggingface/trl/pull/4498\r\n* Fix link to OpenEnv docs by @lukehinds in https://github.com/huggingface/trl/pull/4502\r\n* Tweak description for vLLM sleep mode by @lewtun in https://github.com/huggingface/trl/pull/4506\r\n* Add support for images inside tables with Trackio completions logging by @taha-yassine in https://github.com/huggingface/trl/pull/4505\r\n* Add MiniLLM Trainer by @t1101675 in https://github.com/huggingface/trl/pull/4504\r\n* Replace accelerate logging with stdlib in CLI by @lewtun in https://github.com/huggingface/trl/pull/4512\r\n* Add temporary workaround for `lr_scheduler_kwargs` dtype issue in Transformers 4.57.0 by @qgallouedec in https://github.com/huggingface/trl/pull/4513\r\n* `device_map` and `dtype` to `\"auto\"` by default by @qgallouedec in https://github.com/huggingface/trl/pull/4509\r\n* Replace `wandb_log_unique_prompts` with `log_unique_prompts` by @taha-yassine in https://github.com/huggingface/trl/pull/4508\r\n* refactor: Move CPOTrainer to experimental module by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4470\r\n* MiniLLM: Fix arguments in config & add to documentation index by @t1101675 in https://github.com/huggingface/trl/pull/4518\r\n* Replace flash attention2 with kernels-community/flash-attn2 by @tamoghnokandar in https://github.com/huggingface/trl/pull/4426\r\n* Move GKDTrainer to experimental module by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4474\r\n* Paper Index: Change `num_completions` to `num_generations` by @pramodith in https://github.com/huggingface/trl/pull/4515\r\n* Fix Docker images for Liger by @lewtun in https://github.com/huggingface/trl/pull/4522\r\n* [Bug Fix] OnlineDPOTrainer with vLLM Server Mode by @YangKai0616 in https://github.com/huggingface/trl/pull/4500\r\n* Move NashMDTrainer to experimental module by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4477\r\n* Move PPOTrainer to trl.experimental.ppo by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4482\r\n* Add step time metric to GRPO Trainer for performance tracking by @qgallouedec in https://github.com/huggingface/trl/pull/4516\r\n* Rename `flash-attn` to `flash-attn2` by @qgallouedec in https://github.com/huggingface/trl/pull/4514\r\n* Remove test trainer args by @qgallouedec in https://github.com/huggingface/trl/pull/4517\r\n* docs: Extend CLI basic usage examples to all supported CLIs by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4425\r\n* Prevent upcasting norm layers in `prepare_model_for_kbit_training` by @sergiopaniego in https://github.com/huggingface/trl/pull/4457\r\n* Add vLLM quantization option for colocate by @sergiopaniego in https://github.com/huggingface/trl/pull/4496\r\n* fix(GOLDTrainer): Resolve incorrect attribute access and VLLMClient.generate() output type by @fabio-sim in https://github.com/huggingface/trl/pull/4526\r\n* [OpenEnv] add vllm colocate mode to openenv scripts by @kashif in https://github.com/huggingface/trl/pull/4510\r\n* [Doc] Drop dummy reward and dataset for DeepMath-103K and accuracy reward by @qgallouedec in https://github.com/huggingface/trl/pull/4524\r\n* Add OpenEnv Script examples to docs by @sergiopaniego in https://github.com/huggingface/trl/pull/4533\r\n* Update OpenEnv example scripts by @sergiopaniego in https://github.com/huggingface/trl/pull/4547\r\n* [OpenEnv] browsergym example script by @kashif in https://github.com/huggingface/trl/pull/4539\r\n* Update OpenEnv guide with latest details by @sergiopaniego in https://github.com/huggingface/trl/pull/4552\r\n* Fix bug with VLM processors in prompt-completion completion text-only training by @kschwethelm in https://github.com/huggingface/trl/pull/4553\r\n* Add target_parameters to LoraConfig by @jonnyli1125 in https://github.com/huggingface/trl/pull/4536\r\n* fix+docs: `device_map=None` for DeepSpeed and add ZeRO paper (1910.02054) to Paper Index by @JenWei0312 in https://github.com/huggingface/trl/pull/4551\r\n* [ORPO] Move ORPOTrainer to experimental by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4480\r\n* Add GRPO Wordle OpenEnv Colab by @sergiopaniego in https://github.com/huggingface/trl/pull/4542\r\n* Update OpenEnv guide with new notebook by @sergiopaniego in https://github.com/huggingface/trl/pull/4555\r\n* Move PRMTrainer to trl.experimental.prm by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4483\r\n* docs: add KTO (2402.01306) to Paper Index + link ref to KTOTrainer by @SSusantAchary in https://github.com/huggingface/trl/pull/4440\r\n* [SFT] Log mean token accuracy from Liger kernel by @kashif in https://github.com/huggingface/trl/pull/4302\r\n* Move OnlineDPOTrainer to experimental module by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4473\r\n* Add `num_generations_eval` parameter for efficient evaluation by @mingxuetian in https://github.com/huggingface/trl/pull/4458\r\n* docs: Rewrite PEFT integration guide with comprehensive examples by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4421\r\n* Reorder documentation TOC to surface key trainer sections by @qgallouedec in https://github.com/huggingface/trl/pull/4565\r\n* Reasoning reward by @lewtun in https://github.com/huggingface/trl/pull/4563\r\n* Fix vLLM sleep mode: add collective RPC call to reload weights in vLLM wake-up process by @qgallouedec in https://github.com/huggingface/trl/pull/4571\r\n* Fix typo in GRPO description in README by @iliasmerigh in https://github.com/huggingface/trl/pull/4573\r\n* Add `shuffle_dataset` option to `SFTTrainer` by @qgallouedec in https://github.com/huggingface/trl/pull/4564\r\n* Fix Replay Buffer docs. by @pramodith in https://github.com/huggingface/trl/pull/4574\r\n* Fix PPO example by @qgallouedec in https://github.com/huggingface/trl/pull/4556\r\n* Move `WinRateCallback` to experimental by @qgallouedec in https://github.com/huggingface/trl/pull/4558\r\n* Move tests for GSPOTokenTrainer to experimental by @qgallouedec in https://github.com/huggingface/trl/pull/4572\r\n* Revert hotfix Fall back to config.text_config._name_or_path by @albertvillanova in https://github.com/huggingface/trl/pull/4581\r\n* fix: use shift_labels for metrics when using CP or SP by @jue-jue-zi in https://github.com/huggingface/trl/pull/4579\r\n* Add missing require_bitsandbytes marker to CI tests by @albertvillanova in https://github.com/huggingface/trl/pull/4586\r\n* Remove module-level imports of extra deps in experimental.judges by @albertvillanova in https://github.com/huggingface/trl/pull/4598\r\n* Fix 'generation_config' AttributeError by @albertvillanova in https://github.com/huggingface/trl/pull/4596\r\n* Revert \"Hotfix CI with dev dependencies: xfail test_prepare_inputs_for_generation\" by @albertvillanova in https://github.com/huggingface/trl/pull/4587\r\n* docs: Add Beyond the 80/20 Rule (2506.01939) to Paper Index by @xuanduy04 in https://github.com/huggingface/trl/pull/4580\r\n* [GRPO] Sequence-level TIS & MIS by @LeonEricsson in https://github.com/huggingface/trl/pull/4530\r\n* docs: Expand training customization examples by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4427\r\n* docs: Expand speeding up training guide with acceleration methods by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4428\r\n* Raise FutureWarning for classes moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/4605\r\n* Update How-to guides by @qgallouedec in https://github.com/huggingface/trl/pull/4604\r\n* Silence experimental warnings when imported in the stable by @qgallouedec in https://github.com/huggingface/trl/pull/4606\r\n* Remove deprecations for 0.26 release by @albertvillanova in https://github.com/huggingface/trl/pull/4607\r\n* Fixed OpenEnv example scripts by @sergiopaniego in https://github.com/huggingface/trl/pull/4610\r\n* Move MergeModelCallback to experimental by @qgallouedec in https://github.com/huggingface/trl/pull/4608\r\n* [GRPOTrainer]: Add SAPO Loss by @pramodith in https://github.com/huggingface/trl/pull/4600\r\n* Add ministral 3 free notebooks by @sergiopaniego in https://github.com/huggingface/trl/pull/4614\r\n* Replace arXiv paper links with HF links by @albertvillanova in https://github.com/huggingface/trl/pull/4613\r\n* Add experimental imports to docs by @albertvillanova in https://github.com/huggingface/trl/pull/4616\r\n* Fix README style by @sergiopaniego in https://github.com/huggingface/trl/pull/4619\r\n* Fix link to OpenEnv blog in docs by @sergiopaniego in https://github.com/huggingface/trl/pull/4625\r\n* Update ministral notebooks with official bf16 ckpt by @sergiopaniego in https://github.com/huggingface/trl/pull/4626\r\n* Remove deprecated batched formatting in GOLDTrainer by @albertvillanova in https://github.com/huggingface/trl/pull/4622\r\n* Clean up model preparation  by @qgallouedec in https://github.com/huggingface/trl/pull/4577\r\n* Silence experimental warning during docs build by @albertvillanova in https://github.com/huggingface/trl/pull/4623\r\n* Raise warnings at 2nd stack level by @albertvillanova in https://github.com/huggingface/trl/pull/4621\r\n* Raise FutureWarning for trainer moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/4620\r\n* Add missing experimental autodoc classes to docs by @albertvillanova in https://github.com/huggingface/trl/pull/4618\r\n* Add logos as assets by @qgallouedec in https://github.com/huggingface/trl/pull/4627\r\n* Remove no longer applicable warning once BCO was moved to experimental by @albertvillanova in https://github.com/huggingface/trl/pull/4628\r\n* Refactor suppression of warning at experimental import by @albertvillanova in https://github.com/huggingface/trl/pull/4629\r\n* fix(PPO examples): passing model dict to models by @casinca in https://github.com/huggingface/trl/pull/4630\r\n* Fix FSDP2 model key miss match when sync LoRA model to vLLM server by @Xiao-Chenguang in https://github.com/huggingface/trl/pull/4603\r\n* TRL supports vLLM 0.11 by @qgallouedec in https://github.com/huggingface/trl/pull/4633\r\n* [ALST/Ulysses] Added ALST/Ulysses documentation by @kashif in https://github.com/huggingface/trl/pull/4420\r\n* Adding  EssentialAI/rnj-1-instruct GRPO example by @sergiopaniego in https://github.com/huggingface/trl/pull/4640\r\n* 🚚 Move KTO to trl.experimental by @neha222222 in https://github.com/huggingface/trl/pull/4575\r\n* 🕵️‍♂️ GRPO: Agent training by @qgallouedec in https://github.com/huggingface/trl/pull/4300\r\n* feat: implement DeepSeek unbiased KL estimator for GRPO by @jlcanta in https://github.com/huggingface/trl/pull/4638\r\n* Update `rnj_1_instruct` notebook by @sergiopaniego in https://github.com/huggingface/trl/pull/4646\r\n* Remove deprecation warning from RLOOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/4644\r\n* Add agent training notebook to examples by @sergiopaniego in https://github.com/huggingface/trl/pull/4645\r\n* Fix KTOTrainer CUDA error for large-vocab models via tensor indexing by @bhuvanprakash in https://github.com/huggingface/trl/pull/4635\r\n* Disable gradient checkpointing during no-grad inference to avoid PyTorch warning by @qgallouedec in https://github.com/huggingface/trl/pull/4636\r\n* Release: v0.26 by @qgallouedec in https://github.com/huggingface/trl/pull/4649\r\n\r\n## New Contributors\r\n\r\n* @lukehinds made their first contribution in https://github.com/huggingface/trl/pull/4502\r\n* @t1101675 made their first contribution in https://github.com/huggingface/trl/pull/4504\r\n* @tamoghnokandar made their first contribution in https://github.com/huggingface/trl/pull/4426\r\n* @fabio-sim made their first contribution in https://github.com/huggingface/trl/pull/4526\r\n* @kschwethelm made their first contribution in https://github.com/huggingface/trl/pull/4553\r\n* @jonnyli1125 made their first contribution in https://github.com/huggingface/trl/pull/4536\r\n* @JenWei0312 made their first contribution in https://github.com/huggingface/trl/pull/4551\r\n* @SSusantAchary made their first contribution in https://github.com/huggingface/trl/pull/4440\r\n* @mingxuetian made their first contribution in https://github.com/huggingface/trl/pull/4458\r\n* @iliasmerigh made their first contribution in https://github.com/huggingface/trl/pull/4573\r\n* @xuanduy04 made their first contribution in https://github.com/huggingface/trl/pull/4580\r\n* @casinca made their first contribution in https://github.com/huggingface/trl/pull/4630\r\n* @Xiao-Chenguang made their first contribution in https://github.com/huggingface/trl/pull/4603\r\n* @neha222222 made their first contribution in https://github.com/huggingface/trl/pull/4575\r\n* @jlcanta made their first contribution in https://github.com/huggingface/trl/pull/4638\r\n* @bhuvanprakash made their first contribution in https://github.com/huggingface/trl/pull/4635\r\n\r\n**Full Changelog**: https://github.com/huggingface/trl/compare/v0.25.0...v0.26.0","publishedAt":"2025-12-09T20:51:12.000Z","url":"https://github.com/huggingface/trl/releases/tag/v0.26.0","media":[]},{"id":"rel_j7WZsvA9tx_Mc3LoTg-AA","version":"v0.25.1","title":"v0.25.1","summary":"## What's Changed\r\n\r\n* Replace accelerate logging with stdlib in CLI by @lewtun in https://github.com/huggingface/trl/pull/4512\r\n* Add temporary worka...","content":"## What's Changed\r\n\r\n* Replace accelerate logging with stdlib in CLI by @lewtun in https://github.com/huggingface/trl/pull/4512\r\n* Add temporary workaround for `lr_scheduler_kwargs` dtype issue in Transformers 4.57.0 by @qgallouedec in https://github.com/huggingface/trl/pull/4513\r\n\r\n**Full Changelog**: https://github.com/huggingface/trl/compare/v0.25.0...0.25.1","publishedAt":"2025-11-12T16:51:21.000Z","url":"https://github.com/huggingface/trl/releases/tag/v0.25.1","media":[]},{"id":"rel_tJ-_VLRNaef8-oFyunDdj","version":"v0.25.0","title":"v0.25.0","summary":"## Features\r\n\r\n* 💤 Switch to sleep level=2 and split wake-ups in GRPO and RLOO trainers by @xxrjun in https://github.com/huggingface/trl/pull/4296\r\n*...","content":"## Features\r\n\r\n* 💤 Switch to sleep level=2 and split wake-ups in GRPO and RLOO trainers by @xxrjun in https://github.com/huggingface/trl/pull/4296\r\n* Added custom `prepare_model_for_kbit_training` to save VRAM by @sergiopaniego in https://github.com/huggingface/trl/pull/4335\r\n* Add `add_generation_prompt` to processor_kwargs in GRPO and RLOO trainer by @qgallouedec in https://github.com/huggingface/trl/pull/4361\r\n* Add support for Trackio completions logging in GRPOTrainer by @taha-yassine in https://github.com/huggingface/trl/pull/4359\r\n* Support chat_template_kwargs by @pramodith in https://github.com/huggingface/trl/pull/4350\r\n* GRPO: ScaleRL -> Support casting LM Head to FP32 by @pramodith in https://github.com/huggingface/trl/pull/4303\r\n* Support casting to fp32 when word embeddings are tied to lm_head by @pramodith in https://github.com/huggingface/trl/pull/4446\r\n* 💬 Add chat to vLLM client and server, update trainer calls by @qgallouedec in https://github.com/huggingface/trl/pull/4450\r\n\r\n## Experimental\r\n\r\n* 🚚 Move BCO to `trl.experimental` by @qgallouedec in https://github.com/huggingface/trl/pull/4312\r\n* 👑 [experimental] GOLD Trainer by @kashif in https://github.com/huggingface/trl/pull/4349\r\n* Add PAPOTrainer for preference-based optimization by @SolarWindRider in https://github.com/huggingface/trl/pull/4334\r\n* [GFPO] fix the GFPO loss calculation error caused by unmodified old_per_token_logps by @Peter-Chou in https://github.com/huggingface/trl/pull/4454\r\n* 🕹️ Add rollout function for OpenEnv integration by @lewtun in https://github.com/huggingface/trl/pull/4310\r\n\r\n## Fixes\r\n\r\n* [Activation-checkpointing] add tensor dedup and param offloading by @kashif in https://github.com/huggingface/trl/pull/4247\r\n* Fix attn_implementation name in OnlineDPO for transformers v5 by @albertvillanova in https://github.com/huggingface/trl/pull/4322\r\n* Hotfix: Fall back to config.text_config._name_or_path if missing config._name_or_path by @albertvillanova in https://github.com/huggingface/trl/pull/4324\r\n* Fix GRPO and RLOO trainers for continuous batching by @albertvillanova in https://github.com/huggingface/trl/pull/4348\r\n* Fix: `add_generation_prompt=True` for conversational only by @qgallouedec in https://github.com/huggingface/trl/pull/4362\r\n* Remove ignored max_length parameter from PRMTrainer data collator by @albertvillanova in https://github.com/huggingface/trl/pull/4355\r\n* Fix add_generation_prompt arg for paged transformers in GRPO and RLOO trainers by @albertvillanova in https://github.com/huggingface/trl/pull/4370\r\n* Fix GKD Liger memory spike by @qgallouedec in https://github.com/huggingface/trl/pull/4140\r\n* Fix GRPO with replay buffer by inserting images in the prompt by @albertvillanova in https://github.com/huggingface/trl/pull/4391\r\n* fix: Remove chat template setting from non-SFT trainer scripts by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4437\r\n* 🖼️ Fix reporting images with vLLM by @qgallouedec in https://github.com/huggingface/trl/pull/4476\r\n\r\n## Documentation and Examples\r\n\r\n* Added SFT LoRA notebook by @sergiopaniego in https://github.com/huggingface/trl/pull/4244\r\n* Update notebooks README with latest additions by @sergiopaniego in https://github.com/huggingface/trl/pull/4316\r\n* Add notebooks to Examples docs and restructure by @sergiopaniego in https://github.com/huggingface/trl/pull/4317\r\n* Highlight OpenEnv in landing docs by @sergiopaniego in https://github.com/huggingface/trl/pull/4327\r\n* Update OpenEnv docs by @sergiopaniego in https://github.com/huggingface/trl/pull/4328\r\n* Add OpenEnv blog to landing by @sergiopaniego in https://github.com/huggingface/trl/pull/4333\r\n* 🗞️ Update \"What's New\" by @qgallouedec in https://github.com/huggingface/trl/pull/4338\r\n* Update Reducing Memory Consumption guide with more details by @sergiopaniego in https://github.com/huggingface/trl/pull/4332\r\n* Fixed links inside Tips in docs by @sergiopaniego in https://github.com/huggingface/trl/pull/4360\r\n* 🔥 docs: Add RapidFire AI integration guide by @kamran-rapidfireAI in https://github.com/huggingface/trl/pull/4340\r\n* Fix paper link for \"Towards Efficient and Exact Optimization of Language Model Alignment\" by @qgallouedec in https://github.com/huggingface/trl/pull/4409\r\n* Migrate experimental trl feature docs  by @ethanknights in https://github.com/huggingface/trl/pull/4411\r\n* Update SFT QLoRA notebook with **14B** model on free Colab by @sergiopaniego in https://github.com/huggingface/trl/pull/4336\r\n* Create \"Talks\" subsection by @sergiopaniego in https://github.com/huggingface/trl/pull/4414\r\n* Openenv wordle example by @burtenshaw in https://github.com/huggingface/trl/pull/4357\r\n* docs: Remove outdated conversational dataset conversion guidance by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4422\r\n* docs: List all trainers that support Liger Kernel by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4432\r\n* Add On-Policy Distillation from thinking labs to paper index. by @pramodith in https://github.com/huggingface/trl/pull/4410\r\n* Upload notebook with T4 selected by @sergiopaniego in https://github.com/huggingface/trl/pull/4449\r\n* Removed outdated warning about batch contamination by @Harras3 in https://github.com/huggingface/trl/pull/4423\r\n* Removed Sentiment Tuning Examples by @Harras3 in https://github.com/huggingface/trl/pull/4424\r\n* docs: Remove outdated notebooks by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4435\r\n* docs: Move Multi-Adapter RL section to PEFT integration by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4436\r\n* Update `max_length` explanation for VLM in online trainers by @sergiopaniego in https://github.com/huggingface/trl/pull/4220\r\n* Updated OpenEnv docs by @sergiopaniego in https://github.com/huggingface/trl/pull/4418\r\n* add llasa-tutorial by @Deep-unlearning in https://github.com/huggingface/trl/pull/4456\r\n\r\n## Deprecations\r\n\r\n* Replace deprecated AutoModelForVision2Seq with AutoModelForImageTextToText by @albertvillanova in https://github.com/huggingface/trl/pull/4353\r\n* Replace deprecated list with tuple indexing in PPOTrainer by @albertvillanova in https://github.com/huggingface/trl/pull/4356\r\n* Remove liger loss in favor of liger kernel by @sergiopaniego in https://github.com/huggingface/trl/pull/4364\r\n* 🐍 Drop Python 3.9 by @qgallouedec in https://github.com/huggingface/trl/pull/4183\r\n\r\n## What's Changed\r\n\r\n* ⬆️ Bump dev version by @qgallouedec in https://github.com/huggingface/trl/pull/4293\r\n* Update links to docs in README to latest packaged version by @sergiopaniego in https://github.com/huggingface/trl/pull/4084\r\n* 🧺 [4/N] Refactor `_generate` in GRPO/RLOO: Move `forward_kwargs` outside generation method by @qgallouedec in https://github.com/huggingface/trl/pull/4154\r\n* Fix missing CI slow tests: ImportError: vLLM is not installed by @albertvillanova in https://github.com/huggingface/trl/pull/4304\r\n* Added SFT LoRA notebook by @sergiopaniego in https://github.com/huggingface/trl/pull/4244\r\n* ⚰️ Remove deprecated by @qgallouedec in https://github.com/huggingface/trl/pull/4301\r\n* Silence TRL experimental warnings in CI by @albertvillanova in https://github.com/huggingface/trl/pull/4307\r\n* Filter expected setup_chat_format deprecation warning in CI by @albertvillanova in https://github.com/huggingface/trl/pull/4306\r\n* [Activation-checkpointing] add tensor dedup and param offloading by @kashif in https://github.com/huggingface/trl/pull/4247\r\n* Remove parameterized as test extra dependency by @albertvillanova in https://github.com/huggingface/trl/pull/4315\r\n* Update notebooks README with latest additions by @sergiopaniego in https://github.com/huggingface/trl/pull/4316\r\n* 🚚 Move BCO to `trl.experimental` by @qgallouedec in https://github.com/huggingface/trl/pull/4312\r\n* 🧺 [5/N] Refactor `_generate` in GRPO/RLOO: Insert images in the prompt by @qgallouedec in https://github.com/huggingface/trl/pull/4155\r\n* 💤 Switch to sleep level=2 and split wake-ups in GRPO and RLOO trainers by @xxrjun in https://github.com/huggingface/trl/pull/4296\r\n* Replace unittest skipTest from transformers with pytest.skip by @albertvillanova in https://github.com/huggingface/trl/pull/4297\r\n* Add notebooks to Examples docs and restructure by @sergiopaniego in https://github.com/huggingface/trl/pull/4317\r\n* Fix attn_implementation name in OnlineDPO for transformers v5 by @albertvillanova in https://github.com/huggingface/trl/pull/4322\r\n* 🕹️ Add rollout function for OpenEnv integration by @lewtun in https://github.com/huggingface/trl/pull/4310\r\n* Highlight OpenEnv in landing docs by @sergiopaniego in https://github.com/huggingface/trl/pull/4327\r\n* Update OpenEnv docs by @sergiopaniego in https://github.com/huggingface/trl/pull/4328\r\n* Move BCO tests to tests/experimental by @albertvillanova in https://github.com/huggingface/trl/pull/4326\r\n* Hotfix: Fall back to config.text_config._name_or_path if missing config._name_or_path by @albertvillanova in https://github.com/huggingface/trl/pull/4324\r\n* Add OpenEnv blog to landing by @sergiopaniego in https://github.com/huggingface/trl/pull/4333\r\n* 🗞️ Update \"What's New\" by @qgallouedec in https://github.com/huggingface/trl/pull/4338\r\n* Update Reducing Memory Consumption guide with more details by @sergiopaniego in https://github.com/huggingface/trl/pull/4332\r\n* Added custom `prepare_model_for_kbit_training` to save VRAM by @sergiopaniego in https://github.com/huggingface/trl/pull/4335\r\n* [vllm] update comment about communication group host ip by @kashif in https://github.com/huggingface/trl/pull/4337\r\n* Fix GRPO and RLOO trainers for continuous batching by @albertvillanova in https://github.com/huggingface/trl/pull/4348\r\n* Fixed links inside Tips in docs by @sergiopaniego in https://github.com/huggingface/trl/pull/4360\r\n* Fix CI issue for vlm_gemma_3n model by @kaixuanliu in https://github.com/huggingface/trl/pull/4278\r\n* Add `add_generation_prompt` to processor_kwargs in GRPO and RLOO trainer by @qgallouedec in https://github.com/huggingface/trl/pull/4361\r\n* Fix: `add_generation_prompt=True` for conversational only by @qgallouedec in https://github.com/huggingface/trl/pull/4362\r\n* Use explicit tiny-Qwen2_5_VL model_id parameter in CI tests by @albertvillanova in https://github.com/huggingface/trl/pull/4325\r\n* Move tests of experimental GRPO with replay buffer to tests/experimental by @albertvillanova in https://github.com/huggingface/trl/pull/4329\r\n* Implement CI test workflow for experimental module by @albertvillanova in https://github.com/huggingface/trl/pull/4330\r\n* Replace deprecated AutoModelForVision2Seq with AutoModelForImageTextToText by @albertvillanova in https://github.com/huggingface/trl/pull/4353\r\n* Move tests of BCO trainer args to tests/experimental by @albertvillanova in https://github.com/huggingface/trl/pull/4354\r\n* Remove ignored max_length parameter from PRMTrainer data collator by @albertvillanova in https://github.com/huggingface/trl/pull/4355\r\n* Replace deprecated list with tuple indexing in PPOTrainer by @albertvillanova in https://github.com/huggingface/trl/pull/4356\r\n* Add support for Trackio completions logging in GRPOTrainer by @taha-yassine in https://github.com/huggingface/trl/pull/4359\r\n* Fix add_generation_prompt arg for paged transformers in GRPO and RLOO trainers by @albertvillanova in https://github.com/huggingface/trl/pull/4370\r\n* Align make test_experimental with make test by @albertvillanova in https://github.com/huggingface/trl/pull/4371\r\n* 🔥 docs: Add RapidFire AI integration guide by @kamran-rapidfireAI in https://github.com/huggingface/trl/pull/4340\r\n* 👑 [experimental] GOLD Trainer by @kashif in https://github.com/huggingface/trl/pull/4349\r\n* Support chat_template_kwargs by @pramodith in https://github.com/huggingface/trl/pull/4350\r\n* [GOLD] Set teacher tokenizer name if using ULD loss by @kashif in https://github.com/huggingface/trl/pull/4389\r\n* Fix typo in GOLD docs by @sergiopaniego in https://github.com/huggingface/trl/pull/4394\r\n* Hotfix CI for Python 3.9 by setting test as xfail until transformers release by @albertvillanova in https://github.com/huggingface/trl/pull/4388\r\n* [tests] Update rope_scaling configuration for tiny qwen-vl models by @kashif in https://github.com/huggingface/trl/pull/4405\r\n* [GOLD] Update code example for GOLD Trainer by @cmpatino in https://github.com/huggingface/trl/pull/4406\r\n* Hotfix CI with dev dependencies: xfail test_prepare_inputs_for_generation by @albertvillanova in https://github.com/huggingface/trl/pull/4372\r\n* Fix paper link for \"Towards Efficient and Exact Optimization of Language Model Alignment\" by @qgallouedec in https://github.com/huggingface/trl/pull/4409\r\n* Migrate experimental trl feature docs  by @ethanknights in https://github.com/huggingface/trl/pull/4411\r\n* Update SFT QLoRA notebook with **14B** model on free Colab by @sergiopaniego in https://github.com/huggingface/trl/pull/4336\r\n* Add PAPOTrainer for preference-based optimization by @SolarWindRider in https://github.com/huggingface/trl/pull/4334\r\n* Fix GKD Liger memory spike by @qgallouedec in https://github.com/huggingface/trl/pull/4140\r\n* Remove liger loss in favor of liger kernel by @sergiopaniego in https://github.com/huggingface/trl/pull/4364\r\n* Add license to test file and disable docstyle in GOLD script by @qgallouedec in https://github.com/huggingface/trl/pull/4412\r\n* Replace duplicate test with model_id parametrized test by @albertvillanova in https://github.com/huggingface/trl/pull/4415\r\n* Fix raising of deprecation warning for liger_loss by @albertvillanova in https://github.com/huggingface/trl/pull/4417\r\n* Consolidate slow tests into main test files by @ishitab02 in https://github.com/huggingface/trl/pull/4408\r\n* Fix CI experimental tests TypeError for GRPOWithReplayBufferTrainer.update_with_replay_buffer by @albertvillanova in https://github.com/huggingface/trl/pull/4366\r\n* Fix GRPO with replay buffer by inserting images in the prompt by @albertvillanova in https://github.com/huggingface/trl/pull/4391\r\n* GRPO: ScaleRL -> Support casting LM Head to FP32 by @pramodith in https://github.com/huggingface/trl/pull/4303\r\n* Create \"Talks\" subsection by @sergiopaniego in https://github.com/huggingface/trl/pull/4414\r\n* Openenv wordle example by @burtenshaw in https://github.com/huggingface/trl/pull/4357\r\n* docs: Remove outdated conversational dataset conversion guidance by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4422\r\n* docs: List all trainers that support Liger Kernel by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4432\r\n* fix: Remove chat template setting from non-SFT trainer scripts by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4437\r\n* Add On-Policy Distillation from thinking labs to paper index. by @pramodith in https://github.com/huggingface/trl/pull/4410\r\n* Upload notebook with T4 selected by @sergiopaniego in https://github.com/huggingface/trl/pull/4449\r\n* Support casting to fp32 when word embeddings are tied to lm_head by @pramodith in https://github.com/huggingface/trl/pull/4446\r\n* Update tokenizer apply_chat_template with return_dict=True default by @albertvillanova in https://github.com/huggingface/trl/pull/4448\r\n* Removed outdated warning about batch contamination by @Harras3 in https://github.com/huggingface/trl/pull/4423\r\n* 🐍 Drop Python 3.9 by @qgallouedec in https://github.com/huggingface/trl/pull/4183\r\n* Removed Sentiment Tuning Examples by @Harras3 in https://github.com/huggingface/trl/pull/4424\r\n* docs: Remove outdated notebooks by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4435\r\n* docs: Move Multi-Adapter RL section to PEFT integration by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4436\r\n* Moved masked_mean, masked_var and masked_whiten to ppo_trainer.py by @Harras3 in https://github.com/huggingface/trl/pull/4444\r\n* Update `max_length` explanation for VLM in online trainers by @sergiopaniego in https://github.com/huggingface/trl/pull/4220\r\n* [fix] wordle model_id updates by @burtenshaw in https://github.com/huggingface/trl/pull/4453\r\n* Updated OpenEnv docs by @sergiopaniego in https://github.com/huggingface/trl/pull/4418\r\n* add llasa-tutorial by @Deep-unlearning in https://github.com/huggingface/trl/pull/4456\r\n* 💬 Add chat to vLLM client and server, update trainer calls by @qgallouedec in https://github.com/huggingface/trl/pull/4450\r\n* [GFPO] fix the GFPO loss calculation error caused by unmodified old_per_token_logps by @Peter-Chou in https://github.com/huggingface/trl/pull/4454\r\n* 🖼️ Fix reporting images with vLLM by @qgallouedec in https://github.com/huggingface/trl/pull/4476\r\n* Release: v0.25 by @qgallouedec in https://github.com/huggingface/trl/pull/4478\r\n\r\n## New Contributors\r\n* @xxrjun made their first contribution in https://github.com/huggingface/trl/pull/4296\r\n* @taha-yassine made their first contribution in https://github.com/huggingface/trl/pull/4359\r\n* @kamran-rapidfireAI made their first contribution in https://github.com/huggingface/trl/pull/4340\r\n* @ethanknights made their first contribution in https://github.com/huggingface/trl/pull/4411\r\n* @SolarWindRider made their first contribution in https://github.com/huggingface/trl/pull/4334\r\n* @ishitab02 made their first contribution in https://github.com/huggingface/trl/pull/4408\r\n* @Harras3 made their first contribution in https://github.com/huggingface/trl/pull/4423\r\n* @Deep-unlearning made their first contribution in https://github.com/huggingface/trl/pull/4456\r\n\r\n**Full Changelog**: https://github.com/huggingface/trl/compare/v0.24.0...v0.25.0","publishedAt":"2025-11-06T00:18:30.000Z","url":"https://github.com/huggingface/trl/releases/tag/v0.25.0","media":[]},{"id":"rel_rAKYE4eyzsULuEJIYg6n9","version":"v0.24.0","title":"v0.24.0","summary":"## Features\r\n\r\n* Add accuracy reward by @pramodith in https://github.com/huggingface/trl/pull/4270\r\n* Add support for `token_type_ids` in `DPOTrainer`...","content":"## Features\r\n\r\n* Add accuracy reward by @pramodith in https://github.com/huggingface/trl/pull/4270\r\n* Add support for `token_type_ids` in `DPOTrainer` by @aweers in https://github.com/huggingface/trl/pull/4285\r\n* 💰 `RichProgressCallback` enhancement by @qgallouedec in https://github.com/huggingface/trl/pull/4245\r\n* Include `chat_template_kwargs` in `apply_chat_template` by @cmpatino in https://github.com/huggingface/trl/pull/4233\r\n* 🏷️ Account for `token_type_ids` in `DataCollatorForVisionLanguageModeling` by @qgallouedec in https://github.com/huggingface/trl/pull/4190\r\n* 🎨 Support mixing image+text and text-only examples by @qgallouedec in https://github.com/huggingface/trl/pull/4203\r\n* 🎁 `RewardTrainer` refactor by @qgallouedec in https://github.com/huggingface/trl/pull/4093\r\n* 🎞️ Support sequence classification models in `clone_chat_template` by @qgallouedec in https://github.com/huggingface/trl/pull/4097\r\n* ✨ Add logging for training completion and model saving in training scripts by @qgallouedec in https://github.com/huggingface/trl/pull/4048\r\n* 🖨️ Print rich table for messages by @qgallouedec in https://github.com/huggingface/trl/pull/4160\r\n* 😴 Add `vllm_enable_sleep_mode` to RLOO Trainer by @sergiopaniego in https://github.com/huggingface/trl/pull/4107\r\n* 📽 Multi image support for GRPO/RLOO by @qgallouedec in https://github.com/huggingface/trl/pull/4113\r\n* 👁️ Add VLM support to RLOO trainer by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4067\r\n* ℹ️ Enable XPU for vLLM client by @jiqing-feng in https://github.com/huggingface/trl/pull/4031\r\n* 🧶 feat: Add WeaveCallback for W&B Weave integration by @parambharat in https://github.com/huggingface/trl/pull/4089\r\n\r\n## Fixes\r\n\r\n* [Online-DPO] fix the completion_len == max_new_tokens crash by @kashif in https://github.com/huggingface/trl/pull/4193\r\n* Fix entropy and accuracy calculation for prompt_tuning techniques. by @pramodith in https://github.com/huggingface/trl/pull/4196\r\n* Fix prompt-completion labeling with add_generation_prompt and warning by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4201\r\n* 🌡️ Have vLLM return processed (temperature scaled) log probs by @YonatanGideoni in https://github.com/huggingface/trl/pull/4163\r\n* Fix handling of f_divergence_type in DPO by @albertvillanova in https://github.com/huggingface/trl/pull/4171\r\n* ⚡ Fix Flash Attention x Padding-Free loss by @qgallouedec in https://github.com/huggingface/trl/pull/4170\r\n* Pass required token_type_ids by @albertvillanova in https://github.com/huggingface/trl/pull/4148\r\n* 👩‍🦯 Fix usage of VLM using text only by @SamuelBarryCS in https://github.com/huggingface/trl/pull/4080\r\n* ⚓ [vllm] ensure MASTER_ADDR/MASTER_PORT are set safely by @kashif in https://github.com/huggingface/trl/pull/4057\r\n* 📤 Fix a dataset loading bug in scripts by @singing-cat in https://github.com/huggingface/trl/pull/4124\r\n* 🐯 fix: use_liger_kernel with IterableDataset by @jue-jue-zi in https://github.com/huggingface/trl/pull/4087\r\n* [GKD] Fix `batchmean` reduce op in GKDTrainer's loss by @cmpatino in https://github.com/huggingface/trl/pull/4105\r\n* Fix get_peft_model() so that prepare_model_for_kbit_training does not reapply to an instance of PeftModel, thus freezing all the layers by @Hoesu in https://github.com/huggingface/trl/pull/4081\r\n* Aux loss is already included in the loss returned by Transformers by @pramodith in https://github.com/huggingface/trl/pull/4078\r\n* ♨️ [GRPO] Fix potential hang in `get_high_entropy_mask` by @akakakakakaa in https://github.com/huggingface/trl/pull/4041\r\n\r\n## Documentation\r\n\r\n* Remove logging.md: trainer-specific metrics documentation by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4269\r\n* Remove using_llama_models.md: outdated Llama2-specific documentation by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4268\r\n* Remove how_to_train.md: outdated training FAQ by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4267\r\n* Add Qwen3-VL notebooks (SFT, GRPO) by @sergiopaniego in https://github.com/huggingface/trl/pull/4275\r\n* Remove obsolete research_projects directory by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4243\r\n* Add Efficient Online Training with GRPO and vLLM in TRL to community tutorials by @sergiopaniego in https://github.com/huggingface/trl/pull/4219\r\n* Add trainers taxonomy to docs by @sergiopaniego in https://github.com/huggingface/trl/pull/4195\r\n* Updated vLLM integration guide by @sergiopaniego in https://github.com/huggingface/trl/pull/4162\r\n* [DOCS] Lora without regret by @burtenshaw in https://github.com/huggingface/trl/pull/4181\r\n* Add docstring for OnlineTrainerState by @albertvillanova in https://github.com/huggingface/trl/pull/4166\r\n* ⚖️ Align SFT and DPO for model creation and deprecate `DPOConfig.padding_value` in favour or `pad_token_id` by @qgallouedec in https://github.com/huggingface/trl/pull/4006\r\n* 🏞️ Context Parallelism benchmark guide by @sergiopaniego in https://github.com/huggingface/trl/pull/4075\r\n* ▶️ Add video to community tutorials by @qgallouedec in https://github.com/huggingface/trl/pull/4090\r\n* Reviewed HF jobs updated docs by @sergiopaniego in https://github.com/huggingface/trl/pull/4088\r\n\r\n## Deprecations\r\n\r\n* Deprecate `BestOfNSampler` by @qgallouedec in https://github.com/huggingface/trl/pull/4291\r\n* Raise deprecation warning for Python 3.9 by @albertvillanova in https://github.com/huggingface/trl/pull/4226\r\n* Deprecate unused dataset_formatting module by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4242\r\n* Warnings pointing to RFC by @qgallouedec in https://github.com/huggingface/trl/pull/4224\r\n* 🅰️ Remove apex by @qgallouedec in https://github.com/huggingface/trl/pull/4139\r\n* 🗑️ Remove deprecated `AlignPropTrainer`, `DDPOTrainer` and `IterativeSFTTrainer` by @qgallouedec in https://github.com/huggingface/trl/pull/4068\r\n\r\n## Experimental\r\n\r\n* 🧪 Add `trl.experimental` Submodule by @August-murr in https://github.com/huggingface/trl/pull/4073\r\n* [GRPO]: Sample from a Replay Buffer To Substitute Groups with 0 std. by @pramodith in https://github.com/huggingface/trl/pull/4060\r\n* 🪙 [Experimental] Support GSPO-token by @hjh0119 in https://github.com/huggingface/trl/pull/3820\r\n* 🌪️ [GFPO]: implement GFPO in GRPOTrainer by @Peter-Chou in https://github.com/huggingface/trl/pull/3989\r\n* 🌾 [Experimental] BEMA for ref model by @qgallouedec in https://github.com/huggingface/trl/pull/3898\r\n\r\n## What's Changed\r\n\r\n* ⬆️ Bump dev version by @qgallouedec in https://github.com/huggingface/trl/pull/4054\r\n* Remove redundant 'None' from docstrings by @albertvillanova in https://github.com/huggingface/trl/pull/4058\r\n* Hotfix: Add ParallelismConfig fallback for transformers with old accelerate by @albertvillanova in https://github.com/huggingface/trl/pull/4063\r\n* Fix CI failure in slow GRPO test due to missing pillow dependency by @albertvillanova in https://github.com/huggingface/trl/pull/4064\r\n* 💡 Fix type hint to `make_parser` function in multiple scripts by @qgallouedec in https://github.com/huggingface/trl/pull/4050\r\n* Improve docstring of AlignPropTrainer by @albertvillanova in https://github.com/huggingface/trl/pull/4059\r\n* ♨️ [GRPO] Fix potential hang in `get_high_entropy_mask` by @akakakakakaa in https://github.com/huggingface/trl/pull/4041\r\n* Set Ruff src for first-party imports by @albertvillanova in https://github.com/huggingface/trl/pull/4074\r\n* 🧪 Add `trl.experimental` Submodule by @August-murr in https://github.com/huggingface/trl/pull/4073\r\n* 🌾 [Experimental] BEMA for ref model by @qgallouedec in https://github.com/huggingface/trl/pull/3898\r\n* ✂️ [GRPO VLM] Update split sizes to generalize by @zucchini-nlp in https://github.com/huggingface/trl/pull/4032\r\n* 🛠️ Fix CI by @qgallouedec in https://github.com/huggingface/trl/pull/4076\r\n* 🐳 Docker update + Simplify Jobs doc by @qgallouedec in https://github.com/huggingface/trl/pull/3931\r\n* Aux loss is already included in the loss returned by Transformers by @pramodith in https://github.com/huggingface/trl/pull/4078\r\n* Reviewed HF jobs updated docs by @sergiopaniego in https://github.com/huggingface/trl/pull/4088\r\n* 🗑️ Remove deprecated `AlignPropTrainer`, `DDPOTrainer` and `IterativeSFTTrainer` by @qgallouedec in https://github.com/huggingface/trl/pull/4068\r\n* ▶️ Add video to community tutorials by @qgallouedec in https://github.com/huggingface/trl/pull/4090\r\n* Align slow tests with regular tests by @albertvillanova in https://github.com/huggingface/trl/pull/4085\r\n* Add support for testing experimental features by @albertvillanova in https://github.com/huggingface/trl/pull/4082\r\n* Community Tutorials design adaptation for videos by @sergiopaniego in https://github.com/huggingface/trl/pull/4095\r\n* 🏞️ Context Parallelism benchmark guide by @sergiopaniego in https://github.com/huggingface/trl/pull/4075\r\n* ⌨️ Pin num2words by @lewtun in https://github.com/huggingface/trl/pull/4094\r\n* Add deprecation warnings to docstrings by @albertvillanova in https://github.com/huggingface/trl/pull/4083\r\n* 📜 Convert `set` to `list` of tags by @qgallouedec in https://github.com/huggingface/trl/pull/4092\r\n* 🧶 feat: Add WeaveCallback for W&B Weave integration by @parambharat in https://github.com/huggingface/trl/pull/4089\r\n* ⚖️ Align SFT and DPO for model creation and deprecate `DPOConfig.padding_value` in favour or `pad_token_id` by @qgallouedec in https://github.com/huggingface/trl/pull/4006\r\n* 🌪️ [GFPO]: implement GFPO in GRPOTrainer by @Peter-Chou in https://github.com/huggingface/trl/pull/3989\r\n* ℹ️ feat: Add NPU and XPU support for activation offloading by @zilongzheng in https://github.com/huggingface/trl/pull/4056\r\n* ℹ️ Enable XPU for vLLM client by @jiqing-feng in https://github.com/huggingface/trl/pull/4031\r\n* Fix get_peft_model() so that prepare_model_for_kbit_training does not reapply to an instance of PeftModel, thus freezing all the layers by @Hoesu in https://github.com/huggingface/trl/pull/4081\r\n* [GKD] Fix `batchmean` reduce op in GKDTrainer's loss by @cmpatino in https://github.com/huggingface/trl/pull/4105\r\n* 👁️ Add VLM support to RLOO trainer by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4067\r\n* Some nits GRPO and RLOO trainer docs by @sergiopaniego in https://github.com/huggingface/trl/pull/4108\r\n* Fix typos by @cyyever in https://github.com/huggingface/trl/pull/4106\r\n* Fix typos by @qgallouedec in https://github.com/huggingface/trl/pull/4109\r\n* Fix VLM configs in generate_tiny_models by @albertvillanova in https://github.com/huggingface/trl/pull/4101\r\n* docs: correct option name to enable vllm sleep mode by @muupan in https://github.com/huggingface/trl/pull/4102\r\n* CI hotfix: xfail test_training_with_transformers_paged for transformers<4.57.0 by @albertvillanova in https://github.com/huggingface/trl/pull/4120\r\n* Fix code style with make precommit by @albertvillanova in https://github.com/huggingface/trl/pull/4119\r\n* 🟩 Drop `image_split_sizes` in favour of `image_grid_thw` by @qgallouedec in https://github.com/huggingface/trl/pull/4111\r\n* 🔭 Align param passing to VLM configs in generate_tiny_models by @albertvillanova in https://github.com/huggingface/trl/pull/4118\r\n* 📽 Multi image support for GRPO/RLOO by @qgallouedec in https://github.com/huggingface/trl/pull/4113\r\n* 😴 Add `vllm_enable_sleep_mode` to RLOO Trainer by @sergiopaniego in https://github.com/huggingface/trl/pull/4107\r\n* 🐯 fix: use_liger_kernel with IterableDataset by @jue-jue-zi in https://github.com/huggingface/trl/pull/4087\r\n* 📤 Fix a dataset loading bug in scripts by @singing-cat in https://github.com/huggingface/trl/pull/4124\r\n* ⚓ [vllm] ensure MASTER_ADDR/MASTER_PORT are set safely by @kashif in https://github.com/huggingface/trl/pull/4057\r\n* 📌 Pin vLLM version by @qgallouedec in https://github.com/huggingface/trl/pull/4122\r\n* 👋 Remove `backend` parameter from `GuidedDecodingParams` by @qgallouedec in https://github.com/huggingface/trl/pull/4123\r\n* 🧹 Remove `max_batch_tokens`, `num_blocks` and `block_size` from generation kwargs by @qgallouedec in https://github.com/huggingface/trl/pull/4065\r\n* Remove Python version < 3.13 constraint from vllm extra dependencies by @albertvillanova in https://github.com/huggingface/trl/pull/4125\r\n* 👩‍🦯 Fix usage of VLM using text only by @SamuelBarryCS in https://github.com/huggingface/trl/pull/4080\r\n* [SFTrainer]: Fix DFT Loss by @pramodith in https://github.com/huggingface/trl/pull/4112\r\n* Improve typing of SFT trainer by @cyyever in https://github.com/huggingface/trl/pull/4007\r\n* 🌺 Fix GPT-OSS test by @qgallouedec in https://github.com/huggingface/trl/pull/4134\r\n* 🪙 [Experimental] Support GSPO-token by @hjh0119 in https://github.com/huggingface/trl/pull/3820\r\n* Fix CI: torch.AcceleratorError: CUDA error: device-side assert triggered by @albertvillanova in https://github.com/huggingface/trl/pull/4138\r\n* 🤸‍♀️ Fix DFT test by @qgallouedec in https://github.com/huggingface/trl/pull/4135\r\n* 🌵 Mark GKD trainer test as expected failure due to OOM issue by @qgallouedec in https://github.com/huggingface/trl/pull/4126\r\n* [GRPO]: Sample from a Replay Buffer To Substitute Groups with 0 std. by @pramodith in https://github.com/huggingface/trl/pull/4060\r\n* Fix import statement and GRPO test case by @qgallouedec in https://github.com/huggingface/trl/pull/4141\r\n* Refactor trainers classes to use BaseTrainer with shared functionality by @albertvillanova in https://github.com/huggingface/trl/pull/4128\r\n* Fixed some <Tip> rendering issues by @sergiopaniego in https://github.com/huggingface/trl/pull/4143\r\n* 😷 Refactor GRPO/RLOO to isolate `_generate` by @qgallouedec in https://github.com/huggingface/trl/pull/4114\r\n* 🟩 Drop `image_split_sizes` in favour of `image_grid_thw` by @qgallouedec in https://github.com/huggingface/trl/pull/4156\r\n* 📽 Multi image support for GRPO replay buffer by @qgallouedec in https://github.com/huggingface/trl/pull/4157\r\n* 😷 Refactor GRPO/RLOO to isolate `_generate` for GRPO with replay buffer by @qgallouedec in https://github.com/huggingface/trl/pull/4158\r\n* Add docstring for OnlineTrainerState by @albertvillanova in https://github.com/huggingface/trl/pull/4166\r\n* Pass required token_type_ids by @albertvillanova in https://github.com/huggingface/trl/pull/4148\r\n* 💡 Replace `<Tip>` with new markdown syntax by @qgallouedec in https://github.com/huggingface/trl/pull/4161\r\n* Remove unnecessary list comprehensions by @albertvillanova in https://github.com/huggingface/trl/pull/4164\r\n* Add missing FDivergenceType docstring by @albertvillanova in https://github.com/huggingface/trl/pull/4165\r\n* Fix docstrings with 'deprecated' Sphinx directive by @albertvillanova in https://github.com/huggingface/trl/pull/4174\r\n* Fix docstring interlink to parent class for NashMDTrainer and XPOTrainer by @albertvillanova in https://github.com/huggingface/trl/pull/4179\r\n* Fix link in docstring of RLOOTrainer by @albertvillanova in https://github.com/huggingface/trl/pull/4180\r\n* 🖨️ Print rich table for messages by @qgallouedec in https://github.com/huggingface/trl/pull/4160\r\n* 🅰️ Remove apex by @qgallouedec in https://github.com/huggingface/trl/pull/4139\r\n* Fix CI ValueError: Unknown loss type: dapo by @albertvillanova in https://github.com/huggingface/trl/pull/4173\r\n* Fix PEFT interlinks in docstrings by @albertvillanova in https://github.com/huggingface/trl/pull/4178\r\n* ✨ Add logging for training completion and model saving in training scripts by @qgallouedec in https://github.com/huggingface/trl/pull/4048\r\n* 👾 Use our own `require_bitsandbytes` by @qgallouedec in https://github.com/huggingface/trl/pull/4137\r\n* 🎞️ Support sequence classification models in `clone_chat_template` by @qgallouedec in https://github.com/huggingface/trl/pull/4097\r\n* ⚡ Fix Flash Attention x Padding-Free loss by @qgallouedec in https://github.com/huggingface/trl/pull/4170\r\n* 🎁 `RewardTrainer` refactor by @qgallouedec in https://github.com/huggingface/trl/pull/4093\r\n* 🧺 [1/N] Refactor `_generate` in GRPO/RLOO: list of ints instead of tensors by @qgallouedec in https://github.com/huggingface/trl/pull/4146\r\n* Fix handling of f_divergence_type in DPO by @albertvillanova in https://github.com/huggingface/trl/pull/4171\r\n* 🔣 Fix test: replace `trainer.tokenizer` by `trainer.processing_class` by @qgallouedec in https://github.com/huggingface/trl/pull/4185\r\n* Fix CI ImportError: FlashAttention2 and decorator order for all parameterized tests by @albertvillanova in https://github.com/huggingface/trl/pull/4176\r\n* Hotfix wrong formatting of docstrings with blockquote tips by @albertvillanova in https://github.com/huggingface/trl/pull/4187\r\n* 🌡️ Have vLLM return processed (temperature scaled) log probs by @YonatanGideoni in https://github.com/huggingface/trl/pull/4163\r\n* Replace remaining trainer.tokenizer with trainer.processing_class in GRPO test by @albertvillanova in https://github.com/huggingface/trl/pull/4192\r\n* [DOCS] Lora without regret by @burtenshaw in https://github.com/huggingface/trl/pull/4181\r\n* [DOCS/FIX] lora without regrets - fix lr by @burtenshaw in https://github.com/huggingface/trl/pull/4207\r\n* Remove custome_container for building the docs by @albertvillanova in https://github.com/huggingface/trl/pull/4198\r\n* Remove tokenizer creation from `sft` example script by @sergiopaniego in https://github.com/huggingface/trl/pull/4197\r\n* Hotfix: Exclude transformers 4.57.0 for Python 3.9 by @albertvillanova in https://github.com/huggingface/trl/pull/4209\r\n* Replace unittest with pytest by @albertvillanova in https://github.com/huggingface/trl/pull/4188\r\n* Updated vLLM integration guide by @sergiopaniego in https://github.com/huggingface/trl/pull/4162\r\n* Remove `Optional` from `processing_class` in `PPOTrainer` by @sergiopaniego in https://github.com/huggingface/trl/pull/4212\r\n* Replace setup with pyproject and fix packaging unintended modules by @albertvillanova in https://github.com/huggingface/trl/pull/4194\r\n* Removed tokenizer/processor creation from example scripts by @sergiopaniego in https://github.com/huggingface/trl/pull/4211\r\n* Apply style and revert change in `sft_video_llm` example by @qgallouedec in https://github.com/huggingface/trl/pull/4214\r\n* Fix `trl-internal-testing/tiny-DbrxForCausalLM` by @qgallouedec in https://github.com/huggingface/trl/pull/4213\r\n* Fix prompt-completion labeling with add_generation_prompt and warning by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4201\r\n* Fix LoRA params in Python in LoRA without regret by @sergiopaniego in https://github.com/huggingface/trl/pull/4215\r\n* [DOCS] fix prose in lora guide by @burtenshaw in https://github.com/huggingface/trl/pull/4217\r\n* Add trainers taxonomy to docs by @sergiopaniego in https://github.com/huggingface/trl/pull/4195\r\n* 🎨 Support mixing image+text and text-only examples by @qgallouedec in https://github.com/huggingface/trl/pull/4203\r\n* 🧺 [2/N] Refactor `_generate` in GRPO/RLOO: Use `prompt_ids` from generation by @qgallouedec in https://github.com/huggingface/trl/pull/4152\r\n* Fix entropy and accuracy calculation for prompt_tuning techniques. by @pramodith in https://github.com/huggingface/trl/pull/4196\r\n* Add Efficient Online Training with GRPO and vLLM in TRL to community tutorials by @sergiopaniego in https://github.com/huggingface/trl/pull/4219\r\n* 🏷️ Account for `token_type_ids` in `DataCollatorForVisionLanguageModeling` by @qgallouedec in https://github.com/huggingface/trl/pull/4190\r\n* Exclude vllm dependencies from dev extra by @albertvillanova in https://github.com/huggingface/trl/pull/4229\r\n* Fix CI unittest asserts by @albertvillanova in https://github.com/huggingface/trl/pull/4234\r\n* Fix callable annotations by @albertvillanova in https://github.com/huggingface/trl/pull/4216\r\n* Remove unused Path import in __init__.py by @albertvillanova in https://github.com/huggingface/trl/pull/4227\r\n* Update CI Docker image to pytorch/pytorch:2.8.0 by @albertvillanova in https://github.com/huggingface/trl/pull/4232\r\n* Replace setup with pyproject in CI tests paths by @albertvillanova in https://github.com/huggingface/trl/pull/4230\r\n* Fix CI IndentationError for Python 3.13.8 by @albertvillanova in https://github.com/huggingface/trl/pull/4240\r\n* Remove unused log_example_reports.py script by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4241\r\n* 🧘 Enhance markdown style by @qgallouedec in https://github.com/huggingface/trl/pull/4235\r\n* Warnings pointing to RFC by @qgallouedec in https://github.com/huggingface/trl/pull/4224\r\n* Fix CI slow test ValueError: Backward pass should have cleared tracker of all tensors by @sywangyi in https://github.com/huggingface/trl/pull/4236\r\n* Fix CI CUDA out of memory errors by improving GPU memory management by @albertvillanova in https://github.com/huggingface/trl/pull/4238\r\n* Install peft from main for CI tests with dev dependencies by @albertvillanova in https://github.com/huggingface/trl/pull/4250\r\n* Fix CI ImportError for 'require_torch_gpu_if_bnb_not_multi_backend_enabled' by @albertvillanova in https://github.com/huggingface/trl/pull/4253\r\n* Fix CI slow test ValueError: Unknown loss type: dapo by @albertvillanova in https://github.com/huggingface/trl/pull/4254\r\n* 🧺 [3/N] Refactor `_generate` in GRPO/RLOO: Rely on generator for prompt truncation by @qgallouedec in https://github.com/huggingface/trl/pull/4153\r\n* Remove obsolete research_projects directory by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4243\r\n* Deprecate unused dataset_formatting module by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4242\r\n* Fix CI slow test AttributeError: 'TestSFTTrainerSlow' object has no attribute 'addCleanup' by @albertvillanova in https://github.com/huggingface/trl/pull/4255\r\n* [Online-DPO] fix the completion_len == max_new_tokens crash by @kashif in https://github.com/huggingface/trl/pull/4193\r\n* Include `chat_template_kwargs` in `apply_chat_template` by @cmpatino in https://github.com/huggingface/trl/pull/4233\r\n* Fix Python version check for skipping tests on Python 3.13.8 by @albertvillanova in https://github.com/huggingface/trl/pull/4246\r\n* Raise deprecation warning for Python 3.9 by @albertvillanova in https://github.com/huggingface/trl/pull/4226\r\n* Fix docstring interlinks by @albertvillanova in https://github.com/huggingface/trl/pull/4221\r\n* Use FutureWarning instead of DeprecationWarning by @albertvillanova in https://github.com/huggingface/trl/pull/4266\r\n* Fix style with make precommit by @albertvillanova in https://github.com/huggingface/trl/pull/4265\r\n* Add Qwen3-VL notebooks (SFT, GRPO) by @sergiopaniego in https://github.com/huggingface/trl/pull/4275\r\n* Fix typo in Colab link by @sergiopaniego in https://github.com/huggingface/trl/pull/4276\r\n* Fix docstrings with Sphinx 'deprecated' directive by @albertvillanova in https://github.com/huggingface/trl/pull/4279\r\n* Fix CI slow test OSError: You are trying to access a gated repo by @albertvillanova in https://github.com/huggingface/trl/pull/4283\r\n* 💰 `RichProgressCallback` enhancement by @qgallouedec in https://github.com/huggingface/trl/pull/4245\r\n* Fix CI dev test TypeError: unexpected keyword argument 'load_in_4bit' by @albertvillanova in https://github.com/huggingface/trl/pull/4262\r\n* Replace unittest skipTest with pytest.skip by @albertvillanova in https://github.com/huggingface/trl/pull/4263\r\n* Fix CI slow tests: ImportError: vLLM is not installed by @albertvillanova in https://github.com/huggingface/trl/pull/4287\r\n* Remove logging.md: trainer-specific metrics documentation by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4269\r\n* Remove using_llama_models.md: outdated Llama2-specific documentation by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4268\r\n* Add support for `token_type_ids` in `DPOTrainer` by @aweers in https://github.com/huggingface/trl/pull/4285\r\n* Remove how_to_train.md: outdated training FAQ by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4267\r\n* Add accuracy reward by @pramodith in https://github.com/huggingface/trl/pull/4270\r\n* Remove unused commands directory by @behroozazarkhalili in https://github.com/huggingface/trl/pull/4258\r\n* Deprecate `BestOfNSampler` by @qgallouedec in https://github.com/huggingface/trl/pull/4291\r\n* Release: v0.24 by @qgallouedec in https://github.com/huggingface/trl/pull/4292\r\n\r\n## New Contributors\r\n\r\n* @zucchini-nlp made their first contribution in https://github.com/huggingface/trl/pull/4032\r\n* @parambharat made their first contribution in https://github.com/huggingface/trl/pull/4089\r\n* @zilongzheng made their first contribution in https://github.com/huggingface/trl/pull/4056\r\n* @jiqing-feng made their first contribution in https://github.com/huggingface/trl/pull/4031\r\n* @Hoesu made their first contribution in https://github.com/huggingface/trl/pull/4081\r\n* @cmpatino made their first contribution in https://github.com/huggingface/trl/pull/4105\r\n* @singing-cat made their first contribution in https://github.com/huggingface/trl/pull/4124\r\n* @SamuelBarryCS made their first contribution in https://github.com/huggingface/trl/pull/4080\r\n* @YonatanGideoni made their first contribution in https://github.com/huggingface/trl/pull/4163\r\n* @aweers made their first contribution in https://github.com/huggingface/trl/pull/4285\r\n\r\n**Full Changelog**: https://github.com/huggingface/trl/compare/v0.23.0...v0.24.0","publishedAt":"2025-10-16T00:29:40.000Z","url":"https://github.com/huggingface/trl/releases/tag/v0.24.0","media":[]},{"id":"rel_JLKG1_jCTwTHSmFSH8WZC","version":"v0.23.1","title":"v0.23.1","summary":"## What's Changed\r\n\r\n* ♨️ [GRPO] Fix potential hang in `get_high_entropy_mask` by @akakakakakaa in https://github.com/huggingface/trl/pull/4041\r\n* Aux...","content":"## What's Changed\r\n\r\n* ♨️ [GRPO] Fix potential hang in `get_high_entropy_mask` by @akakakakakaa in https://github.com/huggingface/trl/pull/4041\r\n* Aux loss is already included in the loss returned by Transformers by @pramodith in https://github.com/huggingface/trl/pull/4078\r\n* Fix get_peft_model() so that prepare_model_for_kbit_training does not reapply to an instance of PeftModel, thus freezing all the layers by @Hoesu in https://github.com/huggingface/trl/pull/4081\r\n* 🐯 fix: use_liger_kernel with IterableDataset by @jue-jue-zi in https://github.com/huggingface/trl/pull/4087\r\n* [SFTrainer]: Fix DFT Loss by @pramodith in https://github.com/huggingface/trl/pull/4112\r\n* ⚡ Fix Flash Attention x Padding-Free loss by @qgallouedec in https://github.com/huggingface/trl/pull/4170\r\n\r\n## New Contributors\r\n\r\n* @Hoesu made their first contribution in https://github.com/huggingface/trl/pull/4081\r\n\r\n**Full Changelog**: https://github.com/huggingface/trl/compare/v0.23.0...v0.23.1","publishedAt":"2025-10-02T05:20:49.000Z","url":"https://github.com/huggingface/trl/releases/tag/v0.23.1","media":[]},{"id":"rel_wv5_RmKqdgm1UewPZvSJN","version":"v0.23.0","title":"v0.23.0","summary":"## Major\r\n\r\n### 🥓 Context Parallelism\r\n\r\nSFT now supports Context Parallelism (CP) for training large language models on very large sequences. You ca...","content":"## Major\r\n\r\n### 🥓 Context Parallelism\r\n\r\nSFT now supports Context Parallelism (CP) for training large language models on very large sequences. You can now train with an arbitrarily long sequence length.\r\n\r\n<img width=\"844\" height=\"336\" alt=\"Screenshot 2025-09-09 at 10 39 30 PM\" src=\"https://github.com/user-attachments/assets/f1dfc349-440a-4e05-aac9-439a3c286f08\" />\r\n\r\nby @kashif in https://github.com/huggingface/trl/pull/3994\r\n\r\n### 🧨 Dynamic Fine-Tuning\r\n\r\n\r\nDynamic Fine-Tuning (DFT) is a nnow supported in TRL.\r\n\r\n```python\r\nfrom trl import SFTConfig\r\n\r\ntraining_args = SFTConfig(\r\n    loss_type=\"dft\",\r\n    ...\r\n)\r\n```\r\n\r\n<img width=\"692\" height=\"472\" alt=\"Screenshot 2025-09-09 at 10 37 36 PM\" src=\"https://github.com/user-attachments/assets/4ee2b4ab-7cc6-4578-bfac-c38124891510\" />\r\n\r\nby @qgallouedec in https://github.com/huggingface/trl/pull/4042\r\n\r\n### 🪵 Truncated Importance Sampling (TIS) to address rollout-training mismatch\r\n\r\nDifferent implementations are used for rollout generation (vLLM) and model training. The implementation gap implicitly turns the on-policy RL to be off-policy. Truncated Importance Sampling (TIS) a simple yet effective importance sampling technique for handling such discrepancy. This is now implemented in GRPO.\r\n\r\n```python\r\nfrom trl import GRPOConfig\r\n\r\ntraining_args = GRPOConfig(\r\n    ...\r\n    use_vllm=True,\r\n    vllm_importance_sampling_correction=True, # default True\r\n    vllm_importance_sampling_cap=2.0, # hyper-parameter C\r\n)\r\n```\r\n\r\nby @LeonEricsson in https://github.com/huggingface/trl/pull/3867\r\n\r\n### 🥣 [SFTTrainer]: Add Aux Loss for MoE models\r\n\r\nMixture of Experts (MoE) models require an auxiliary loss to ensure that the different experts are used evenly. This auxiliary loss is now supported in SFTTrainer.\r\n\r\n```python\r\ntraining_args = SFTConfig(\r\n    model_init_kwargs={\"output_router_logits\": True},\r\n    ...\r\n)\r\n```\r\n\r\nby @pramodith in https://github.com/huggingface/trl/pull/4012\r\n\r\n### 💤 [GRPO/RLOO] Adds an option to sleep vllm when running in colocated mode\r\n\r\nWhen running GRPO (or RLOO) with vLLM in colocated mode, the vLLM server consume VRAM during optimization while not being used. We now have an option to put the vLLM server to sleep during optimization to free up VRAM.\r\n\r\n```python\r\nfrom trl import GRPOConfig\r\n\r\ntraining_args = GRPOConfig(..., vllm_sleep_enabled=True)\r\n```\r\n\r\nby @edbeeching in https://github.com/huggingface/trl/pull/3968\r\n\r\n### ⚖️ Add vLLM server mode and VLM support to OnlineDPOTrainer\r\n\r\nYou can now use vLLM server mode with OnlineDPOTrainer. Additionally, VLM models are now supported.\r\n\r\nby @vaelev in https://github.com/huggingface/trl/pull/3783\r\n\r\n\r\n### Comprehensive Paper Index Enhancement with 9 New Algorithm Implementations\r\n\r\nThe paper index has been significantly enhanced with the addition of 9+ new algorithm implementations, providing a more comprehensive resource for users.\r\n\r\nby @behroozazarkhalili in https://github.com/huggingface/trl/pull/3990\r\n\r\n### Other Notable Changes\r\n\r\n* 👷 Added Kernels on the Hub x TRL guide by @sergiopaniego in https://github.com/huggingface/trl/pull/3969\r\n* 🌵 Refactor entropy_from_logits for memory efficiency by @qgallouedec in https://github.com/huggingface/trl/pull/4013\r\n\r\n## What's Changed\r\n\r\n* ⬆️ Bump dev version by @qgallouedec in https://github.com/huggingface/trl/pull/3978\r\n* 👮 Fix GRPO CLI by setting parameters for `get_soft_overlong_punishment` by @qgallouedec in https://github.com/huggingface/trl/pull/3972\r\n* 🪃 `args.gradient_checkpointing = False` instead of `args = dataclasses.replace(args, gradient_checkpointing=False)` by @qgallouedec in https://github.com/huggingface/trl/pull/3981\r\n* [GRPO] Adds an option to sleep vllm when running in colocated mode by @edbeeching in https://github.com/huggingface/trl/pull/3968\r\n* 🎯 Add Trackio integration documentation and update TOC by @qgallouedec in https://github.com/huggingface/trl/pull/3971\r\n* ⚖️ Fix scale_rewards issue in GRPO by @Peter-Chou in https://github.com/huggingface/trl/pull/3992\r\n* ⏰ fix: add return to shift_tokens_right by @ginkyenglee in https://github.com/huggingface/trl/pull/3987\r\n* Add pre-commit and hf-doc-builder as dev dependencies by @albertvillanova in https://github.com/huggingface/trl/pull/3993\r\n* [GRPO] Truncated Importance Sampling to address rollout-training mismatch by @LeonEricsson in https://github.com/huggingface/trl/pull/3867\r\n* Fixed tags shown problem in memory usage docs by @sergiopaniego in https://github.com/huggingface/trl/pull/3999\r\n* ✖️ Support pad-to-multiple-of and padding-free by @qgallouedec in https://github.com/huggingface/trl/pull/3996\r\n* 💾 [bugfix] fix PPO save_checkpoint by @hjh0119 in https://github.com/huggingface/trl/pull/3998\r\n* [GRPO]: Fix Multi-GPU training for Entropy based masking of tokens. by @pramodith in https://github.com/huggingface/trl/pull/3964\r\n* 📏 `torch_dype` to `dtype` everywhere by @sergiopaniego in https://github.com/huggingface/trl/pull/4000\r\n* Comprehensive Paper Index Enhancement with 9 New Algorithm Implementations by @behroozazarkhalili in https://github.com/huggingface/trl/pull/3990\r\n* [SFT] fix: collator docstring by @LeonEricsson in https://github.com/huggingface/trl/pull/4011\r\n* 👷 Added Kernels on the Hub x TRL guide by @sergiopaniego in https://github.com/huggingface/trl/pull/3969\r\n* 🌵 Refactor entropy_from_logits for memory efficiency by @qgallouedec in https://github.com/huggingface/trl/pull/4013\r\n* [SFTTrainer]: Add Aux Loss for MoE models. by @pramodith in https://github.com/huggingface/trl/pull/4012\r\n* Add missing doc strings in SFTrainer by @pramodith in https://github.com/huggingface/trl/pull/4003\r\n* ⚖️ Add vLLM server mode and VLM support to OnlineDPOTrainer by @vaelev in https://github.com/huggingface/trl/pull/3783\r\n* Fix typo in GRPO quickstart by @dwisdom0 in https://github.com/huggingface/trl/pull/4020\r\n* Align docstring parameters with function definitions by @albertvillanova in https://github.com/huggingface/trl/pull/4017\r\n* Fix formatting errors in docstrings by @albertvillanova in https://github.com/huggingface/trl/pull/4025\r\n* [doc] Paper index for Truncated Importance Sampling by @LeonEricsson in https://github.com/huggingface/trl/pull/4026\r\n* [doc] Group paper index by trainer by @LeonEricsson in https://github.com/huggingface/trl/pull/4027\r\n* Add missing trainer docstrings by @albertvillanova in https://github.com/huggingface/trl/pull/4030\r\n* Add autodoc for AlignPropTrainer and AlignPropConfig by @albertvillanova in https://github.com/huggingface/trl/pull/4033\r\n* 🥓 [docs] add CP docs by @kashif in https://github.com/huggingface/trl/pull/3994\r\n* ⚖️ Remove `average_tokens_across_devices` default replacement by @qgallouedec in https://github.com/huggingface/trl/pull/4039\r\n* CI hotfix: xfail test_training_with_transformers_paged by @albertvillanova in https://github.com/huggingface/trl/pull/4046\r\n* Update transformers minimum version to 4.56.1 by @albertvillanova in https://github.com/huggingface/trl/pull/4047\r\n* 🧨 DFT by @qgallouedec in https://github.com/huggingface/trl/pull/4042\r\n* Update VLM arch check to `AutoModelForImageTextToText` for DPO and Online DPO by @sergiopaniego in https://github.com/huggingface/trl/pull/4049\r\n* 🏂 Fix label shifting logic in `SFTTrainer` for compatibility with CP by @qgallouedec in https://github.com/huggingface/trl/pull/4038\r\n* Add autodoc for BestOfNSampler and improve docstrings by @albertvillanova in https://github.com/huggingface/trl/pull/4034\r\n* ✨ Improve SFT doc  by @qgallouedec in https://github.com/huggingface/trl/pull/4005\r\n* 💬 Remove setting chat template in sft script by @qgallouedec in https://github.com/huggingface/trl/pull/4037\r\n* 🪪 Update SFTTrainer to handle labels correctly and add configuration example in paper index by @qgallouedec in https://github.com/huggingface/trl/pull/4051\r\n* 🗜 Hotfix: avoid passing `quantization_config=None` by @qgallouedec in https://github.com/huggingface/trl/pull/4019\r\n* Release: 0.23 by @qgallouedec in https://github.com/huggingface/trl/pull/4053\r\n\r\n## New Contributors\r\n\r\n* @Peter-Chou made their first contribution in https://github.com/huggingface/trl/pull/3992\r\n* @ginkyenglee made their first contribution in https://github.com/huggingface/trl/pull/3987\r\n* @albertvillanova made their first contribution in https://github.com/huggingface/trl/pull/3993\r\n* @hjh0119 made their first contribution in https://github.com/huggingface/trl/pull/3998\r\n* @vaelev made their first contribution in https://github.com/huggingface/trl/pull/3783\r\n* @dwisdom0 made their first contribution in https://github.com/huggingface/trl/pull/4020\r\n\r\n**Full Changelog**: https://github.com/huggingface/trl/compare/v0.22.0...v0.23.0","publishedAt":"2025-09-10T04:39:53.000Z","url":"https://github.com/huggingface/trl/releases/tag/v0.23.0","media":[]},{"id":"rel_sTkyTHODbrZbXaEA4sWej","version":"v0.22.2","title":"v0.22.2","summary":"## What's Changed\r\n\r\n* ⚖️ Fix scale_rewards issue in GRPO by @Peter-Chou in https://github.com/huggingface/trl/pull/3992\r\n* ⏰ fix: add return to shift...","content":"## What's Changed\r\n\r\n* ⚖️ Fix scale_rewards issue in GRPO by @Peter-Chou in https://github.com/huggingface/trl/pull/3992\r\n* ⏰ fix: add return to shift_tokens_right by @ginkyenglee in https://github.com/huggingface/trl/pull/3987\r\n* ✖️ Support pad-to-multiple-of and padding-free by @qgallouedec in https://github.com/huggingface/trl/pull/3996\r\n\r\n## New Contributors\r\n* @Peter-Chou made their first contribution in https://github.com/huggingface/trl/pull/3992\r\n\r\n**Full Changelog**: https://github.com/huggingface/trl/compare/v0.22.1...v0.22.2","publishedAt":"2025-09-03T14:44:47.000Z","url":"https://github.com/huggingface/trl/releases/tag/v0.22.2","media":[]},{"id":"rel_Ysk002VDwOGH-Khuv_E-O","version":"v0.22.1","title":"v0.22.1","summary":"## What changed\r\n- Refactor version retrieval to use `importlib.metadata` by @qgallouedec\r\n- Release: 0.22.1 by @qgallouedec\r\n\r\n**Full Changelog**: ht...","content":"## What changed\r\n- Refactor version retrieval to use `importlib.metadata` by @qgallouedec\r\n- Release: 0.22.1 by @qgallouedec\r\n\r\n**Full Changelog**: https://github.com/huggingface/trl/compare/v0.22.0...v0.22.1","publishedAt":"2025-08-29T22:11:44.000Z","url":"https://github.com/huggingface/trl/releases/tag/v0.22.1","media":[]}],"pagination":{"page":1,"pageSize":20,"totalPages":5,"totalItems":81},"summaries":{"rolling":{"windowDays":90,"summary":"TRL moved toward production-grade reinforcement learning with v1.0.0, marking a transition from prototype frameworks to deployable training systems. Asynchronous GRPO decoupled generation from gradient updates by offloading rollouts to external vLLM servers, eliminating idle GPU time during training. VESPO (Variational Sequence-Level Soft Policy Optimization) replaced heuristic token-level clipping with a principled variational framework that derives smooth importance weighting, addressing training instability from policy staleness and asynchronous updates. Earlier releases hardened the foundation with async reward functions parallelized across GRPO and RLOO, vLLM 0.12.0 compatibility, tool-calling support for agent training, and memory optimizations like forward-masked logits that cut VRAM usage by up to 50 percent during forward passes.","releaseCount":8,"generatedAt":"2026-04-07T17:28:27.620Z"},"monthly":[{"year":2026,"month":3,"summary":"TRL hit v1.0 by shipping asynchronous GRPO, which offloads generation to external vLLM servers to parallelize rollouts and training while eliminating GPU idle time. The release also introduced VESPO, a variational framework that replaces heuristic token-level clipping with a principled Gamma weighting function to stabilize off-policy training. Earlier in the month, v0.29.1 fixed multimodal token handling across SFT/GRPO/RLOO and decoupled rollout dispatch from the vLLM backend to improve compatibility across versions.","releaseCount":3,"generatedAt":"2026-04-07T17:28:30.516Z"}]}}