We are excited to introduce the new v0.11.0 release, with many new features and post-training algorithms. The highlights are as follows:
Generalized Knowledge Distillation (GKD) is a post-training method from Google DeepMind that extends standard knowledge distillation by allowing the student to generate outputs during training and receive online feedback from the teacher. It consistently outperforms SFT and in some cases enables the student model to match the performance of the teacher, but with far fewer parameters.
To train models with this method, check out the GKDTrainer.
Exploratory Preference Optimization is an online post-training method from researchers at Microsoft, MIT, and Wisconsin that extends DPO to incorporate online feedback from reward models or LLM judges. It is similar to online DPO, but has a slightly different theoretical basis concerning sample efficiency.
To train models with this method, check out the XPOTrainer.
Nash Learning with Human Feedback is a novel post-training method from Google DeepMind that uses pairwise preference models which are conditioned on two inputs, instead of the single one used in reward models. These preference models are then used to train a policy that consistently produces responses that are preferred over those from competing policies, thus approximating a Nash equilibrium (i.e. a two player game where actions are responses and payoffs are given by the preference model).
To train models with this method, check out the NashMDTrainer.
OrpoTrainer has better integration with PyTorchXLA for faster step time on TPUs ⚡ . By @wenxindongwork in https://github.com/huggingface/trl/pull/2001PPOTrainer is marked for deprecated in favour of PPOv2Trainer to provide a consistent API across TRL's trainers. It will be removed in v0.12.0. By @qgallouedec in https://github.com/huggingface/trl/pull/2016RichProgressCallback has been removed from the example scripts as it caused a variety of problems with logging in distributed environments. You can still use it by adding it manually to the trainer callbacks. By @lewtun in https://github.com/huggingface/trl/pull/2053prompts arg from WinrateCallback by @qgallouedec in https://github.com/huggingface/trl/pull/2010WinRateCallback to be used without reference model by @qgallouedec in https://github.com/huggingface/trl/pull/2013core.py by @northern-64bit in https://github.com/huggingface/trl/pull/2017packing doc in SFTConfig and fix error when neither dataset_text_field nor formatting_func is provided. by @qgallouedec in https://github.com/huggingface/trl/pull/2035aux_loss_enabled is set to True. by @Jonathanjordan21 in https://github.com/huggingface/trl/pull/2039non_eos_penalty to be consistent across OnPolicy trainers by @RylanSchaeffer in https://github.com/huggingface/trl/pull/2033debug and sanity_check args by @qgallouedec in https://github.com/huggingface/trl/pull/2055SFTTrainer.evaluate() and SFTTrainer.predict() with null train_dataset by @Sohaib9920 in https://github.com/huggingface/trl/pull/2004ConstantLengthDataset (or packing=True) shuffle examples before they are packed by @muupan in https://github.com/huggingface/trl/pull/2037WinRateCallback and LogCompletionsCallback by @lewtun in https://github.com/huggingface/trl/pull/2061transformers utilities when possible by @qgallouedec in https://github.com/huggingface/trl/pull/2064ref_policy and policy have different identities by @RylanSchaeffer in https://github.com/huggingface/trl/pull/2057processor(prompt, images=image) to processor(images=image, text=prompt) by @qgallouedec in https://github.com/huggingface/trl/pull/2076WinRateCallback and set default freq to eval_steps in LogCompletionsCallback` by @lewtun in https://github.com/huggingface/trl/pull/2074logits/chosen and logits/rejected metrics in kto_trainer. by @PhilipMay in https://github.com/huggingface/trl/pull/2077PPOv2Trainer by @qgallouedec in https://github.com/huggingface/trl/pull/2080Full Changelog: https://github.com/huggingface/trl/compare/v0.9.6...v0.11.0
Fetched April 7, 2026