This release includes multiple important bugfixes (SFTTrainer, PPOTrainer), the release also extends the current DataCollatorForCompletionOnlyLM to support chat-like training.
The DPO algorithm (Direct Policy Optimization) has been introduced by Rafailov et al. in this paper and introduces a way of performing RL training without having to rely on a reward model. The DPOTrainer is now part of TRL library for anyone that wants to use it thanks to the amazing contributors!
DPO] Resolve logging for DPOTrainer by @tomaarsen in https://github.com/lvwerra/trl/pull/570_get_current_device() by @lewtun in https://github.com/lvwerra/trl/pull/515DataCollatorForCompletionOnlyLMYou can now mask out the users prompts in the DataCollatorForCompletionOnlyLM data collator and train only on chat completions. Check out the PR below or the appropriate section on the documentation to learn more about it!
Multiple bugs on the supported trainers have been raised by the community and fixed in the below PRs
core] Fix offline case by @younesbelkada in https://github.com/lvwerra/trl/pull/538SFTTrainer] Add warning for wrong padding_side by @younesbelkada in https://github.com/lvwerra/trl/pull/550SFTTrainer] Add epochs and num steps on CLI by @younesbelkada in https://github.com/lvwerra/trl/pull/562DataCollatorForCompletionOnlyLM in the docs by @younesbelkada in https://github.com/lvwerra/trl/pull/565PPO] fix corner cases with PPO batch size and forward_batch_size by @younesbelkada in https://github.com/lvwerra/trl/pull/563The examples and documentation has been refactored, check the PRs below for more details
examples] Big refactor of examples and documentation by @younesbelkada in https://github.com/lvwerra/trl/pull/509examples] Fix sentiment nit by @younesbelkada in https://github.com/lvwerra/trl/pull/517examples] make the sft script more modulable by @younesbelkada in https://github.com/lvwerra/trl/pull/543use_auth_token arg to sft_trainer example by @corey-lambda in https://github.com/lvwerra/trl/pull/544Full Changelog: https://github.com/lvwerra/trl/compare/v0.4.7...v0.5.0
Fetched April 7, 2026