releases.shpreview

v0.5.0

August 2, 2023TRLView original ↗
$npx -y @buildinternet/releases show rel_YAU68EpAk3IarjrPWL_e7

v0.5.0 DPOTrainer and multiple bug fixes on PPOTrainer and SFTTrainer

This release includes multiple important bugfixes (SFTTrainer, PPOTrainer), the release also extends the current DataCollatorForCompletionOnlyLM to support chat-like training.

DPO Trainer

The DPO algorithm (Direct Policy Optimization) has been introduced by Rafailov et al. in this paper and introduces a way of performing RL training without having to rely on a reward model. The DPOTrainer is now part of TRL library for anyone that wants to use it thanks to the amazing contributors!

What's Changed

Extending the DataCollatorForCompletionOnlyLM

You can now mask out the users prompts in the DataCollatorForCompletionOnlyLM data collator and train only on chat completions. Check out the PR below or the appropriate section on the documentation to learn more about it!

Important bug fixes

Multiple bugs on the supported trainers have been raised by the community and fixed in the below PRs

Big refactor of examples and documentation

The examples and documentation has been refactored, check the PRs below for more details

New Contributors

Full Changelog: https://github.com/lvwerra/trl/compare/v0.4.7...v0.5.0

Fetched April 7, 2026