v0.5.0 DPOTrainer and multiple bug fixes on PPOTrainer and SFTTrainer

This release includes multiple important bugfixes (SFTTrainer, PPOTrainer), the release also extends the current DataCollatorForCompletionOnlyLM to support chat-like training.

DPO Trainer

The DPO algorithm (Direct Policy Optimization) has been introduced by Rafailov et al. in this paper and introduces a way of performing RL training without having to rely on a reward model. The DPOTrainer is now part of TRL library for anyone that wants to use it thanks to the amazing contributors!

DPO Trainer by @kashif in https://github.com/lvwerra/trl/pull/416
[DPO] make sure all the concated batches are on same device by @kashif in https://github.com/lvwerra/trl/pull/528
[DPO] remove response/pairs from the DPO side by @kashif in https://github.com/lvwerra/trl/pull/540
[DPO] remove unnecessary batch size arg to Collator by @kashif in https://github.com/lvwerra/trl/pull/554
[DPO] Resolve logging for DPOTrainer by @tomaarsen in https://github.com/lvwerra/trl/pull/570

What's Changed

Reward trainer multi-gpu eval bug by @rlindskog in https://github.com/lvwerra/trl/pull/513
Use local process index for _get_current_device() by @lewtun in https://github.com/lvwerra/trl/pull/515

Extending the `DataCollatorForCompletionOnlyLM`

You can now mask out the users prompts in the DataCollatorForCompletionOnlyLM data collator and train only on chat completions. Check out the PR below or the appropriate section on the documentation to learn more about it!

Introducing DataCollatorForChatCompletionOnlyLM by @gaetanlop in https://github.com/lvwerra/trl/pull/456

Important bug fixes

Multiple bugs on the supported trainers have been raised by the community and fixed in the below PRs

[core] Fix offline case by @younesbelkada in https://github.com/lvwerra/trl/pull/538
Relax reward trainer constraint by @younesbelkada in https://github.com/lvwerra/trl/pull/539
ADD: num_proc to SFTTrainer by @BramVanroy in https://github.com/lvwerra/trl/pull/547
[SFTTrainer] Add warning for wrong padding_side by @younesbelkada in https://github.com/lvwerra/trl/pull/550
Minor typo and whitespace fixes by @tmm1 in https://github.com/lvwerra/trl/pull/559
[SFTTrainer] Add epochs and num steps on CLI by @younesbelkada in https://github.com/lvwerra/trl/pull/562
Add DataCollatorForCompletionOnlyLM in the docs by @younesbelkada in https://github.com/lvwerra/trl/pull/565
Add comment to explain how the sentiment pipeline is used to run the … by @jvhoffbauer in https://github.com/lvwerra/trl/pull/555
Fix model output dim in reward trainer example by @liutianlin0121 in https://github.com/lvwerra/trl/pull/566
Computes the KL penalty using the entire distribution by @edbeeching in https://github.com/lvwerra/trl/pull/541
Add missing max_seq_length arg to example sft_trainer.py by @SharkWipf in https://github.com/lvwerra/trl/pull/585
[PPO] fix corner cases with PPO batch size and forward_batch_size by @younesbelkada in https://github.com/lvwerra/trl/pull/563
Update the example sft_trainer.py by @ZeusFSX in https://github.com/lvwerra/trl/pull/587
docs: Replace SFTTrainer with RewardTrainer in comment by @tomaarsen in https://github.com/lvwerra/trl/pull/589
Fix comparison in DataCollatorForCompletionOnlyLM (#588) by @RyujiTamaki in https://github.com/lvwerra/trl/pull/594
refactor grad accum by @vwxyzjn in https://github.com/lvwerra/trl/pull/546

Big refactor of examples and documentation

The examples and documentation has been refactored, check the PRs below for more details

[examples] Big refactor of examples and documentation by @younesbelkada in https://github.com/lvwerra/trl/pull/509
[examples] Fix sentiment nit by @younesbelkada in https://github.com/lvwerra/trl/pull/517
[examples] make the sft script more modulable by @younesbelkada in https://github.com/lvwerra/trl/pull/543
Add use_auth_token arg to sft_trainer example by @corey-lambda in https://github.com/lvwerra/trl/pull/544

New Contributors

@rlindskog made their first contribution in https://github.com/lvwerra/trl/pull/513
@corey-lambda made their first contribution in https://github.com/lvwerra/trl/pull/544
@tmm1 made their first contribution in https://github.com/lvwerra/trl/pull/559
@jvhoffbauer made their first contribution in https://github.com/lvwerra/trl/pull/555
@liutianlin0121 made their first contribution in https://github.com/lvwerra/trl/pull/566
@SharkWipf made their first contribution in https://github.com/lvwerra/trl/pull/585
@ZeusFSX made their first contribution in https://github.com/lvwerra/trl/pull/587
@gaetanlop made their first contribution in https://github.com/lvwerra/trl/pull/456
@RyujiTamaki made their first contribution in https://github.com/lvwerra/trl/pull/594

Full Changelog: https://github.com/lvwerra/trl/compare/v0.4.7...v0.5.0

v0.5.0