QLoRA RLHF, SFT Trainer and RewardTrainer

A new version of TRL that includes training larger models using QLoRA (4 bit quantization through bitsandbytes), brand new classes RewardTrainer and SFTTrainer to easily conduct your RLHF projects end-to-end!

Introducing `SFTTrainer` and `RewardTrainer`

Use the brand new trainer to easily train your reward model and supervised fine-tuned (SFT) model with few lines of code!

[core] officially support SFT (Supervised Finetuning) by @younesbelkada in https://github.com/lvwerra/trl/pull/323
[SFT] Fix sft issues by @younesbelkada in https://github.com/lvwerra/trl/pull/336
[docs] fix SFT doc by @younesbelkada in https://github.com/lvwerra/trl/pull/367
[core] Officially Support Reward Modeling by @younesbelkada in https://github.com/lvwerra/trl/pull/303
Resolve broken evaluation/prediction for RewardTrainer by @tomaarsen in https://github.com/lvwerra/trl/pull/404

QLoRA integration

Pass 4bit models directly into PPOTrainer for more memory efficient training

[core] Add 4bit QLora by @younesbelkada in https://github.com/lvwerra/trl/pull/383
[bnb] fix 4 bit SFT by @younesbelkada in https://github.com/lvwerra/trl/pull/396

Updated StackLlama example

Great work by @mnoukhov that managed to fix the issues related with StackLlama and the new versions of accelerate, peft and transformers. The completely reproducible examples below:

StackLLaMA: correctly merge peft model by @mnoukhov in https://github.com/lvwerra/trl/pull/398
StackLlama: fixed RL training and added args by @mnoukhov in https://github.com/lvwerra/trl/pull/400
Fixed some type annotations of trl.trainer.PPoTrainer by @JulesGM in https://github.com/lvwerra/trl/pull/392
StackLLaMA: fix supervised finetuning and reward model training by @mnoukhov in https://github.com/lvwerra/trl/pull/399

Bug fixes and improvements

[core] refactor peft API by @younesbelkada in https://github.com/lvwerra/trl/pull/231
Batched generation by @lvwerra in https://github.com/lvwerra/trl/pull/228
Reduce memory consumption in batched_forward_pass by @ohashi56225 in https://github.com/lvwerra/trl/pull/234
[core] Add warning when negative KL by @younesbelkada in https://github.com/lvwerra/trl/pull/239
adds early stopping by @edbeeching in https://github.com/lvwerra/trl/pull/238
PPO config init is bloated by @GauravVirmani in https://github.com/lvwerra/trl/pull/241
feat(ci): enable pip cache by @SauravMaheshkar in https://github.com/lvwerra/trl/pull/198
Improve logging for PPO + Docs page by @natolambert in https://github.com/lvwerra/trl/pull/243
Fix typo by @heya5 in https://github.com/lvwerra/trl/pull/253
Using batched generate in sentiment scripts by @GauravVirmani in https://github.com/lvwerra/trl/pull/249
[core] Fix DeepSpeed zero-3 issue by @younesbelkada in https://github.com/lvwerra/trl/pull/182
[distributed] Fix early stopping and DP by @younesbelkada in https://github.com/lvwerra/trl/pull/254
[core] Fix ds issue by @younesbelkada in https://github.com/lvwerra/trl/pull/260
Add LlaMa in tests + create_reference_model by @younesbelkada in https://github.com/lvwerra/trl/pull/261
Use active model to generate response in example on README (#269) by @rmill040 in https://github.com/lvwerra/trl/pull/271
stack-llama by @edbeeching in https://github.com/lvwerra/trl/pull/273
Adding pointer back to Meta's LLaMA. by @meg-huggingface in https://github.com/lvwerra/trl/pull/277
fix doc string problem in ppo trainer loss function by @thuwyh in https://github.com/lvwerra/trl/pull/279
Add LLaMA tutorial to docs by @natolambert in https://github.com/lvwerra/trl/pull/278
Fix swapped helper texts by @philipp-classen in https://github.com/lvwerra/trl/pull/284
fix typo in gpt2-sentiment.ipynb by @eltociear in https://github.com/lvwerra/trl/pull/293
add functionality to push best models to the hub during training by @Bearnardd in https://github.com/lvwerra/trl/pull/275
Small improvements / fixes to toxicity example by @natolambert in https://github.com/lvwerra/trl/pull/266
Fix arguments description by @lvzii in https://github.com/lvwerra/trl/pull/298
[t5] Fix negative kl issue by @younesbelkada in https://github.com/lvwerra/trl/pull/262
Log Token distribution of Query / Response by @natolambert in https://github.com/lvwerra/trl/pull/295
clean examples folder by @natolambert in https://github.com/lvwerra/trl/pull/294
fixed typo in error message by @soerenarlt in https://github.com/lvwerra/trl/pull/312
fix DS for peft ref_model in ppo trainer by @halfrot in https://github.com/lvwerra/trl/pull/309
[CI] Fix broken tests by @younesbelkada in https://github.com/lvwerra/trl/pull/318
[Docs] Add details on multi-GPU / multi-node by @younesbelkada in https://github.com/lvwerra/trl/pull/320
Give a key to the wandb PPOConfig config entry by @JulesGM in https://github.com/lvwerra/trl/pull/315
added doc for using torch.distributed.launch/run by @oroojlooy in https://github.com/lvwerra/trl/pull/324
Fix argument's description by @vinhkhuc in https://github.com/lvwerra/trl/pull/339
stack_llama: update instructions in README, fix broken _get_submodules and save tokenizer by @teticio in https://github.com/lvwerra/trl/pull/358
stack_llama: add parameter to control max_length (to mitigate OOM errors) by @teticio in https://github.com/lvwerra/trl/pull/359
[PPO] Relax negative KL constraint by @younesbelkada in https://github.com/lvwerra/trl/pull/352
[PPOTrainer] Fix tensorboard issue by @younesbelkada in https://github.com/lvwerra/trl/pull/330
140/best n sampling by @metric-space in https://github.com/lvwerra/trl/pull/326
Fix bug when loading local peft model by @Opdoop in https://github.com/lvwerra/trl/pull/342
add is_trainable in kwargs by @Opdoop in https://github.com/lvwerra/trl/pull/363
Remove obsolete layer_norm_names parameter and add peft>=0.3.0 to requirements by @teticio in https://github.com/lvwerra/trl/pull/366
Delete test_training.py by @younesbelkada in https://github.com/lvwerra/trl/pull/371
[core] Fix warning issue by @younesbelkada in https://github.com/lvwerra/trl/pull/377
Update customization.mdx by @binganao in https://github.com/lvwerra/trl/pull/390
fix dataloader typo in ppo_trainer.py by @LZY-the-boys in https://github.com/lvwerra/trl/pull/389
from_pretrain with peft adapter on the hub (# 379) by @glerzing in https://github.com/lvwerra/trl/pull/380
keep state_dict kwargs instead of popping it in save_pretrained by @rizar in https://github.com/lvwerra/trl/pull/393
Remove unused imports in docs. by @vwxyzjn in https://github.com/lvwerra/trl/pull/406

New Contributors

@ohashi56225 made their first contribution in https://github.com/lvwerra/trl/pull/234
@GauravVirmani made their first contribution in https://github.com/lvwerra/trl/pull/241
@SauravMaheshkar made their first contribution in https://github.com/lvwerra/trl/pull/198
@heya5 made their first contribution in https://github.com/lvwerra/trl/pull/253
@rmill040 made their first contribution in https://github.com/lvwerra/trl/pull/271
@thuwyh made their first contribution in https://github.com/lvwerra/trl/pull/279
@philipp-classen made their first contribution in https://github.com/lvwerra/trl/pull/284
@Bearnardd made their first contribution in https://github.com/lvwerra/trl/pull/275
@lvzii made their first contribution in https://github.com/lvwerra/trl/pull/298
@soerenarlt made their first contribution in https://github.com/lvwerra/trl/pull/312
@halfrot made their first contribution in https://github.com/lvwerra/trl/pull/309
@oroojlooy made their first contribution in https://github.com/lvwerra/trl/pull/324
@vinhkhuc made their first contribution in https://github.com/lvwerra/trl/pull/339
@teticio made their first contribution in https://github.com/lvwerra/trl/pull/358
@metric-space made their first contribution in https://github.com/lvwerra/trl/pull/326
@Opdoop made their first contribution in https://github.com/lvwerra/trl/pull/342
@binganao made their first contribution in https://github.com/lvwerra/trl/pull/390
@LZY-the-boys made their first contribution in https://github.com/lvwerra/trl/pull/389
@glerzing made their first contribution in https://github.com/lvwerra/trl/pull/380
@rizar made their first contribution in https://github.com/lvwerra/trl/pull/393
@mnoukhov made their first contribution in https://github.com/lvwerra/trl/pull/398
@tomaarsen made their first contribution in https://github.com/lvwerra/trl/pull/404
@vwxyzjn made their first contribution in https://github.com/lvwerra/trl/pull/406

Full Changelog: https://github.com/lvwerra/trl/compare/v0.4.1...v0.4.2

v0.4.2