decoder_input_ids in DPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/2208Full Changelog: https://github.com/huggingface/trl/compare/v0.11.2...v0.11.3
Full Changelog: https://github.com/huggingface/trl/compare/v0.11.1...v0.11.2
Full Changelog: https://github.com/huggingface/trl/compare/v0.11.0...v0.11.1
We are excited to introduce the new v0.11.0 release, with many new features and post-training algorithms. The highlights are as follows:
Generalized Knowledge Distillation (GKD) is a post-training method from Google DeepMind that extends standard knowledge distillation by allowing the student to generate outputs during training and receive online feedback from the teacher. It consistently outperforms SFT and in some cases enables the student model to match the performance of the teacher, but with far fewer parameters.
To train models with this method, check out the GKDTrainer.
Exploratory Preference Optimization is an online post-training method from researchers at Microsoft, MIT, and Wisconsin that extends DPO to incorporate online feedback from reward models or LLM judges. It is similar to online DPO, but has a slightly different theoretical basis concerning sample efficiency.
To train models with this method, check out the XPOTrainer.
Nash Learning with Human Feedback is a novel post-training method from Google DeepMind that uses pairwise preference models which are conditioned on two inputs, instead of the single one used in reward models. These preference models are then used to train a policy that consistently produces responses that are preferred over those from competing policies, thus approximating a Nash equilibrium (i.e. a two player game where actions are responses and payoffs are given by the preference model).
To train models with this method, check out the NashMDTrainer.
OrpoTrainer has better integration with PyTorchXLA for faster step time on TPUs ⚡ . By @wenxindongwork in https://github.com/huggingface/trl/pull/2001PPOTrainer is marked for deprecated in favour of PPOv2Trainer to provide a consistent API across TRL's trainers. It will be removed in v0.12.0. By @qgallouedec in https://github.com/huggingface/trl/pull/2016RichProgressCallback has been removed from the example scripts as it caused a variety of problems with logging in distributed environments. You can still use it by adding it manually to the trainer callbacks. By @lewtun in https://github.com/huggingface/trl/pull/2053prompts arg from WinrateCallback by @qgallouedec in https://github.com/huggingface/trl/pull/2010WinRateCallback to be used without reference model by @qgallouedec in https://github.com/huggingface/trl/pull/2013core.py by @northern-64bit in https://github.com/huggingface/trl/pull/2017packing doc in SFTConfig and fix error when neither dataset_text_field nor formatting_func is provided. by @qgallouedec in https://github.com/huggingface/trl/pull/2035aux_loss_enabled is set to True. by @Jonathanjordan21 in https://github.com/huggingface/trl/pull/2039non_eos_penalty to be consistent across OnPolicy trainers by @RylanSchaeffer in https://github.com/huggingface/trl/pull/2033debug and sanity_check args by @qgallouedec in https://github.com/huggingface/trl/pull/2055SFTTrainer.evaluate() and SFTTrainer.predict() with null train_dataset by @Sohaib9920 in https://github.com/huggingface/trl/pull/2004ConstantLengthDataset (or packing=True) shuffle examples before they are packed by @muupan in https://github.com/huggingface/trl/pull/2037WinRateCallback and LogCompletionsCallback by @lewtun in https://github.com/huggingface/trl/pull/2061transformers utilities when possible by @qgallouedec in https://github.com/huggingface/trl/pull/2064ref_policy and policy have different identities by @RylanSchaeffer in https://github.com/huggingface/trl/pull/2057processor(prompt, images=image) to processor(images=image, text=prompt) by @qgallouedec in https://github.com/huggingface/trl/pull/2076WinRateCallback and set default freq to eval_steps in LogCompletionsCallback` by @lewtun in https://github.com/huggingface/trl/pull/2074logits/chosen and logits/rejected metrics in kto_trainer. by @PhilipMay in https://github.com/huggingface/trl/pull/2077PPOv2Trainer by @qgallouedec in https://github.com/huggingface/trl/pull/2080Full Changelog: https://github.com/huggingface/trl/compare/v0.9.6...v0.11.0
We are excited to introduce the new v0.10.1 release, with many new exciting features and post-training algorithms. The highlights are as follows:
Online DPO is a new alignment method from DeepMind to boost the performance of LLMs. With Online DPO, data is generated on the fly by the trained model (instead of pre-collected). For each prompt, two completions are generated, with a reward model selecting the preferred one. This approach:
To train models with this method, use the OnlineDPOTrainer
SFTTrainer for faster throughput and lower memory usage. To use them, set use_liger_kernel in SFTConfigdpo_visual.py script as followsaccelerate launch examples/scripts/dpo_visual.py \
--dataset_name HuggingFaceH4/rlaif-v_formatted \
--model_name_or_path google/paligemma-3b-pt-224 \
--trust_remote_code \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 8 \
--output_dir dpo_paligemma_rlaif-v \
--bf16 \
--torch_dtype bfloat16
trainer = DPOTrainer(...)
win_rate_callback = WinRateCallback(..., trainer=trainer)
trainer.add_callback(win_rate_callback)
apo_zero and apo_down. The apo_zero loss increases the likelihood of winning outputs while decreasing the likelihood of losing outputs, making it suitable when the model is less performant than the winning outputs. On the other hand, apo_down decreases the likelihood of both winning and losing outputs, but with a stronger emphasis on reducing the likelihood of losing outputs. This variant is more effective when the model is better than the winning outputs. To use these losses, set loss_type="apo_zero" or loss_type="apo_down" in the DPOConfigmodel_ref naming by @qgallouedec in https://github.com/huggingface/trl/pull/1835CI_HUB_USER_TOKEN by @qgallouedec in https://github.com/huggingface/trl/pull/1852setup_chat_format by @Rishav-hub in https://github.com/huggingface/trl/pull/1862evaluation_strategy -> eval_strategy by @qgallouedec in https://github.com/huggingface/trl/pull/1894setUpClass in reward tester by @qgallouedec in https://github.com/huggingface/trl/pull/1895IterableDataset for SFTTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/1899AlignPropTrainer import by @qgallouedec in https://github.com/huggingface/trl/pull/1908lr_scheduler.step() after optimizer.step() by @qgallouedec in https://github.com/huggingface/trl/pull/1918torch.cuda.amp.autocast() -> torch.amp.autocast("cuda") by @qgallouedec in https://github.com/huggingface/trl/pull/1921dataset_num_proc usage by @qgallouedec in https://github.com/huggingface/trl/pull/1925PartialState().local_main_process_first() when map in examples by @qgallouedec in https://github.com/huggingface/trl/pull/1926push_to_hub by @qgallouedec in https://github.com/huggingface/trl/pull/1945padding_free to DataCollatorForCompletionOnlyLM by @RhuiDih in https://github.com/huggingface/trl/pull/1887PairRMJudge to top-level import by @qgallouedec in https://github.com/huggingface/trl/pull/1985torch.load with weights_only=True by @qgallouedec in https://github.com/huggingface/trl/pull/1988Full Changelog: https://github.com/huggingface/trl/compare/v0.9.6...v0.10
We are excited to introduce the new v0.9.6 release. Many new exciting features and algorithms. The highlights are as follows:
Support for SimPO by @fe1ixxu, a reference-free method that also regularizes output length. To use this loss, the users can input loss_type="simpo" and cpo_alpha=0 in the CPOConfig and use it with the CPOTrainer.
Added AlignProp by @mihirp1998, a method for finetuning Stable Diffusion model using reward gradients.
Added Efficient Exact Optimization (EXO) by @haozheji
We also included many important fixes and improvements such as fixing prints in the CLI with GCP containers by @alvarobartt. Enjoy the release!
numpy to !=2.0.0 for CI and to users by @younesbelkada in https://github.com/huggingface/trl/pull/1747TrlParser: Add ignore extra args option by @younesbelkada in https://github.com/huggingface/trl/pull/1748KTOTrainer: Remove old tests by @younesbelkada in https://github.com/huggingface/trl/pull/1750process function in the example of DPO by @AIR-hl in https://github.com/huggingface/trl/pull/1753evaluation_strategy to eval_strategy by @qgallouedec in https://github.com/huggingface/trl/pull/1771torch_dtype handling in {DPO,SFT}Trainer when provided via CLI by @alvarobartt in https://github.com/huggingface/trl/pull/1807TRL_USE_RICH environment variable handling by @alvarobartt in https://github.com/huggingface/trl/pull/1808Full Changelog: https://github.com/huggingface/trl/compare/v0.9.4...v0.9.6
Mainly backward compatibility fixes with SFTTrainer.
TrainingArguments by @younesbelkada in https://github.com/huggingface/trl/pull/1707Full Changelog: https://github.com/huggingface/trl/compare/v0.9.3...v0.9.4
We are excited to introduce the new v0.9.3 release. Many new exciting features and algorithms. The highlights are as follows:
https://github.com/huggingface/trl/assets/5555347/6575a879-cb2f-4e2e-bb84-a76707f9de84
SFTTrainer] Add warning in SFTTrainer when dataset already processed by @younesbelkada in https://github.com/huggingface/trl/pull/1577SftArgumentParser by @younesbelkada in https://github.com/huggingface/trl/pull/1602KTOTrainer] add BCO (reward shift and underlying distribution matching) by @seanexp in https://github.com/huggingface/trl/pull/1599tests/ from package data by @jamesbraza in https://github.com/huggingface/trl/pull/1607evaluation_strategy by @muellerzr in https://github.com/huggingface/trl/pull/1559enable_input_require_grads issues with PPO models by @younesbelkada in https://github.com/huggingface/trl/pull/1664args=None by @younesbelkada in https://github.com/huggingface/trl/pull/1678Full Changelog: https://github.com/huggingface/trl/compare/v0.8.6...v0.9.2
Full Changelog: https://github.com/huggingface/trl/compare/v0.8.5...v0.8.6
Full Changelog: https://github.com/huggingface/trl/compare/v0.8.4...v0.8.5
This patch release includes important fixes for the CLI and KTO & CPO trainers
dataset_text_field to None to allow ChatML automatic template by @younesbelkada in https://github.com/huggingface/trl/pull/1545Full Changelog: https://github.com/huggingface/trl/compare/v0.8.3...v0.8.4
This is a patch release that includes an import fix for CLIs
Full Changelog: https://github.com/huggingface/trl/compare/v0.8.2...v0.8.3
This release includes two new trainers: ORPO from KAIST and CPO
The release also includes Vision LLM such as Llava support for SFTTrainer, please see: https://github.com/huggingface/trl/blob/main/examples/scripts/vsft_llava.py for more details
use_cache=False in {ORPO,CPO}Trainer.concatenated_forward by @alvarobartt in https://github.com/huggingface/trl/pull/1478input_ids instead by @alvarobartt in https://github.com/huggingface/trl/pull/1516You can now use SFTTrainer to fine-tune VLLMs such as Llava !
See: https://github.com/huggingface/trl/blob/main/examples/scripts/vsft_llava.py for more details
Many fixes were introduced for the KTOTrainer:
RichProgressCallback by @eggry in https://github.com/huggingface/trl/pull/1496Full Changelog: https://github.com/huggingface/trl/compare/v0.8.1...v0.8.2
This patch release includes some important fixes for CLIs
Full Changelog: https://github.com/huggingface/trl/compare/v0.8.0...v0.8.1
We recently introduced the KTOTrainer in order to run KTO algorithms on LLMs !
Run SFT, DPO and chat with your aligned model directly from the terminal:
SFT:
trl sft --model_name_or_path facebook/opt-125m --dataset_name imdb --output_dir opt-sft-imdb
DPO:
trl dpo --model_name_or_path facebook/opt-125m --dataset_name trl-internal-testing/Anthropic-hh-rlhf-processed --output_dir opt-sft-hh-rlhf
Chat:
trl chat --model_name_or_path Qwen/Qwen1.5-0.5B-Chat
Read more about CLI in the relevant documentation section or use --help for more details.
model --> model_name_or_path by @lvwerra in https://github.com/huggingface/trl/pull/1452SFTTrainer now supports FSDP + QLoRA
SFTTrainer] Add eval_packing by @younesbelkada in https://github.com/huggingface/trl/pull/1369force_use_ref_model for power users by @younesbelkada in https://github.com/huggingface/trl/pull/1367RewardModeling] Fix RM script for PEFT by @younesbelkada in https://github.com/huggingface/trl/pull/1393Full Changelog: https://github.com/huggingface/trl/compare/v0.7.11...v0.8.0
We fixed issues with respect to IPO loss, leading to consistent results according to newest experiements:
We also fixed important bugs with respect to DPO / PEFT and Flash Attention
DPOTrainer] Fix DPO trainer + mistral + FA2 by @younesbelkada in https://github.com/huggingface/trl/pull/1290Data processing is now faster for multi-GPU envs
DPOTrainer] Load data only on main process + fix dpo example test by @younesbelkada in https://github.com/huggingface/trl/pull/1291Other DPO bugfixes:
PEFT + DPO] Raise value error if one passes a ref_model and a peft_config by @younesbelkada in https://github.com/huggingface/trl/pull/1289Models now gets tagged correctly even if users do not call trainer.push_to_hub()
core / xxxTrainer] Automatic tagging by @younesbelkada in https://github.com/huggingface/trl/pull/1329DPOTrainer docstrings by @alvarobartt in https://github.com/huggingface/trl/pull/1298core / DDPO] Fix diffusers import issue by @younesbelkada in https://github.com/huggingface/trl/pull/1314CI] Add tests on transformers peft main on push main by @younesbelkada in https://github.com/huggingface/trl/pull/1328Full Changelog: https://github.com/huggingface/trl/compare/v0.7.10...v0.7.11
setup_chat_format API, stronger testsThis Patch release adds a new feature in TRL for dealing with chat datasets - you can load a directly formatted dataset without the need of formatting it beforehand.
Read more about it here: https://huggingface.co/docs/trl/sft_trainer#dataset-format-support
The release also introduces a new API setup_chat_format to correctly resize the model embeddings with the target size when adding new tokens to comply with the chat format. Currently we only support chatml format and we can add more formats in the future
Read more about it here: https://huggingface.co/docs/trl/sft_trainer#add-special-tokens-for-chat-format
We also extensively test SFTTrainer and DPOTrainer and the example scripts, dpo.py and sft.py should be well -battletested. If you see any issue with the script, please let us know on GitHub.
core / Docker] Add workflow to build TRL docker images by @younesbelkada in https://github.com/huggingface/trl/pull/1215pad_token_id is not configured by @yumemio in https://github.com/huggingface/trl/pull/1152core / tests ] v1 slow tests by @younesbelkada in https://github.com/huggingface/trl/pull/1218core / SFTTrainer] Fix breaking change by @younesbelkada in https://github.com/huggingface/trl/pull/1229setup_chat_format for adding new special tokens to model for training chat models by @philschmid in https://github.com/huggingface/trl/pull/1242Full Changelog: https://github.com/huggingface/trl/compare/v0.7.9...v0.7.10
This is a patch release that fixes critical issues with SFTTrainer & DPOTrainer, together with minor fixes for PPOTrainer and DataCollatorForCompletionOnlyLM
DPOTrainer] Fix peft + DPO + bf16 if one uses generate_during_eval or pre-computed logits by @younesbelkada in https://github.com/huggingface/trl/pull/1203Full Changelog: https://github.com/huggingface/trl/compare/v0.7.8...v0.7.9
xxxTrainerIf users use Unsloth library, the unsloth tag gets automatically pushed on the Hub.
xxxTrainer] Add unsloth tag by @younesbelkada in https://github.com/huggingface/trl/pull/1130Some important fixes for DPO has been introduced to address: https://twitter.com/jon_durbin/status/1743575483365699809 and to make DPO faster
Now DDPO supports PEFT
peft in ddpo. by @sayakpaul in https://github.com/huggingface/trl/pull/1165Full Changelog: https://github.com/huggingface/trl/compare/v0.7.7...v0.7.8
A fix has been introduce to fix a breaking change with PPOTrainer.push_to_hub() and DDPOTrainer.push_to_hub()
PPOTrainer / DDPOTrainer] Fix ppo & ddpo push to Hub by @younesbelkada in https://github.com/huggingface/trl/pull/1141Full Changelog: https://github.com/huggingface/trl/compare/v0.7.6...v0.7.7