Online DPO intially only supported a reward model that had the same tokenizer and chat template as the trained model. Now, you can use any reward model.
from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoModelForSequenceClassification, AutoTokenizer
from trl import OnlineDPOConfig, OnlineDPOTrainer
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_config.model_name_or_path, padding_side="left")
reward_model = AutoModelForSequenceClassification.from_pretrained(training_args.reward_model_path, num_labels=1)
reward_tokenizer = AutoTokenizer.from_pretrained(reward_model_name, truncation=True, truncation_side="left")
dataset = load_dataset(script_args.dataset_name)
training_args = OnlineDPOConfig(output_dir="...")
trainer = OnlineDPOTrainer(
model=model,
reward_model=reward_model,
args=training_args,
train_dataset=dataset,
processing_class=tokenizer,
reward_processing_class=reward_tokenizer,
)
trainer.train()
by @qgallouedec in https://github.com/huggingface/trl/pull/2276
PPOv2 -> PPOThe PPOv2 trainer has been renamed to PPO. The old PPO trainer has been removed. PPOv2 is now deprecated and will be removed in the next release.
- trainer = PPOv2Trainer(...)
+ trainer = PPOTrainer(...)
by @qgallouedec in https://github.com/huggingface/trl/pull/2174
ScriptArgumentsWe had ScriptArguments, SFTScriptArguments, DPOScriptArguments and RewardScriptArguments. Since they all share mostly the same fields, we've merged them into a single ScriptArguments class.
SFTScriptArguments, DPOScriptArguments and RewardScriptArguments still exist but are deprecated and will be removed in the next release.
- script_args = DPOScriptArguments(...)
+ script_args = ScriptArguments(...)
by @qgallouedec in https://github.com/huggingface/trl/pull/2145
The PairRMJudge now when called via the judge method has a flag return_scores that returns the probability scores of the first completion of the pair (instead of the rank of the preferred completion). The logits for the probability score can be scaled by an optional temperature parameter.
from trl import PairRMJudge
pairrm_judge = PairRMJudge()
prompts = ["Translate 'hello' to French", "What's the capital of Japan?"]
completions = [["Bonjour", "Salut"], ["Kyoto", "Tokyo"]]
results = pairrm_judge.judge(prompts, completions, return_scores=True)
print(results) # [0.7492601275444031, 0.0005497377132996917]
by @kashif in https://github.com/huggingface/trl/pull/2221
The OnlineDPOTrainer and any trainers that inherit from it (NashMDTrainer and XPOTrainer) can now accept an initialized PairwiseJudge instead of a reward model.
from datasets import load_dataset
from trl import OnlineDPOConfig, OnlineDPOTrainer, PairRMJudge
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2-0.5B-Instruct")
judge = PairRMJudge()
train_dataset = load_dataset("trl-lib/ultrafeedback-prompt", split="train")
training_args = OnlineDPOConfig(output_dir="Qwen2-0.5B-OnlineDPO", logging_steps=10)
trainer = OnlineDPOTrainer(
model=model, judge=judge, args=training_args, processing_class=tokenizer, train_dataset=train_dataset
)
trainer.train()
by @kashif in https://github.com/huggingface/trl/pull/2243
tokenizer to processing_classThe tokenizer argument in the trainers has been renamed to processing_class to better reflect the fact that it can be not only a tokenizer but also a processor.
- trainer = DPOTrainer(model, args=training_args, train_dataset=dataset, tokenizer=tokenizer)
+ trainer = DPOTrainer(model, args=training_args, train_dataset=dataset, processing_class=tokenizer)
tokenizer is still supported for SFTTrainer and DPOTrainer but deprecated and will be removed in the next release.
by @qgallouedec in https://github.com/huggingface/trl/pull/2162
The WPO paper adapts off-policy data to resemble on-policy data more closely by reweighting preference pairs according to their probability under the current policy. To use this method, set the use_weighting flag to True in the [DPOConfig].
DPOConfig(..., use_weighting=True)
<img width="1112" alt="Screenshot 2024-11-04 at 10 59 38" src="https://github.com/user-attachments/assets/544ddc02-bd09-4f21-b8a4-b81c21561a9b">
<img width="539" alt="Screenshot 2024-11-04 at 10 59 22" src="https://github.com/user-attachments/assets/8d5afe9e-89bd-4d00-8483-dd7ba98997e7">
by @gaetanlop in https://github.com/huggingface/trl/pull/2141
Using trainer.push_to_hub() now automatically creates a model card that includes:
All links are properly formatted to allow cross-referencing, enabling traceability back to sources (e.g., the model appears linked on the paper’s page).
https://github.com/user-attachments/assets/b903964e-9087-45cc-8fb0-2418fdd87b72
by @qgallouedec in https://github.com/huggingface/trl/pull/2123
You can now use conversational datasets directly, without needing to apply a chat template beforehand, for the following trainers:
BCOTrainer (by @qgallouedec in PR #2107)CPOTrainer (by @qgallouedec in PR #2144)DPOTrainer (by @qgallouedec in PR #2131)KTOTrainer (by @qgallouedec in PR #2248)ORPOTrainer (by @qgallouedec in PR #2184)from datasets import load_dataset
from transformers import AutoTokenizer, AutoModelForCausalLM
from trl import DPOTrainer
model = AutoModelForCausalLM.from_pretrained(model_id)
tokenizer = AutoTokenizer.from_pretrained(model_id)
dataset = load_dataset(dataset_name, split="train")
# Not needed anymore:
#
# def process(row):
# prompt = tokenizer.apply_chat_template(example["prompt"], tokenize=False, add_generation_prompt=True)
# prompt_chosen = tokenizer.apply_chat_template(example["prompt"] + example["chosen"], tokenize=False)
# chosen = prompt_chosen[len(prompt) :]
# prompt_rejected = tokenizer.apply_chat_template(example["prompt"] + example["rejected"], tokenize=False)
# rejected = prompt_rejected[len(prompt) :]
# return {"prompt": prompt, "chosen": chosen, "rejected": rejected}
#
# dataset = dataset.map(process)
training_args = DPOConfig(output_dir="...")
trainer = DPOTrainer(model, args=training_args, train_dataset=dataset, processing_class=tokenizer)
trainer.train()
For more information, see PR #2209.
trl env for printing system infoYou can now use trl env to print system information, including the platform, Python version, PyTorch version, CUDA device(s), and versions of various libraries.
$ trl env
Copy-paste the following information when reporting an issue:
- Platform: Linux-5.15.0-1048-aws-x86_64-with-glibc2.31
- Python version: 3.11.9
- PyTorch version: 2.4.0
- CUDA device(s): NVIDIA H100 80GB HBM3
- Transformers version: 4.47.0.dev0
- Accelerate version: 0.19.0
- Accelerate config: not found
- Datasets version: 3.0.2
- HF Hub version: 0.26.1
- TRL version: 0.12.0+14ef1ab
- bitsandbytes version: 0.44.1
- DeepSpeed version: 0.15.3
- Diffusers version: 0.30.3
- Liger-Kernel version: 0.3.0
- LLM-Blender version: 0.0.2
- OpenAI version: 1.46.0
- PEFT version: 0.13.2
by @qgallouedec in https://github.com/huggingface/trl/pull/2104
From GKD paper:
Sequence-Level KD (Kim & Rush, 2016). SeqKD maximizes the likelihood of high probability sequences generated by the teacher, and can be viewed as supervised FT on teacher-generated outputs.
SeqKD is taken as a baseline in the paper. It is now possible to use Sequence-Level KD in the GKDTrainer by setting seq_kd=True in the GKDConfig.
training_args = GKDConfig(..., seq_kd=True)
by @mst272 in https://github.com/huggingface/trl/pull/2220
dataset_text_field to "text"Since many users use "text" as the column name for textual data in datasets, we've made it the default (previously a required argument) in SFTConfig. Now, specifying dataset_text_field="text" is no longer necessary.
SFTConfig(
...,
- dataset_text_field="text",
)
by @qgallouedec in https://github.com/huggingface/trl/pull/2078
training_args by @qgallouedec in https://github.com/huggingface/trl/pull/2082trl env for printing system info by @qgallouedec in https://github.com/huggingface/trl/pull/2104BCOTrainer conversational dataset support by @qgallouedec in https://github.com/huggingface/trl/pull/2107max_length from RewardDataCollatorWithPadding by @qgallouedec in https://github.com/huggingface/trl/pull/2119training_step by @qgallouedec in https://github.com/huggingface/trl/pull/2117script_args by @qgallouedec in https://github.com/huggingface/trl/pull/2130WinRateCallback table by @lewtun in https://github.com/huggingface/trl/pull/2134dpo_visual.py example to dpo_vlm.py by @qgallouedec in https://github.com/huggingface/trl/pull/2139eval_strategy="steps" when no eval dataset by @qgallouedec in https://github.com/huggingface/trl/pull/2152DPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/2131dataset_text_field to "text" by @qgallouedec in https://github.com/huggingface/trl/pull/2078CPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/2144tokenizer to processing_class by @qgallouedec in https://github.com/huggingface/trl/pull/2162"unsloth" tag by @qgallouedec in https://github.com/huggingface/trl/pull/2173skip_prompt=True in TextIteratorStreamer by @qgallouedec in https://github.com/huggingface/trl/pull/2193decoder_input_ids in DPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/2208"none" in GKD test by @qgallouedec in https://github.com/huggingface/trl/pull/2214trl env report all cuda devices by @qgallouedec in https://github.com/huggingface/trl/pull/2216ORPOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/2184PPOv2 -> PPO by @qgallouedec in https://github.com/huggingface/trl/pull/2174ScriptArguments by @qgallouedec in https://github.com/huggingface/trl/pull/2145ScriptArguments warning messages by @sergiopaniego in https://github.com/huggingface/trl/pull/2230remove_unused_columns by @qgallouedec in https://github.com/huggingface/trl/pull/2233get_batch_sample and add num_items_in_batch to compute_loss by @qgallouedec in https://github.com/huggingface/trl/pull/2246processing_class instead of tokenizer in LogCompletionsCallback by @qgallouedec in https://github.com/huggingface/trl/pull/2261KTOTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/2248max_new_tokens by @qgallouedec in https://github.com/huggingface/trl/pull/2272log_reports.py for Improved Logging, File Processing, and Slack Payload Handling by @Mefisto04 in https://github.com/huggingface/trl/pull/2249eval_dataset in to trainers when no eval strategy by @qgallouedec in https://github.com/huggingface/trl/pull/2270_save_checkpoint for online methods by @qgallouedec in https://github.com/huggingface/trl/pull/2288optimizer_cls_and_kwargs attribute to PPO and RLOO by @qgallouedec in https://github.com/huggingface/trl/pull/2302Full Changelog: https://github.com/huggingface/trl/compare/v0.11.0...v0.12.0
Fetched April 7, 2026