We are excited to introduce the new v0.10.1 release, with many new exciting features and post-training algorithms. The highlights are as follows:
Online DPO is a new alignment method from DeepMind to boost the performance of LLMs. With Online DPO, data is generated on the fly by the trained model (instead of pre-collected). For each prompt, two completions are generated, with a reward model selecting the preferred one. This approach:
To train models with this method, use the OnlineDPOTrainer
SFTTrainer for faster throughput and lower memory usage. To use them, set use_liger_kernel in SFTConfigdpo_visual.py script as followsaccelerate launch examples/scripts/dpo_visual.py \
--dataset_name HuggingFaceH4/rlaif-v_formatted \
--model_name_or_path google/paligemma-3b-pt-224 \
--trust_remote_code \
--per_device_train_batch_size 1 \
--gradient_accumulation_steps 8 \
--output_dir dpo_paligemma_rlaif-v \
--bf16 \
--torch_dtype bfloat16
trainer = DPOTrainer(...)
win_rate_callback = WinRateCallback(..., trainer=trainer)
trainer.add_callback(win_rate_callback)

apo_zero and apo_down. The apo_zero loss increases the likelihood of winning outputs while decreasing the likelihood of losing outputs, making it suitable when the model is less performant than the winning outputs. On the other hand, apo_down decreases the likelihood of both winning and losing outputs, but with a stronger emphasis on reducing the likelihood of losing outputs. This variant is more effective when the model is better than the winning outputs. To use these losses, set loss_type="apo_zero" or loss_type="apo_down" in the DPOConfigmodel_ref naming by @qgallouedec in https://github.com/huggingface/trl/pull/1835CI_HUB_USER_TOKEN by @qgallouedec in https://github.com/huggingface/trl/pull/1852setup_chat_format by @Rishav-hub in https://github.com/huggingface/trl/pull/1862evaluation_strategy -> eval_strategy by @qgallouedec in https://github.com/huggingface/trl/pull/1894setUpClass in reward tester by @qgallouedec in https://github.com/huggingface/trl/pull/1895IterableDataset for SFTTrainer by @qgallouedec in https://github.com/huggingface/trl/pull/1899AlignPropTrainer import by @qgallouedec in https://github.com/huggingface/trl/pull/1908lr_scheduler.step() after optimizer.step() by @qgallouedec in https://github.com/huggingface/trl/pull/1918torch.cuda.amp.autocast() -> torch.amp.autocast("cuda") by @qgallouedec in https://github.com/huggingface/trl/pull/1921dataset_num_proc usage by @qgallouedec in https://github.com/huggingface/trl/pull/1925PartialState().local_main_process_first() when map in examples by @qgallouedec in https://github.com/huggingface/trl/pull/1926push_to_hub by @qgallouedec in https://github.com/huggingface/trl/pull/1945padding_free to DataCollatorForCompletionOnlyLM by @RhuiDih in https://github.com/huggingface/trl/pull/1887PairRMJudge to top-level import by @qgallouedec in https://github.com/huggingface/trl/pull/1985torch.load with weights_only=True by @qgallouedec in https://github.com/huggingface/trl/pull/1988Full Changelog: https://github.com/huggingface/trl/compare/v0.9.6...v0.10
Fetched April 7, 2026