releases.shpreview

TRL

$npx -y @buildinternet/releases show trl
Mon
Wed
Fri
AprMayJunJulAugSepOctNovDecJanFebMarApr
Less
More
Releases10Avg3/moVersionsv0.27.0 → v1.2.0
Oct 10, 2024

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.11.2...v0.11.3

Oct 7, 2024

What's Changed

Full Changelog: https://github.com/huggingface/trl/compare/v0.11.1...v0.11.2

Sep 24, 2024

Bug fix

  • allow parse-args as list of floats for Online DPO, XPO and Nash-MD configs by @kashif in #2108

Full Changelog: https://github.com/huggingface/trl/compare/v0.11.0...v0.11.1

Sep 19, 2024

We are excited to introduce the new v0.11.0 release, with many new features and post-training algorithms. The highlights are as follows:

New post-training methods

Generalized Knowledge Distillation

<img width="992" alt="Screenshot 2024-09-19 at 10 01 02" src="https://github.com/user-attachments/assets/97afd65d-1a2c-484b-b6dd-b02a2cbe6430">

Generalized Knowledge Distillation (GKD) is a post-training method from Google DeepMind that extends standard knowledge distillation by allowing the student to generate outputs during training and receive online feedback from the teacher. It consistently outperforms SFT and in some cases enables the student model to match the performance of the teacher, but with far fewer parameters.

To train models with this method, check out the GKDTrainer.

Exploratory Preference Optimization

<img width="1224" alt="Screenshot 2024-09-19 at 10 13 27" src="https://github.com/user-attachments/assets/36decb24-ef01-41f1-84e8-53b491eb6c86">

Exploratory Preference Optimization is an online post-training method from researchers at Microsoft, MIT, and Wisconsin that extends DPO to incorporate online feedback from reward models or LLM judges. It is similar to online DPO, but has a slightly different theoretical basis concerning sample efficiency.

To train models with this method, check out the XPOTrainer.

Nash Learning with Human Feedback

<img width="476" alt="Screenshot 2024-09-19 at 10 32 04" src="https://github.com/user-attachments/assets/8e68263f-bf5a-4f68-b451-110c78e27bb6">

Nash Learning with Human Feedback is a novel post-training method from Google DeepMind that uses pairwise preference models which are conditioned on two inputs, instead of the single one used in reward models. These preference models are then used to train a policy that consistently produces responses that are preferred over those from competing policies, thus approximating a Nash equilibrium (i.e. a two player game where actions are responses and payoffs are given by the preference model).

To train models with this method, check out the NashMDTrainer.

New trainer features

Deprecations 🚨

  • The PPOTrainer is marked for deprecated in favour of PPOv2Trainer to provide a consistent API across TRL's trainers. It will be removed in v0.12.0. By @qgallouedec in https://github.com/huggingface/trl/pull/2016
  • The RichProgressCallback has been removed from the example scripts as it caused a variety of problems with logging in distributed environments. You can still use it by adding it manually to the trainer callbacks. By @lewtun in https://github.com/huggingface/trl/pull/2053

Bugfixes and improvements

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.9.6...v0.11.0

Aug 29, 2024

We are excited to introduce the new v0.10.1 release, with many new exciting features and post-training algorithms. The highlights are as follows:

Online DPO

<img width="1210" alt="Screenshot 2024-08-29 at 15 53 29" src="https://github.com/user-attachments/assets/c11863ca-434c-47d7-8436-dc096683075a">

Online DPO is a new alignment method from DeepMind to boost the performance of LLMs. With Online DPO, data is generated on the fly by the trained model (instead of pre-collected). For each prompt, two completions are generated, with a reward model selecting the preferred one. This approach:

  • Eliminates the need for a pre-collected preference dataset (it's generated online)
  • Enables continuous model improvement
  • Yields better results than traditional DPO

To train models with this method, use the OnlineDPOTrainer

Liger Triton kernels for supercharged SFT

  • We've integrated LinkedIn's Liger Triton kernels to the SFTTrainer for faster throughput and lower memory usage. To use them, set use_liger_kernel in SFTConfig

DPO for VLMs

  • We've added support to align vision-language models with DPO, now covering architectures LLaVa-1.5, PaliGemma, and Idefics2. To train VLMs with DPO, use the dpo_visual.py script as follows
accelerate launch examples/scripts/dpo_visual.py \
    --dataset_name HuggingFaceH4/rlaif-v_formatted \
    --model_name_or_path google/paligemma-3b-pt-224 \
    --trust_remote_code \
    --per_device_train_batch_size 1 \
    --gradient_accumulation_steps 8 \
    --output_dir dpo_paligemma_rlaif-v \
    --bf16 \
    --torch_dtype bfloat16

WinRate callback for LLM as a judge

  • We've added support to compute win rates over the reference model for methods like DPO. To do so, configure the callback to point to the LLM as judge API (OpenAI or Hugging Face Inference API) and then add:
trainer = DPOTrainer(...)
win_rate_callback = WinRateCallback(..., trainer=trainer)
trainer.add_callback(win_rate_callback)

Anchored Preference Optimisation (APO) for fine-grained human/AI feedback

  • Added the APO method, which is an "anchored" version of the alignment objective. There are two variants: apo_zero and apo_down. The apo_zero loss increases the likelihood of winning outputs while decreasing the likelihood of losing outputs, making it suitable when the model is less performant than the winning outputs. On the other hand, apo_down decreases the likelihood of both winning and losing outputs, but with a stronger emphasis on reducing the likelihood of losing outputs. This variant is more effective when the model is better than the winning outputs. To use these losses, set loss_type="apo_zero" or loss_type="apo_down" in the DPOConfig

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.9.6...v0.10

Jul 8, 2024
v0.9.6 release

We are excited to introduce the new v0.9.6 release. Many new exciting features and algorithms. The highlights are as follows:

  • Support for SimPO by @fe1ixxu, a reference-free method that also regularizes output length. To use this loss, the users can input loss_type="simpo" and cpo_alpha=0 in the CPOConfig and use it with the CPOTrainer.

    <img width="880" alt="image" src="https://github.com/huggingface/trl/assets/5555347/87551147-3f58-4c6a-9a78-70b513dea76e">
  • Added AlignProp by @mihirp1998, a method for finetuning Stable Diffusion model using reward gradients.

  • Added Efficient Exact Optimization (EXO) by @haozheji

We also included many important fixes and improvements such as fixing prints in the CLI with GCP containers by @alvarobartt. Enjoy the release!

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.9.4...v0.9.6

Jun 6, 2024

Mainly backward compatibility fixes with SFTTrainer.

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.9.3...v0.9.4

Jun 5, 2024
v0.9.3 RLOO / PPOv2 Trainer, RM Visualization

We are excited to introduce the new v0.9.3 release. Many new exciting features and algorithms. The highlights are as follows:

  1. RLOO Trainer: RLOO (Reinforce Leave-one-out) is a new online RL algorithm for RLHF, proposed by Ahmadian et al from Cohere. Check out our docs here to get started
  2. PPOv2 Trainer: We are introducing a new experimental PPOv2 trainer which is more aligned with OpenAI's PPO implementation based on https://arxiv.org/abs/2403.17031. Check out our docs here to get started
  3. Reward model visualization: the reward model training now includes visualization on the eval dataset, as shown below.

https://github.com/huggingface/trl/assets/5555347/6575a879-cb2f-4e2e-bb84-a76707f9de84

  1. New losses in the DPO Trainer: DPOTrainer now includes losses / support for Self-play Preference Optimization, Robust DPO, TR-DPO, Iterative Reasoning Preference Optimization, and Pairwise Noise Contrastive Alignment
  2. New losses in the KTO Trainer: KTOTrainer now includes the loss for Binary Classifier Optimization (BCO)

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.8.6...v0.9.2

Apr 22, 2024
v0.8.6: Fixes for CLI

What's Changed

Full Changelog: https://github.com/huggingface/trl/compare/v0.8.5...v0.8.6

Apr 18, 2024
v0.8.5: Important fixes for CLIs

What's Changed

Full Changelog: https://github.com/huggingface/trl/compare/v0.8.4...v0.8.5

Apr 17, 2024
v0.8.4: CLI / CPO / KTO important fixes

This patch release includes important fixes for the CLI and KTO & CPO trainers

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.8.3...v0.8.4

Apr 12, 2024
v0.8.3: Patch release for CLI

What's Changed

This is a patch release that includes an import fix for CLIs

Full Changelog: https://github.com/huggingface/trl/compare/v0.8.2...v0.8.3

Apr 11, 2024
v0.8.2: ORPO & CPO Trainer / Vision LLMs support for `SFTTrainer`, KTO fixes

ORPO Trainer & Vision LLMs support for SFTTrainer, KTO fixes

This release includes two new trainers: ORPO from KAIST and CPO
The release also includes Vision LLM such as Llava support for SFTTrainer, please see: https://github.com/huggingface/trl/blob/main/examples/scripts/vsft_llava.py for more details

ORPO Trainer

CPO Trainer

VLLMs support for SFTTrainer

You can now use SFTTrainer to fine-tune VLLMs such as Llava ! See: https://github.com/huggingface/trl/blob/main/examples/scripts/vsft_llava.py for more details

KTO Fixes

Many fixes were introduced for the KTOTrainer:

10x PPO !

Other fixes

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.8.1...v0.8.2

Mar 20, 2024
v0.8.1: Patch release for CLIs

This patch release includes some important fixes for CLIs

What's Changed

Full Changelog: https://github.com/huggingface/trl/compare/v0.8.0...v0.8.1

Mar 19, 2024
v0.8.0: KTOTrainer, TRL CLIs, QLoRA + FSDP !

New Trainer: KTOTrainer:

We recently introduced the KTOTrainer in order to run KTO algorithms on LLMs !

TRL Command Line Interfaces (CLIs):

Run SFT, DPO and chat with your aligned model directly from the terminal:

SFT:

trl sft --model_name_or_path facebook/opt-125m --dataset_name imdb --output_dir opt-sft-imdb

DPO:

trl dpo --model_name_or_path facebook/opt-125m --dataset_name trl-internal-testing/Anthropic-hh-rlhf-processed --output_dir opt-sft-hh-rlhf 

Chat:

trl chat --model_name_or_path Qwen/Qwen1.5-0.5B-Chat

Read more about CLI in the relevant documentation section or use --help for more details.

FSDP + QLoRA:

SFTTrainer now supports FSDP + QLoRA

Other fixes

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.7.11...v0.8.0

Feb 16, 2024
v0.7.11: IPO & DPO fixes, faster data processing for multi-GPU, Automatic tagging for all models

DPO important fixes

We fixed issues with respect to IPO loss, leading to consistent results according to newest experiements:

We also fixed important bugs with respect to DPO / PEFT and Flash Attention

Data processing is now faster for multi-GPU envs

Other DPO bugfixes:

Faster data processing and other enhancements:

Automatic tagging for all models

Models now gets tagged correctly even if users do not call trainer.push_to_hub()

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.7.10...v0.7.11

Jan 19, 2024
v0.7.10: Automatic templating, `setup_chat_format` API, stronger tests

v0.7.10: Minor fixes, Automatic templating, setup_chat_format API, stronger tests

This Patch release adds a new feature in TRL for dealing with chat datasets - you can load a directly formatted dataset without the need of formatting it beforehand.

Read more about it here: https://huggingface.co/docs/trl/sft_trainer#dataset-format-support

The release also introduces a new API setup_chat_format to correctly resize the model embeddings with the target size when adding new tokens to comply with the chat format. Currently we only support chatml format and we can add more formats in the future

Read more about it here: https://huggingface.co/docs/trl/sft_trainer#add-special-tokens-for-chat-format

We also extensively test SFTTrainer and DPOTrainer and the example scripts, dpo.py and sft.py should be well -battletested. If you see any issue with the script, please let us know on GitHub.

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.7.9...v0.7.10

Jan 9, 2024
v0.7.9: Patch release for DPO & SFTTrainer

This is a patch release that fixes critical issues with SFTTrainer & DPOTrainer, together with minor fixes for PPOTrainer and DataCollatorForCompletionOnlyLM

What's Changed

Full Changelog: https://github.com/huggingface/trl/compare/v0.7.8...v0.7.9

v0.7.8: Unsloth tag, DPO fixes, PEFT support for DDPO

v0.7.8: Unsloth tag, DPO fixes, PEFT support for DDPO

Unsloth tag for xxxTrainer

If users use Unsloth library, the unsloth tag gets automatically pushed on the Hub.

DPO fixes

Some important fixes for DPO has been introduced to address: https://twitter.com/jon_durbin/status/1743575483365699809 and to make DPO faster

DDPO + PEFT

Now DDPO supports PEFT

Other fixes

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.7.7...v0.7.8

Dec 26, 2023

v0.7.7: Patch release PPO & DDPO tags

A fix has been introduce to fix a breaking change with PPOTrainer.push_to_hub() and DDPOTrainer.push_to_hub()

What's Changed

New Contributors

Full Changelog: https://github.com/huggingface/trl/compare/v0.7.6...v0.7.7

Latest
v1.2.0
Tracking Since
Jan 25, 2023
Last fetched Apr 19, 2026