SFT now supports Context Parallelism (CP) for training large language models on very large sequences. You can now train with an arbitrarily long sequence length.
<img width="844" height="336" alt="Screenshot 2025-09-09 at 10 39 30โฏPM" src="https://github.com/user-attachments/assets/f1dfc349-440a-4e05-aac9-439a3c286f08" />by @kashif in https://github.com/huggingface/trl/pull/3994
Dynamic Fine-Tuning (DFT) is a nnow supported in TRL.
from trl import SFTConfig
training_args = SFTConfig(
loss_type="dft",
...
)
<img width="692" height="472" alt="Screenshot 2025-09-09 at 10 37 36โฏPM" src="https://github.com/user-attachments/assets/4ee2b4ab-7cc6-4578-bfac-c38124891510" />
by @qgallouedec in https://github.com/huggingface/trl/pull/4042
Different implementations are used for rollout generation (vLLM) and model training. The implementation gap implicitly turns the on-policy RL to be off-policy. Truncated Importance Sampling (TIS) a simple yet effective importance sampling technique for handling such discrepancy. This is now implemented in GRPO.
from trl import GRPOConfig
training_args = GRPOConfig(
...
use_vllm=True,
vllm_importance_sampling_correction=True, # default True
vllm_importance_sampling_cap=2.0, # hyper-parameter C
)
by @LeonEricsson in https://github.com/huggingface/trl/pull/3867
Mixture of Experts (MoE) models require an auxiliary loss to ensure that the different experts are used evenly. This auxiliary loss is now supported in SFTTrainer.
training_args = SFTConfig(
model_init_kwargs={"output_router_logits": True},
...
)
by @pramodith in https://github.com/huggingface/trl/pull/4012
When running GRPO (or RLOO) with vLLM in colocated mode, the vLLM server consume VRAM during optimization while not being used. We now have an option to put the vLLM server to sleep during optimization to free up VRAM.
from trl import GRPOConfig
training_args = GRPOConfig(..., vllm_sleep_enabled=True)
by @edbeeching in https://github.com/huggingface/trl/pull/3968
You can now use vLLM server mode with OnlineDPOTrainer. Additionally, VLM models are now supported.
by @vaelev in https://github.com/huggingface/trl/pull/3783
The paper index has been significantly enhanced with the addition of 9+ new algorithm implementations, providing a more comprehensive resource for users.
by @behroozazarkhalili in https://github.com/huggingface/trl/pull/3990
get_soft_overlong_punishment by @qgallouedec in https://github.com/huggingface/trl/pull/3972args.gradient_checkpointing = False instead of args = dataclasses.replace(args, gradient_checkpointing=False) by @qgallouedec in https://github.com/huggingface/trl/pull/3981torch_dype to dtype everywhere by @sergiopaniego in https://github.com/huggingface/trl/pull/4000average_tokens_across_devices default replacement by @qgallouedec in https://github.com/huggingface/trl/pull/4039AutoModelForImageTextToText for DPO and Online DPO by @sergiopaniego in https://github.com/huggingface/trl/pull/4049SFTTrainer for compatibility with CP by @qgallouedec in https://github.com/huggingface/trl/pull/4038quantization_config=None by @qgallouedec in https://github.com/huggingface/trl/pull/4019Full Changelog: https://github.com/huggingface/trl/compare/v0.22.0...v0.23.0
Fetched April 7, 2026