Previously, vLLM could only be used by dedicating a single GPU, preventing both the scalability benefits of vLLM and multi-node training. This limitation has now been removed!
GRPO can now scale efficiently with models exceeding 70B parameters, supporting multi-node training with super-fast performance.
To take advantage of this, simply launch a vLLM server using the following command:
trl vllm-serve --model <model_name> --tensor_parallel_size <tp_size>
Then, start GRPO training with use_vllm=True.
Below is a comparison of GRPO throughput with and without vLLM, across different TP values and model sizes.
@binary-husky and @qgallouedec in https://github.com/huggingface/trl/pull/3094
This release introduces the multi-step trick, which allows for the reuse of generated data across multiple steps, speeding up the training process.
To support this, we've implemented importance sampling and clipping logic. This enhancement should lead to significant improvements in training speed.
<img width="1097" alt="Screenshot 2025-03-23 at 14 52 28" src="https://github.com/user-attachments/assets/8f1ee339-63c5-43cf-9b0f-5395432513ae" />To use it, simply set num_iterations to a value greater than 1.
training_args = GRPOConfig(..., num_iterations=4)
by @qgallouedec in https://github.com/huggingface/trl/pull/2899
As demonstrated in Dr GRPO, sequence-level normalization can introduce a response level length bias.
To address this, we have now switched to normalizing the loss and by the total number of tokens in the batch, ensuring more consistent and unbiased training.
- loss = ((per_token_loss * completion_mask).sum(dim=1) / completion_mask.sum(dim=1)).mean()
+ loss = (per_token_loss * completion_mask).sum() / completion_mask.sum()
by @edbeeching in https://github.com/huggingface/trl/pull/2881
As demonstrated in Dr GRPO, scaling rewards can introduce a question-level difficulty bias. To address this, we have now added an option to disable reward scaling in GRPO.
training_args = GRPOConfig(..., scale_rewards=False)
advantages = rewards - mean_grouped_rewards
- advantages = advantages / std_grouped_rewards
+ if self.args.scale_rewards:
+ advantages = advantages / std_grouped_rewards
it's likely that we'll make this (scale_rewards=False) the default behavior in the future.
by @qgallouedec in https://github.com/huggingface/trl/pull/3135
When optimizing across multiple domains, not all reward functions are relevant for every sample. For example, a math verifier's reward does not apply to grammar samples, and a grammar verifier's reward does not apply to math samples.
It is now possible to return None for rewards that do not make sense for a given sample. For instance, when the domain is specified in a column like domain, you can implement it as follows:
def math_reward(completions, domain, **kwargs):
rewards = []
for completion, dom in zip(completions, domain):
if dom == "math":
rewards.append(verify(completion))
else:
rewards.append(None)
return rewards
This allows for more domain-specific reward handling, ensuring that irrelevant rewards are ignored and donβt interfere with optimization.
by @shirinyamani in https://github.com/huggingface/trl/pull/3079
beta == 0.0It has been observed that not minimizing the KL divergence between the trained model and the reference model can still yield good results, while significantly reducing memory usage and compute. This is because there is no need to store the reference model in memory or perform a forward pass for it.
When beta is set to 0.0, the reference model is not loaded, and the KL divergence is not computed, leading to savings in both time and memory.
training_args = GRPOConfig(..., beta=0.0)
by @ingambe in https://github.com/huggingface/trl/pull/2806
Padding-free batching is an alternative approach to packing for reducing memory usage. In this method, a batch is first sampled and then flattened into a single sequence, avoiding padding. Unlike packing, which can result in incomplete sequences by combining parts of different samples, padding-free batching ensures that all sequences remain complete and intact.

To enable padding-free batching in SFT, simply set padding_free=True in the SFTConfig, and make sure to use flash_attention2 as the attention implementation.
training_args = SFTConfig(..., padding_free=True, model_init_kwargs={"attn_implementation": "flash_attention2"})
by @qgallouedec in https://github.com/huggingface/trl/pull/3076
As outlined in the DAPO paper, increasing the upper bound epsilon leads to higher entropy during generation, promoting better exploration. To enable this, weβve added support for adjusting the upper bound epsilon directly in the default GRPO trainer.
training_args = GRPOConfig(epsilon_high=0.28)
by @shirinyamani in https://github.com/huggingface/trl/pull/3118
unwrap_model_for_generation for DeepSpeed Stage-3 compatibility by @kiddj in https://github.com/huggingface/trl/pull/2871inference_mode to no_grad when computing old_per_token_logps by @qgallouedec in https://github.com/huggingface/trl/pull/2987maybe_convert_to_chatml map for conversational datasets in SFT by @kashif in https://github.com/huggingface/trl/pull/2862max_seq_length to max_length by @qgallouedec in https://github.com/huggingface/trl/pull/2895 and https://github.com/huggingface/trl/pull/2947num_tokens and some logging fixes by @qgallouedec in https://github.com/huggingface/trl/pull/3006maybe_convert_to_chatml map for conversational datasets in SFT by @kashif in https://github.com/huggingface/trl/pull/2862max_seq_length to max_length by @qgallouedec in https://github.com/huggingface/trl/pull/2895max_seq_length to max_length in SFTConfig by @qgallouedec in https://github.com/huggingface/trl/pull/2947max_tool_response by @shenxiangzhuang in https://github.com/huggingface/trl/pull/2921enable_prefix_caching by @ji-huazhong in https://github.com/huggingface/trl/pull/2900num_iterations by @qgallouedec in https://github.com/huggingface/trl/pull/2966KTOConfig by @sileod in https://github.com/huggingface/trl/pull/2912unwrap_model_for_generation for DeepSpeed Stage-3 compatibility by @kiddj in https://github.com/huggingface/trl/pull/2871inference_mode to no_grad when computing old_per_token_logps by @qgallouedec in https://github.com/huggingface/trl/pull/2987num_tokens and some logging fixes by @qgallouedec in https://github.com/huggingface/trl/pull/3006apply_chat_template by @jamesbraza in https://github.com/huggingface/trl/pull/3005deepspeed>=0.16.4's rename by @jamesbraza in https://github.com/huggingface/trl/pull/2963GPROTrainer.generation_config by @jamesbraza in https://github.com/huggingface/trl/pull/3046SFTTrainer.compute_loss crash with accelerate by @jamesbraza in https://github.com/huggingface/trl/pull/3048SFTTrainer.compute_loss hang from #3048's PR comments by @jamesbraza in https://github.com/huggingface/trl/pull/3056Full Changelog: https://github.com/huggingface/trl/compare/v0.15.0...v0.16.0
Fetched April 7, 2026