TRL moved toward production-grade reinforcement learning with v1.0.0, marking a transition from prototype frameworks to deployable training systems. Asynchronous GRPO decoupled generation from gradient updates by offloading rollouts to external vLLM servers, eliminating idle GPU time during training. VESPO (Variational Sequence-Level Soft Policy Optimization) replaced heuristic token-level clipping with a principled variational framework that derives smooth importance weighting, addressing training instability from policy staleness and asynchronous updates. Earlier releases hardened the foundation with async reward functions parallelized across GRPO and RLOO, vLLM 0.12.0 compatibility, tool-calling support for agent training, and memory optimizations like forward-masked logits that cut VRAM usage by up to 50 percent during forward passes.
TRL hit v1.0 by shipping asynchronous GRPO, which offloads generation to external vLLM servers to parallelize rollouts and training while eliminating GPU idle time. The release also introduced VESPO, a variational framework that replaces heuristic token-level clipping with a principled Gamma weighting function to stabilize off-policy training. Earlier in the month, v0.29.1 fixed multimodal token handling across SFT/GRPO/RLOO and decoupled rollout dispatch from the vLLM backend to improve compatibility across versions.