Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning
- URL: http://arxiv.org/abs/2510.04072v2
- Date: Wed, 08 Oct 2025 04:24:36 GMT
- Title: Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning
- Authors: Ziyan Wang, Zheng Wang, Jie Fu, Xingwei Qu, Qi Cheng, Shengpu Tang, Minjia Zhang, Xiaoming Huo,
- Abstract summary: Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs)<n>We introduce Slow-Fast Policy Optimization (SFPO), a simple yet efficient framework to address these limitations via decomposing each step into three stages.<n>SFPO consistently improves stability, reduces rollouts, and accelerates convergence of reasoning RL training.
- Score: 45.51804571136028
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs). Yet on-policy algorithms such as Group Relative Policy Optimization (GRPO) often suffer in early training: noisy gradients from low-quality rollouts lead to unstable updates and inefficient exploration. We introduce Slow-Fast Policy Optimization (SFPO), a simple yet efficient framework to address these limitations via decomposing each step into three stages: a short fast trajectory of inner steps on the same batch, a reposition mechanism to control off-policy drift, and a final slow correction. This reposition-before-update design preserves the objective and rollout process unchanged, making SFPO plug-compatible with existing policy-gradient pipelines. Extensive experiments demonstrate that SFPO consistently improves stability, reduces rollouts, and accelerates convergence of reasoning RL training. Specifically, it outperforms GRPO by up to 2.80 points in average on math reasoning benchmarks. It also achieves up to 4.93\texttimes{} fewer rollouts and an up to 4.19\texttimes{} reduction in wall-clock time to match GRPO's best accuracy.
Related papers
- Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization [60.87651283510059]
Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs.<n>We propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation.<n>To mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy.
arXiv Detail & Related papers (2026-03-04T14:48:53Z) - Clipping-Free Policy Optimization for Large Language Models [30.663054788473598]
Reinforcement learning has become central to post-training large language models.<n> dominant algorithms rely on clipping mechanisms to introduce optimization issues at scale.<n>We propose Clipping-Free Policy Optimization, which replaces clipping with a convex penalty derived from Total Variation divergence constraints.
arXiv Detail & Related papers (2026-01-30T10:32:37Z) - A Step Back: Prefix Importance Ratio Stabilizes Policy Optimization [58.116300485427764]
Reinforcement learning post-training can elicit reasoning behaviors in large language models.<n> token-level correction often leads to unstable training dynamics when the degree of off-policyness is large.<n>We propose a simple yet effective objective, Minimum Prefix Ratio (MinPRO)
arXiv Detail & Related papers (2026-01-30T08:47:19Z) - M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization [9.358876832727239]
Self-supervised reinforcement learning (RL) presents a promising approach for enhancing the reasoning capabilities of Large Language Models (LLMs)<n>We find that existing methods suffer from a critical failure mode under long-horizon training: a "policy collapse" where performance precipitously degrades.<n>We introduce M-GRPO, a framework that leverages a slowly evolving momentum model to provide a stable training target.<n>We also propose an adaptive filtering method based on the interquartile range (IQR) that dynamically prunes low-entropy trajectories.
arXiv Detail & Related papers (2025-12-15T08:07:23Z) - On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral [59.14787085809595]
We identify Lazy Likelihood Displacement (LLD) as the core mechanism driving this failure.<n>LLD emerges early and triggers a self-reinforcing LLD Death Spiral, where declining likelihood leads to low-confidence responses.<n>We propose a lightweight likelihood-preserving regularization LLDS for GRPO that activates only when a trajectory's likelihood decreases.
arXiv Detail & Related papers (2025-12-03T19:41:15Z) - GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping [63.33669214116784]
GRPO-Guard is a simple yet effective enhancement to existing GRPO frameworks.<n>It restores a balanced and step-consistent importance ratio, ensuring that PPO clipping properly constrains harmful updates.<n>It substantially mitigates implicit over-optimization without relying on heavy KL regularization.
arXiv Detail & Related papers (2025-10-25T14:51:17Z) - BAPO: Stabilizing Off-Policy Reinforcement Learning for LLMs via Balanced Policy Optimization with Adaptive Clipping [69.74252624161652]
We propose BAlanced Policy Optimization with Adaptive Clipping (BAPO)<n>BAPO dynamically adjusts clipping bounds to adaptively re-balance positive and negative contributions, preserve entropy, and stabilize RL optimization.<n>On AIME 2024 and AIME 2025 benchmarks, our 7B BAPO model surpasses open-source counterparts such as SkyWork-OR1-7B.
arXiv Detail & Related papers (2025-10-21T12:55:04Z) - Asymmetric Proximal Policy Optimization: mini-critics boost LLM reasoning [49.57517969069136]
We introduce Asymmetric Proximal Policy Optimization (AsyPPO), a simple and scalable framework that restores the critics role while remaining efficient in large-model settings.<n>AsyPPO employs a set of lightweight mini-critics, each trained on disjoint prompt shards.<n>It consistently improves learning stability and performance across multiple benchmarks over strong baselines.
arXiv Detail & Related papers (2025-10-02T04:24:27Z) - TempSamp-R1: Effective Temporal Sampling with Reinforcement Fine-Tuning for Video LLMs [67.55973229034319]
This paper introduces TempSamp-R1, a new reinforcement fine-tuning framework designed to improve the effectiveness of adapting multimodal large language models (MLLMs) to video temporal grounding tasks.<n>We show that TempSamp-R1 outperforms GRPO-based baselines, establishing new state-of-the-art performance on benchmark datasets.
arXiv Detail & Related papers (2025-09-22T17:30:15Z) - BranchGRPO: Stable and Efficient GRPO with Structured Branching in Diffusion Models [57.304411396229035]
We present BranchGRPO, a method that restructures the rollout process into a branching tree.<n>On HPDv2.1 image alignment, BranchGRPO improves alignment scores by up to textbf16% over DanceGRPO.<n>A hybrid variant, BranchGRPO-Mix, further accelerates training to 4.7x faster than DanceGRPO without degrading alignment.
arXiv Detail & Related papers (2025-09-07T12:53:06Z) - VerIPO: Cultivating Long Reasoning in Video-LLMs via Verifier-Gudied Iterative Policy Optimization [59.39976343879587]
VerIPO aims to gradually improve video LLMs' capacity for generating deep, long-term reasoning chains.<n>The training loop benefits from GRPO's expansive search and DPO's targeted optimization.<n>Our trained models exceed the direct inference of large-scale instruction-tuned Video-LLMs.
arXiv Detail & Related papers (2025-05-25T06:41:28Z) - PPO-BR: Dual-Signal Entropy-Reward Adaptation for Trust Region Policy Optimization [0.0]
PPO-BR establishes a paradigm adaptive RL by fusing new exploration and convergence signals into a single trust region.<n>This work bridges a critical gap in phase-aware learning, enabling real-world deployment in safety-critical systems like robotic surgery.
arXiv Detail & Related papers (2025-05-23T10:30:58Z) - Learn Your Reference Model for Real Good Alignment [3.091688550418396]
offline methods for Large Language Models (LLMs) alignment are susceptible to overoptimization.<n>We propose a new paradigm of offline alignment methods, called Trust Region, which dynamically updates the reference policy throughout the training process.<n>Our results show that TR alignment methods effectively mitigate overoptimization, enabling models to maintain strong performance even when substantially deviating from the initial reference policy.
arXiv Detail & Related papers (2024-04-15T10:44:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.