WS-GRPO: Weakly-Supervised Group-Relative Policy Optimization for Rollout-Efficient Reasoning
- URL: http://arxiv.org/abs/2602.17025v1
- Date: Thu, 19 Feb 2026 02:43:35 GMT
- Title: WS-GRPO: Weakly-Supervised Group-Relative Policy Optimization for Rollout-Efficient Reasoning
- Authors: Gagan Mundada, Zihan Huang, Rohan Surana, Sheldon Yu, Jennifer Yuntong Zhang, Xintong Li, Tong Yu, Lina Yao, Jingbo Shang, Julian McAuley, Junda Wu,
- Abstract summary: Group Relative Policy Optimization is effective for training language models on complex reasoning.<n>We propose Weakly Supervised GRPO, which improves rollout efficiency by converting terminal rewards into correctness-aware guidance.
- Score: 67.45237332694025
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Group Relative Policy Optimization (GRPO) is effective for training language models on complex reasoning. However, since the objective is defined relative to a group of sampled trajectories, extended deliberation can create more chances to realize relative gains, leading to inefficient reasoning and overthinking, and complicating the trade-off between correctness and rollout efficiency. Controlling this behavior is difficult in practice, considering (i) Length penalties are hard to calibrate because longer rollouts may reflect harder problems that require longer reasoning, penalizing tokens risks truncating useful reasoning along with redundant continuation; and (ii) supervision that directly indicates when to continue or stop is typically unavailable beyond final answer correctness. We propose Weakly Supervised GRPO (WS-GRPO), which improves rollout efficiency by converting terminal rewards into correctness-aware guidance over partial trajectories. Unlike global length penalties that are hard to calibrate, WS-GRPO trains a preference model from outcome-only correctness to produce prefix-level signals that indicate when additional continuation is beneficial. Thus, WS-GRPO supplies outcome-derived continue/stop guidance, reducing redundant deliberation while maintaining accuracy. We provide theoretical results and empirically show on reasoning benchmarks that WS-GRPO substantially reduces rollout length while remaining competitive with GRPO baselines.
Related papers
- Unbiased Dynamic Pruning for Efficient Group-Based Policy Optimization [60.87651283510059]
Group Relative Policy Optimization (GRPO) effectively scales LLM reasoning but incurs prohibitive computational costs.<n>We propose Dynamic Pruning Policy Optimization (DPPO), a framework that enables dynamic pruning while preserving unbiased gradient estimation.<n>To mitigate the data sparsity induced by pruning, we introduce Dense Prompt Packing, a window-based greedy strategy.
arXiv Detail & Related papers (2026-03-04T14:48:53Z) - Difficulty-Estimated Policy Optimization [38.86673795561421]
We propose Difficulty-Estimated Policy Optimization (DEPO), a novel framework designed to optimize the efficiency and robustness of reasoning alignment.<n>Our approach significantly lowers the computational barrier for training high-performance reasoning models, offering a more sustainable path for reasoning scaling.
arXiv Detail & Related papers (2026-02-06T04:12:23Z) - AMIR-GRPO: Inducing Implicit Preference Signals into GRPO [15.759757442328388]
Reinforcement learning has become the primary paradigm for aligning large language models on complex reasoning tasks.<n> GRPO is widely used in large-scale post-training but faces structural limitations in reasoning-heavy settings.<n>AMIR-GRPO augments GRPO with an implicit DPO-style contrastive regularizer constructed directly from intra-group reward rankings.
arXiv Detail & Related papers (2026-01-07T07:22:58Z) - DaGRPO: Rectifying Gradient Conflict in Reasoning via Distinctiveness-Aware Group Relative Policy Optimization [20.66452395111739]
We propose Distinctiveness-aware Group Relative Policy Optimization (DaGRPO)<n>DaGRPO incorporates two core mechanisms: (1) Sequence-level Gradient Rectification, which utilizes fine-grained scoring to dynamically mask sample pairs with low distinctiveness; and (2) Off-policy Data Augmentation, which introduces high-quality anchors to recover training signals for challenging tasks.<n>In-depth analysis confirms that DaGRPO effectively mitigates gradient explosion and accelerates the emergence of long-chain reasoning capabilities.
arXiv Detail & Related papers (2025-12-06T07:51:36Z) - Repurposing Synthetic Data for Fine-grained Search Agent Supervision [81.95597592711688]
LLM-based search agents are increasingly trained on entity-centric synthetic data.<n> prevailing training methods discard this rich entity information, relying instead on sparse, outcome-based rewards.<n>We introduce Entity-aware Group Relative Policy Optimization (E-GRPO), a novel framework that formulates a dense entity-aware reward function.
arXiv Detail & Related papers (2025-10-28T17:50:40Z) - DRPO: Efficient Reasoning via Decoupled Reward Policy Optimization [37.96314154235252]
We propose a novel framework that decouples the length-based learning signal of correct rollouts from incorrect ones.<n>With a 1.5B model, our method achieves 77% length reduction with only 1.1% performance loss on simple questions.
arXiv Detail & Related papers (2025-10-06T04:18:13Z) - HiPO: Hybrid Policy Optimization for Dynamic Reasoning in LLMs [54.16300997612526]
Large Language Models (LLMs) increasingly rely on Chain-of-Thought (CoT) reasoning to improve accuracy on complex tasks.<n>This paper introduces the Hybrid Policy Optimization (i.e., HiPO), a framework for adaptive reasoning control.<n> Experiments across mathematics and coding benchmarks demonstrate that HiPO can substantially reduce token length while maintaining or improving accuracy.
arXiv Detail & Related papers (2025-09-28T16:46:12Z) - On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization [52.76330545825083]
Reinforcement learning (RL) has become popular in enhancing the reasoning capabilities of large language models (LLMs)<n>We identify a previously unrecognized phenomenon we term Lazy Likelihood Displacement (LLD), wherein the likelihood of correct responses marginally increases or even decreases during training.<n>We develop a method called NTHR, which downweights penalties on tokens contributing to the LLD. Unlike prior DPO-based approaches, NTHR takes advantage of GRPO's group-based structure, using correct responses as anchors to identify influential tokens.
arXiv Detail & Related papers (2025-05-24T18:58:51Z) - DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization [50.91849555841057]
Group Relative Policy Optimization is a reinforcement learning method for large reasoning models (LRMs)<n>We introduce a new Discriminative Constrained Optimization framework for reinforcing LRMs, grounded in the principle of discriminative learning.<n>DisCO significantly outperforms GRPO and its improved variants such as DAPO, achieving average gains of 7% over GRPO and 6% over DAPO.
arXiv Detail & Related papers (2025-05-18T11:08:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.