On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral
- URL: http://arxiv.org/abs/2512.04220v1
- Date: Wed, 03 Dec 2025 19:41:15 GMT
- Title: On GRPO Collapse in Search-R1: The Lazy Likelihood-Displacement Death Spiral
- Authors: Wenlong Deng, Yushu Li, Boying Gong, Yi Ren, Christos Thrampoulidis, Xiaoxiao Li,
- Abstract summary: We identify Lazy Likelihood Displacement (LLD) as the core mechanism driving this failure.<n>LLD emerges early and triggers a self-reinforcing LLD Death Spiral, where declining likelihood leads to low-confidence responses.<n>We propose a lightweight likelihood-preserving regularization LLDS for GRPO that activates only when a trajectory's likelihood decreases.
- Score: 59.14787085809595
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Tool-integrated (TI) reinforcement learning (RL) enables large language models (LLMs) to perform multi-step reasoning by interacting with external tools such as search engines and retrievers. Group Relative Policy Optimization (GRPO), exemplified by the recent Search-R1, offers fast convergence and a value-free formulation that makes it appealing for this setting, yet consistently suffers from training collapse. We identify Lazy Likelihood Displacement (LLD), a systematic reduction or stagnation in the likelihood of both correct and incorrect responses, as the core mechanism driving this failure. LLD emerges early and triggers a self-reinforcing LLD Death Spiral, where declining likelihood leads to low-confidence responses, inflating gradients, and ultimately causing collapse. We empirically characterize this process across models on a Search-R1-style, search-integrated question answering task, revealing a consistent three-phase trajectory: early stagnation, steady decay, and accelerated collapse. To address this, we propose a lightweight likelihood-preserving regularization LLDS for GRPO that activates only when a trajectory's likelihood decreases, and regularizes only the tokens responsible. This fine-grained structure mitigates LLD with minimal interference to optimization. Across seven open-domain and multi-hop QA benchmarks, our method stabilizes training, prevents gradient explosion, and yields substantial performance improvements, including +37.8% gains on Qwen2.5-3B and +32.0% gains on Qwen2.5-7B. Our results establish LLD as a fundamental bottleneck in GRPO-based TIRL and provide a practical path toward stable, scalable training of tool-integrated LLM.
Related papers
- Learning from the Irrecoverable: Error-Localized Policy Optimization for Tool-Integrated LLM Reasoning [36.23268783033404]
In long-horizon tool-integrated reasoning trajectories, an early irrecoverable mistake can determine success or failure.<n>We propose Error-Localized Policy Optimization (ELPO) to localize the first irrecoverable step and leverage it for fine-grained credit assignment.<n>Our code will be publicly released soon.
arXiv Detail & Related papers (2026-02-10T09:50:24Z) - Trajectory Guard -- A Lightweight, Sequence-Aware Model for Real-Time Anomaly Detection in Agentic AI [0.0]
Trajectory Guard is a Siamese Recurrent Autoencoder with a hybrid loss function that jointly learns task-trajectory alignment via contrastive learning and sequential validity via reconstruction.<n>At 32 ms latency, our approach runs 17$-27times$ faster than LLM Judge baselines, enabling real-time safety verification in production deployments.
arXiv Detail & Related papers (2026-01-02T00:27:11Z) - M-GRPO: Stabilizing Self-Supervised Reinforcement Learning for Large Language Models with Momentum-Anchored Policy Optimization [9.358876832727239]
Self-supervised reinforcement learning (RL) presents a promising approach for enhancing the reasoning capabilities of Large Language Models (LLMs)<n>We find that existing methods suffer from a critical failure mode under long-horizon training: a "policy collapse" where performance precipitously degrades.<n>We introduce M-GRPO, a framework that leverages a slowly evolving momentum model to provide a stable training target.<n>We also propose an adaptive filtering method based on the interquartile range (IQR) that dynamically prunes low-entropy trajectories.
arXiv Detail & Related papers (2025-12-15T08:07:23Z) - Alignment Tipping Process: How Self-Evolution Pushes LLM Agents Off the Rails [103.05296856071931]
We identify the Alignment Tipping Process (ATP), a critical post-deployment risk unique to self-evolving Large Language Model (LLM) agents.<n>ATP arises when continual interaction drives agents to abandon alignment constraints established during training in favor of reinforced, self-interested strategies.<n>Our experiments show that alignment benefits erode rapidly under self-evolution, with initially aligned models converging toward unaligned states.
arXiv Detail & Related papers (2025-10-06T14:48:39Z) - Slow-Fast Policy Optimization: Reposition-Before-Update for LLM Reasoning [45.51804571136028]
Reinforcement learning (RL) has become central to enhancing reasoning in large language models (LLMs)<n>We introduce Slow-Fast Policy Optimization (SFPO), a simple yet efficient framework to address these limitations via decomposing each step into three stages.<n>SFPO consistently improves stability, reduces rollouts, and accelerates convergence of reasoning RL training.
arXiv Detail & Related papers (2025-10-05T07:22:54Z) - Agentic Reinforced Policy Optimization [66.96989268893932]
Large-scale reinforcement learning with verifiable rewards (RLVR) has demonstrated its effectiveness in harnessing the potential of large language models (LLMs) for single-turn reasoning tasks.<n>Current RL algorithms inadequately balance the models' intrinsic long-horizon reasoning capabilities and their proficiency in multi-turn tool interactions.<n>We propose Agentic Reinforced Policy Optimization (ARPO), a novel agentic RL algorithm tailored for training multi-turn LLM-based agents.
arXiv Detail & Related papers (2025-07-26T07:53:11Z) - On the Effect of Negative Gradient in Group Relative Deep Reinforcement Optimization [52.76330545825083]
Reinforcement learning (RL) has become popular in enhancing the reasoning capabilities of large language models (LLMs)<n>We identify a previously unrecognized phenomenon we term Lazy Likelihood Displacement (LLD), wherein the likelihood of correct responses marginally increases or even decreases during training.<n>We develop a method called NTHR, which downweights penalties on tokens contributing to the LLD. Unlike prior DPO-based approaches, NTHR takes advantage of GRPO's group-based structure, using correct responses as anchors to identify influential tokens.
arXiv Detail & Related papers (2025-05-24T18:58:51Z) - Stable Reinforcement Learning for Efficient Reasoning [2.838966689544288]
GRPO-$lambda$ is an efficient and stabilized variant of GRPO.<n>It dynamically adjusts the reward strategy by monitoring the correctness ratio.<n>It improves average accuracy by 1.48% while reducing CoT sequence length by 47.3%.
arXiv Detail & Related papers (2025-05-23T16:43:03Z) - A Minimalist Approach to LLM Reasoning: from Rejection Sampling to Reinforce [68.99924691391048]
We revisit GRPO from a reinforce-like algorithm perspective and analyze its core components.<n>We find that a simple rejection sampling baseline, RAFT, yields competitive performance than GRPO and PPO.<n>Motivated by this insight, we propose Reinforce-Rej, a minimal extension of policy gradient that filters both entirely incorrect and entirely correct samples.
arXiv Detail & Related papers (2025-04-15T16:15:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.