Anchored Policy Optimization: Mitigating Exploration Collapse Via Support-Constrained Rectification
- URL: http://arxiv.org/abs/2602.05717v1
- Date: Thu, 05 Feb 2026 14:41:57 GMT
- Title: Anchored Policy Optimization: Mitigating Exploration Collapse Via Support-Constrained Rectification
- Authors: Tianyi Wang, Long Li, Hongcan Guo, Yibiao Chen, Yixia Li, Yong Wang, Yun Chen, Guanhua Chen,
- Abstract summary: We propose Anchored Policy Optimization (APO) to shift the paradigm from global Shape Matching to Support Coverage.<n>APO breaks the accuracy-diversity trade-off, significantly improving Pass@1 while restoring the Pass@K diversity typically lost by standard policy gradient methods.
- Score: 14.911955979675772
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement Learning with Verifiable Rewards (RLVR) is increasingly viewed as a tree pruning mechanism. However, we identify a systemic pathology termed Recursive Space Contraction (RSC), an irreversible collapse driven by the combined dynamics of positive sharpening and negative squeezing, where the sampling probability of valid alternatives vanishes. While Kullback-Leibler (KL) regularization aims to mitigate this, it imposes a rigid Shape Matching constraint that forces the policy to mimic the reference model's full density, creating a gradient conflict with the sharpening required for correctness. We propose Anchored Policy Optimization (APO), shifting the paradigm from global Shape Matching to Support Coverage. By defining a Safe Manifold based on the reference model's high-confidence support, APO permits aggressive sharpening for efficiency while selectively invoking a restorative force during error correction to prevent collapse. We theoretically derive that APO serves as a gradient-aligned mechanism to maximize support coverage, enabling an Elastic Recovery that re-inflates valid branches. Empirical evaluations on mathematical benchmarks demonstrate that APO breaks the accuracy-diversity trade-off, significantly improving Pass@1 while restoring the Pass@K diversity typically lost by standard policy gradient methods.
Related papers
- BandPO: Bridging Trust Regions and Ratio Clipping via Probability-Aware Bounds for LLM Reinforcement Learning [49.25750348525603]
BandPO replaces canonical clipping with Band, a unified theoretical operator that projects trust regions into dynamic, probability-aware clipping intervals.<n>We show that BandPO consistently outperforms canonical clipping and Clip-Higher, while robustly mitigating entropy collapse.
arXiv Detail & Related papers (2026-03-05T08:03:05Z) - Amortized Reasoning Tree Search: Decoupling Proposal and Decision in Large Language Models [2.5170433424424874]
Reinforcement Learning with Verifiable Rewards has established itself as the dominant paradigm for instilling rigorous reasoning capabilities in Large Language Models.<n>We identify a critical pathology in this alignment process: the systematic suppression of valid but rare (low-likelihood under the base model distribution) reasoning paths.<n>We propose Amortized Reasoning Tree Search (ARTS) to counteract this collapse without discarding the base model's latent diversity.
arXiv Detail & Related papers (2026-02-13T11:52:50Z) - Mitigating Mismatch within Reference-based Preference Optimization [55.07698254211876]
Direct Preference Optimization (DPO) has become the de facto standard for offline preference alignment of large language models.<n>DPO weighs each update relative to a reference, which stabilizes the training by regularizing the updates within a trusted region.<n>This reliance becomes problematic for pessimistic pairs, where the reference model prefers the rejected response.<n>We modify DPO to treat the reference as neutral when it is pessimistic by replacing $_-_mathrmref$ with $_-max0,_mathrmref$.
arXiv Detail & Related papers (2026-02-12T12:55:51Z) - Unifying Stable Optimization and Reference Regularization in RLHF [64.16830602324345]
This paper introduces a unified regularization approach that balances objectives of preventing reward hacking and maintaining stable policy updates.<n>Our simple yet principled alignment objective yields a weighted supervised fine-tuning loss with a superior trade-off, which demonstrably improves both alignment results and implementation complexity.
arXiv Detail & Related papers (2026-02-12T03:31:19Z) - Back to Basics: Revisiting Exploration in Reinforcement Learning for LLM Reasoning via Generative Probabilities [10.235183326885794]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as an indispensable paradigm for enhancing reasoning in Large Language Models (LLMs)<n>We analyze this issue from the perspective of sampling probability dynamics, identifying that the standard objective disproportionately reinforces the highest-likelihood paths.<n>We propose a novel Advantage Re-weighting Mechanism (ARM) designed to equilibrate the confidence levels across all correct responses.
arXiv Detail & Related papers (2026-02-05T04:06:55Z) - Rethinking the Trust Region in LLM Reinforcement Learning [72.25890308541334]
Proximal Policy Optimization (PPO) serves as the de facto standard algorithm for Large Language Models (LLMs)<n>We propose Divergence Proximal Policy Optimization (DPPO), which substitutes clipping with a more principled constraint.<n>DPPO achieves superior training and efficiency compared to existing methods, offering a more robust foundation for RL-based fine-tuning.
arXiv Detail & Related papers (2026-02-04T18:59:04Z) - Stochastic Decision Horizons for Constrained Reinforcement Learning [22.755234154139174]
Constrained Markov decision processes (CMDPs) provide a principled model for handling constraints, such as safety and other auxiliary objectives, in reinforcement learning.<n>We propose Control as Inference formulation based on state-action-dependent decision horizons, where constraint violations attenuate reward contributions and shorten the effective planning horizon via state-action-dependent continuation.<n>We propose two violation semantics, absorbing and virtual termination, that share the same survival-weighted return but result in distinct structures that lead to SAC/MPO-style policy improvement.
arXiv Detail & Related papers (2026-02-04T14:27:16Z) - Expected Return Causes Outcome-Level Mode Collapse in Reinforcement Learning and How to Fix It with Inverse Probability Scaling [2.8825582215429186]
We show that outcome-level mode collapse is a structural consequence of the expected-return itself.<n>We propose a minimal correction: inverse probability scaling, which removes outcome-frequency from the learning signal.
arXiv Detail & Related papers (2026-01-29T13:03:33Z) - MAESTRO: Meta-learning Adaptive Estimation of Scalarization Trade-offs for Reward Optimization [56.074760766965085]
Group-Relative Policy Optimization has emerged as an efficient paradigm for aligning Large Language Models (LLMs)<n>We propose MAESTRO, which treats reward scalarization as a dynamic latent policy, leveraging the model's terminal hidden states as a semantic bottleneck.<n>We formulate this as a contextual bandit problem within a bi-level optimization framework, where a lightweight Conductor network co-evolves with the policy by utilizing group-relative advantages as a meta-reward signal.
arXiv Detail & Related papers (2026-01-12T05:02:48Z) - Entropy Ratio Clipping as a Soft Global Constraint for Stable Reinforcement Learning [49.92803982100042]
We propose using the entropy ratio between the current and previous policies as a new global metric.<n>We introduce an textbfEntropy Ratio (ERC) mechanism that imposes bidirectional constraints on the entropy ratio.<n>This stabilizes policy updates at the global distribution level and compensates for the inability of PPO-clip to regulate probability shifts of un-sampled actions.
arXiv Detail & Related papers (2025-12-05T10:26:32Z) - GRPO-Guard: Mitigating Implicit Over-Optimization in Flow Matching via Regulated Clipping [63.33669214116784]
GRPO-Guard is a simple yet effective enhancement to existing GRPO frameworks.<n>It restores a balanced and step-consistent importance ratio, ensuring that PPO clipping properly constrains harmful updates.<n>It substantially mitigates implicit over-optimization without relying on heavy KL regularization.
arXiv Detail & Related papers (2025-10-25T14:51:17Z) - Convergence and Generalization of Anti-Regularization for Parametric Models [0.0]
Anti-regularization introduces a reward term with a reversed sign into the loss function.<n>We formalize spectral safety conditions and trust-region constraints.<n>We design a lightweight safeguard that combines a projection operator with gradient clipping to guarantee stable intervention.
arXiv Detail & Related papers (2025-08-24T15:34:17Z) - Reparameterization Proximal Policy Optimization [35.59197802340267]
Policy gradient (RPG) is promising for improving sample efficiency by leveraging differentiable dynamics.<n>We draw inspiration from Proximal Policy Optimization (PPO), which uses a surrogate objective to enable stable sample reuse.<n>We propose Re Parameters Proximal Policy Optimization (RPO), a stable and sample-efficient RPG-based method.<n>RPO enables stable sample reuse over multiple epochs by employing a policy gradient clipping mechanism tailored for RPG.
arXiv Detail & Related papers (2025-08-08T10:50:55Z) - Improper Learning with Gradient-based Policy Optimization [62.50997487685586]
We consider an improper reinforcement learning setting where the learner is given M base controllers for an unknown Markov Decision Process.
We propose a gradient-based approach that operates over a class of improper mixtures of the controllers.
arXiv Detail & Related papers (2021-02-16T14:53:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.