Related papers: APR: Penalizing Structural Redundancy in Large Reasoning Models via Anchor-based Process Rewards

APR: Penalizing Structural Redundancy in Large Reasoning Models via Anchor-based Process Rewards

URL: http://arxiv.org/abs/2602.00760v2
Date: Mon, 09 Feb 2026 06:21:10 GMT
Title: APR: Penalizing Structural Redundancy in Large Reasoning Models via Anchor-based Process Rewards
Authors: Kaiyan Chang, Chenwei Zhu, Yingfeng Luo, Yifu Huo, Chenglong Wang, Xiaoqian Liu, Qiaozhi He, Tong Xiao, Zhengtao Yu, Jingbo Zhu,
Abstract summary: Test-Time Scaling (TTS) has significantly enhanced the capabilities of Large Reasoning Models (LRMs)<n>We observe that LRMs frequently conduct repetitive self-verification without revision even after obtaining the final answer during the reasoning process.<n>We propose Anchor-based Process Reward (APR), a structure-aware reward shaping method that localizes the reasoning anchor and penalizes exclusively the post-anchor AST.
Score: 61.52322047892064
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Test-Time Scaling (TTS) has significantly enhanced the capabilities of Large Reasoning Models (LRMs) but introduces a critical side-effect known as Overthinking. We conduct a preliminary study to rethink this phenomenon from a fine-grained perspective. We observe that LRMs frequently conduct repetitive self-verification without revision even after obtaining the final answer during the reasoning process. We formally define this specific position where the answer first stabilizes as the Reasoning Anchor. By analyzing pre- and post-anchor reasoning behaviors, we uncover the structural redundancy fixed in LRMs: the meaningless repetitive verification after deriving the first complete answer, which we term the Answer-Stable Tail (AST). Motivated by this observation, we propose Anchor-based Process Reward (APR), a structure-aware reward shaping method that localizes the reasoning anchor and penalizes exclusively the post-anchor AST. Leveraging the policy optimization algorithm suitable for length penalties, our APR models achieved the performance-efficiency Pareto frontier at 1.5B and 7B scales averaged across five mathematical reasoning datasets while requiring substantially fewer computational resources for RL training.

Related papers

Stepwise Penalization for Length-Efficient Chain-of-Thought Reasoning [66.22060690012512]
Large reasoning models improve with more test-time computation, but often overthink, producing unnecessarily long chains-of-thought that raise cost without improving accuracy.<n>We propose Step-wise Adaptive Penalization (SWAP), a fine-grained framework that allocates length reduction across steps based on intrinsic contribution.
arXiv Detail & Related papers (2026-02-27T20:23:59Z)
Self-Compression of Chain-of-Thought via Multi-Agent Reinforcement Learning [34.10133693878611]
We propose a multi-agent RL framework that selectively penalizes redundant chunks, while preserving essential reasoning logic.<n>Our framework, Self-Compression via MARL (SCMA), instantiates redundancy detection and evaluation through two specialized agents.<n> Empirical evaluations across model scales demonstrate that SCMA reduces response length by 11.1% to 39.0% while boosting accuracy by 4.33% to 10.02%.
arXiv Detail & Related papers (2026-01-29T16:13:10Z)
Evidence-Augmented Policy Optimization with Reward Co-Evolution for Long-Context Reasoning [52.144281362465996]
We propose EAPO (Evidence-Augmented Policy Optimization) to apply Reinforcement Learning to long-context scenarios.<n>We first establish the Evidence-Augmented Reasoning paradigm, validating via Tree-Structured Evidence Sampling.<n>We then introduce a specialized RL algorithm where a reward model computes a Group-Relative Evidence Reward.<n>To sustain accurate supervision throughout training, we further incorporate an Adaptive Reward-Policy Co-Evolution mechanism.
arXiv Detail & Related papers (2026-01-15T11:40:57Z)
Structured Reasoning for Large Language Models [59.215789462977206]
We propose Structured Reasoning (SCR), a framework that decouples reasoning trajectories into explicit, evaluable, and trainable components.<n>SCR substantially improves reasoning efficiency and self-verification.<n>Compared with existing reasoning paradigms, it reduces output token length by up to 50%.
arXiv Detail & Related papers (2026-01-12T04:04:01Z)
ENTRA: Entropy-Based Redundancy Avoidance in Large Language Model Reasoning [30.786062954495403]
Large Reasoning Models (LRMs) often suffer from overthinking, generating unnecessarily long reasoning chains even for simple tasks.<n>We propose ENTRA, an entropy-based training framework that suppresses redundant reasoning while preserving performance.
arXiv Detail & Related papers (2026-01-12T01:26:30Z)
Stop Spinning Wheels: Mitigating LLM Overthinking via Mining Patterns for Early Reasoning Exit [114.83867400179354]
Overthinking can degrade overall performance of large language models.<n>We categorize reasoning into three stages: insufficient exploration stage, compensatory reasoning stage, and reasoning convergence stage.<n>We develop a lightweight thresholding strategy based on rules to improve reasoning accuracy.
arXiv Detail & Related papers (2025-08-25T03:17:17Z)
Lost at the Beginning of Reasoning [85.17612793300238]
We show that the first reasoning step exerts a disproportionately large influence on the final prediction.<n>We propose an efficient sampling strategy that leverages a reward model to identify and retain high-quality first reasoning steps.
arXiv Detail & Related papers (2025-06-27T09:53:57Z)
Retrieval is Not Enough: Enhancing RAG Reasoning through Test-Time Critique and Optimization [58.390885294401066]
Retrieval-augmented generation (RAG) has become a widely adopted paradigm for enabling knowledge-grounded large language models (LLMs)<n>RAG pipelines often fail to ensure that model reasoning remains consistent with the evidence retrieved, leading to factual inconsistencies or unsupported conclusions.<n>We propose AlignRAG, a novel iterative framework grounded in Critique-Driven Alignment (CDA)<n>We introduce AlignRAG-auto, an autonomous variant that dynamically terminates refinement, removing the need to pre-specify the number of critique iterations.
arXiv Detail & Related papers (2025-04-21T04:56:47Z)
Concise Reasoning via Reinforcement Learning [13.657506042120167]
We revisit the core principles of reinforcement learning (RL)<n>We uncover a natural correlation between conciseness and accuracy that has been largely overlooked.<n>We show that introducing a secondary phase of RL training, using a very small set of problems, can significantly reduce chains of thought.
arXiv Detail & Related papers (2025-04-07T15:35:54Z)
ReaRAG: Knowledge-guided Reasoning Enhances Factuality of Large Reasoning Models with Iterative Retrieval Augmented Generation [38.64751082999587]
Large Reasoning Models (LRMs) exhibit remarkable reasoning abilities but rely primarily on parametric knowledge, limiting factual accuracy.<n>We propose ReaRAG, a factuality-enhanced reasoning model that explores diverse queries without excessive iterations.<n>Our study enhances LRMs' factuality while effectively integrating robust reasoning for Retrieval-Augmented Generation (RAG)
arXiv Detail & Related papers (2025-03-27T17:44:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.