Related papers: Consistency Is Not Always Correct: Towards Understanding the Role of Exploration in Post-Training Reasoning

Consistency Is Not Always Correct: Towards Understanding the Role of Exploration in Post-Training Reasoning

URL: http://arxiv.org/abs/2511.07368v1
Date: Mon, 10 Nov 2025 18:25:26 GMT
Title: Consistency Is Not Always Correct: Towards Understanding the Role of Exploration in Post-Training Reasoning
Authors: Dake Bu, Wei Huang, Andi Han, Atsushi Nitanda, Bo Xue, Qingfu Zhang, Hau-San Wong, Taiji Suzuki,
Abstract summary: Foundation models exhibit broad knowledge but limited task-specific reasoning.<n> RLVR and inference scaling motivate post-training strategies such as RLVR and inference scaling.<n>We show that RLVR induces a squeezing effect, reducing reasoning entropy and forgetting some correct paths.
Score: 75.79451512757844
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Foundation models exhibit broad knowledge but limited task-specific reasoning, motivating post-training strategies such as RLVR and inference scaling with outcome or process reward models (ORM/PRM). While recent work highlights the role of exploration and entropy stability in improving pass@K, empirical evidence points to a paradox: RLVR and ORM/PRM typically reinforce existing tree-like reasoning paths rather than expanding the reasoning scope, raising the question of why exploration helps at all if no new patterns emerge. To reconcile this paradox, we adopt the perspective of Kim et al. (2025), viewing easy (e.g., simplifying a fraction) versus hard (e.g., discovering a symmetry) reasoning steps as low- versus high-probability Markov transitions, and formalize post-training dynamics through Multi-task Tree-structured Markov Chains (TMC). In this tractable model, pretraining corresponds to tree expansion, while post-training corresponds to chain-of-thought reweighting. We show that several phenomena recently observed in empirical studies arise naturally in this setting: (1) RLVR induces a squeezing effect, reducing reasoning entropy and forgetting some correct paths; (2) population rewards of ORM/PRM encourage consistency rather than accuracy, thereby favoring common patterns; and (3) certain rare, high-uncertainty reasoning paths by the base model are responsible for solving hard problem instances. Together, these explain why exploration -- even when confined to the base model's reasoning scope -- remains essential: it preserves access to rare but crucial reasoning traces needed for difficult cases, which are squeezed out by RLVR or unfavored by inference scaling. Building on this, we further show that exploration strategies such as rejecting easy instances and KL regularization help preserve rare reasoning traces. Empirical simulations corroborate our theoretical results.

Related papers

How and Why LLMs Generalize: A Fine-Grained Analysis of LLM Reasoning from Cognitive Behaviors to Low-Level Patterns [51.02752099869218]
Large Language Models (LLMs) display strikingly different generalization behaviors.<n>We introduce a novel benchmark that decomposes reasoning into atomic core skills.<n>We show that RL-tuned models maintain more stable behavioral profiles and resist collapse in reasoning skills, whereas SFT models exhibit sharper drift and overfit to surface patterns.
arXiv Detail & Related papers (2025-12-30T08:16:20Z)
RAVR: Reference-Answer-guided Variational Reasoning for Large Language Models [21.671577399379885]
We introduce RAVR, an end-to-end framework that uses answer-conditioned reasoning as a variational surrogate for question-only reasoning.<n> RAVR reduces hesitation, strengthens conclusion consolidation, and promotes problem-specific strategies in reasoning.
arXiv Detail & Related papers (2025-10-29T06:18:37Z)
VAR: Visual Attention Reasoning via Structured Search and Backtracking [49.427842994857635]
We introduce Visual Attention Reasoning, a framework that recasts grounded reasoning as a structured search.<n> VAR decomposes the reasoning process into two key stages: traceable evidence grounding and search-based chain-of-thought.<n>We show that our 7B model, VAR-7B, sets a new state-of-the-art on a comprehensive suite of hallucination and safety benchmarks.
arXiv Detail & Related papers (2025-10-21T13:18:44Z)
Reasoning or Retrieval? A Study of Answer Attribution on Large Reasoning Models [15.797612515648412]
Large reasoning models (LRMs) exhibit unprecedented capabilities in solving complex problems through Chain-of-Thought (CoT) reasoning.<n>Recent studies reveal that their final answers often contradict their own reasoning traces.<n>We hypothesize that this inconsistency stems from two competing mechanisms for generating answers: CoT reasoning and memory retrieval.<n>We introduce FARL, a novel fine-tuning framework that integrates memory unlearning with reinforcement learning.
arXiv Detail & Related papers (2025-09-29T01:13:33Z)
Test-time Prompt Intervention [22.35022545068874]
We propose PI, a novel framework for Test-time Prompt Intervention.<n> PI provides an interface to dynamically guide and regulate reasoning paths during inference.<n>This allows human problem-solving expertise and cognitive science principles to be seamlessly integrated into LLMs' reasoning processes.
arXiv Detail & Related papers (2025-08-04T15:17:13Z)
CTRLS: Chain-of-Thought Reasoning via Latent State-Transition [57.51370433303236]
Chain-of-thought (CoT) reasoning enables large language models to break down complex problems into interpretable intermediate steps.<n>We introduce groundingS, a framework that formulates CoT reasoning as a Markov decision process (MDP) with latent state transitions.<n>We show improvements in reasoning accuracy, diversity, and exploration efficiency across benchmark reasoning tasks.
arXiv Detail & Related papers (2025-07-10T21:32:18Z)
Lost at the Beginning of Reasoning [85.17612793300238]
We show that the first reasoning step exerts a disproportionately large influence on the final prediction.<n>We propose an efficient sampling strategy that leverages a reward model to identify and retain high-quality first reasoning steps.
arXiv Detail & Related papers (2025-06-27T09:53:57Z)
Reasoning with Exploration: An Entropy Perspective on Reinforcement Learning for LLMs [112.40801692473723]
Balancing exploration and exploitation is a central goal in reinforcement learning (RL)<n>We introduce a minimal modification to standard RL with only one line of code: augmenting the advantage function with an entropy-based term.<n>Our method achieves significant gains on the Pass@K metric, even when evaluated with extremely large K values.
arXiv Detail & Related papers (2025-06-17T17:54:03Z)
A Survey of Efficient Reasoning for Large Reasoning Models: Language, Multimodality, and Beyond [88.5807076505261]
Large Reasoning Models (LRMs) have demonstrated strong performance gains by scaling up the length of Chain-of-Thought (CoT) reasoning during inference.<n>A growing concern lies in their tendency to produce excessively long reasoning traces.<n>This inefficiency introduces significant challenges for training, inference, and real-world deployment.
arXiv Detail & Related papers (2025-03-27T15:36:30Z)
Causal Representation Learning Made Identifiable by Grouping of Observational Variables [8.157856010838382]
Causal Representation Learning aims to learn a causal model for hidden features in a data-driven manner. Here, we show identifiability based on novel, weak constraints. We also propose a novel self-supervised estimation framework consistent with the model.
arXiv Detail & Related papers (2023-10-24T10:38:02Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.