The Two-Stage Decision-Sampling Hypothesis: Understanding the Emergence of Self-Reflection in RL-Trained LLMs
- URL: http://arxiv.org/abs/2601.01580v1
- Date: Sun, 04 Jan 2026 15:59:15 GMT
- Title: The Two-Stage Decision-Sampling Hypothesis: Understanding the Emergence of Self-Reflection in RL-Trained LLMs
- Authors: Zibo Zhao, Yuanting Zha, Haipeng Zhang, Xingcheng Xu,
- Abstract summary: We introduce the Gradient Attribution Property to characterize how reward gradients distribute across policy components.<n>We prove that surrogate rewards exhibit Balanced Gradient Attribution, while SFT and KL penalties exhibit Unbalanced Gradient Attribution.<n>We also empirically validate our theoretical predictions on arithmetic reasoning.
- Score: 8.563321259359244
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Self-reflection capabilities emerge in Large Language Models after RL post-training, with multi-turn RL achieving substantial gains over SFT counterparts. Yet the mechanism of how a unified optimization objective gives rise to functionally distinct capabilities of generating solutions and evaluating when to revise them remains opaque. To address this question, we introduce the Gradient Attribution Property to characterize how reward gradients distribute across policy components, formalized through the Two-Stage Decision-Sampling (DS) Hypothesis, which decomposes the policy into sampling ($π_{sample}$) for generation and decision ($π_{d}$) for verification. We prove that surrogate rewards exhibit Balanced Gradient Attribution, while SFT and KL penalties exhibit Unbalanced Gradient Attribution, with length-weighting creating asymmetric regularization that constrains $π_{sample}$ while leaving $π_{d}$ under-optimized, providing an theoretical explanation of why RL succeeds where SFT fails. We also empirically validate our theoretical predictions on arithmetic reasoning demonstrates that RL's superior generalization stems primarily from improved decision-making ($π_{d}$) rather than sampling capabilities, providing a first-principles mechanistic explanation for self-correction in thinking models.
Related papers
- Evidence-Augmented Policy Optimization with Reward Co-Evolution for Long-Context Reasoning [52.144281362465996]
We propose EAPO (Evidence-Augmented Policy Optimization) to apply Reinforcement Learning to long-context scenarios.<n>We first establish the Evidence-Augmented Reasoning paradigm, validating via Tree-Structured Evidence Sampling.<n>We then introduce a specialized RL algorithm where a reward model computes a Group-Relative Evidence Reward.<n>To sustain accurate supervision throughout training, we further incorporate an Adaptive Reward-Policy Co-Evolution mechanism.
arXiv Detail & Related papers (2026-01-15T11:40:57Z) - Learning to Reason in LLMs by Expectation Maximization [55.721496945401846]
We formalize reasoning as a latent variable model and derive an expectation-maximization objective for learning to reason.<n>This view connects EM and modern reward-based optimization, and shows that the main challenge lies in designing a sampling distribution that generates rationales that justify correct answers.
arXiv Detail & Related papers (2025-12-23T08:56:49Z) - A Theoretical Lens for RL-Tuned Language Models via Energy-Based Models [9.324642081509756]
Large language models (LLMs) trained via KL-regularized reinforcement learning demonstrate strong instruction following, self-correction, and reasoning abilities.<n>We exploit the closed-form energy-based model (EBM) structure of the optimal KL-regularized policy to provide a unified variational analysis of LLMs.
arXiv Detail & Related papers (2025-12-21T13:28:58Z) - Multimodal Reinforcement Learning with Agentic Verifier for AI Agents [131.46008226323423]
Argos is a principled multimodal reward agent to train reasoning models for agentic tasks.<n>By leveraging our agentic verifier across both SFT data and RL training, our model achieves state-of-the-art results.
arXiv Detail & Related papers (2025-12-03T04:42:47Z) - Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking [78.69179041551014]
We propose an information-theoretic reward modeling framework based on the Information Bottleneck principle.<n>We show that InfoRM filters out preference-irrelevant information to alleviate reward misgeneralization.<n>We also introduce IBL, a distribution-level regularization that penalizes such deviations, effectively expanding the optimization landscape.
arXiv Detail & Related papers (2025-10-15T15:51:59Z) - Uncalibrated Reasoning: GRPO Induces Overconfidence for Stochastic Outcomes [55.2480439325792]
Reinforcement learning (RL) has proven remarkably effective at improving the accuracy of language models in verifiable and deterministic domains like mathematics.<n>Here, we examine if current RL methods are also effective at optimizing language models in verifiable domains with outcomes, like scientific experiments.
arXiv Detail & Related papers (2025-08-15T20:50:53Z) - Accelerating LLM Reasoning via Early Rejection with Partial Reward Modeling [12.835376812101323]
We introduce the hypothesis that PRMs are also Partial Reward Models.<n>This allows for principled early rejection based on intermediate token-level signals.<n>On math reasoning benchmarks, our method achieves up to 1.4$times$-9$times$ reduction in inference FLOPs without degrading final performance.
arXiv Detail & Related papers (2025-08-04T00:58:56Z) - Supervised Optimism Correction: Be Confident When LLMs Are Sure [91.7459076316849]
We establish a novel theoretical connection between supervised fine-tuning and offline reinforcement learning.<n>We show that the widely used beam search method suffers from unacceptable over-optimism.<n>We propose Supervised Optimism Correction, which introduces a simple yet effective auxiliary loss for token-level $Q$-value estimations.
arXiv Detail & Related papers (2025-04-10T07:50:03Z) - Reinforcement Learning from Human Feedback without Reward Inference: Model-Free Algorithm and Instance-Dependent Analysis [16.288866201806382]
We develop a model-free RLHF best policy identification algorithm, called $mathsfBSAD$, without explicit reward model inference.<n>The algorithm identifies the optimal policy directly from human preference information in a backward manner.
arXiv Detail & Related papers (2024-06-11T17:01:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.