Related papers: Know What You Know: Metacognitive Entropy Calibration for Verifiable RL Reasoning

Know What You Know: Metacognitive Entropy Calibration for Verifiable RL Reasoning

URL: http://arxiv.org/abs/2602.22751v1
Date: Thu, 26 Feb 2026 08:40:06 GMT
Title: Know What You Know: Metacognitive Entropy Calibration for Verifiable RL Reasoning
Authors: Qiannian Zhao, Chen Yang, Jinhao Jing, Yunke Zhang, Xuhui Ren, Lu Yu, Shijie Zhang, Hongzhi Yin,
Abstract summary: Large reasoning models (LRMs) have emerged as a powerful paradigm for solving complex real-world tasks.<n>Most existing outcome-only RLVR pipelines rely almost exclusively on a binary correctness signal and largely ignore the model's intrinsic uncertainty.<n>We propose EGPO, a metacognitive entropy calibration framework that explicitly integrates intrinsic uncertainty into RLVR for enhancing LRMs.
Score: 31.629261193485053
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large reasoning models (LRMs) have emerged as a powerful paradigm for solving complex real-world tasks. In practice, these models are predominantly trained via Reinforcement Learning with Verifiable Rewards (RLVR), yet most existing outcome-only RLVR pipelines rely almost exclusively on a binary correctness signal and largely ignore the model's intrinsic uncertainty. We term this discrepancy the uncertainty-reward mismatch, under which high- and low-uncertainty solutions are treated equivalently, preventing the policy from "Know What You Know" and impeding the shift from optimizing for correct answers to optimizing effective reasoning paths. This limitation is especially critical in reasoning-centric tasks such as mathematics and question answering, where performance hinges on the quality of the model's internal reasoning process rather than mere memorization of final answers. To address this, we propose EGPO, a metacognitive entropy calibration framework that explicitly integrates intrinsic uncertainty into RLVR for enhancing LRMs. EGPO estimates per-sample uncertainty using a zero-overhead entropy proxy derived from token-level likelihoods and aligns it with extrinsic correctness through an asymmetric calibration mechanism that preserves correct reasoning while selectively regulating overconfident failures, thereby enabling stable and uncertainty-aware policy optimization. Moreover, EGPO recovers informative learning signals from otherwise degenerate group-based rollouts without modifying the verifier or reward definition. Extensive experiments across multiple benchmarks demonstrate that the proposed EGPO leads to substantial and consistent improvements in reasoning performance, establishing a principled path for advancing LRMs through metacognitive entropy calibration.

Related papers

VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction [55.04308051033549]
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a dominant paradigm for enhancing Large Language Models (LLMs) reasoning.<n>We introduceVerifier-Independent Curriculum Reinforcement Learning (VI-CuRL), a framework that leverages the model's intrinsic confidence to construct a curriculum independent from external verifiers.
arXiv Detail & Related papers (2026-02-13T03:40:52Z)
Uncertainty-aware Generative Recommendation [52.0751022792023]
Uncertainty-aware Generative Recommendation (UGR) is a unified framework that leverages uncertainty as a critical signal for adaptive optimization.<n>UGR not only yields superior recommendation performance but also fundamentally stabilizes training, preventing the performance degradation often observed in standard methods.
arXiv Detail & Related papers (2026-02-12T08:48:51Z)
R-Align: Enhancing Generative Reward Models through Rationale-Centric Meta-Judging [69.96389360650072]
We show that reasoning fidelity is highly predictive of downstream RLHF outcomes, beyond standard label accuracy.<n>We propose Rationale-Centric Alignment, R-Align, which augments training with gold judgments and explicitly supervises rationale alignment.
arXiv Detail & Related papers (2026-02-06T15:17:11Z)
UCPO: Uncertainty-Aware Policy Optimization [12.847800921274617]
Existing Large Language Models (LLMs) often suffer from Advantage Bias due to binary decision spaces and static uncertainty rewards, inducing excessive conservatism or overconfidence.<n>This paper unveils the root causes of reward hacking and overconfidence in current RL paradigms incorporating uncertainty-based rewards, based on which we propose the UnCertainty-Aware Policy Optimization framework.
arXiv Detail & Related papers (2026-01-30T07:07:42Z)
Confidence-Based Response Abstinence: Improving LLM Trustworthiness via Activation-Based Uncertainty Estimation [7.3923284353934875]
We propose a method for confidence estimation in retrieval-augmented generation (RAG) systems that aligns closely with the correctness of large language model (LLM) outputs.<n>Our approach extends prior uncertainty quantification methods by leveraging raw feed-forward network (FFN) activations as auto-regressive signals.<n>Our results demonstrate that activation-based confidence modeling offers a scalable, architecture-aware path toward trustworthy RAG deployment.
arXiv Detail & Related papers (2025-10-15T16:55:56Z)
Uncertainty Quantification for Regression using Proper Scoring Rules [76.24649098854219]
We introduce a unified UQ framework for regression based on proper scoring rules, such as CRPS, logarithmic, squared error, and quadratic scores.<n>We derive closed-form expressions for the uncertainty measures under practical parametric assumptions and show how to estimate them using ensembles of models.<n>Our broad evaluation on synthetic and real-world regression datasets provides guidance for selecting reliable UQ measures.
arXiv Detail & Related papers (2025-09-30T17:52:12Z)
CLUE: Neural Networks Calibration via Learning Uncertainty-Error alignment [7.702016079410588]
We introduce CLUE (Calibration via Learning Uncertainty-Error Alignment), a novel approach that aligns predicted uncertainty with observed error during training.<n>We show that CLUE achieves superior calibration quality and competitive predictive performance with respect to state-of-the-art approaches.
arXiv Detail & Related papers (2025-05-28T19:23:47Z)
Enhancing LLM Reliability via Explicit Knowledge Boundary Modeling [41.19330514054401]
Large language models (LLMs) are prone to hallucination stemming from misaligned self-awareness.<n>We propose the Explicit Knowledge Boundary Modeling framework to integrate fast and slow reasoning systems to harmonize reliability and usability.
arXiv Detail & Related papers (2025-03-04T03:16:02Z)
Assessing Correctness in LLM-Based Code Generation via Uncertainty Estimation [0.0]
We explore uncertainty estimation as a proxy for correctness in LLM-generated code.<n>We adapt two state-of-the-art techniques from natural language generation to the domain of code generation.<n>Our findings indicate a strong correlation between the uncertainty computed through these techniques and correctness.
arXiv Detail & Related papers (2025-02-17T10:03:01Z)
Model-Based Epistemic Variance of Values for Risk-Aware Policy Optimization [59.758009422067]
We consider the problem of quantifying uncertainty over expected cumulative rewards in model-based reinforcement learning. We propose a new uncertainty Bellman equation (UBE) whose solution converges to the true posterior variance over values. We introduce a general-purpose policy optimization algorithm, Q-Uncertainty Soft Actor-Critic (QU-SAC) that can be applied for either risk-seeking or risk-averse policy optimization.
arXiv Detail & Related papers (2023-12-07T15:55:58Z)
Understanding, Predicting and Better Resolving Q-Value Divergence in Offline-RL [86.0987896274354]
We first identify a fundamental pattern, self-excitation, as the primary cause of Q-value estimation divergence in offline RL. We then propose a novel Self-Excite Eigenvalue Measure (SEEM) metric to measure the evolving property of Q-network at training. For the first time, our theory can reliably decide whether the training will diverge at an early stage.
arXiv Detail & Related papers (2023-10-06T17:57:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.