Beyond Correctness: Confidence-Aware Reward Modeling for Enhancing Large Language Model Reasoning
- URL: http://arxiv.org/abs/2511.07483v1
- Date: Wed, 12 Nov 2025 01:01:23 GMT
- Title: Beyond Correctness: Confidence-Aware Reward Modeling for Enhancing Large Language Model Reasoning
- Authors: Qianxi He, Qingyu Ren, Shanzhe Lei, Xuhong Wang, Yingchun Wang,
- Abstract summary: We propose a novel confidence-based reward model tailored for enhancing STEM reasoning capabilities.<n>Unlike conventional approaches, our model penalizes not only incorrect answers but also low-confidence correct responses.<n>We validate the effectiveness of our approach through static evaluations, Best-of-N inference tests, and PPO-based RL training.
- Score: 13.228177497050567
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advancements in large language models (LLMs) have shifted the post-training paradigm from traditional instruction tuning and human preference alignment toward reinforcement learning (RL) focused on reasoning capabilities. However, numerous technical reports indicate that purely rule-based reward RL frequently results in poor-quality reasoning chains or inconsistencies between reasoning processes and final answers, particularly when the base model is of smaller scale. During the RL exploration process, models might employ low-quality reasoning chains due to the lack of knowledge, occasionally producing correct answers randomly and receiving rewards based on established rule-based judges. This constrains the potential for resource-limited organizations to conduct direct reinforcement learning training on smaller-scale models. We propose a novel confidence-based reward model tailored for enhancing STEM reasoning capabilities. Unlike conventional approaches, our model penalizes not only incorrect answers but also low-confidence correct responses, thereby promoting more robust and logically consistent reasoning. We validate the effectiveness of our approach through static evaluations, Best-of-N inference tests, and PPO-based RL training. Our method outperforms several state-of-the-art open-source reward models across diverse STEM benchmarks. We release our codes and model in https://github.com/qianxiHe147/C2RM.
Related papers
- Native Reasoning Models: Training Language Models to Reason on Unverifiable Data [16.065264121785294]
We introduce NRT (Native Reasoning Training), a novel framework that cultivates complex reasoning.<n>NRT reframes the training problem by treating the reasoning process as a latent variable.<n>NRT achieves state-of-the-art performance among verifier-free methods.
arXiv Detail & Related papers (2026-02-12T04:15:46Z) - You Need Reasoning to Learn Reasoning: The Limitations of Label-Free RL in Weak Base Models [12.14455026524814]
We investigate the generalizability of label-free RL approaches to base models with limited reasoning capabilities.<n>We find that label-free RL is highly dependent on the base model's pre-existing reasoning capability.<n>We propose a simple yet effective method for label-free RL that utilizes curriculum learning to progressively introduce harder problems.
arXiv Detail & Related papers (2025-11-07T01:05:11Z) - Confidence as a Reward: Transforming LLMs into Reward Models [54.98336080630691]
Confidence-as-a-Reward (CRew) is a training-free method that utilizes token-level confidence in the model's final answers as a proxy for reward.<n>We show that CRew outperforms existing training-free reward approaches on the MATH500 and RewardMATH benchmarks.<n>We propose CRew-DPO, a training strategy that constructs preference data from confidence scores combined with correctness signals.
arXiv Detail & Related papers (2025-10-15T12:51:47Z) - Generalist Reward Models: Found Inside Large Language Models [50.7432354447554]
We show that a powerful reward model is already latently present within any Large Language Models (LLMs) trained via standard next-token prediction.<n>We prove that this endogenous reward is not a reward function learned through offline inverse reinforcement learning.<n>We also prove that subsequent reinforcement learning using this endogenous reward leads to a policy with a provably superior error bound compared to the base model.
arXiv Detail & Related papers (2025-06-29T13:45:54Z) - Beyond Accuracy: Dissecting Mathematical Reasoning for LLMs Under Reinforcement Learning [93.00629872970364]
Reinforcement learning (RL) has become the dominant paradigm for improving the performance of language models on complex reasoning tasks.<n>We introduce SPARKLE, a fine-grained analytic framework to dissect the effects of RL across three key dimensions.<n>We study whether difficult problems -- those yielding no RL signals and mixed-quality reasoning traces -- can still be effectively used for training.
arXiv Detail & Related papers (2025-06-05T07:53:59Z) - ProRL: Prolonged Reinforcement Learning Expands Reasoning Boundaries in Large Language Models [89.37819814048288]
We introduce ProRL, a novel training methodology that incorporates KL divergence control, reference policy, and a diverse suite of tasks.<n>Our empirical analysis reveals that RL-trained models consistently outperform base resetting models across a wide range of pass@k evaluations.<n>These findings offer new insights into the conditions under which RL meaningfully expands reasoning boundaries in language models.
arXiv Detail & Related papers (2025-05-30T17:59:01Z) - From Accuracy to Robustness: A Study of Rule- and Model-based Verifiers in Mathematical Reasoning [41.02508512078575]
We take mathematical reasoning as a case study and conduct a comprehensive analysis of various verifiers in both static evaluation and RL training scenarios.<n>First, we find that current open-source rule-based verifiers often fail to recognize equivalent answers presented in different formats, resulting in non-negligible false negative rates.<n>We investigate model-based verifiers as a potential solution to address these limitations.
arXiv Detail & Related papers (2025-05-28T10:28:41Z) - Think-RM: Enabling Long-Horizon Reasoning in Generative Reward Models [50.4652276723694]
Think-RM generates flexible, self-guided reasoning traces that support advanced capabilities.<n>Think-RM achieves state-of-the-art results on RM-Bench, outperforming both BT RM and vertically scaled GenRM by 8%.
arXiv Detail & Related papers (2025-05-22T05:56:11Z) - Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model? [66.61292196146016]
Reinforcement Learning with Verifiable Rewards (RLVR) has recently demonstrated notable success in enhancing the reasoning performance of large language models (LLMs)<n>This study critically examines the current state of RLVR.<n>We find that the current training setup does not elicit fundamentally new reasoning patterns.
arXiv Detail & Related papers (2025-04-18T17:59:56Z) - RewardBench: Evaluating Reward Models for Language Modeling [100.28366840977966]
We present RewardBench, a benchmark dataset and code-base for evaluation of reward models.
The dataset is a collection of prompt-chosen-rejected trios spanning chat, reasoning, and safety.
On the RewardBench leaderboard, we evaluate reward models trained with a variety of methods.
arXiv Detail & Related papers (2024-03-20T17:49:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.