Related papers: How Reliable are Confidence Estimators for Large Reasoning Models? A Systematic Benchmark on High-Stakes Domains

How Reliable are Confidence Estimators for Large Reasoning Models? A Systematic Benchmark on High-Stakes Domains

URL: http://arxiv.org/abs/2601.08134v2
Date: Wed, 21 Jan 2026 18:03:19 GMT
Title: How Reliable are Confidence Estimators for Large Reasoning Models? A Systematic Benchmark on High-Stakes Domains
Authors: Reza Khanmohammadi, Erfan Miahi, Simerjot Kaur, Ivan Brugere, Charese H. Smiley, Kundan Thind, Mohammad M. Ghassemi,
Abstract summary: miscalibration of Large Reasoning Models undermines their reliability in high-stakes domains.<n>We introduce the Reasoning Model Confidence estimation Benchmark (RMCB), a public resource of 347,496 reasoning traces from six popular LRMs.
Score: 7.845652284569666
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The miscalibration of Large Reasoning Models (LRMs) undermines their reliability in high-stakes domains, necessitating methods to accurately estimate the confidence of their long-form, multi-step outputs. To address this gap, we introduce the Reasoning Model Confidence estimation Benchmark (RMCB), a public resource of 347,496 reasoning traces from six popular LRMs across different architectural families. The benchmark is constructed from a diverse suite of datasets spanning high-stakes domains, including clinical, financial, legal, and mathematical reasoning, alongside complex general reasoning benchmarks, with correctness annotations provided for all samples. Using RMCB, we conduct a large-scale empirical evaluation of over ten distinct representation-based methods, spanning sequential, graph-based, and text-based architectures. Our central finding is a persistent trade-off between discrimination (AUROC) and calibration (ECE): text-based encoders achieve the best AUROC (0.672), while structurally-aware models yield the best ECE (0.148), with no single method dominating both. Furthermore, we find that increased architectural complexity does not reliably outperform simpler sequential baselines, suggesting a performance ceiling for methods relying solely on chunk-level hidden states. This work provides the most comprehensive benchmark for this task to date, establishing rigorous baselines and demonstrating the limitations of current representation-based paradigms.

Related papers

STAR : Bridging Statistical and Agentic Reasoning for Large Model Performance Prediction [78.0692157478247]
We propose STAR, a framework that bridges data-driven STatistical expectations with knowledge-driven Agentic Reasoning.<n>We show that STAR consistently outperforms all baselines on both score-based and rank-based metrics.
arXiv Detail & Related papers (2026-02-12T16:30:07Z)
Trust in One Round: Confidence Estimation for Large Language Models via Structural Signals [13.89434979851652]
Large language models (LLMs) are increasingly deployed in domains where errors carry high social, scientific, or safety costs.<n>We present Structural Confidence, a single-pass, model-agnostic framework that enhances output correctness prediction.
arXiv Detail & Related papers (2026-02-01T02:35:59Z)
Learning More from Less: Unlocking Internal Representations for Benchmark Compression [37.69575776639016]
We introduce REPCORE, which aligns heterogeneous hidden states into a unified latent space to construct representative coresets.<n>Experiments on five benchmarks and over 200 models show consistent gains over output-based baselines in ranking correlation and estimation accuracy.
arXiv Detail & Related papers (2026-01-31T13:11:39Z)
ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning [2.1461777157838724]
We introduce ReasonBENCH, the first benchmark designed to quantify the underlying instability in large language models (LLMs) reasoning.<n>Across tasks from different domains, we find that the vast majority of reasoning strategies and models exhibit high instability.<n>We further analyze the impact of prompts, model families, and scale on the trade-off between solve rate and stability.
arXiv Detail & Related papers (2025-12-08T18:26:58Z)
An Empirical Study of SOTA RCA Models: From Oversimplified Benchmarks to Realistic Failures [16.06503310632004]
We show that simple rule-based methods can match or even outperform state-of-the-art (SOTA) models on four widely used benchmarks.<n>Our analysis highlights three common failure patterns: scalability issues, observability blind spots, and modeling bottlenecks.
arXiv Detail & Related papers (2025-10-06T11:30:03Z)
Rethinking Reward Models for Multi-Domain Test-Time Scaling [91.76069784586149]
Prior work generally assumes that process reward models (PRMs) outperform outcome reward models (ORMs) that assess only the final answer.<n>We present the first unified evaluation of four reward model variants across 14 diverse domains.<n>We attribute this to PRM-style stepwise scoring, which inherits label noise from LLM auto-labeling and has difficulty evaluating long reasoning trajectories.
arXiv Detail & Related papers (2025-10-01T04:21:14Z)
The Flaw of Averages: Quantifying Uniformity of Performance on Benchmarks [32.00464870277127]
We study benchmark reliability from a distributional perspective and introduce benchmark harmony.<n>High harmony is a desirable benchmark property, indicating that the aggregate metric reflects uniform competence across models.<n>By recommending that harmony should be reported alongside accuracy, we reframe evaluation from simple performance averages to a more robust, distributionally reliable measurement of performance.
arXiv Detail & Related papers (2025-09-30T02:14:30Z)
Discrete Markov Bridge [93.64996843697278]
We propose a novel framework specifically designed for discrete representation learning, called Discrete Markov Bridge.<n>Our approach is built upon two key components: Matrix Learning and Score Learning.
arXiv Detail & Related papers (2025-05-26T09:32:12Z)
A NotSo Simple Way to Beat Simple Bench [0.0]
This paper presents a novel framework for enhancing reasoning capabilities in large language models (LLMs)<n>We propose a multi-step prompting strategy coupled with global consistency checks to improve model accuracy and robustness.<n>Our results reveal model-specific strengths: Claude excels in maintaining logical consistency, while GPT-4o exhibits exploratory creativity but struggles with ambiguous prompts.
arXiv Detail & Related papers (2024-12-12T16:04:31Z)
Revisiting BPR: A Replicability Study of a Common Recommender System Baseline [78.00363373925758]
We study the features of the BPR model, indicating their impact on its performance, and investigate open-source BPR implementations. Our analysis reveals inconsistencies between these implementations and the original BPR paper, leading to a significant decrease in performance of up to 50% for specific implementations. We show that the BPR model can achieve performance levels close to state-of-the-art methods on the top-n recommendation tasks and even outperform them on specific datasets.
arXiv Detail & Related papers (2024-09-21T18:39:53Z)
Cycles of Thought: Measuring LLM Confidence through Stable Explanations [53.15438489398938]
Large language models (LLMs) can reach and even surpass human-level accuracy on a variety of benchmarks, but their overconfidence in incorrect responses is still a well-documented failure mode. We propose a framework for measuring an LLM's uncertainty with respect to the distribution of generated explanations for an answer.
arXiv Detail & Related papers (2024-06-05T16:35:30Z)
Preserving Knowledge Invariance: Rethinking Robustness Evaluation of Open Information Extraction [49.15931834209624]
We present the first benchmark that simulates the evaluation of open information extraction models in the real world.<n>We design and annotate a large-scale testbed in which each example is a knowledge-invariant clique.<n>By further elaborating the robustness metric, a model is judged to be robust if its performance is consistently accurate on the overall cliques.
arXiv Detail & Related papers (2023-05-23T12:05:09Z)
A Confidence-based Partial Label Learning Model for Crowd-Annotated Named Entity Recognition [74.79785063365289]
Existing models for named entity recognition (NER) are mainly based on large-scale labeled datasets. We propose a Confidence-based Partial Label Learning (CPLL) method to integrate the prior confidence (given by annotators) and posterior confidences (learned by models) for crowd-annotated NER.
arXiv Detail & Related papers (2023-05-21T15:31:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.