OpenEstimate: Evaluating LLMs on Reasoning Under Uncertainty with Real-World Data
- URL: http://arxiv.org/abs/2510.15096v1
- Date: Thu, 16 Oct 2025 19:35:22 GMT
- Title: OpenEstimate: Evaluating LLMs on Reasoning Under Uncertainty with Real-World Data
- Authors: Alana Renda, Jillian Ross, Michael Cafarella, Jacob Andreas,
- Abstract summary: Real-world settings require language models to grapple with incomplete information and reason under uncertainty.<n>OpenEstimate is a benchmark for evaluating LMs on numerical estimation.<n>We find that LM-elicited priors are often inaccurate and overconfident.
- Score: 42.23843583401247
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Real-world settings where language models (LMs) are deployed -- in domains spanning healthcare, finance, and other forms of knowledge work -- require models to grapple with incomplete information and reason under uncertainty. Yet most LM evaluations focus on problems with well-defined answers and success criteria. This gap exists in part because natural problems involving uncertainty are difficult to construct: given that LMs have access to most of the same knowledge as humans, it is non-trivial to design questions for which LMs will struggle to produce correct answers, but which humans can answer reliably. As a result, LM performance on reasoning under uncertainty remains poorly characterized. To address this gap, we introduce OpenEstimate, an extensible, multi-domain benchmark for evaluating LMs on numerical estimation tasks that require models to synthesize significant amounts of background information and express predictions as probabilistic priors. We assess these priors for accuracy and calibration, quantifying their usefulness relative to samples from the true distribution of interest. Across six frontier LMs, we find that LM-elicited priors are often inaccurate and overconfident. Performance improves modestly depending on how uncertainty is elicited from the model, but is largely unaffected by changes in sampling strategy, reasoning effort, or prompt design. The OpenEstimate benchmark thus offers a challenging evaluation for frontier LMs and a platform for developing models that are better at probabilistic estimation and reasoning under uncertainty.
Related papers
- AdversaRiskQA: An Adversarial Factuality Benchmark for High-Risk Domains [3.721111684544962]
Hallucination in large language models (LLMs) contributes to spread of misinformation and diminished public trust.<n>We introduce AdversaRiskQA, the first verified and reliable benchmark systematically evaluating adversarial factuality.<n>We evaluate six open- and closed-source LLMs from the Qwen, GPT-OSS, and GPT families, measuring misinformation detection rates.
arXiv Detail & Related papers (2026-01-21T22:47:59Z) - Reasoning with Confidence: Efficient Verification of LLM Reasoning Steps via Uncertainty Heads [104.9566359759396]
We propose a lightweight alternative for step-level reasoning verification based on data-driven uncertainty scores.<n>Our findings suggest that the internal states of LLMs encode their uncertainty and can serve as reliable signals for reasoning verification.
arXiv Detail & Related papers (2025-11-09T03:38:29Z) - The Illusion of Certainty: Uncertainty quantification for LLMs fails under ambiguity [48.899855816199484]
We introduce MAQA* and AmbigQA*, the first ambiguous question-answering (QA) datasets equipped with ground-truth answer distributions.<n>We show that predictive-distribution and ensemble-based estimators are fundamentally limited under ambiguity.
arXiv Detail & Related papers (2025-11-06T14:46:35Z) - The LLM Already Knows: Estimating LLM-Perceived Question Difficulty via Hidden Representations [33.65540900920885]
Estimating the difficulty of input questions as perceived by large language models (LLMs) is essential for accurate performance evaluation and adaptive inference.<n>We propose a novel approach for difficulty estimation that leverages only the hidden representations produced by the target LLM.
arXiv Detail & Related papers (2025-09-16T09:38:41Z) - Can LLMs Detect Their Confabulations? Estimating Reliability in Uncertainty-Aware Language Models [24.72990207218907]
Large Language Models (LLMs) are prone to generating fluent but incorrect content, known as confabulation.<n>We investigate how in-context information influences model behavior and whether LLMs can identify their unreliable responses.
arXiv Detail & Related papers (2025-08-11T16:12:36Z) - TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning [27.449948943467163]
We propose a Token-level Uncertainty estimation framework for Reasoning (TokUR)<n>TokUR enables Large Language Models to self-assess and self-improve their responses in mathematical reasoning.<n> Experiments on mathematical reasoning datasets of varying difficulty demonstrate that TokUR exhibits a strong correlation with answer correctness and model robustness.
arXiv Detail & Related papers (2025-05-16T22:47:32Z) - UAlign: Leveraging Uncertainty Estimations for Factuality Alignment on Large Language Models [41.67393607081513]
Large Language Models (LLMs) often struggle to accurately express the factual knowledge they possess.<n>We propose the UAlign framework, which leverages Uncertainty estimations to represent knowledge boundaries.<n>We show that the proposed UAlign can significantly enhance the LLMs' capacities to confidently answer known questions.
arXiv Detail & Related papers (2024-12-16T14:14:27Z) - Large Language Models Must Be Taught to Know What They Don't Know [97.90008709512921]
We show that fine-tuning on a small dataset of correct and incorrect answers can create an uncertainty estimate with good generalization and small computational overhead.<n>We also investigate the mechanisms that enable reliable uncertainty estimation, finding that many models can be used as general-purpose uncertainty estimators.
arXiv Detail & Related papers (2024-06-12T16:41:31Z) - Language Model Cascades: Token-level uncertainty and beyond [65.38515344964647]
Recent advances in language models (LMs) have led to significant improvements in quality on complex NLP tasks.
Cascading offers a simple strategy to achieve more favorable cost-quality tradeoffs.
We show that incorporating token-level uncertainty through learned post-hoc deferral rules can significantly outperform simple aggregation strategies.
arXiv Detail & Related papers (2024-04-15T21:02:48Z) - Relying on the Unreliable: The Impact of Language Models' Reluctance to Express Uncertainty [53.336235704123915]
We investigate how LMs incorporate confidence in responses via natural language and how downstream users behave in response to LM-articulated uncertainties.
We find that LMs are reluctant to express uncertainties when answering questions even when they produce incorrect responses.
We test the risks of LM overconfidence by conducting human experiments and show that users rely heavily on LM generations.
Lastly, we investigate the preference-annotated datasets used in post training alignment and find that humans are biased against texts with uncertainty.
arXiv Detail & Related papers (2024-01-12T18:03:30Z) - Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling [69.83976050879318]
In large language models (LLMs), identifying sources of uncertainty is an important step toward improving reliability, trustworthiness, and interpretability.
In this paper, we introduce an uncertainty decomposition framework for LLMs, called input clarification ensembling.
Our approach generates a set of clarifications for the input, feeds them into an LLM, and ensembles the corresponding predictions.
arXiv Detail & Related papers (2023-11-15T05:58:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.