Related papers: Reasoning Under Uncertainty: Exploring Probabilistic Reasoning Capabilities of LLMs

Reasoning Under Uncertainty: Exploring Probabilistic Reasoning Capabilities of LLMs

URL: http://arxiv.org/abs/2509.10739v2
Date: Fri, 26 Sep 2025 15:28:16 GMT
Title: Reasoning Under Uncertainty: Exploring Probabilistic Reasoning Capabilities of LLMs
Authors: Mobina Pournemat, Keivan Rezaei, Gaurang Sriramanan, Arman Zarei, Jiaxiang Fu, Yang Wang, Hamid Eghbalzadeh, Soheil Feizi,
Abstract summary: We present the first comprehensive study of the reasoning capabilities of large language models (LLMs)<n>We evaluate models on three carefully designed tasks, mode identification, maximum likelihood estimation, and sample generation.<n>Through empirical evaluations, we demonstrate that there exists a clear performance gap between smaller and larger models.
Score: 47.20307724127832
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Despite widespread success in language understanding and generation, large language models (LLMs) exhibit unclear and often inconsistent behavior when faced with tasks that require probabilistic reasoning. In this work, we present the first comprehensive study of the reasoning capabilities of LLMs over explicit discrete probability distributions. Given observations from a probability distribution, we evaluate models on three carefully designed tasks, mode identification, maximum likelihood estimation, and sample generation, by prompting them to provide responses to queries about either the joint distribution or its conditionals. These tasks thus probe a range of probabilistic skills, including frequency analysis, marginalization, and generative behavior. Through comprehensive empirical evaluations, we demonstrate that there exists a clear performance gap between smaller and larger models, with the latter demonstrating stronger inference and surprising capabilities in sample generation. Furthermore, our investigations reveal notable limitations, including sensitivity to variations in the notation utilized to represent probabilistic outcomes and performance degradation of over 60% as context length increases. Together, our results provide a detailed understanding of the probabilistic reasoning abilities of LLMs and identify key directions for future improvement.

Related papers

Humans and LLMs Diverge on Probabilistic Inferences [25.525228660836024]
We introduce ProbCOPA, a dataset of 210 handcrafted probabilistic inferences in English, each annotated for inference likelihood by 25--30 human participants.<n>We find that human responses are graded and varied, revealing probabilistic judgments of the inferences in our dataset.<n>Comparing these judgments with responses from eight state-of-the-art reasoning LLMs, we show that models consistently fail to produce human-like distributions.
arXiv Detail & Related papers (2026-02-26T23:00:41Z)
The Illusion of Certainty: Uncertainty quantification for LLMs fails under ambiguity [48.899855816199484]
We introduce MAQA* and AmbigQA*, the first ambiguous question-answering (QA) datasets equipped with ground-truth answer distributions.<n>We show that predictive-distribution and ensemble-based estimators are fundamentally limited under ambiguity.
arXiv Detail & Related papers (2025-11-06T14:46:35Z)
Extracting Probabilistic Knowledge from Large Language Models for Bayesian Network Parameterization [22.286144400569007]
We evaluate the potential of Large Language Models (LLMs) in building Bayesian Networks (BNs) by approximating domain expert priors.<n>Our experiments on eighty publicly available Bayesian Networks, from healthcare to finance, demonstrate that querying LLMs about the conditional probabilities of events provides meaningful results.
arXiv Detail & Related papers (2025-05-21T18:15:05Z)
TokUR: Token-Level Uncertainty Estimation for Large Language Model Reasoning [27.449948943467163]
We propose a Token-level Uncertainty estimation framework for Reasoning (TokUR)<n>TokUR enables Large Language Models to self-assess and self-improve their responses in mathematical reasoning.<n> Experiments on mathematical reasoning datasets of varying difficulty demonstrate that TokUR exhibits a strong correlation with answer correctness and model robustness.
arXiv Detail & Related papers (2025-05-16T22:47:32Z)
Always Tell Me The Odds: Fine-grained Conditional Probability Estimation [37.950889606305836]
We present a state-of-the-art model for fine-grained probability estimation of propositions conditioned on context.<n>We show that our approach consistently outperforms existing fine-tuned and prompting-based methods by a large margin.
arXiv Detail & Related papers (2025-05-02T21:33:18Z)
Exploring the Potential for Large Language Models to Demonstrate Rational Probabilistic Beliefs [12.489784979345654]
We show that current versions of large language models (LLMs) lack the ability to provide rational and coherent representations of probabilistic beliefs.<n>We apply well-established techniques for uncertainty quantification to measure the ability of LLM's to adhere to fundamental properties of probabilistic reasoning.
arXiv Detail & Related papers (2025-04-18T11:50:30Z)
Do Large Language Models Exhibit Cognitive Dissonance? Studying the Difference Between Revealed Beliefs and Stated Answers [13.644277507363036]
We introduce Revealed Belief, an evaluation framework that evaluates Large Language Models (LLMs) on tasks requiring reasoning under uncertainty.<n>Our findings suggest that while LLMs frequently state the correct answer, their Revealed Belief shows that they often allocate probability mass inconsistently, exhibit systematic biases, and often fail to update their beliefs appropriately when presented with new evidence.
arXiv Detail & Related papers (2024-06-21T08:56:35Z)
Chain-of-Thought Prompting for Demographic Inference with Large Multimodal Models [58.58594658683919]
Large multimodal models (LMMs) have shown transformative potential across various research tasks. Our findings indicate LMMs possess advantages in zero-shot learning, interpretability, and handling uncurated 'in-the-wild' inputs. We propose a Chain-of-Thought augmented prompting approach, which effectively mitigates the off-target prediction issue.
arXiv Detail & Related papers (2024-05-24T16:26:56Z)
Do LLMs Play Dice? Exploring Probability Distribution Sampling in Large Language Models for Behavioral Simulation [73.58618024960968]
An increasing number of studies are employing large language models (LLMs) as agents to emulate the sequential decision-making processes of humans.<n>This arouses curiosity regarding the capacity of LLM agents to comprehend probability distributions.<n>Our analysis indicates that LLM agents can understand probabilities, but they struggle with probability sampling.
arXiv Detail & Related papers (2024-04-13T16:59:28Z)
Reasoning over Uncertain Text by Generative Large Language Models [18.983753573277596]
This paper considers the challenges Large Language Models (LLMs) face when reasoning over text that includes information involving uncertainty explicitly quantified via probability values.<n>We introduce the Bayesian Linguistic Inference dataset (BLInD), a new dataset designed to test the probabilistic reasoning capabilities of LLMs.<n>We present several prompting strategies that map the problem to different formal representations, including Python code, probabilistic algorithms, and probabilistic logical programming.
arXiv Detail & Related papers (2024-02-14T23:05:44Z)
Decomposing Uncertainty for Large Language Models through Input Clarification Ensembling [69.83976050879318]
In large language models (LLMs), identifying sources of uncertainty is an important step toward improving reliability, trustworthiness, and interpretability. In this paper, we introduce an uncertainty decomposition framework for LLMs, called input clarification ensembling. Our approach generates a set of clarifications for the input, feeds them into an LLM, and ensembles the corresponding predictions.
arXiv Detail & Related papers (2023-11-15T05:58:35Z)
Counterfactual Maximum Likelihood Estimation for Training Deep Networks [83.44219640437657]
Deep learning models are prone to learning spurious correlations that should not be learned as predictive clues. We propose a causality-based training framework to reduce the spurious correlations caused by observable confounders. We conduct experiments on two real-world tasks: Natural Language Inference (NLI) and Image Captioning.
arXiv Detail & Related papers (2021-06-07T17:47:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.