Humans and LLMs Diverge on Probabilistic Inferences
- URL: http://arxiv.org/abs/2602.23546v1
- Date: Thu, 26 Feb 2026 23:00:41 GMT
- Title: Humans and LLMs Diverge on Probabilistic Inferences
- Authors: Gaurav Kamath, Sreenath Madathil, Sebastian Schuster, Marie-Catherine de Marneffe, Siva Reddy,
- Abstract summary: We introduce ProbCOPA, a dataset of 210 handcrafted probabilistic inferences in English, each annotated for inference likelihood by 25--30 human participants.<n>We find that human responses are graded and varied, revealing probabilistic judgments of the inferences in our dataset.<n>Comparing these judgments with responses from eight state-of-the-art reasoning LLMs, we show that models consistently fail to produce human-like distributions.
- Score: 25.525228660836024
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Human reasoning often involves working over limited information to arrive at probabilistic conclusions. In its simplest form, this involves making an inference that is not strictly entailed by a premise, but rather only likely given the premise. While reasoning LLMs have demonstrated strong performance on logical and mathematical tasks, their behavior on such open-ended, non-deterministic inferences remains largely unexplored. We introduce ProbCOPA, a dataset of 210 handcrafted probabilistic inferences in English, each annotated for inference likelihood by 25--30 human participants. We find that human responses are graded and varied, revealing probabilistic judgments of the inferences in our dataset. Comparing these judgments with responses from eight state-of-the-art reasoning LLMs, we show that models consistently fail to produce human-like distributions. Finally, analyzing LLM reasoning chains, we find evidence of a common reasoning pattern used to evaluate such inferences. Our findings reveal persistent differences between humans and LLMs, and underscore the need to evaluate reasoning beyond deterministic settings.
Related papers
- CausalFlip: A Benchmark for LLM Causal Judgment Beyond Semantic Matching [50.65932158912512]
We propose a new causal reasoning benchmark, CausalFlip, to encourage the development of new large language models.<n>CaulFlip consists of causal judgment questions built over event triples that could form different confounder, chain, and collider relations.<n>We evaluate LLMs under multiple training paradigms, including answer-only training, explicit Chain-of-Thought supervision, and a proposed internalized causal reasoning approach.
arXiv Detail & Related papers (2026-02-23T18:06:15Z) - Towards Generalizable Reasoning: Group Causal Counterfactual Policy Optimization for LLM Reasoning [50.352417879912515]
Large language models (LLMs) excel at complex tasks with advances in reasoning capabilities.<n>We propose Group Causal Counterfactual Policy Optimization to explicitly train LLMs to learn generalizable reasoning patterns.<n>We then construct token-level advantages from this reward and optimize the policy, encouraging LLMs to favor reasoning patterns that are process-valid and counterfactually robust.
arXiv Detail & Related papers (2026-02-06T08:03:11Z) - Reasoning Under Uncertainty: Exploring Probabilistic Reasoning Capabilities of LLMs [47.20307724127832]
We present the first comprehensive study of the reasoning capabilities of large language models (LLMs)<n>We evaluate models on three carefully designed tasks, mode identification, maximum likelihood estimation, and sample generation.<n>Through empirical evaluations, we demonstrate that there exists a clear performance gap between smaller and larger models.
arXiv Detail & Related papers (2025-09-12T22:58:05Z) - Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models [76.6028674686018]
We introduce thought-tracing, an inference-time reasoning algorithm to trace the mental states of agents.<n>Our algorithm is modeled after the Bayesian theory-of-mind framework.<n>We evaluate thought-tracing on diverse theory-of-mind benchmarks, demonstrating significant performance improvements.
arXiv Detail & Related papers (2025-02-17T15:08:50Z) - Logical forms complement probability in understanding language model (and human) performance [14.694876851134273]
This work conducts a systematic investigation of large language models' ability to perform logical reasoning in natural language.<n>We introduce a controlled dataset of hypothetical and disjunctive syllogisms in propositional and modal logic.<n>We show similarities and discrepancies between the logical reasoning performances of humans and LLMs by collecting and comparing behavioral data from both.
arXiv Detail & Related papers (2025-02-13T18:46:44Z) - P-FOLIO: Evaluating and Improving Logical Reasoning with Abundant Human-Written Reasoning Chains [97.25943550933829]
We present P-FOLIO, a human-annotated dataset consisting of diverse and complex reasoning chains.
We use P-FOLIO to evaluate and improve large-language-model (LLM) reasoning capabilities.
arXiv Detail & Related papers (2024-10-11T19:22:57Z) - Evaluating the Correctness of Inference Patterns Used by LLMs for Judgment [53.17596274334017]
We evaluate the correctness of the detailed inference patterns of an LLM behind its seemingly correct outputs.<n>Experiments show that even when the language generation results appear correct, a significant portion of the inference patterns used by the LLM for the legal judgment may represent misleading or irrelevant logic.
arXiv Detail & Related papers (2024-10-06T08:33:39Z) - Does Reasoning Emerge? Examining the Probabilities of Causation in Large Language Models [6.922021128239465]
Recent advances in AI have been driven by the capabilities of large language models (LLMs)
This paper introduces a framework that is both theoretical and practical, aimed at assessing how effectively LLMs are able to replicate real-world reasoning mechanisms.
arXiv Detail & Related papers (2024-08-15T15:19:11Z) - Reasoning over Uncertain Text by Generative Large Language Models [18.983753573277596]
This paper considers the challenges Large Language Models (LLMs) face when reasoning over text that includes information involving uncertainty explicitly quantified via probability values.<n>We introduce the Bayesian Linguistic Inference dataset (BLInD), a new dataset designed to test the probabilistic reasoning capabilities of LLMs.<n>We present several prompting strategies that map the problem to different formal representations, including Python code, probabilistic algorithms, and probabilistic logical programming.
arXiv Detail & Related papers (2024-02-14T23:05:44Z) - Incoherent Probability Judgments in Large Language Models [4.307483901449801]
We assess the coherence of probability judgments made by autoregressive Large Language Models (LLMs)<n>Our results show that the judgments produced by these models are often incoherent, displaying human-like systematic deviations from the rules of probability theory.
arXiv Detail & Related papers (2024-01-30T00:40:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.