Related papers: Epidemiology of Large Language Models: A Benchmark for Observational Distribution Knowledge

Epidemiology of Large Language Models: A Benchmark for Observational Distribution Knowledge

URL: http://arxiv.org/abs/2511.03070v1
Date: Tue, 04 Nov 2025 23:34:52 GMT
Title: Epidemiology of Large Language Models: A Benchmark for Observational Distribution Knowledge
Authors: Drago Plecko, Patrik Okanovic, Torsten Hoefler, Elias Bareinboim,
Abstract summary: Our goal is to build a benchmark for understanding the capabilities of LLMs in terms of knowledge of probability distributions describing the real world.<n>Our results demonstrate that LLMs perform poorly overall, and do not seem to internalize real-world statistics naturally.
Score: 69.50062870487349
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Artificial intelligence (AI) systems hold great promise for advancing various scientific disciplines, and are increasingly used in real-world applications. Despite their remarkable progress, further capabilities are expected in order to achieve more general types of intelligence. A critical distinction in this context is between factual knowledge, which can be evaluated against true or false answers (e.g., "what is the capital of England?"), and probabilistic knowledge, reflecting probabilistic properties of the real world (e.g., "what is the sex of a computer science graduate in the US?"). In this paper, our goal is to build a benchmark for understanding the capabilities of LLMs in terms of knowledge of probability distributions describing the real world. Given that LLMs are trained on vast amounts of text, it may be plausible that they internalize aspects of these distributions. Indeed, LLMs are touted as powerful universal approximators of real-world distributions. At the same time, classical results in statistics, known as curse of dimensionality, highlight fundamental challenges in learning distributions in high dimensions, challenging the notion of universal distributional learning. In this work, we develop the first benchmark to directly test this hypothesis, evaluating whether LLMs have access to empirical distributions describing real-world populations across domains such as economics, health, education, and social behavior. Our results demonstrate that LLMs perform poorly overall, and do not seem to internalize real-world statistics naturally. When interpreted in the context of Pearl's Causal Hierarchy (PCH), our benchmark demonstrates that language models do not contain knowledge on observational distributions (Layer 1 of PCH), and thus the Causal Hierarchy Theorem implies that interventional (Layer 2) and counterfactual (Layer 3) knowledge of these models is also limited.

Related papers

Remembering Unequally: Global and Disciplinary Bias in LLM-Generated Co-Authorship Networks [3.179831861897336]
This study examines the impact of Large Language Models (LLMs) on the co-authorship networks.<n>We assess effects across three prominent models, DeepSeek R1, Llama 4 Scout, and Mixtral 8x7B.<n>While our global analysis reveals a consistent bias favoring highly cited researchers, this pattern is not uniformly observed.<n>Certain disciplines, such as Clinical Medicine, and regions, including parts of Africa, show more balanced representation.
arXiv Detail & Related papers (2025-11-01T10:05:43Z)
WorldLLM: Improving LLMs' world modeling using curiosity-driven theory-making [17.8062839646513]
Large Language Models (LLMs) possess general world knowledge but often struggle to generate precise predictions in structured, domain-specific contexts such as simulations.<n>We present WorldLLM, a framework that enhances LLM-based world modeling by combining Bayesian inference and autonomous active exploration with reinforcement learning.
arXiv Detail & Related papers (2025-06-07T09:13:34Z)
Truly Assessing Fluid Intelligence of Large Language Models through Dynamic Reasoning Evaluation [106.17986469245302]
Large language models (LLMs) have demonstrated impressive reasoning capacities that mirror human-like thinking.<n>Existing reasoning benchmarks either focus on domain-specific knowledge (crystallized intelligence) or lack interpretability.<n>We propose DRE-Bench, a dynamic reasoning evaluation benchmark grounded in a hierarchical cognitive framework.
arXiv Detail & Related papers (2025-06-03T09:01:08Z)
Bayesian Teaching Enables Probabilistic Reasoning in Large Language Models [54.38054999271322]
We show that large language models (LLMs) don't update their beliefs as expected from the Bayesian framework.<n>We teach the LLMs to reason in a Bayesian manner by training them to mimic the predictions of the normative Bayesian model.<n>More generally, our results indicate that LLMs can effectively learn reasoning skills from examples and generalize those skills to new domains.
arXiv Detail & Related papers (2025-03-21T20:13:04Z)
Computation Mechanism Behind LLM Position Generalization [59.013857707250814]
Large language models (LLMs) exhibit flexibility in handling textual positions.<n>They can understand texts with position perturbations and generalize to longer texts.<n>This work connects the linguistic phenomenon with LLMs' computational mechanisms.
arXiv Detail & Related papers (2025-03-17T15:47:37Z)
An Overview of Large Language Models for Statisticians [109.38601458831545]
Large Language Models (LLMs) have emerged as transformative tools in artificial intelligence (AI)<n>This paper explores potential areas where statisticians can make important contributions to the development of LLMs.<n>We focus on issues such as uncertainty quantification, interpretability, fairness, privacy, watermarking and model adaptation.
arXiv Detail & Related papers (2025-02-25T03:40:36Z)
Language Agents Meet Causality -- Bridging LLMs and Causal World Models [50.79984529172807]
We propose a framework that integrates causal representation learning with large language models. This framework learns a causal world model, with causal variables linked to natural language expressions. We evaluate the framework on causal inference and planning tasks across temporal scales and environmental complexities.
arXiv Detail & Related papers (2024-10-25T18:36:37Z)
Unveiling LLMs: The Evolution of Latent Representations in a Dynamic Knowledge Graph [15.129079475322637]
This work unveils the factual information an Large Language Models represents internally for sentence-level claim verification. We propose an end-to-end framework to decode factual knowledge embedded in token representations from a vector space to a set of ground predicates. Our framework employs activation patching, a vector-level technique that alters a token representation during inference, to extract encoded knowledge.
arXiv Detail & Related papers (2024-04-04T17:45:59Z)
Should We Fear Large Language Models? A Structural Analysis of the Human Reasoning System for Elucidating LLM Capabilities and Risks Through the Lens of Heidegger's Philosophy [0.0]
This study investigates the capabilities and risks of Large Language Models (LLMs) It uses the innovative parallels between the statistical patterns of word relationships within LLMs and Martin Heidegger's concepts of "ready-to-hand" and "present-at-hand" Our findings reveal that while LLMs possess the capability for Direct Explicative Reasoning and Pseudo Rational Reasoning, they fall short in authentic rational reasoning and have no creative reasoning capabilities.
arXiv Detail & Related papers (2024-03-05T19:40:53Z)
Brain in a Vat: On Missing Pieces Towards Artificial General Intelligence in Large Language Models [83.63242931107638]
We propose four characteristics of generally intelligent agents. We argue that active engagement with objects in the real world delivers more robust signals for forming conceptual representations. We conclude by outlining promising future research directions in the field of artificial general intelligence.
arXiv Detail & Related papers (2023-07-07T13:58:16Z)
Event knowledge in large language models: the gap between the impossible and the unlikely [46.540380831486125]
We show that pre-trained large language models (LLMs) possess substantial event knowledge. They almost always assign higher likelihood to possible vs. impossible events. However, they show less consistent preferences for likely vs. unlikely events.
arXiv Detail & Related papers (2022-12-02T23:43:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.