Fugu-MT 論文翻訳(概要): Epidemiology of Large Language Models: A Benchmark for Observational Distribution Knowledge

論文の概要: Epidemiology of Large Language Models: A Benchmark for Observational Distribution Knowledge

arxiv url: http://arxiv.org/abs/2511.03070v1
Date: Tue, 04 Nov 2025 23:34:52 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-06 18:19:32.273351
Title: Epidemiology of Large Language Models: A Benchmark for Observational Distribution Knowledge
Title（参考訳）: 大規模言語モデルの疫学:観測分布知識のベンチマーク
Authors: Drago Plecko, Patrik Okanovic, Torsten Hoefler, Elias Bareinboim,
Abstract要約: 我々のゴールは、実世界を記述する確率分布の知識の観点から、LLMの能力を理解するためのベンチマークを構築することである。以上の結果から,LLMは全体の性能が悪く,実世界の統計を自然に内在化していないことが示唆された。
参考スコア（独自算出の注目度）: 69.50062870487349
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Artificial intelligence (AI) systems hold great promise for advancing various scientific disciplines, and are increasingly used in real-world applications. Despite their remarkable progress, further capabilities are expected in order to achieve more general types of intelligence. A critical distinction in this context is between factual knowledge, which can be evaluated against true or false answers (e.g., "what is the capital of England?"), and probabilistic knowledge, reflecting probabilistic properties of the real world (e.g., "what is the sex of a computer science graduate in the US?"). In this paper, our goal is to build a benchmark for understanding the capabilities of LLMs in terms of knowledge of probability distributions describing the real world. Given that LLMs are trained on vast amounts of text, it may be plausible that they internalize aspects of these distributions. Indeed, LLMs are touted as powerful universal approximators of real-world distributions. At the same time, classical results in statistics, known as curse of dimensionality, highlight fundamental challenges in learning distributions in high dimensions, challenging the notion of universal distributional learning. In this work, we develop the first benchmark to directly test this hypothesis, evaluating whether LLMs have access to empirical distributions describing real-world populations across domains such as economics, health, education, and social behavior. Our results demonstrate that LLMs perform poorly overall, and do not seem to internalize real-world statistics naturally. When interpreted in the context of Pearl's Causal Hierarchy (PCH), our benchmark demonstrates that language models do not contain knowledge on observational distributions (Layer 1 of PCH), and thus the Causal Hierarchy Theorem implies that interventional (Layer 2) and counterfactual (Layer 3) knowledge of these models is also limited.
Abstract（参考訳）: 人工知能(AI)システムは、様々な科学分野を前進させる大きな可能性を秘めており、現実世界の応用でますます使われている。その顕著な進歩にもかかわらず、より一般的なインテリジェンスを達成するために、さらなる能力が期待されている。この文脈における批判的な区別は、事実的知識と、実世界の確率的特性を反映する確率的知識(例えば、「米国のコンピュータサイエンス卒業生のセックスとは何か? 本稿では,実世界を記述する確率分布の知識の観点から,LLMの能力を理解するためのベンチマークを構築することを目的とする。 LLMは大量のテキストで訓練されているので、これらの分布の側面を内在化することが妥当である。実際、LLMは現実世界の分布の強力な普遍近似器として評価されている。同時に、次元の呪いとして知られる統計学の古典的な結果は、高次元における分布の学習における根本的な課題を強調し、普遍的な分布学習の概念に挑戦する。本研究では、この仮説を直接検証する最初のベンチマークを開発し、LLMが、経済、健康、教育、社会行動などの領域にまたがる実世界の人口を記述した経験的分布にアクセスできるかどうかを評価する。以上の結果から,LLMは全体の性能が悪く,実世界の統計を自然に内在化していないことが示唆された。 Perl's Causal Hierarchy (PCH) の文脈で解釈すると、我々のベンチマークは言語モデルが観測分布に関する知識を含まないことを示す(PCHのLayer 1)。 2)と対策(レイヤー) 3) これらのモデルの知識も限られている。

論文の概要: Epidemiology of Large Language Models: A Benchmark for Observational Distribution Knowledge

関連論文リスト