Related papers: Language Statistics and False Belief Reasoning: Evidence from 41 Open-Weight LMs

Language Statistics and False Belief Reasoning: Evidence from 41 Open-Weight LMs

URL: http://arxiv.org/abs/2602.16085v1
Date: Tue, 17 Feb 2026 23:20:08 GMT
Title: Language Statistics and False Belief Reasoning: Evidence from 41 Open-Weight LMs
Authors: Sean Trott, Samuel Taylor, Cameron Jones, James A. Michaelov, Pamela D. Rivière,
Abstract summary: We assess LM mental state reasoning behavior across 41 open-weight models.<n>We find sensitivity to implied knowledge states in 34% of the LMs tested.<n>Larger LMs show increased sensitivity and also exhibit higher psychometric predictive power.
Score: 6.600578957536851
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Research on mental state reasoning in language models (LMs) has the potential to inform theories of human social cognition--such as the theory that mental state reasoning emerges in part from language exposure--and our understanding of LMs themselves. Yet much published work on LMs relies on a relatively small sample of closed-source LMs, limiting our ability to rigorously test psychological theories and evaluate LM capacities. Here, we replicate and extend published work on the false belief task by assessing LM mental state reasoning behavior across 41 open-weight models (from distinct model families). We find sensitivity to implied knowledge states in 34% of the LMs tested; however, consistent with prior work, none fully ``explain away'' the effect in humans. Larger LMs show increased sensitivity and also exhibit higher psychometric predictive power. Finally, we use LM behavior to generate and test a novel hypothesis about human cognition: both humans and LMs show a bias towards attributing false beliefs when knowledge states are cued using a non-factive verb (``John thinks...'') than when cued indirectly (``John looks in the...''). Unlike the primary effect of knowledge states, where human sensitivity exceeds that of LMs, the magnitude of the human knowledge cue effect falls squarely within the distribution of LM effect sizes-suggesting that distributional statistics of language can in principle account for the latter but not the former in humans. These results demonstrate the value of using larger samples of open-weight LMs to test theories of human cognition and evaluate LM capacities.

Related papers

Language Agents Mirror Human Causal Reasoning Biases. How Can We Help Them Think Like Scientists? [42.29911505696807]
Language model (LM) agents are increasingly used as autonomous decision-makers.<n>We examine LMs' ability to explore and infer causal relationships.<n>We find that LMs reliably infer the common, intuitive disjunctive causal relationships but systematically struggle with the unusual, yet equally, evidenced conjunctive ones.
arXiv Detail & Related papers (2025-05-14T17:59:35Z)
Belief in the Machine: Investigating Epistemological Blind Spots of Language Models [51.63547465454027]
Language models (LMs) are essential for reliable decision-making in fields like healthcare, law, and journalism. This study systematically evaluates the capabilities of modern LMs, including GPT-4, Claude-3, and Llama-3, using a new dataset, KaBLE. Our results reveal key limitations. First, while LMs achieve 86% accuracy on factual scenarios, their performance drops significantly with false scenarios. Second, LMs struggle with recognizing and affirming personal beliefs, especially when those beliefs contradict factual data.
arXiv Detail & Related papers (2024-10-28T16:38:20Z)
Generative Models, Humans, Predictive Models: Who Is Worse at High-Stakes Decision Making? [10.225573060836478]
Large generative models (LMs) are already being used for decision making tasks that were previously done by predictive models or humans.<n>We put popular LMs to the test in a high-stakes decision making task: recidivism prediction.
arXiv Detail & Related papers (2024-10-20T19:00:59Z)
FactCheckmate: Preemptively Detecting and Mitigating Hallucinations in LMs [21.767886997853022]
We introduce FactCheckmate, which preemptively detects hallucinations by learning a classifier.<n>If a hallucination is detected, FactCheckmate then intervenes by adjusting the LM's hidden states.<n>Our results demonstrate the effectiveness of FactCheckmate, achieving over 70% preemptive detection accuracy.
arXiv Detail & Related papers (2024-10-03T18:45:00Z)
LLM Internal States Reveal Hallucination Risk Faced With a Query [62.29558761326031]
Humans have a self-awareness process that allows us to recognize what we don't know when faced with queries. This paper investigates whether Large Language Models can estimate their own hallucination risk before response generation. By a probing estimator, we leverage LLM self-assessment, achieving an average hallucination estimation accuracy of 84.32% at run time.
arXiv Detail & Related papers (2024-07-03T17:08:52Z)
WellDunn: On the Robustness and Explainability of Language Models and Large Language Models in Identifying Wellness Dimensions [46.60244609728416]
Language Models (LMs) are being proposed for mental health applications where the heightened risk of adverse outcomes means predictive performance may not be a litmus test of a model's utility in clinical practice. We introduce an evaluation design that focuses on the robustness and explainability of LMs in identifying Wellness Dimensions (WDs) We reveal four surprising results about LMs/LLMs.
arXiv Detail & Related papers (2024-06-17T19:50:40Z)
Divergences between Language Models and Human Brains [59.100552839650774]
We systematically explore the divergences between human and machine language processing.<n>We identify two domains that LMs do not capture well: social/emotional intelligence and physical commonsense.<n>Our results show that fine-tuning LMs on these domains can improve their alignment with human brain responses.
arXiv Detail & Related papers (2023-11-15T19:02:40Z)
Language Models Hallucinate, but May Excel at Fact Verification [89.0833981569957]
Large language models (LLMs) frequently "hallucinate," resulting in non-factual outputs. Even GPT-3.5 produces factual outputs less than 25% of the time. This underscores the importance of fact verifiers in order to measure and incentivize progress.
arXiv Detail & Related papers (2023-10-23T04:39:01Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.