Related papers: Doing Things with Words: Rethinking Theory of Mind Simulation in Large Language Models

Doing Things with Words: Rethinking Theory of Mind Simulation in Large Language Models

URL: http://arxiv.org/abs/2510.13395v1
Date: Wed, 15 Oct 2025 10:48:31 GMT
Title: Doing Things with Words: Rethinking Theory of Mind Simulation in Large Language Models
Authors: Agnese Lombardi, Alessandro Lenci,
Abstract summary: This study explores whether the Generative Agent-Based Model (GABM) Concordia can effectively model Theory of Mind (ToM) within simulated real-world environments.<n>We assess whether this framework successfully simulates ToM abilities and whether GPT-4 can perform tasks by making genuine inferences from social context.
Score: 48.815314312823006
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Language is fundamental to human cooperation, facilitating not only the exchange of information but also the coordination of actions through shared interpretations of situational contexts. This study explores whether the Generative Agent-Based Model (GABM) Concordia can effectively model Theory of Mind (ToM) within simulated real-world environments. Specifically, we assess whether this framework successfully simulates ToM abilities and whether GPT-4 can perform tasks by making genuine inferences from social context, rather than relying on linguistic memorization. Our findings reveal a critical limitation: GPT-4 frequently fails to select actions based on belief attribution, suggesting that apparent ToM-like abilities observed in previous studies may stem from shallow statistical associations rather than true reasoning. Additionally, the model struggles to generate coherent causal effects from agent actions, exposing difficulties in processing complex social interactions. These results challenge current statements about emergent ToM-like capabilities in LLMs and highlight the need for more rigorous, action-based evaluation frameworks.

Related papers

Beyond Words: Evaluating and Bridging Epistemic Divergence in User-Agent Interaction via Theory of Mind [8.740788873949471]
Large Language Models (LLMs) have developed rapidly and are widely applied to both general-purpose and professional tasks.<n>They still struggle to comprehend and respond to the true user needs when intentions and instructions are imprecisely conveyed.
arXiv Detail & Related papers (2026-02-14T16:01:59Z)
On Emergent Social World Models -- Evidence for Functional Integration of Theory of Mind and Pragmatic Reasoning in Language Models [4.5373666852176715]
This paper investigates whether LMs recruit shared computational mechanisms for general Theory of Mind (ToM) and language-specific pragmatic reasoning.<n>We analyze LMs' performance across seven subcategories of ToM abilities on a substantially larger localizer dataset.<n>Results from stringent hypothesis-driven statistical testing offer suggestive evidence for the functional integration hypothesis.
arXiv Detail & Related papers (2026-02-10T21:12:12Z)
Beyond Description: Cognitively Benchmarking Fine-Grained Action for Embodied Agents [52.14392337070763]
We introduce CFG-Bench, a new benchmark designed to systematically evaluate fine-grained action intelligence.<n>CFG-Bench consists of 1,368 curated videos paired with 19,562 three-modalities question-answer pairs targeting four cognitive abilities.<n>Our comprehensive evaluation on CFG-Bench reveals that leading MLLMs struggle to produce detailed instructions for physical interactions.
arXiv Detail & Related papers (2025-11-24T02:02:29Z)
Large Language Models as Theory of Mind Aware Generative Agents with Counterfactual Reflection [31.38516078163367]
ToM-agent is designed to empower LLMs-based generative agents to simulate ToM in open-domain conversational interactions.<n>ToM-agent disentangles the confidence from mental states, facilitating the emulation of an agent's perception of its counterpart's mental states.<n>Our findings indicate that the ToM-agent can grasp the underlying reasons for their counterpart's behaviors beyond mere semantic-emotional supporting or decision-making based on common sense.
arXiv Detail & Related papers (2025-01-26T00:32:38Z)
Chain-of-Reasoning: Towards Unified Mathematical Reasoning in Large Language Models via a Multi-Paradigm Perspective [98.29190911211053]
Chain-of-Reasoning (CoR) is a novel unified framework integrating multiple reasoning paradigms.<n>CoR generates multiple potential answers via different reasoning paradigms and synthesizes them into a coherent final solution.<n> Experimental results demonstrate that CoR-Math-7B significantly outperforms current SOTA models.
arXiv Detail & Related papers (2025-01-19T16:53:26Z)
Failure Modes of LLMs for Causal Reasoning on Narratives [51.19592551510628]
We investigate the interaction between world knowledge and logical reasoning.<n>We find that state-of-the-art large language models (LLMs) often rely on superficial generalizations.<n>We show that simple reformulations of the task can elicit more robust reasoning behavior.
arXiv Detail & Related papers (2024-10-31T12:48:58Z)
Does Reasoning Emerge? Examining the Probabilities of Causation in Large Language Models [6.922021128239465]
Recent advances in AI have been driven by the capabilities of large language models (LLMs) This paper introduces a framework that is both theoretical and practical, aimed at assessing how effectively LLMs are able to replicate real-world reasoning mechanisms.
arXiv Detail & Related papers (2024-08-15T15:19:11Z)
Shall We Team Up: Exploring Spontaneous Cooperation of Competing LLM Agents [18.961470450132637]
This paper emphasizes the importance of spontaneous phenomena, wherein agents deeply engage in contexts and make adaptive decisions without explicit directions. We explored spontaneous cooperation across three competitive scenarios and successfully simulated the gradual emergence of cooperation.
arXiv Detail & Related papers (2024-02-19T18:00:53Z)
MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation [60.65820977963331]
We introduce a novel evaluation paradigm for Large Language Models (LLMs) This paradigm shifts the emphasis from result-oriented assessments, which often neglect the reasoning process, to a more comprehensive evaluation. By applying this paradigm in the GSM8K dataset, we have developed the MR-GSM8K benchmark.
arXiv Detail & Related papers (2023-12-28T15:49:43Z)
Exchange-of-Thought: Enhancing Large Language Model Capabilities through Cross-Model Communication [76.04373033082948]
Large Language Models (LLMs) have recently made significant strides in complex reasoning tasks through the Chain-of-Thought technique. We propose Exchange-of-Thought (EoT), a novel framework that enables cross-model communication during problem-solving.
arXiv Detail & Related papers (2023-12-04T11:53:56Z)
DiPlomat: A Dialogue Dataset for Situated Pragmatic Reasoning [89.92601337474954]
Pragmatic reasoning plays a pivotal role in deciphering implicit meanings that frequently arise in real-life conversations. We introduce a novel challenge, DiPlomat, aiming at benchmarking machines' capabilities on pragmatic reasoning and situated conversational understanding.
arXiv Detail & Related papers (2023-06-15T10:41:23Z)
The Machine Psychology of Cooperation: Can GPT models operationalise prompts for altruism, cooperation, competitiveness and selfishness in economic games? [0.0]
We investigated the capability of the GPT-3.5 large language model (LLM) to operationalize natural language descriptions of cooperative, competitive, altruistic, and self-interested behavior. We used a prompt to describe the task environment using a similar protocol to that used in experimental psychology studies with human subjects. Our results provide evidence that LLMs can, to some extent, translate natural language descriptions of different cooperative stances into corresponding descriptions of appropriate task behaviour.
arXiv Detail & Related papers (2023-05-13T17:23:16Z)

This list is automatically generated from the titles and abstracts of the papers in this site.