Do Large Language Models Have a Planning Theory of Mind? Evidence from MindGames: a Multi-Step Persuasion Task
- URL: http://arxiv.org/abs/2507.16196v1
- Date: Tue, 22 Jul 2025 03:15:27 GMT
- Title: Do Large Language Models Have a Planning Theory of Mind? Evidence from MindGames: a Multi-Step Persuasion Task
- Authors: Jared Moore, Ned Cooper, Rasmus Overmark, Beba Cibralic, Nick Haber, Cameron R. Jones,
- Abstract summary: We present MindGames: a novel planning theory of mind' (PToM) task.<n>We find that humans significantly outperform o1-preview (an LLM) at our PToM task.<n>These results suggest a significant gap between human-like social reasoning and Theory of Mind abilities.
- Score: 1.9998928079358735
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Recent evidence suggests Large Language Models (LLMs) display Theory of Mind (ToM) abilities. Most ToM experiments place participants in a spectatorial role, wherein they predict and interpret other agents' behavior. However, human ToM also contributes to dynamically planning action and strategically intervening on others' mental states. We present MindGames: a novel `planning theory of mind' (PToM) task which requires agents to infer an interlocutor's beliefs and desires to persuade them to alter their behavior. Unlike previous evaluations, we explicitly evaluate use cases of ToM. We find that humans significantly outperform o1-preview (an LLM) at our PToM task (11% higher; $p=0.006$). We hypothesize this is because humans have an implicit causal model of other agents (e.g., they know, as our task requires, to ask about people's preferences). In contrast, o1-preview outperforms humans in a baseline condition which requires a similar amount of planning but minimal mental state inferences (e.g., o1-preview is better than humans at planning when already given someone's preferences). These results suggest a significant gap between human-like social reasoning and LLM abilities.
Related papers
- The Decrypto Benchmark for Multi-Agent Reasoning and Theory of Mind [8.341160422849969]
Decrypto is a game-based benchmark for multi-agent reasoning and ToM.<n>It is the first platform for designing interactive ToM experiments.<n>We find that LLM game-playing abilities lag behind humans and simple word-embedding baselines.
arXiv Detail & Related papers (2025-06-25T17:55:27Z) - The Essence of Contextual Understanding in Theory of Mind: A Study on Question Answering with Story Characters [67.61587661660852]
Theory-of-Mind (ToM) allows humans to understand and interpret the mental states of others.<n>In this paper, we verify the importance of comprehensive contextual understanding about personal backgrounds in ToM.<n>We introduce CharToM benchmark, comprising 1,035 ToM questions based on characters from classic novels.
arXiv Detail & Related papers (2025-01-03T09:04:45Z) - Position: Theory of Mind Benchmarks are Broken for Large Language Models [41.832853832803046]
Our paper argues that the majority of theory of mind benchmarks are broken because of their inability to directly test how large language models adapt to new partners.<n>This problem stems from the fact that theory of mind benchmarks are overwhelmingly inspired by the methods used to test theory of mind in humans.<n>We introduce the concept of functional theory of mind: the ability to adapt to agents in-context following a rational response to their behavior.
arXiv Detail & Related papers (2024-12-27T16:30:12Z) - SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs [72.06808538971487]
We test whether large language models (LLMs) can implicitly apply a "theory of mind" (ToM) to predict behavior.
We create a new dataset, SimpleTom, containing stories with three questions that test different degrees of ToM reasoning.
To our knowledge, SimpleToM is the first dataset to explore downstream reasoning requiring knowledge of mental states in realistic scenarios.
arXiv Detail & Related papers (2024-10-17T15:15:00Z) - Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models [52.894048516550065]
We develop a pipeline for multimodal ToM reasoning using video and text.
We also enable explicit ToM reasoning by retrieving key frames for answering a ToM question.
arXiv Detail & Related papers (2024-06-19T18:24:31Z) - Do LLMs Exhibit Human-Like Reasoning? Evaluating Theory of Mind in LLMs for Open-Ended Responses [11.121931601655174]
Theory of Mind (ToM) reasoning entails recognizing that other individuals possess their own intentions, emotions, and thoughts.
Large language models (LLMs) excel in tasks such as summarization, question answering, and translation.
Despite advancements, the extent to which LLMs truly understand ToM reasoning remains inadequately explored in open-ended scenarios.
arXiv Detail & Related papers (2024-06-09T05:57:59Z) - Theory of Mind abilities of Large Language Models in Human-Robot
Interaction : An Illusion? [18.770522926093786]
Large Language Models have shown exceptional generative abilities in various natural language and generation tasks.
We study a special application of ToM abilities that has higher stakes and possibly irreversible consequences.
We focus on the task of Perceived Behavior Recognition, where a robot employs a Large Language Model (LLM) to assess the robot's generated behavior in a manner similar to human observer.
arXiv Detail & Related papers (2024-01-10T18:09:36Z) - Think Twice: Perspective-Taking Improves Large Language Models'
Theory-of-Mind Capabilities [63.90227161974381]
SimToM is a novel prompting framework inspired by Simulation Theory's notion of perspective-taking.
Our approach, which requires no additional training and minimal prompt-tuning, shows substantial improvement over existing methods.
arXiv Detail & Related papers (2023-11-16T22:49:27Z) - HI-TOM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning
in Large Language Models [31.831042765744204]
Theory of Mind (ToM) is the ability to reason about one's own and others' mental states.
We introduce HI-TOM, a Higher Order Theory of Mind benchmark.
Our experimental evaluation using various Large Language Models (LLMs) indicates a decline in performance on higher-order ToM tasks.
arXiv Detail & Related papers (2023-10-25T16:41:15Z) - The Neuro-Symbolic Inverse Planning Engine (NIPE): Modeling
Probabilistic Social Inferences from Linguistic Inputs [50.32802502923367]
We study the process of language driving and influencing social reasoning in a probabilistic goal inference domain.
We propose a neuro-symbolic model that carries out goal inference from linguistic inputs of agent scenarios.
Our model closely matches human response patterns and better predicts human judgements than using an LLM alone.
arXiv Detail & Related papers (2023-06-25T19:38:01Z) - Large Language Models Fail on Trivial Alterations to Theory-of-Mind
Tasks [3.3178024597495903]
Theory-of-Mind tasks have shown both successes and failures.
Small variations that maintain the principles of ToM turn the results on their head.
We argue that in general, the zero-hypothesis for model evaluation in intuitive psychology should be skeptical.
arXiv Detail & Related papers (2023-02-16T16:18:03Z) - Neural Theory-of-Mind? On the Limits of Social Intelligence in Large LMs [77.88043871260466]
We show that one of today's largest language models lacks this kind of social intelligence out-of-the box.
We conclude that person-centric NLP approaches might be more effective towards neural Theory of Mind.
arXiv Detail & Related papers (2022-10-24T14:58:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.