Related papers: Re-evaluating Theory of Mind evaluation in large language models

Re-evaluating Theory of Mind evaluation in large language models

URL: http://arxiv.org/abs/2502.21098v1
Date: Fri, 28 Feb 2025 14:36:57 GMT
Title: Re-evaluating Theory of Mind evaluation in large language models
Authors: Jennifer Hu, Felix Sosa, Tomer Ullman,
Abstract summary: We take inspiration from cognitive science to re-evaluate the state of ToM evaluation in large language models.<n>A major reason for the disagreement on whether LLMs have ToM is a lack of clarity on whether models should be expected to match human behaviors.<n>We conclude by discussing several directions for future research, including the relationship between ToM and pragmatic communication.
Score: 3.262532929657758
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The question of whether large language models (LLMs) possess Theory of Mind (ToM) -- often defined as the ability to reason about others' mental states -- has sparked significant scientific and public interest. However, the evidence as to whether LLMs possess ToM is mixed, and the recent growth in evaluations has not resulted in a convergence. Here, we take inspiration from cognitive science to re-evaluate the state of ToM evaluation in LLMs. We argue that a major reason for the disagreement on whether LLMs have ToM is a lack of clarity on whether models should be expected to match human behaviors, or the computations underlying those behaviors. We also highlight ways in which current evaluations may be deviating from "pure" measurements of ToM abilities, which also contributes to the confusion. We conclude by discussing several directions for future research, including the relationship between ToM and pragmatic communication, which could advance our understanding of artificial systems as well as human cognition.

Related papers

GPT-4o Lacks Core Features of Theory of Mind [0.09320657506524145]
We use a cognitively-grounded definition of ToM to develop and test a new evaluation framework.<n>We find that even though LLMs succeed in approxing human judgments in a simple ToM paradigm, they fail at a logically equivalent task.
arXiv Detail & Related papers (2026-02-12T16:33:58Z)
Doing Things with Words: Rethinking Theory of Mind Simulation in Large Language Models [48.815314312823006]
This study explores whether the Generative Agent-Based Model (GABM) Concordia can effectively model Theory of Mind (ToM) within simulated real-world environments.<n>We assess whether this framework successfully simulates ToM abilities and whether GPT-4 can perform tasks by making genuine inferences from social context.
arXiv Detail & Related papers (2025-10-15T10:48:31Z)
XToM: Exploring the Multilingual Theory of Mind for Large Language Models [57.9821865189077]
Existing evaluations of Theory of Mind in LLMs are largely limited to English.<n>We present XToM, a rigorously validated multilingual benchmark that evaluates ToM across five languages.<n>Our findings expose limitations in LLMs' ability to replicate human-like mentalizing across linguistic contexts.
arXiv Detail & Related papers (2025-06-03T05:23:25Z)
Effects of Theory of Mind and Prosocial Beliefs on Steering Human-Aligned Behaviors of LLMs in Ultimatum Games [17.615991993062455]
We investigate the role of theory-of-mind (ToM) reasoning in aligning agentic behaviors with human norms in negotiation tasks.<n>ToM reasoning enhances behavior alignment, decision-making consistency, and negotiation outcomes.<n>Our findings contribute to the understanding of ToM's role in enhancing human-AI interaction and cooperative decision-making.
arXiv Detail & Related papers (2025-05-30T06:23:52Z)
Rethinking Theory of Mind Benchmarks for LLMs: Towards A User-Centered Perspective [24.27038998164743]
Theory-of-Mind (ToM) tasks are designed for humans to benchmark LLM's ToM capabilities. This approach has a number of limitations. Taking a human-computer interaction (HCI) perspective, these limitations prompt us to rethink the definition and criteria of ToM in ToM benchmarks.
arXiv Detail & Related papers (2025-04-15T03:44:43Z)
Perceptions to Beliefs: Exploring Precursory Inferences for Theory of Mind in Large Language Models [51.91448005607405]
We evaluate key human ToM precursors by annotating characters' perceptions on ToMi and FANToM. We present PercepToM, a novel ToM method leveraging LLMs' strong perception inference capability while supplementing their limited perception-to-belief inference.
arXiv Detail & Related papers (2024-07-08T14:58:29Z)
Do LLMs Exhibit Human-Like Reasoning? Evaluating Theory of Mind in LLMs for Open-Ended Responses [11.121931601655174]
Theory of Mind (ToM) reasoning entails recognizing that other individuals possess their own intentions, emotions, and thoughts. Large language models (LLMs) excel in tasks such as summarization, question answering, and translation. Despite advancements, the extent to which LLMs truly understand ToM reasoning remains inadequately explored in open-ended scenarios.
arXiv Detail & Related papers (2024-06-09T05:57:59Z)
NegotiationToM: A Benchmark for Stress-testing Machine Theory of Mind on Negotiation Surrounding [55.38254464415964]
Theory of mind evaluations currently focuses on testing models using machine-generated data or game settings prone to shortcuts and spurious correlations. We introduce NegotiationToM, a new benchmark designed to stress-test machine ToM in real-world negotiation surrounding covered multi-dimensional mental states.
arXiv Detail & Related papers (2024-04-21T11:51:13Z)
ToMBench: Benchmarking Theory of Mind in Large Language Models [41.565202027904476]
ToM is the cognitive capability to perceive and ascribe mental states to oneself and others.<n>Existing ToM evaluations are hindered by challenges such as constrained scope, subjective judgment, and unintended contamination.<n>We introduce ToMBench with three key characteristics: a systematic evaluation framework encompassing 8 tasks and 31 abilities in social cognition, a multiple-choice question format to support automated and unbiased evaluation, and a build-from-scratch bilingual inventory to strictly avoid data leakage.
arXiv Detail & Related papers (2024-02-23T02:05:46Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
Think Twice: Perspective-Taking Improves Large Language Models' Theory-of-Mind Capabilities [63.90227161974381]
SimToM is a novel prompting framework inspired by Simulation Theory's notion of perspective-taking. Our approach, which requires no additional training and minimal prompt-tuning, shows substantial improvement over existing methods.
arXiv Detail & Related papers (2023-11-16T22:49:27Z)
HI-TOM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in Large Language Models [31.831042765744204]
Theory of Mind (ToM) is the ability to reason about one's own and others' mental states. We introduce HI-TOM, a Higher Order Theory of Mind benchmark. Our experimental evaluation using various Large Language Models (LLMs) indicates a decline in performance on higher-order ToM tasks.
arXiv Detail & Related papers (2023-10-25T16:41:15Z)
FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions [94.61530480991627]
Theory of mind evaluations currently focus on testing models using passive narratives that inherently lack interactivity. We introduce FANToM, a new benchmark designed to stress-test ToM within information-asymmetric conversational contexts via question answering.
arXiv Detail & Related papers (2023-10-24T00:24:11Z)
Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models [82.50173296858377]
Many anecdotal examples were used to suggest newer large language models (LLMs) like ChatGPT and GPT-4 exhibit Neural Theory-of-Mind (N-ToM) We investigate the extent of LLMs' N-ToM through an extensive evaluation on 6 tasks and find that while LLMs exhibit certain N-ToM abilities, this behavior is far from being robust.
arXiv Detail & Related papers (2023-05-24T06:14:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.