Related papers: Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks

Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks

URL: http://arxiv.org/abs/2302.08399v3
Date: Mon, 20 Feb 2023 03:46:17 GMT
Title: Large Language Models Fail on Trivial Alterations to Theory-of-Mind Tasks
Authors: Tomer Ullman
Abstract summary: Theory-of-Mind tasks have shown both successes and failures. Small variations that maintain the principles of ToM turn the results on their head. We argue that in general, the zero-hypothesis for model evaluation in intuitive psychology should be skeptical.
Score: 3.3178024597495903
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Intuitive psychology is a pillar of common-sense reasoning. The replication of this reasoning in machine intelligence is an important stepping-stone on the way to human-like artificial intelligence. Several recent tasks and benchmarks for examining this reasoning in Large-Large Models have focused in particular on belief attribution in Theory-of-Mind tasks. These tasks have shown both successes and failures. We consider in particular a recent purported success case, and show that small variations that maintain the principles of ToM turn the results on their head. We argue that in general, the zero-hypothesis for model evaluation in intuitive psychology should be skeptical, and that outlying failure cases should outweigh average success rates. We also consider what possible future successes on Theory-of-Mind tasks by more powerful LLMs would mean for ToM tasks with people.

Related papers

FairReason: Balancing Reasoning and Social Bias in MLLMs [50.618158642714505]
Multimodal Large Language Models (MLLMs) already achieve state-of-the-art results across a wide range of tasks and modalities.<n>Recent studies explore advanced prompting schemes and post-training fine-tuning to push their reasoning ability further.
arXiv Detail & Related papers (2025-07-30T19:57:22Z)
Do Large Language Models Have a Planning Theory of Mind? Evidence from MindGames: a Multi-Step Persuasion Task [1.9998928079358735]
We present MindGames: a novel planning theory of mind' (PToM) task.<n>We find that humans significantly outperform o1-preview (an LLM) at our PToM task.<n>These results suggest a significant gap between human-like social reasoning and Theory of Mind abilities.
arXiv Detail & Related papers (2025-07-22T03:15:27Z)
Benchmarking Systematic Relational Reasoning with Large Language and Reasoning Models [15.56445409535547]
Large Language Models (LLMs) have been found to struggle with systematic reasoning. This paper focuses on tasks that require systematic reasoning about relational compositions. We find that the considered LLMs and LRMs overall perform poorly overall, albeit better than random chance.
arXiv Detail & Related papers (2025-03-30T15:41:55Z)
Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models [76.6028674686018]
We introduce thought-tracing, an inference-time reasoning algorithm to trace the mental states of agents. Our algorithm is modeled after the Bayesian theory-of-mind framework. We evaluate thought-tracing on diverse theory-of-mind benchmarks, demonstrating significant performance improvements.
arXiv Detail & Related papers (2025-02-17T15:08:50Z)
Position: Theory of Mind Benchmarks are Broken for Large Language Models [41.832853832803046]
This position paper argues that the majority of theory of mind benchmarks are broken because of their inability to directly test how large language models adapt to new partners. We call this functional theory of mind: the ability to adapt to agents in-context following a rational response to predictions about their behavior.
arXiv Detail & Related papers (2024-12-27T16:30:12Z)
Explore Theory of Mind: Program-guided adversarial data generation for theory of mind reasoning [88.68573198200698]
We introduce ExploreToM, the first framework to allow large-scale generation of diverse and challenging theory of mind data. Our approach leverages an A* search over a custom domain-specific language to produce complex story structures and novel, diverse, yet plausible scenarios. Our evaluation reveals that state-of-the-art LLMs, such as Llama-3.1-70B and GPT-4o, show accuracies as low as 0% and 9% on ExploreToM-generated data.
arXiv Detail & Related papers (2024-12-12T21:29:00Z)
NegotiationToM: A Benchmark for Stress-testing Machine Theory of Mind on Negotiation Surrounding [55.38254464415964]
Theory of mind evaluations currently focuses on testing models using machine-generated data or game settings prone to shortcuts and spurious correlations. We introduce NegotiationToM, a new benchmark designed to stress-test machine ToM in real-world negotiation surrounding covered multi-dimensional mental states.
arXiv Detail & Related papers (2024-04-21T11:51:13Z)
Think Twice: Perspective-Taking Improves Large Language Models' Theory-of-Mind Capabilities [63.90227161974381]
SimToM is a novel prompting framework inspired by Simulation Theory's notion of perspective-taking. Our approach, which requires no additional training and minimal prompt-tuning, shows substantial improvement over existing methods.
arXiv Detail & Related papers (2023-11-16T22:49:27Z)
HI-TOM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning in Large Language Models [31.831042765744204]
Theory of Mind (ToM) is the ability to reason about one's own and others' mental states. We introduce HI-TOM, a Higher Order Theory of Mind benchmark. Our experimental evaluation using various Large Language Models (LLMs) indicates a decline in performance on higher-order ToM tasks.
arXiv Detail & Related papers (2023-10-25T16:41:15Z)
FANToM: A Benchmark for Stress-testing Machine Theory of Mind in Interactions [94.61530480991627]
Theory of mind evaluations currently focus on testing models using passive narratives that inherently lack interactivity. We introduce FANToM, a new benchmark designed to stress-test ToM within information-asymmetric conversational contexts via question answering.
arXiv Detail & Related papers (2023-10-24T00:24:11Z)
Minding Language Models' (Lack of) Theory of Mind: A Plug-and-Play Multi-Character Belief Tracker [72.09076317574238]
ToM is a plug-and-play approach to investigate the belief states of characters in reading comprehension. We show that ToM enhances off-the-shelf neural network theory mind in a zero-order setting while showing robust out-of-distribution performance compared to supervised baselines.
arXiv Detail & Related papers (2023-06-01T17:24:35Z)
Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models [82.50173296858377]
Many anecdotal examples were used to suggest newer large language models (LLMs) like ChatGPT and GPT-4 exhibit Neural Theory-of-Mind (N-ToM) We investigate the extent of LLMs' N-ToM through an extensive evaluation on 6 tasks and find that while LLMs exhibit certain N-ToM abilities, this behavior is far from being robust.
arXiv Detail & Related papers (2023-05-24T06:14:31Z)
Can Pretrained Language Models (Yet) Reason Deductively? [72.9103833294272]
We conduct a comprehensive evaluation of the learnable deductive (also known as explicit) reasoning capability of PLMs. Our main results suggest that PLMs cannot yet perform reliable deductive reasoning. We reach beyond (misleading) task performance, revealing that PLMs are still far from human-level reasoning capabilities.
arXiv Detail & Related papers (2022-10-12T17:44:15Z)

This list is automatically generated from the titles and abstracts of the papers in this site.