Large Language Models Fail on Trivial Alterations to Theory-of-Mind
Tasks
- URL: http://arxiv.org/abs/2302.08399v3
- Date: Mon, 20 Feb 2023 03:46:17 GMT
- Title: Large Language Models Fail on Trivial Alterations to Theory-of-Mind
Tasks
- Authors: Tomer Ullman
- Abstract summary: Theory-of-Mind tasks have shown both successes and failures.
Small variations that maintain the principles of ToM turn the results on their head.
We argue that in general, the zero-hypothesis for model evaluation in intuitive psychology should be skeptical.
- Score: 3.3178024597495903
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Intuitive psychology is a pillar of common-sense reasoning. The replication
of this reasoning in machine intelligence is an important stepping-stone on the
way to human-like artificial intelligence. Several recent tasks and benchmarks
for examining this reasoning in Large-Large Models have focused in particular
on belief attribution in Theory-of-Mind tasks. These tasks have shown both
successes and failures. We consider in particular a recent purported success
case, and show that small variations that maintain the principles of ToM turn
the results on their head. We argue that in general, the zero-hypothesis for
model evaluation in intuitive psychology should be skeptical, and that outlying
failure cases should outweigh average success rates. We also consider what
possible future successes on Theory-of-Mind tasks by more powerful LLMs would
mean for ToM tasks with people.
Related papers
- Hypothesis-Driven Theory-of-Mind Reasoning for Large Language Models [76.6028674686018]
We introduce thought-tracing, an inference-time reasoning algorithm to trace the mental states of agents.
Our algorithm is modeled after the Bayesian theory-of-mind framework.
We evaluate thought-tracing on diverse theory-of-mind benchmarks, demonstrating significant performance improvements.
arXiv Detail & Related papers (2025-02-17T15:08:50Z) - Position: Theory of Mind Benchmarks are Broken for Large Language Models [41.832853832803046]
This position paper argues that the majority of theory of mind benchmarks are broken because of their inability to directly test how large language models adapt to new partners.
We call this functional theory of mind: the ability to adapt to agents in-context following a rational response to predictions about their behavior.
arXiv Detail & Related papers (2024-12-27T16:30:12Z) - Explore Theory of Mind: Program-guided adversarial data generation for theory of mind reasoning [88.68573198200698]
We introduce ExploreToM, the first framework to allow large-scale generation of diverse and challenging theory of mind data.
Our approach leverages an A* search over a custom domain-specific language to produce complex story structures and novel, diverse, yet plausible scenarios.
Our evaluation reveals that state-of-the-art LLMs, such as Llama-3.1-70B and GPT-4o, show accuracies as low as 0% and 9% on ExploreToM-generated data.
arXiv Detail & Related papers (2024-12-12T21:29:00Z) - NegotiationToM: A Benchmark for Stress-testing Machine Theory of Mind on Negotiation Surrounding [55.38254464415964]
Theory of mind evaluations currently focuses on testing models using machine-generated data or game settings prone to shortcuts and spurious correlations.
We introduce NegotiationToM, a new benchmark designed to stress-test machine ToM in real-world negotiation surrounding covered multi-dimensional mental states.
arXiv Detail & Related papers (2024-04-21T11:51:13Z) - Think Twice: Perspective-Taking Improves Large Language Models'
Theory-of-Mind Capabilities [63.90227161974381]
SimToM is a novel prompting framework inspired by Simulation Theory's notion of perspective-taking.
Our approach, which requires no additional training and minimal prompt-tuning, shows substantial improvement over existing methods.
arXiv Detail & Related papers (2023-11-16T22:49:27Z) - HI-TOM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning
in Large Language Models [31.831042765744204]
Theory of Mind (ToM) is the ability to reason about one's own and others' mental states.
We introduce HI-TOM, a Higher Order Theory of Mind benchmark.
Our experimental evaluation using various Large Language Models (LLMs) indicates a decline in performance on higher-order ToM tasks.
arXiv Detail & Related papers (2023-10-25T16:41:15Z) - FANToM: A Benchmark for Stress-testing Machine Theory of Mind in
Interactions [94.61530480991627]
Theory of mind evaluations currently focus on testing models using passive narratives that inherently lack interactivity.
We introduce FANToM, a new benchmark designed to stress-test ToM within information-asymmetric conversational contexts via question answering.
arXiv Detail & Related papers (2023-10-24T00:24:11Z) - Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in
Large Language Models [82.50173296858377]
Many anecdotal examples were used to suggest newer large language models (LLMs) like ChatGPT and GPT-4 exhibit Neural Theory-of-Mind (N-ToM)
We investigate the extent of LLMs' N-ToM through an extensive evaluation on 6 tasks and find that while LLMs exhibit certain N-ToM abilities, this behavior is far from being robust.
arXiv Detail & Related papers (2023-05-24T06:14:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.