Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in
Large Language Models
- URL: http://arxiv.org/abs/2305.14763v1
- Date: Wed, 24 May 2023 06:14:31 GMT
- Title: Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in
Large Language Models
- Authors: Natalie Shapira, Mosh Levy, Seyed Hossein Alavi, Xuhui Zhou, Yejin
Choi, Yoav Goldberg, Maarten Sap, Vered Shwartz
- Abstract summary: Many anecdotal examples were used to suggest newer large language models (LLMs) like ChatGPT and GPT-4 exhibit Neural Theory-of-Mind (N-ToM)
We investigate the extent of LLMs' N-ToM through an extensive evaluation on 6 tasks and find that while LLMs exhibit certain N-ToM abilities, this behavior is far from being robust.
- Score: 82.50173296858377
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The escalating debate on AI's capabilities warrants developing reliable
metrics to assess machine "intelligence". Recently, many anecdotal examples
were used to suggest that newer large language models (LLMs) like ChatGPT and
GPT-4 exhibit Neural Theory-of-Mind (N-ToM); however, prior work reached
conflicting conclusions regarding those abilities. We investigate the extent of
LLMs' N-ToM through an extensive evaluation on 6 tasks and find that while LLMs
exhibit certain N-ToM abilities, this behavior is far from being robust. We
further examine the factors impacting performance on N-ToM tasks and discover
that LLMs struggle with adversarial examples, indicating reliance on shallow
heuristics rather than robust ToM abilities. We caution against drawing
conclusions from anecdotal examples, limited benchmark testing, and using
human-designed psychological tests to evaluate models.
Related papers
- Perceptions to Beliefs: Exploring Precursory Inferences for Theory of Mind in Large Language Models [51.91448005607405]
We evaluate key human ToM precursors by annotating characters' perceptions on ToMi and FANToM.
We present PercepToM, a novel ToM method leveraging LLMs' strong perception inference capability while supplementing their limited perception-to-belief inference.
arXiv Detail & Related papers (2024-07-08T14:58:29Z) - NegotiationToM: A Benchmark for Stress-testing Machine Theory of Mind on Negotiation Surrounding [55.38254464415964]
Theory of mind evaluations currently focuses on testing models using machine-generated data or game settings prone to shortcuts and spurious correlations.
We introduce NegotiationToM, a new benchmark designed to stress-test machine ToM in real-world negotiation surrounding covered multi-dimensional mental states.
arXiv Detail & Related papers (2024-04-21T11:51:13Z) - ToMBench: Benchmarking Theory of Mind in Large Language Models [42.80231362967291]
ToM is the cognitive capability to perceive and ascribe mental states to oneself and others.
Existing ToM evaluations are hindered by challenges such as constrained scope, subjective judgment, and unintended contamination.
We introduce ToMBench with three key characteristics: a systematic evaluation framework encompassing 8 tasks and 31 abilities in social cognition, a multiple-choice question format to support automated and unbiased evaluation, and a build-from-scratch bilingual inventory to strictly avoid data leakage.
arXiv Detail & Related papers (2024-02-23T02:05:46Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - FANToM: A Benchmark for Stress-testing Machine Theory of Mind in
Interactions [94.61530480991627]
Theory of mind evaluations currently focus on testing models using passive narratives that inherently lack interactivity.
We introduce FANToM, a new benchmark designed to stress-test ToM within information-asymmetric conversational contexts via question answering.
arXiv Detail & Related papers (2023-10-24T00:24:11Z) - Understanding Social Reasoning in Language Models with Language Models [34.068368860882586]
We present a novel framework for generating evaluations with Large Language Models (LLMs) by populating causal templates.
We create a new social reasoning benchmark (BigToM) for LLMs which consists of 25 controls and 5,000 model-written evaluations.
We find that human participants rate the quality of our benchmark higher than previous crowd-sourced evaluations and comparable to expert-written evaluations.
arXiv Detail & Related papers (2023-06-21T16:42:15Z) - Revisiting the Reliability of Psychological Scales on Large Language Models [62.57981196992073]
This study aims to determine the reliability of applying personality assessments to Large Language Models.
Analysis of 2,500 settings per model, including GPT-3.5, GPT-4, Gemini-Pro, and LLaMA-3.1, reveals that various LLMs show consistency in responses to the Big Five Inventory.
arXiv Detail & Related papers (2023-05-31T15:03:28Z) - ToMChallenges: A Principle-Guided Dataset and Diverse Evaluation Tasks for Exploring Theory of Mind [3.9599054392856483]
We present ToMChallenges, a dataset for comprehensively evaluating the Theory of Mind based on the Sally-Anne and Smarties tests with a diverse set of tasks.
Our evaluation results and error analyses show that LLMs have inconsistent behaviors across prompts and tasks.
arXiv Detail & Related papers (2023-05-24T11:54:07Z) - Large Language Models Fail on Trivial Alterations to Theory-of-Mind
Tasks [3.3178024597495903]
Theory-of-Mind tasks have shown both successes and failures.
Small variations that maintain the principles of ToM turn the results on their head.
We argue that in general, the zero-hypothesis for model evaluation in intuitive psychology should be skeptical.
arXiv Detail & Related papers (2023-02-16T16:18:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.