HI-TOM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning
in Large Language Models
- URL: http://arxiv.org/abs/2310.16755v1
- Date: Wed, 25 Oct 2023 16:41:15 GMT
- Title: HI-TOM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning
in Large Language Models
- Authors: Yinghui He, Yufan Wu, Yilin Jia, Rada Mihalcea, Yulong Chen, Naihao
Deng
- Abstract summary: Theory of Mind (ToM) is the ability to reason about one's own and others' mental states.
We introduce HI-TOM, a Higher Order Theory of Mind benchmark.
Our experimental evaluation using various Large Language Models (LLMs) indicates a decline in performance on higher-order ToM tasks.
- Score: 31.831042765744204
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Theory of Mind (ToM) is the ability to reason about one's own and others'
mental states. ToM plays a critical role in the development of intelligence,
language understanding, and cognitive processes. While previous work has
primarily focused on first and second-order ToM, we explore higher-order ToM,
which involves recursive reasoning on others' beliefs. We introduce HI-TOM, a
Higher Order Theory of Mind benchmark. Our experimental evaluation using
various Large Language Models (LLMs) indicates a decline in performance on
higher-order ToM tasks, demonstrating the limitations of current LLMs. We
conduct a thorough analysis of different failure cases of LLMs, and share our
thoughts on the implications of our findings on the future of NLP.
Related papers
- Perceptions to Beliefs: Exploring Precursory Inferences for Theory of Mind in Large Language Models [51.91448005607405]
We evaluate key human ToM precursors by annotating characters' perceptions on ToMi and FANToM.
We present PercepToM, a novel ToM method leveraging LLMs' strong perception inference capability while supplementing their limited perception-to-belief inference.
arXiv Detail & Related papers (2024-07-08T14:58:29Z) - Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models [52.894048516550065]
We develop a pipeline for multimodal ToM reasoning using video and text.
We also enable explicit ToM reasoning by retrieving key frames for answering a ToM question.
arXiv Detail & Related papers (2024-06-19T18:24:31Z) - Do LLMs Exhibit Human-Like Reasoning? Evaluating Theory of Mind in LLMs for Open-Ended Responses [11.121931601655174]
Theory of Mind (ToM) reasoning entails recognizing that other individuals possess their own intentions, emotions, and thoughts.
Large language models (LLMs) excel in tasks such as summarization, question answering, and translation.
Despite advancements, the extent to which LLMs truly understand ToM reasoning remains inadequately explored in open-ended scenarios.
arXiv Detail & Related papers (2024-06-09T05:57:59Z) - NegotiationToM: A Benchmark for Stress-testing Machine Theory of Mind on Negotiation Surrounding [55.38254464415964]
Theory of mind evaluations currently focuses on testing models using machine-generated data or game settings prone to shortcuts and spurious correlations.
We introduce NegotiationToM, a new benchmark designed to stress-test machine ToM in real-world negotiation surrounding covered multi-dimensional mental states.
arXiv Detail & Related papers (2024-04-21T11:51:13Z) - ToMBench: Benchmarking Theory of Mind in Large Language Models [42.80231362967291]
ToM is the cognitive capability to perceive and ascribe mental states to oneself and others.
Existing ToM evaluations are hindered by challenges such as constrained scope, subjective judgment, and unintended contamination.
We introduce ToMBench with three key characteristics: a systematic evaluation framework encompassing 8 tasks and 31 abilities in social cognition, a multiple-choice question format to support automated and unbiased evaluation, and a build-from-scratch bilingual inventory to strictly avoid data leakage.
arXiv Detail & Related papers (2024-02-23T02:05:46Z) - Think Twice: Perspective-Taking Improves Large Language Models'
Theory-of-Mind Capabilities [63.90227161974381]
SimToM is a novel prompting framework inspired by Simulation Theory's notion of perspective-taking.
Our approach, which requires no additional training and minimal prompt-tuning, shows substantial improvement over existing methods.
arXiv Detail & Related papers (2023-11-16T22:49:27Z) - FANToM: A Benchmark for Stress-testing Machine Theory of Mind in
Interactions [94.61530480991627]
Theory of mind evaluations currently focus on testing models using passive narratives that inherently lack interactivity.
We introduce FANToM, a new benchmark designed to stress-test ToM within information-asymmetric conversational contexts via question answering.
arXiv Detail & Related papers (2023-10-24T00:24:11Z) - ToMChallenges: A Principle-Guided Dataset and Diverse Evaluation Tasks for Exploring Theory of Mind [3.9599054392856483]
We present ToMChallenges, a dataset for comprehensively evaluating the Theory of Mind based on the Sally-Anne and Smarties tests with a diverse set of tasks.
Our evaluation results and error analyses show that LLMs have inconsistent behaviors across prompts and tasks.
arXiv Detail & Related papers (2023-05-24T11:54:07Z) - Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in
Large Language Models [82.50173296858377]
Many anecdotal examples were used to suggest newer large language models (LLMs) like ChatGPT and GPT-4 exhibit Neural Theory-of-Mind (N-ToM)
We investigate the extent of LLMs' N-ToM through an extensive evaluation on 6 tasks and find that while LLMs exhibit certain N-ToM abilities, this behavior is far from being robust.
arXiv Detail & Related papers (2023-05-24T06:14:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.