OpenToM: A Comprehensive Benchmark for Evaluating Theory-of-Mind Reasoning Capabilities of Large Language Models
- URL: http://arxiv.org/abs/2402.06044v3
- Date: Mon, 3 Jun 2024 10:48:16 GMT
- Title: OpenToM: A Comprehensive Benchmark for Evaluating Theory-of-Mind Reasoning Capabilities of Large Language Models
- Authors: Hainiu Xu, Runcong Zhao, Lixing Zhu, Jinhua Du, Yulan He,
- Abstract summary: Neural Theory-of-Mind (N-ToM) machine's ability to understand and keep track of the mental states of others is pivotal in developing socially intelligent agents.
OpenToM is a new benchmark for assessing N-ToM with longer and clearer narrative stories, explicit personality traits, and actions triggered by character intentions.
We reveal that state-of-the-art LLMs thrive at modeling certain aspects of mental states in the physical world but fall short when tracking characters' mental states in the psychological world.
- Score: 17.042114879350788
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Neural Theory-of-Mind (N-ToM), machine's ability to understand and keep track of the mental states of others, is pivotal in developing socially intelligent agents. However, prevalent N-ToM benchmarks have several shortcomings, including the presence of ambiguous and artificial narratives, absence of personality traits and preferences, a lack of questions addressing characters' psychological mental states, and limited diversity in the questions posed. In response to these issues, we construct OpenToM, a new benchmark for assessing N-ToM with (1) longer and clearer narrative stories, (2) characters with explicit personality traits, (3) actions that are triggered by character intentions, and (4) questions designed to challenge LLMs' capabilities of modeling characters' mental states of both the physical and psychological world. Using OpenToM, we reveal that state-of-the-art LLMs thrive at modeling certain aspects of mental states in the physical world but fall short when tracking characters' mental states in the psychological world.
Related papers
- EnigmaToM: Improve LLMs' Theory-of-Mind Reasoning Capabilities with Neural Knowledge Base of Entity States [15.557449564031975]
Theory-of-Mind (ToM) is fundamental to human interaction but remains a challenging task for Large Language Models (LLMs)
We present EnigmaToM, a novel neuro-symbolic framework that enhances ToM reasoning by integrating a Neural Knowledge Base of entity states (Enigma)
Experimental results on multiple benchmarks, including ToMi, HiToM, and FANToM, show that EnigmaToM significantly improves ToM reasoning across LLMs of varying sizes.
arXiv Detail & Related papers (2025-03-05T10:13:05Z) - PersuasiveToM: A Benchmark for Evaluating Machine Theory of Mind in Persuasive Dialogues [27.231701486961917]
The ability to understand and predict the mental states of oneself and others, known as the Theory of Mind (ToM), is crucial for effective social interactions.
Recent research has emerged to evaluate whether Large Language Models (LLMs) exhibit a form of ToM.
We propose PersuasiveToM, a benchmark designed to evaluate the ToM abilities of LLMs in persuasive dialogues.
arXiv Detail & Related papers (2025-02-28T13:04:04Z) - ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking Theory of Mind [25.524355451378593]
ToMATO is a new ToM benchmark formulated as multiple-choice QA over conversations.
We capture both first- and second-order mental states across five categories: belief, intention, desire, emotion, and knowledge.
ToMATO consists of 5.4k questions, 753 conversations, and 15 personality trait patterns.
arXiv Detail & Related papers (2025-01-15T14:47:02Z) - The Essence of Contextual Understanding in Theory of Mind: A Study on Question Answering with Story Characters [67.61587661660852]
Theory-of-Mind (ToM) allows humans to understand and interpret the mental states of others.
In this paper, we verify the importance of comprehensive contextual understanding about personal backgrounds in ToM.
We introduce CharToM benchmark, comprising 1,035 ToM questions based on characters from classic novels.
arXiv Detail & Related papers (2025-01-03T09:04:45Z) - Explore the Hallucination on Low-level Perception for MLLMs [83.12180878559295]
We aim to define and evaluate the self-awareness of MLLMs in low-level visual perception and understanding tasks.
We present QL-Bench, a benchmark settings to simulate human responses to low-level vision.
We demonstrate that while some models exhibit robust low-level visual capabilities, their self-awareness remains relatively underdeveloped.
arXiv Detail & Related papers (2024-09-15T14:38:29Z) - Quantifying AI Psychology: A Psychometrics Benchmark for Large Language Models [57.518784855080334]
Large Language Models (LLMs) have demonstrated exceptional task-solving capabilities, increasingly adopting roles akin to human-like assistants.
This paper presents a framework for investigating psychology dimension in LLMs, including psychological identification, assessment dataset curation, and assessment with results validation.
We introduce a comprehensive psychometrics benchmark for LLMs that covers six psychological dimensions: personality, values, emotion, theory of mind, motivation, and intelligence.
arXiv Detail & Related papers (2024-06-25T16:09:08Z) - Towards A Holistic Landscape of Situated Theory of Mind in Large
Language Models [14.491223187047378]
Large Language Models (LLMs) have generated considerable interest and debate regarding their potential emergence of Theory of Mind (ToM)
Several recent inquiries reveal a lack of robust ToM in these models and pose a pressing demand to develop new benchmarks.
We taxonomize machine ToM into 7 mental state categories and delineate existing benchmarks to identify under-explored aspects of ToM.
arXiv Detail & Related papers (2023-10-30T15:12:09Z) - FANToM: A Benchmark for Stress-testing Machine Theory of Mind in
Interactions [94.61530480991627]
Theory of mind evaluations currently focus on testing models using passive narratives that inherently lack interactivity.
We introduce FANToM, a new benchmark designed to stress-test ToM within information-asymmetric conversational contexts via question answering.
arXiv Detail & Related papers (2023-10-24T00:24:11Z) - Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in
Large Language Models [82.50173296858377]
Many anecdotal examples were used to suggest newer large language models (LLMs) like ChatGPT and GPT-4 exhibit Neural Theory-of-Mind (N-ToM)
We investigate the extent of LLMs' N-ToM through an extensive evaluation on 6 tasks and find that while LLMs exhibit certain N-ToM abilities, this behavior is far from being robust.
arXiv Detail & Related papers (2023-05-24T06:14:31Z) - Few-Shot Character Understanding in Movies as an Assessment to
Meta-Learning of Theory-of-Mind [47.13015852330866]
Humans can quickly understand new fictional characters with a few observations, mainly by drawing analogies to fictional and real people they already know.
This reflects the few-shot and meta-learning essence of humans' inference of characters' mental states, i.e., theory-of-mind (ToM)
We fill this gap with a novel NLP dataset, ToM-in-AMC, the first assessment of machines' meta-learning of ToM in a realistic narrative understanding scenario.
arXiv Detail & Related papers (2022-11-09T05:06:12Z) - Neural Theory-of-Mind? On the Limits of Social Intelligence in Large LMs [77.88043871260466]
We show that one of today's largest language models lacks this kind of social intelligence out-of-the box.
We conclude that person-centric NLP approaches might be more effective towards neural Theory of Mind.
arXiv Detail & Related papers (2022-10-24T14:58:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.