Related papers: ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking Theory of Mind

ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking Theory of Mind

URL: http://arxiv.org/abs/2501.08838v1
Date: Wed, 15 Jan 2025 14:47:02 GMT
Title: ToMATO: Verbalizing the Mental States of Role-Playing LLMs for Benchmarking Theory of Mind
Authors: Kazutoshi Shinoda, Nobukatsu Hojo, Kyosuke Nishida, Saki Mizuno, Keita Suzuki, Ryo Masumura, Hiroaki Sugiyama, Kuniko Saito,
Abstract summary: ToMATO is a new ToM benchmark formulated as multiple-choice QA over conversations.<n>We capture both first- and second-order mental states across five categories: belief, intention, desire, emotion, and knowledge.<n>ToMATO consists of 5.4k questions, 753 conversations, and 15 personality trait patterns.
Score: 25.524355451378593
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Existing Theory of Mind (ToM) benchmarks diverge from real-world scenarios in three aspects: 1) they assess a limited range of mental states such as beliefs, 2) false beliefs are not comprehensively explored, and 3) the diverse personality traits of characters are overlooked. To address these challenges, we introduce ToMATO, a new ToM benchmark formulated as multiple-choice QA over conversations. ToMATO is generated via LLM-LLM conversations featuring information asymmetry. By employing a prompting method that requires role-playing LLMs to verbalize their thoughts before each utterance, we capture both first- and second-order mental states across five categories: belief, intention, desire, emotion, and knowledge. These verbalized thoughts serve as answers to questions designed to assess the mental states of characters within conversations. Furthermore, the information asymmetry introduced by hiding thoughts from others induces the generation of false beliefs about various mental states. Assigning distinct personality traits to LLMs further diversifies both utterances and thoughts. ToMATO consists of 5.4k questions, 753 conversations, and 15 personality trait patterns. Our analysis shows that this dataset construction approach frequently generates false beliefs due to the information asymmetry between role-playing LLMs, and effectively reflects diverse personalities. We evaluate nine LLMs on ToMATO and find that even GPT-4o mini lags behind human performance, especially in understanding false beliefs, and lacks robustness to various personality traits.

Related papers

XToM: Exploring the Multilingual Theory of Mind for Large Language Models [57.9821865189077]
Existing evaluations of Theory of Mind in LLMs are largely limited to English.<n>We present XToM, a rigorously validated multilingual benchmark that evaluates ToM across five languages.<n>Our findings expose limitations in LLMs' ability to replicate human-like mentalizing across linguistic contexts.
arXiv Detail & Related papers (2025-06-03T05:23:25Z)
PersuasiveToM: A Benchmark for Evaluating Machine Theory of Mind in Persuasive Dialogues [27.231701486961917]
The ability to understand and predict the mental states of oneself and others, known as the Theory of Mind (ToM), is crucial for effective social interactions. Recent research has emerged to evaluate whether Large Language Models (LLMs) exhibit a form of ToM. We propose PersuasiveToM, a benchmark designed to evaluate the ToM abilities of LLMs in persuasive dialogues.
arXiv Detail & Related papers (2025-02-28T13:04:04Z)
The Essence of Contextual Understanding in Theory of Mind: A Study on Question Answering with Story Characters [67.61587661660852]
Theory-of-Mind (ToM) allows humans to understand and interpret the mental states of others.<n>In this paper, we verify the importance of understanding long personal backgrounds in ToM.<n>We assess the performance of machines' ToM capabilities in realistic evaluation scenarios.
arXiv Detail & Related papers (2025-01-03T09:04:45Z)
Does ChatGPT Have a Mind? [0.0]
This paper examines whether Large Language Models (LLMs) like ChatGPT possess minds, focusing specifically on whether they have a genuine folk psychology encompassing beliefs, desires, and intentions. First, we survey various philosophical theories of representation, including informational, causal, structural, and teleosemantic accounts, arguing that LLMs satisfy key conditions proposed by each. Second, we explore whether LLMs exhibit robust dispositions to perform actions, a necessary component of folk psychology.
arXiv Detail & Related papers (2024-06-27T00:21:16Z)
Quantifying AI Psychology: A Psychometrics Benchmark for Large Language Models [57.518784855080334]
Large Language Models (LLMs) have demonstrated exceptional task-solving capabilities, increasingly adopting roles akin to human-like assistants. This paper presents a framework for investigating psychology dimension in LLMs, including psychological identification, assessment dataset curation, and assessment with results validation. We introduce a comprehensive psychometrics benchmark for LLMs that covers six psychological dimensions: personality, values, emotion, theory of mind, motivation, and intelligence.
arXiv Detail & Related papers (2024-06-25T16:09:08Z)
Do LLMs Have Distinct and Consistent Personality? TRAIT: Personality Testset designed for LLMs with Psychometrics [29.325576963215163]
Large Language Models (LLMs) have led to their adaptation in various domains as conversational agents. We introduce TRAIT, a new benchmark consisting of 8K multi-choice questions designed to assess the personality of LLMs. LLMs exhibit distinct and consistent personality, which is highly influenced by their training data.
arXiv Detail & Related papers (2024-06-20T19:50:56Z)
Through the Theory of Mind's Eye: Reading Minds with Multimodal Video Large Language Models [52.894048516550065]
We develop a pipeline for multimodal ToM reasoning using video and text. We also enable explicit ToM reasoning by retrieving key frames for answering a ToM question.
arXiv Detail & Related papers (2024-06-19T18:24:31Z)
PHAnToM: Persona-based Prompting Has An Effect on Theory-of-Mind Reasoning in Large Language Models [25.657579792829743]
We empirically evaluate how role-playing prompting influences Theory-of-Mind (ToM) reasoning capabilities. We propose the mechanism that, beyond the inherent variance in the complexity of reasoning tasks, performance differences arise because of socially-motivated prompting differences.
arXiv Detail & Related papers (2024-03-04T17:34:34Z)
Identifying Multiple Personalities in Large Language Models with External Evaluation [6.657168333238573]
Large Language Models (LLMs) are integrated with human daily applications rapidly. Many recent studies quantify LLMs' personalities using self-assessment tests that are created for humans. Yet many critiques question the applicability and reliability of these self-assessment tests when applied to LLMs.
arXiv Detail & Related papers (2024-02-22T18:57:20Z)
OpenToM: A Comprehensive Benchmark for Evaluating Theory-of-Mind Reasoning Capabilities of Large Language Models [17.042114879350788]
Neural Theory-of-Mind (N-ToM) machine's ability to understand and keep track of the mental states of others is pivotal in developing socially intelligent agents. OpenToM is a new benchmark for assessing N-ToM with longer and clearer narrative stories, explicit personality traits, and actions triggered by character intentions. We reveal that state-of-the-art LLMs thrive at modeling certain aspects of mental states in the physical world but fall short when tracking characters' mental states in the psychological world.
arXiv Detail & Related papers (2024-02-08T20:35:06Z)
PsyCoT: Psychological Questionnaire as Powerful Chain-of-Thought for Personality Detection [50.66968526809069]
We propose a novel personality detection method, called PsyCoT, which mimics the way individuals complete psychological questionnaires in a multi-turn dialogue manner. Our experiments demonstrate that PsyCoT significantly improves the performance and robustness of GPT-3.5 in personality detection.
arXiv Detail & Related papers (2023-10-31T08:23:33Z)
How FaR Are Large Language Models From Agents with Theory-of-Mind? [69.41586417697732]
We propose a new evaluation paradigm for large language models (LLMs): Thinking for Doing (T4D) T4D requires models to connect inferences about others' mental states to actions in social scenarios. We introduce a zero-shot prompting framework, Foresee and Reflect (FaR), which provides a reasoning structure that encourages LLMs to anticipate future challenges.
arXiv Detail & Related papers (2023-10-04T06:47:58Z)
Who is ChatGPT? Benchmarking LLMs' Psychological Portrayal Using PsychoBench [83.41621219298489]
We propose a framework, PsychoBench, for evaluating diverse psychological aspects of Large Language Models (LLMs) PsychoBench classifies these scales into four distinct categories: personality traits, interpersonal relationships, motivational tests, and emotional abilities. We employ a jailbreak approach to bypass the safety alignment protocols and test the intrinsic natures of LLMs.
arXiv Detail & Related papers (2023-10-02T17:46:09Z)
Evaluating and Inducing Personality in Pre-trained Language Models [78.19379997967191]
We draw inspiration from psychometric studies by leveraging human personality theory as a tool for studying machine behaviors. To answer these questions, we introduce the Machine Personality Inventory (MPI) tool for studying machine behaviors. MPI follows standardized personality tests, built upon the Big Five Personality Factors (Big Five) theory and personality assessment inventories. We devise a Personality Prompting (P2) method to induce LLMs with specific personalities in a controllable way.
arXiv Detail & Related papers (2022-05-20T07:32:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.