Towards Dynamic Theory of Mind: Evaluating LLM Adaptation to Temporal Evolution of Human States
- URL: http://arxiv.org/abs/2505.17663v2
- Date: Sun, 08 Jun 2025 05:41:33 GMT
- Title: Towards Dynamic Theory of Mind: Evaluating LLM Adaptation to Temporal Evolution of Human States
- Authors: Yang Xiao, Jiashuo Wang, Qiancheng Xu, Changhe Song, Chunpu Xu, Yi Cheng, Wenjie Li, Pengfei Liu,
- Abstract summary: We present textscDynToM, a novel benchmark designed to evaluate Large Language Models' ability to understand and track the temporal progression of mental states.<n>We generate 1,100 social contexts encompassing 5,500 scenarios and 78,100 questions, each validated for realism and quality.<n>Our comprehensive evaluation of ten state-of-the-art LLMs reveals that their average performance underperforms humans by 44.7%, with performance degrading significantly when tracking and reasoning about the shift of mental states.
- Score: 34.30614600300616
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: As Large Language Models (LLMs) increasingly participate in human-AI interactions, evaluating their Theory of Mind (ToM) capabilities - particularly their ability to track dynamic mental states - becomes crucial. While existing benchmarks assess basic ToM abilities, they predominantly focus on static snapshots of mental states, overlooking the temporal evolution that characterizes real-world social interactions. We present \textsc{DynToM}, a novel benchmark specifically designed to evaluate LLMs' ability to understand and track the temporal progression of mental states across interconnected scenarios. Through a systematic four-step framework, we generate 1,100 social contexts encompassing 5,500 scenarios and 78,100 questions, each validated for realism and quality. Our comprehensive evaluation of ten state-of-the-art LLMs reveals that their average performance underperforms humans by 44.7\%, with performance degrading significantly when tracking and reasoning about the shift of mental states. This performance gap highlights fundamental limitations in current LLMs' ability to model the dynamic nature of human mental states.
Related papers
- Truly Assessing Fluid Intelligence of Large Language Models through Dynamic Reasoning Evaluation [75.26829371493189]
Large language models (LLMs) have demonstrated impressive reasoning capacities that mirror human-like thinking.<n>Existing reasoning benchmarks either focus on domain-specific knowledge (crystallized intelligence) or lack interpretability.<n>We propose DRE-Bench, a dynamic reasoning evaluation benchmark grounded in a hierarchical cognitive framework.
arXiv Detail & Related papers (2025-06-03T09:01:08Z) - SocialEval: Evaluating Social Intelligence of Large Language Models [70.90981021629021]
Social Intelligence (SI) equips humans with interpersonal abilities to behave wisely in navigating social interactions to achieve social goals.<n>This presents an operational evaluation paradigm: outcome-oriented goal achievement evaluation and process-oriented interpersonal ability evaluation.<n>We propose SocialEval, a script-based bilingual SI benchmark, integrating outcome- and process-oriented evaluation by manually crafting narrative scripts.
arXiv Detail & Related papers (2025-06-01T08:36:51Z) - Humanizing LLMs: A Survey of Psychological Measurements with Tools, Datasets, and Human-Agent Applications [25.38031971196831]
Large language models (LLMs) are increasingly used in human-centered tasks.<n>Assessing their psychological traits is crucial for understanding their social impact and ensuring trustworthy AI alignment.<n>This study aims to propose future directions for developing more interpretable, robust, and generalizable psychological assessment frameworks for LLMs.
arXiv Detail & Related papers (2025-04-30T06:09:40Z) - Theory of Mind in Large Language Models: Assessment and Enhancement [14.41464477095448]
Large Language Models (LLMs) become increasingly integrated into daily life.<n>It is crucial to assess and enhance their capacity to interpret and respond to human mental states.
arXiv Detail & Related papers (2025-04-26T10:17:48Z) - Measurement of LLM's Philosophies of Human Nature [113.47929131143766]
We design the standardized psychological scale specifically targeting large language models (LLM)<n>We show that current LLMs exhibit a systemic lack of trust in humans.<n>We propose a mental loop learning framework, which enables LLM to continuously optimize its value system.
arXiv Detail & Related papers (2025-04-03T06:22:19Z) - A Systematic Review on the Evaluation of Large Language Models in Theory of Mind Tasks [0.0]
This systematic review synthesizes current efforts to assess large language models' (LLMs) ability to perform ToM tasks.<n>A recurring theme in the literature reveals that while LLMs demonstrate emerging competence in ToM tasks, significant gaps persist in their emulation of human cognitive abilities.
arXiv Detail & Related papers (2025-02-12T21:19:30Z) - A Survey on Human-Centric LLMs [11.49752599240738]
Large language models (LLMs) can simulate human cognition and behavior.<n>This survey focuses on their performance in both individual tasks and collective tasks.
arXiv Detail & Related papers (2024-11-20T12:34:44Z) - Quantifying AI Psychology: A Psychometrics Benchmark for Large Language Models [57.518784855080334]
Large Language Models (LLMs) have demonstrated exceptional task-solving capabilities, increasingly adopting roles akin to human-like assistants.
This paper presents a framework for investigating psychology dimension in LLMs, including psychological identification, assessment dataset curation, and assessment with results validation.
We introduce a comprehensive psychometrics benchmark for LLMs that covers six psychological dimensions: personality, values, emotion, theory of mind, motivation, and intelligence.
arXiv Detail & Related papers (2024-06-25T16:09:08Z) - OpenToM: A Comprehensive Benchmark for Evaluating Theory-of-Mind Reasoning Capabilities of Large Language Models [17.042114879350788]
Neural Theory-of-Mind (N-ToM) machine's ability to understand and keep track of the mental states of others is pivotal in developing socially intelligent agents.
OpenToM is a new benchmark for assessing N-ToM with longer and clearer narrative stories, explicit personality traits, and actions triggered by character intentions.
We reveal that state-of-the-art LLMs thrive at modeling certain aspects of mental states in the physical world but fall short when tracking characters' mental states in the psychological world.
arXiv Detail & Related papers (2024-02-08T20:35:06Z) - MR-GSM8K: A Meta-Reasoning Benchmark for Large Language Model Evaluation [60.65820977963331]
We introduce a novel evaluation paradigm for Large Language Models (LLMs)
This paradigm shifts the emphasis from result-oriented assessments, which often neglect the reasoning process, to a more comprehensive evaluation.
By applying this paradigm in the GSM8K dataset, we have developed the MR-GSM8K benchmark.
arXiv Detail & Related papers (2023-12-28T15:49:43Z) - Neural Theory-of-Mind? On the Limits of Social Intelligence in Large LMs [77.88043871260466]
We show that one of today's largest language models lacks this kind of social intelligence out-of-the box.
We conclude that person-centric NLP approaches might be more effective towards neural Theory of Mind.
arXiv Detail & Related papers (2022-10-24T14:58:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.