Theory of Mind in Large Language Models: Assessment and Enhancement
- URL: http://arxiv.org/abs/2505.00026v1
- Date: Sat, 26 Apr 2025 10:17:48 GMT
- Title: Theory of Mind in Large Language Models: Assessment and Enhancement
- Authors: Ruirui Chen, Weifeng Jiang, Chengwei Qin, Cheston Tan,
- Abstract summary: Large Language Models (LLMs) become increasingly integrated into daily life.<n>It is crucial to assess and enhance their capacity to interpret and respond to human mental states.
- Score: 14.41464477095448
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Theory of Mind (ToM)-the ability to infer and reason about others' mental states-is fundamental to human social intelligence. As Large Language Models (LLMs) become increasingly integrated into daily life, it is crucial to assess and enhance their capacity to interpret and respond to human mental states. In this paper, we review LLMs' ToM capabilities by examining both evaluation benchmarks and the strategies designed to improve them. We focus on widely adopted story-based benchmarks and provide an in-depth analysis of methods aimed at enhancing ToM in LLMs. Furthermore, we outline promising future research directions informed by recent benchmarks and state-of-the-art approaches. Our survey serves as a valuable resource for researchers interested in advancing LLMs' ToM capabilities.
Related papers
- Rethinking Theory of Mind Benchmarks for LLMs: Towards A User-Centered Perspective [24.27038998164743]
Theory-of-Mind (ToM) tasks are designed for humans to benchmark LLM's ToM capabilities.<n>This approach has a number of limitations.<n>Taking a human-computer interaction (HCI) perspective, these limitations prompt us to rethink the definition and criteria of ToM in ToM benchmarks.
arXiv Detail & Related papers (2025-04-15T03:44:43Z) - Revisiting LLM Evaluation through Mechanism Interpretability: a New Metric and Model Utility Law [99.56567010306807]
Large Language Models (LLMs) have become indispensable across academia, industry, and daily applications.
We propose a novel metric, the Model Utilization Index (MUI), which introduces mechanism interpretability techniques to complement traditional performance metrics.
arXiv Detail & Related papers (2025-04-10T04:09:47Z) - A Systematic Review on the Evaluation of Large Language Models in Theory of Mind Tasks [0.0]
This systematic review synthesizes current efforts to assess large language models' (LLMs) ability to perform ToM tasks.<n>A recurring theme in the literature reveals that while LLMs demonstrate emerging competence in ToM tasks, significant gaps persist in their emulation of human cognitive abilities.
arXiv Detail & Related papers (2025-02-12T21:19:30Z) - MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs [97.94579295913606]
Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia.
In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models.
This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods.
arXiv Detail & Related papers (2024-11-22T18:59:54Z) - DnA-Eval: Enhancing Large Language Model Evaluation through Decomposition and Aggregation [75.81096662788254]
Large Language Models (LLMs) are scalable and economical evaluators.
The question of how reliable these evaluators are has emerged as a crucial research question.
We propose Decompose and Aggregate, which breaks down the evaluation process into different stages based on pedagogical practices.
arXiv Detail & Related papers (2024-05-24T08:12:30Z) - LLM as a Mastermind: A Survey of Strategic Reasoning with Large Language Models [75.89014602596673]
Strategic reasoning requires understanding and predicting adversary actions in multi-agent settings while adjusting strategies accordingly.
We explore the scopes, applications, methodologies, and evaluation metrics related to strategic reasoning with Large Language Models.
It underscores the importance of strategic reasoning as a critical cognitive capability and offers insights into future research directions and potential improvements.
arXiv Detail & Related papers (2024-04-01T16:50:54Z) - ToMBench: Benchmarking Theory of Mind in Large Language Models [41.565202027904476]
ToM is the cognitive capability to perceive and ascribe mental states to oneself and others.
Existing ToM evaluations are hindered by challenges such as constrained scope, subjective judgment, and unintended contamination.
We introduce ToMBench with three key characteristics: a systematic evaluation framework encompassing 8 tasks and 31 abilities in social cognition, a multiple-choice question format to support automated and unbiased evaluation, and a build-from-scratch bilingual inventory to strictly avoid data leakage.
arXiv Detail & Related papers (2024-02-23T02:05:46Z) - Think Twice: Perspective-Taking Improves Large Language Models'
Theory-of-Mind Capabilities [63.90227161974381]
SimToM is a novel prompting framework inspired by Simulation Theory's notion of perspective-taking.
Our approach, which requires no additional training and minimal prompt-tuning, shows substantial improvement over existing methods.
arXiv Detail & Related papers (2023-11-16T22:49:27Z) - A Survey on Evaluation of Large Language Models [87.60417393701331]
Large language models (LLMs) are gaining increasing popularity in both academia and industry.
This paper focuses on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
arXiv Detail & Related papers (2023-07-06T16:28:35Z) - Understanding Social Reasoning in Language Models with Language Models [34.068368860882586]
We present a novel framework for generating evaluations with Large Language Models (LLMs) by populating causal templates.
We create a new social reasoning benchmark (BigToM) for LLMs which consists of 25 controls and 5,000 model-written evaluations.
We find that human participants rate the quality of our benchmark higher than previous crowd-sourced evaluations and comparable to expert-written evaluations.
arXiv Detail & Related papers (2023-06-21T16:42:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.