Related papers: ToMBench: Benchmarking Theory of Mind in Large Language Models

ToMBench: Benchmarking Theory of Mind in Large Language Models

URL: http://arxiv.org/abs/2402.15052v1
Date: Fri, 23 Feb 2024 02:05:46 GMT
Title: ToMBench: Benchmarking Theory of Mind in Large Language Models
Authors: Zhuang Chen, Jincenzi Wu, Jinfeng Zhou, Bosi Wen, Guanqun Bi, Gongyao Jiang, Yaru Cao, Mengting Hu, Yunghwei Lai, Zexuan Xiong, Minlie Huang
Abstract summary: ToM is the cognitive capability to perceive and ascribe mental states to oneself and others. Existing ToM evaluations are hindered by challenges such as constrained scope, subjective judgment, and unintended contamination. We introduce ToMBench with three key characteristics: a systematic evaluation framework encompassing 8 tasks and 31 abilities in social cognition, a multiple-choice question format to support automated and unbiased evaluation, and a build-from-scratch bilingual inventory to strictly avoid data leakage.
Score: 42.80231362967291
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Theory of Mind (ToM) is the cognitive capability to perceive and ascribe mental states to oneself and others. Recent research has sparked a debate over whether large language models (LLMs) exhibit a form of ToM. However, existing ToM evaluations are hindered by challenges such as constrained scope, subjective judgment, and unintended contamination, yielding inadequate assessments. To address this gap, we introduce ToMBench with three key characteristics: a systematic evaluation framework encompassing 8 tasks and 31 abilities in social cognition, a multiple-choice question format to support automated and unbiased evaluation, and a build-from-scratch bilingual inventory to strictly avoid data leakage. Based on ToMBench, we conduct extensive experiments to evaluate the ToM performance of 10 popular LLMs across tasks and abilities. We find that even the most advanced LLMs like GPT-4 lag behind human performance by over 10% points, indicating that LLMs have not achieved a human-level theory of mind yet. Our aim with ToMBench is to enable an efficient and effective evaluation of LLMs' ToM capabilities, thereby facilitating the development of LLMs with inherent social intelligence.

Related papers

GPT-4o Lacks Core Features of Theory of Mind [0.09320657506524145]
We use a cognitively-grounded definition of ToM to develop and test a new evaluation framework.<n>We find that even though LLMs succeed in approxing human judgments in a simple ToM paradigm, they fail at a logically equivalent task.
arXiv Detail & Related papers (2026-02-12T16:33:58Z)
Rethinking Theory of Mind Benchmarks for LLMs: Towards A User-Centered Perspective [24.27038998164743]
Theory-of-Mind (ToM) tasks are designed for humans to benchmark LLM's ToM capabilities. This approach has a number of limitations. Taking a human-computer interaction (HCI) perspective, these limitations prompt us to rethink the definition and criteria of ToM in ToM benchmarks.
arXiv Detail & Related papers (2025-04-15T03:44:43Z)
Re-evaluating Theory of Mind evaluation in large language models [3.262532929657758]
We take inspiration from cognitive science to re-evaluate the state of ToM evaluation in large language models. A major reason for the disagreement on whether LLMs have ToM is a lack of clarity on whether models should be expected to match human behaviors. We conclude by discussing several directions for future research, including the relationship between ToM and pragmatic communication.
arXiv Detail & Related papers (2025-02-28T14:36:57Z)
A Systematic Review on the Evaluation of Large Language Models in Theory of Mind Tasks [0.0]
This systematic review synthesizes current efforts to assess large language models' (LLMs) ability to perform ToM tasks. A recurring theme in the literature reveals that while LLMs demonstrate emerging competence in ToM tasks, significant gaps persist in their emulation of human cognitive abilities.
arXiv Detail & Related papers (2025-02-12T21:19:30Z)
Perceptions to Beliefs: Exploring Precursory Inferences for Theory of Mind in Large Language Models [51.91448005607405]
We evaluate key human ToM precursors by annotating characters' perceptions on ToMi and FANToM. We present PercepToM, a novel ToM method leveraging LLMs' strong perception inference capability while supplementing their limited perception-to-belief inference.
arXiv Detail & Related papers (2024-07-08T14:58:29Z)
CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark. In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship. We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z)
MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria [49.500322937449326]
Multimodal large language models (MLLMs) have broadened the scope of AI applications. Existing automatic evaluation methodologies for MLLMs are mainly limited in evaluating queries without considering user experiences. We propose a new evaluation paradigm for MLLMs, which is evaluating MLLMs with per-sample criteria using potent MLLM as the judge.
arXiv Detail & Related papers (2023-11-23T12:04:25Z)
Theory of Mind in Large Language Models: Examining Performance of 11 State-of-the-Art models vs. Children Aged 7-10 on Advanced Tests [1.099532646524593]
We test 11 base- and instruction-tuned Large Language Models (LLMs) on capabilities relevant to Theory of Mind (ToM) We find that instruction-tuned LLMs from the GPT family outperform other models, and often also children. We suggest that the interlinked evolution and development of language and ToM may help explain what instruction-tuning adds.
arXiv Detail & Related papers (2023-10-31T09:55:07Z)
A Survey on Evaluation of Large Language Models [87.60417393701331]
Large language models (LLMs) are gaining increasing popularity in both academia and industry. This paper focuses on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
arXiv Detail & Related papers (2023-07-06T16:28:35Z)
Understanding Social Reasoning in Language Models with Language Models [34.068368860882586]
We present a novel framework for generating evaluations with Large Language Models (LLMs) by populating causal templates. We create a new social reasoning benchmark (BigToM) for LLMs which consists of 25 controls and 5,000 model-written evaluations. We find that human participants rate the quality of our benchmark higher than previous crowd-sourced evaluations and comparable to expert-written evaluations.
arXiv Detail & Related papers (2023-06-21T16:42:15Z)
ToMChallenges: A Principle-Guided Dataset and Diverse Evaluation Tasks for Exploring Theory of Mind [3.9599054392856483]
We present ToMChallenges, a dataset for comprehensively evaluating the Theory of Mind based on the Sally-Anne and Smarties tests with a diverse set of tasks. Our evaluation results and error analyses show that LLMs have inconsistent behaviors across prompts and tasks.
arXiv Detail & Related papers (2023-05-24T11:54:07Z)
Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models [82.50173296858377]
Many anecdotal examples were used to suggest newer large language models (LLMs) like ChatGPT and GPT-4 exhibit Neural Theory-of-Mind (N-ToM) We investigate the extent of LLMs' N-ToM through an extensive evaluation on 6 tasks and find that while LLMs exhibit certain N-ToM abilities, this behavior is far from being robust.
arXiv Detail & Related papers (2023-05-24T06:14:31Z)

This list is automatically generated from the titles and abstracts of the papers in this site.