ToMBench: Benchmarking Theory of Mind in Large Language Models
- URL: http://arxiv.org/abs/2402.15052v1
- Date: Fri, 23 Feb 2024 02:05:46 GMT
- Title: ToMBench: Benchmarking Theory of Mind in Large Language Models
- Authors: Zhuang Chen, Jincenzi Wu, Jinfeng Zhou, Bosi Wen, Guanqun Bi, Gongyao
Jiang, Yaru Cao, Mengting Hu, Yunghwei Lai, Zexuan Xiong, Minlie Huang
- Abstract summary: ToM is the cognitive capability to perceive and ascribe mental states to oneself and others.
Existing ToM evaluations are hindered by challenges such as constrained scope, subjective judgment, and unintended contamination.
We introduce ToMBench with three key characteristics: a systematic evaluation framework encompassing 8 tasks and 31 abilities in social cognition, a multiple-choice question format to support automated and unbiased evaluation, and a build-from-scratch bilingual inventory to strictly avoid data leakage.
- Score: 42.80231362967291
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Theory of Mind (ToM) is the cognitive capability to perceive and ascribe
mental states to oneself and others. Recent research has sparked a debate over
whether large language models (LLMs) exhibit a form of ToM. However, existing
ToM evaluations are hindered by challenges such as constrained scope,
subjective judgment, and unintended contamination, yielding inadequate
assessments. To address this gap, we introduce ToMBench with three key
characteristics: a systematic evaluation framework encompassing 8 tasks and 31
abilities in social cognition, a multiple-choice question format to support
automated and unbiased evaluation, and a build-from-scratch bilingual inventory
to strictly avoid data leakage. Based on ToMBench, we conduct extensive
experiments to evaluate the ToM performance of 10 popular LLMs across tasks and
abilities. We find that even the most advanced LLMs like GPT-4 lag behind human
performance by over 10% points, indicating that LLMs have not achieved a
human-level theory of mind yet. Our aim with ToMBench is to enable an efficient
and effective evaluation of LLMs' ToM capabilities, thereby facilitating the
development of LLMs with inherent social intelligence.
Related papers
- FAC$^2$E: Better Understanding Large Language Model Capabilities by
Dissociating Language and Cognition [57.747888532651]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks.
We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria [44.401826163314716]
We propose a new evaluation paradigm for MLLMs using potent MLLM as the judge.
We benchmark 21 popular MLLMs in a pairwise-comparison fashion, showing diverse performance across models.
The validity of our benchmark manifests itself in reaching 88.02% agreement with human evaluation.
arXiv Detail & Related papers (2023-11-23T12:04:25Z) - Theory of Mind in Large Language Models: Examining Performance of 11
State-of-the-Art models vs. Children Aged 7-10 on Advanced Tests [1.099532646524593]
We test 11 base- and instruction-tuned Large Language Models (LLMs) on capabilities relevant to Theory of Mind (ToM)
We find that instruction-tuned LLMs from the GPT family outperform other models, and often also children.
We suggest that the interlinked evolution and development of language and ToM may help explain what instruction-tuning adds.
arXiv Detail & Related papers (2023-10-31T09:55:07Z) - Collaborative Evaluation: Exploring the Synergy of Large Language Models
and Humans for Open-ended Generation Evaluation [71.76872586182981]
Large language models (LLMs) have emerged as a scalable and cost-effective alternative to human evaluations.
We propose a Collaborative Evaluation pipeline CoEval, involving the design of a checklist of task-specific criteria and the detailed evaluation of texts.
arXiv Detail & Related papers (2023-10-30T17:04:35Z) - A Survey on Evaluation of Large Language Models [87.60417393701331]
Large language models (LLMs) are gaining increasing popularity in both academia and industry.
This paper focuses on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
arXiv Detail & Related papers (2023-07-06T16:28:35Z) - Understanding Social Reasoning in Language Models with Language Models [34.068368860882586]
We present a novel framework for generating evaluations with Large Language Models (LLMs) by populating causal templates.
We create a new social reasoning benchmark (BigToM) for LLMs which consists of 25 controls and 5,000 model-written evaluations.
We find that human participants rate the quality of our benchmark higher than previous crowd-sourced evaluations and comparable to expert-written evaluations.
arXiv Detail & Related papers (2023-06-21T16:42:15Z) - ToMChallenges: A Principle-Guided Dataset and Diverse Evaluation Tasks
for Exploring Theory of Mind [4.450536872346658]
We present ToMChallenges, a dataset for comprehensively evaluating the Theory of Mind based on the Sally-Anne and Smarties tests with a diverse set of tasks.
Our evaluation results and error analyses show that LLMs have inconsistent behaviors across prompts and tasks.
arXiv Detail & Related papers (2023-05-24T11:54:07Z) - Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in
Large Language Models [82.50173296858377]
Many anecdotal examples were used to suggest newer large language models (LLMs) like ChatGPT and GPT-4 exhibit Neural Theory-of-Mind (N-ToM)
We investigate the extent of LLMs' N-ToM through an extensive evaluation on 6 tasks and find that while LLMs exhibit certain N-ToM abilities, this behavior is far from being robust.
arXiv Detail & Related papers (2023-05-24T06:14:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.