ToMBench: Benchmarking Theory of Mind in Large Language Models
- URL: http://arxiv.org/abs/2402.15052v1
- Date: Fri, 23 Feb 2024 02:05:46 GMT
- Title: ToMBench: Benchmarking Theory of Mind in Large Language Models
- Authors: Zhuang Chen, Jincenzi Wu, Jinfeng Zhou, Bosi Wen, Guanqun Bi, Gongyao
Jiang, Yaru Cao, Mengting Hu, Yunghwei Lai, Zexuan Xiong, Minlie Huang
- Abstract summary: ToM is the cognitive capability to perceive and ascribe mental states to oneself and others.
Existing ToM evaluations are hindered by challenges such as constrained scope, subjective judgment, and unintended contamination.
We introduce ToMBench with three key characteristics: a systematic evaluation framework encompassing 8 tasks and 31 abilities in social cognition, a multiple-choice question format to support automated and unbiased evaluation, and a build-from-scratch bilingual inventory to strictly avoid data leakage.
- Score: 42.80231362967291
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Theory of Mind (ToM) is the cognitive capability to perceive and ascribe
mental states to oneself and others. Recent research has sparked a debate over
whether large language models (LLMs) exhibit a form of ToM. However, existing
ToM evaluations are hindered by challenges such as constrained scope,
subjective judgment, and unintended contamination, yielding inadequate
assessments. To address this gap, we introduce ToMBench with three key
characteristics: a systematic evaluation framework encompassing 8 tasks and 31
abilities in social cognition, a multiple-choice question format to support
automated and unbiased evaluation, and a build-from-scratch bilingual inventory
to strictly avoid data leakage. Based on ToMBench, we conduct extensive
experiments to evaluate the ToM performance of 10 popular LLMs across tasks and
abilities. We find that even the most advanced LLMs like GPT-4 lag behind human
performance by over 10% points, indicating that LLMs have not achieved a
human-level theory of mind yet. Our aim with ToMBench is to enable an efficient
and effective evaluation of LLMs' ToM capabilities, thereby facilitating the
development of LLMs with inherent social intelligence.
Related papers
- Perceptions to Beliefs: Exploring Precursory Inferences for Theory of Mind in Large Language Models [51.91448005607405]
We evaluate key human ToM precursors by annotating characters' perceptions on ToMi and FANToM.
We present PercepToM, a novel ToM method leveraging LLMs' strong perception inference capability while supplementing their limited perception-to-belief inference.
arXiv Detail & Related papers (2024-07-08T14:58:29Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - MLLM-Bench: Evaluating Multimodal LLMs with Per-sample Criteria [49.500322937449326]
Multimodal large language models (MLLMs) have broadened the scope of AI applications.
Existing automatic evaluation methodologies for MLLMs are mainly limited in evaluating queries without considering user experiences.
We propose a new evaluation paradigm for MLLMs, which is evaluating MLLMs with per-sample criteria using potent MLLM as the judge.
arXiv Detail & Related papers (2023-11-23T12:04:25Z) - Theory of Mind in Large Language Models: Examining Performance of 11
State-of-the-Art models vs. Children Aged 7-10 on Advanced Tests [1.099532646524593]
We test 11 base- and instruction-tuned Large Language Models (LLMs) on capabilities relevant to Theory of Mind (ToM)
We find that instruction-tuned LLMs from the GPT family outperform other models, and often also children.
We suggest that the interlinked evolution and development of language and ToM may help explain what instruction-tuning adds.
arXiv Detail & Related papers (2023-10-31T09:55:07Z) - A Survey on Evaluation of Large Language Models [87.60417393701331]
Large language models (LLMs) are gaining increasing popularity in both academia and industry.
This paper focuses on three key dimensions: what to evaluate, where to evaluate, and how to evaluate.
arXiv Detail & Related papers (2023-07-06T16:28:35Z) - Understanding Social Reasoning in Language Models with Language Models [34.068368860882586]
We present a novel framework for generating evaluations with Large Language Models (LLMs) by populating causal templates.
We create a new social reasoning benchmark (BigToM) for LLMs which consists of 25 controls and 5,000 model-written evaluations.
We find that human participants rate the quality of our benchmark higher than previous crowd-sourced evaluations and comparable to expert-written evaluations.
arXiv Detail & Related papers (2023-06-21T16:42:15Z) - ToMChallenges: A Principle-Guided Dataset and Diverse Evaluation Tasks for Exploring Theory of Mind [3.9599054392856483]
We present ToMChallenges, a dataset for comprehensively evaluating the Theory of Mind based on the Sally-Anne and Smarties tests with a diverse set of tasks.
Our evaluation results and error analyses show that LLMs have inconsistent behaviors across prompts and tasks.
arXiv Detail & Related papers (2023-05-24T11:54:07Z) - Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in
Large Language Models [82.50173296858377]
Many anecdotal examples were used to suggest newer large language models (LLMs) like ChatGPT and GPT-4 exhibit Neural Theory-of-Mind (N-ToM)
We investigate the extent of LLMs' N-ToM through an extensive evaluation on 6 tasks and find that while LLMs exhibit certain N-ToM abilities, this behavior is far from being robust.
arXiv Detail & Related papers (2023-05-24T06:14:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.