Theory of Mind in Large Language Models: Examining Performance of 11
State-of-the-Art models vs. Children Aged 7-10 on Advanced Tests
- URL: http://arxiv.org/abs/2310.20320v1
- Date: Tue, 31 Oct 2023 09:55:07 GMT
- Title: Theory of Mind in Large Language Models: Examining Performance of 11
State-of-the-Art models vs. Children Aged 7-10 on Advanced Tests
- Authors: Max J. van Duijn, Bram M.A. van Dijk, Tom Kouwenhoven, Werner de Valk,
Marco R. Spruit, and Peter van der Putten
- Abstract summary: We test 11 base- and instruction-tuned Large Language Models (LLMs) on capabilities relevant to Theory of Mind (ToM)
We find that instruction-tuned LLMs from the GPT family outperform other models, and often also children.
We suggest that the interlinked evolution and development of language and ToM may help explain what instruction-tuning adds.
- Score: 1.099532646524593
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: To what degree should we ascribe cognitive capacities to Large Language
Models (LLMs), such as the ability to reason about intentions and beliefs known
as Theory of Mind (ToM)? Here we add to this emerging debate by (i) testing 11
base- and instruction-tuned LLMs on capabilities relevant to ToM beyond the
dominant false-belief paradigm, including non-literal language usage and
recursive intentionality; (ii) using newly rewritten versions of standardized
tests to gauge LLMs' robustness; (iii) prompting and scoring for open besides
closed questions; and (iv) benchmarking LLM performance against that of
children aged 7-10 on the same tasks. We find that instruction-tuned LLMs from
the GPT family outperform other models, and often also children. Base-LLMs are
mostly unable to solve ToM tasks, even with specialized prompting. We suggest
that the interlinked evolution and development of language and ToM may help
explain what instruction-tuning adds: rewarding cooperative communication that
takes into account interlocutor and context. We conclude by arguing for a
nuanced perspective on ToM in LLMs.
Related papers
- ToM-LM: Delegating Theory of Mind Reasoning to External Symbolic Executors in Large Language Models [5.455744338342196]
Theory of Mind (ToM) refers to the ability of individuals to attribute mental states to others.
Large Language Models (LLMs) have shown some promise with ToM ability, but they still struggle with complex ToM reasoning.
arXiv Detail & Related papers (2024-04-23T20:59:03Z) - FAC$^2$E: Better Understanding Large Language Model Capabilities by Dissociating Language and Cognition [56.76951887823882]
Large language models (LLMs) are primarily evaluated by overall performance on various text understanding and generation tasks.
We present FAC$2$E, a framework for Fine-grAined and Cognition-grounded LLMs' Capability Evaluation.
arXiv Detail & Related papers (2024-02-29T21:05:37Z) - ToMBench: Benchmarking Theory of Mind in Large Language Models [42.80231362967291]
ToM is the cognitive capability to perceive and ascribe mental states to oneself and others.
Existing ToM evaluations are hindered by challenges such as constrained scope, subjective judgment, and unintended contamination.
We introduce ToMBench with three key characteristics: a systematic evaluation framework encompassing 8 tasks and 31 abilities in social cognition, a multiple-choice question format to support automated and unbiased evaluation, and a build-from-scratch bilingual inventory to strictly avoid data leakage.
arXiv Detail & Related papers (2024-02-23T02:05:46Z) - CLOMO: Counterfactual Logical Modification with Large Language Models [109.60793869938534]
We introduce a novel task, Counterfactual Logical Modification (CLOMO), and a high-quality human-annotated benchmark.
In this task, LLMs must adeptly alter a given argumentative text to uphold a predetermined logical relationship.
We propose an innovative evaluation metric, the Self-Evaluation Score (SES), to directly evaluate the natural language output of LLMs.
arXiv Detail & Related papers (2023-11-29T08:29:54Z) - AlignedCoT: Prompting Large Language Models via Native-Speaking Demonstrations [52.43593893122206]
Alignedcot is an in-context learning technique for invoking Large Language Models.
It achieves consistent and correct step-wise prompts in zero-shot scenarios.
We conduct experiments on mathematical reasoning and commonsense reasoning.
arXiv Detail & Related papers (2023-11-22T17:24:21Z) - Large Language Models: The Need for Nuance in Current Debates and a
Pragmatic Perspective on Understanding [1.3654846342364308]
Large Language Models (LLMs) are unparalleled in their ability to generate grammatically correct, fluent text.
This position paper critically assesses three points recurring in critiques of LLM capacities.
We outline a pragmatic perspective on the issue of real' understanding and intentionality in LLMs.
arXiv Detail & Related papers (2023-10-30T15:51:04Z) - HI-TOM: A Benchmark for Evaluating Higher-Order Theory of Mind Reasoning
in Large Language Models [31.831042765744204]
Theory of Mind (ToM) is the ability to reason about one's own and others' mental states.
We introduce HI-TOM, a Higher Order Theory of Mind benchmark.
Our experimental evaluation using various Large Language Models (LLMs) indicates a decline in performance on higher-order ToM tasks.
arXiv Detail & Related papers (2023-10-25T16:41:15Z) - FANToM: A Benchmark for Stress-testing Machine Theory of Mind in
Interactions [94.61530480991627]
Theory of mind evaluations currently focus on testing models using passive narratives that inherently lack interactivity.
We introduce FANToM, a new benchmark designed to stress-test ToM within information-asymmetric conversational contexts via question answering.
arXiv Detail & Related papers (2023-10-24T00:24:11Z) - Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate [85.3444184685235]
We propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution.
Our framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation.
arXiv Detail & Related papers (2023-05-30T15:25:45Z) - ToMChallenges: A Principle-Guided Dataset and Diverse Evaluation Tasks for Exploring Theory of Mind [3.9599054392856483]
We present ToMChallenges, a dataset for comprehensively evaluating the Theory of Mind based on the Sally-Anne and Smarties tests with a diverse set of tasks.
Our evaluation results and error analyses show that LLMs have inconsistent behaviors across prompts and tasks.
arXiv Detail & Related papers (2023-05-24T11:54:07Z) - Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in
Large Language Models [82.50173296858377]
Many anecdotal examples were used to suggest newer large language models (LLMs) like ChatGPT and GPT-4 exhibit Neural Theory-of-Mind (N-ToM)
We investigate the extent of LLMs' N-ToM through an extensive evaluation on 6 tasks and find that while LLMs exhibit certain N-ToM abilities, this behavior is far from being robust.
arXiv Detail & Related papers (2023-05-24T06:14:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.