ToMChallenges: A Principle-Guided Dataset and Diverse Evaluation Tasks for Exploring Theory of Mind
- URL: http://arxiv.org/abs/2305.15068v3
- Date: Mon, 28 Oct 2024 00:10:38 GMT
- Title: ToMChallenges: A Principle-Guided Dataset and Diverse Evaluation Tasks for Exploring Theory of Mind
- Authors: Xiaomeng Ma, Lingyu Gao, Qihui Xu,
- Abstract summary: We present ToMChallenges, a dataset for comprehensively evaluating the Theory of Mind based on the Sally-Anne and Smarties tests with a diverse set of tasks.
Our evaluation results and error analyses show that LLMs have inconsistent behaviors across prompts and tasks.
- Score: 3.9599054392856483
- License:
- Abstract: Theory of Mind (ToM), the capacity to comprehend the mental states of distinct individuals, is essential for numerous practical applications. With the development of large language models (LLMs), there is a heated debate about whether they are able to perform ToM tasks. Previous studies have used different tasks and prompts to test the ToM on LLMs and the results are inconsistent: some studies asserted these models are capable of exhibiting ToM, while others suggest the opposite. In this study, We present ToMChallenges, a dataset for comprehensively evaluating the Theory of Mind based on the Sally-Anne and Smarties tests with a diverse set of tasks. In addition, we also propose an auto-grader to streamline the answer evaluation process. We tested three models: davinci, turbo, and gpt-4. Our evaluation results and error analyses show that LLMs have inconsistent behaviors across prompts and tasks. Performing the ToM tasks robustly remains a challenge for the LLMs. In addition, our paper wants to raise awareness in evaluating the ToM in LLMs and we want to invite more discussion on how to design the prompts and tasks for ToM tasks that can better assess the LLMs' ability.
Related papers
- Probing the Robustness of Theory of Mind in Large Language Models [6.7932860553262415]
We introduce a novel dataset of 68 tasks for probing ToM in LLMs.
We evaluate the ToM performance of four SotA open source LLMs on our dataset and the dataset introduced by (Kosinski, 2023)
We find a consistent tendency in all tested LLMs to perform poorly on tasks that require the realization that an agent has knowledge of automatic state changes in its environment.
arXiv Detail & Related papers (2024-10-08T18:13:27Z) - Toward Self-Improvement of LLMs via Imagination, Searching, and Criticizing [56.75702900542643]
We introduce AlphaLLM for the self-improvements of Large Language Models.
It integrates Monte Carlo Tree Search (MCTS) with LLMs to establish a self-improving loop.
Our experimental results show that AlphaLLM significantly enhances the performance of LLMs without additional annotations.
arXiv Detail & Related papers (2024-04-18T15:21:34Z) - How Far Are We from Intelligent Visual Deductive Reasoning? [41.4377002379162]
We dig into vision-based deductive reasoning, a more sophisticated but less explored realm.
We find previously unexposed blindspots in the current SOTA VLMs.
We find that certain standard strategies that are effective when applied to LLMs do not seamlessly translate to the challenges presented by visual reasoning tasks.
arXiv Detail & Related papers (2024-03-07T18:35:54Z) - ToMBench: Benchmarking Theory of Mind in Large Language Models [42.80231362967291]
ToM is the cognitive capability to perceive and ascribe mental states to oneself and others.
Existing ToM evaluations are hindered by challenges such as constrained scope, subjective judgment, and unintended contamination.
We introduce ToMBench with three key characteristics: a systematic evaluation framework encompassing 8 tasks and 31 abilities in social cognition, a multiple-choice question format to support automated and unbiased evaluation, and a build-from-scratch bilingual inventory to strictly avoid data leakage.
arXiv Detail & Related papers (2024-02-23T02:05:46Z) - Theory of Mind in Large Language Models: Examining Performance of 11
State-of-the-Art models vs. Children Aged 7-10 on Advanced Tests [1.099532646524593]
We test 11 base- and instruction-tuned Large Language Models (LLMs) on capabilities relevant to Theory of Mind (ToM)
We find that instruction-tuned LLMs from the GPT family outperform other models, and often also children.
We suggest that the interlinked evolution and development of language and ToM may help explain what instruction-tuning adds.
arXiv Detail & Related papers (2023-10-31T09:55:07Z) - TRACE: A Comprehensive Benchmark for Continual Learning in Large
Language Models [52.734140807634624]
Aligned large language models (LLMs) demonstrate exceptional capabilities in task-solving, following instructions, and ensuring safety.
Existing continual learning benchmarks lack sufficient challenge for leading aligned LLMs.
We introduce TRACE, a novel benchmark designed to evaluate continual learning in LLMs.
arXiv Detail & Related papers (2023-10-10T16:38:49Z) - MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities [159.9847317300497]
We propose MM-Vet, an evaluation benchmark that examines large multimodal models (LMMs) on complicated multimodal tasks.
Recent LMMs have shown various intriguing abilities, such as solving math problems written on the blackboard, reasoning about events and celebrities in news images, and explaining visual jokes.
arXiv Detail & Related papers (2023-08-04T17:59:47Z) - Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate [85.3444184685235]
We propose a Multi-Agent Debate (MAD) framework, in which multiple agents express their arguments in the state of "tit for tat" and a judge manages the debate process to obtain a final solution.
Our framework encourages divergent thinking in LLMs which would be helpful for tasks that require deep levels of contemplation.
arXiv Detail & Related papers (2023-05-30T15:25:45Z) - Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in
Large Language Models [82.50173296858377]
Many anecdotal examples were used to suggest newer large language models (LLMs) like ChatGPT and GPT-4 exhibit Neural Theory-of-Mind (N-ToM)
We investigate the extent of LLMs' N-ToM through an extensive evaluation on 6 tasks and find that while LLMs exhibit certain N-ToM abilities, this behavior is far from being robust.
arXiv Detail & Related papers (2023-05-24T06:14:31Z) - Multi-Task Instruction Tuning of LLaMa for Specific Scenarios: A
Preliminary Study on Writing Assistance [60.40541387785977]
Small foundational models can display remarkable proficiency in tackling diverse tasks when fine-tuned using instruction-driven data.
In this work, we investigate a practical problem setting where the primary focus is on one or a few particular tasks rather than general-purpose instruction following.
Experimental results show that fine-tuning LLaMA on writing instruction data significantly improves its ability on writing tasks.
arXiv Detail & Related papers (2023-05-22T16:56:44Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.