Evaluating Large Language Models in Theory of Mind Tasks
- URL: http://arxiv.org/abs/2302.02083v7
- Date: Mon, 04 Nov 2024 19:51:53 GMT
- Title: Evaluating Large Language Models in Theory of Mind Tasks
- Authors: Michal Kosinski,
- Abstract summary: Eleven Large Language Models (LLMs) were assessed using a custom-made battery of false-belief tasks.
The battery included 640 prompts spread across 40 diverse tasks, each one including a false-belief scenario.
To solve a single task, a model needed to correctly answer 16 prompts across all eight scenarios.
- Score: 11.622327857276389
- License:
- Abstract: Eleven Large Language Models (LLMs) were assessed using a custom-made battery of false-belief tasks, considered a gold standard in testing Theory of Mind (ToM) in humans. The battery included 640 prompts spread across 40 diverse tasks, each one including a false-belief scenario, three closely matched true-belief control scenarios, and the reversed versions of all four. To solve a single task, a model needed to correctly answer 16 prompts across all eight scenarios. Smaller and older models solved no tasks; GPT-3-davinci-003 (from November 2022) and ChatGPT-3.5-turbo (from March 2023) solved 20% of the tasks; ChatGPT-4 (from June 2023) solved 75% of the tasks, matching the performance of six-year-old children observed in past studies. We explore the potential interpretation of these findings, including the intriguing possibility that ToM, previously considered exclusive to humans, may have spontaneously emerged as a byproduct of LLMs' improving language skills.
Related papers
- Evaluating Large Language Models on Spatial Tasks: A Multi-Task Benchmarking Study [4.80612909282198]
This study introduces a novel multi-task spatial evaluation dataset.
The dataset encompasses twelve distinct task types, including spatial understanding and path planning.
The study highlights the impact of prompt strategies on model performance in specific tasks.
arXiv Detail & Related papers (2024-08-26T17:25:16Z) - Do Large Language Models Understand Conversational Implicature -- A case study with a chinese sitcom [4.142301960178498]
SwordsmanImp is the first Chinese multi-turn-dialogue-based dataset aimed at conversational implicature.
It includes 200 carefully handcrafted questions, all annotated on which Gricean maxims have been violated.
Our results show that GPT-4 attains human-level accuracy (94%) on multiple-choice questions.
Other models, including GPT-3.5 and several open-source models, demonstrate a lower accuracy ranging from 20% to 60% on multiple-choice questions.
arXiv Detail & Related papers (2024-04-30T12:43:53Z) - Tree of Thoughts: Deliberate Problem Solving with Large Language Models [52.31950122881687]
We introduce a new framework for language model inference, Tree of Thoughts (ToT)
ToT generalizes over the popular Chain of Thought approach to prompting language models.
Our experiments show that ToT significantly enhances language models' problem-solving abilities.
arXiv Detail & Related papers (2023-05-17T23:16:17Z) - Boosting Theory-of-Mind Performance in Large Language Models via
Prompting [2.538209532048867]
This study measures the ToM performance of GPT-4 and three GPT-3.5 variants.
We investigated the effectiveness of in-context learning in improving ToM comprehension.
arXiv Detail & Related papers (2023-04-22T22:50:50Z) - How Robust is GPT-3.5 to Predecessors? A Comprehensive Study on Language
Understanding Tasks [65.7949334650854]
GPT-3.5 models have demonstrated impressive performance in various Natural Language Processing (NLP) tasks.
However, their robustness and abilities to handle various complexities of the open world have yet to be explored.
We show that GPT-3.5 faces some specific robustness challenges, including instability, prompt sensitivity, and number sensitivity.
arXiv Detail & Related papers (2023-03-01T07:39:01Z) - True Detective: A Deep Abductive Reasoning Benchmark Undoable for GPT-3
and Challenging for GPT-4 [0.0]
Large language models (LLMs) have demonstrated solid zero-shot reasoning capabilities.
In this paper, we introduce such a benchmark, consisting of 191 long-form (1200 words on average) mystery narratives constructed as detective puzzles.
We show that GPT-3 models barely outperform random on this benchmark (with 28% accuracy) while state-of-the-art GPT-4 solves only 38% of puzzles.
arXiv Detail & Related papers (2022-12-20T09:34:43Z) - Holistic Evaluation of Language Models [183.94891340168175]
Language models (LMs) are becoming the foundation for almost all major language technologies, but their capabilities, limitations, and risks are not well understood.
We present Holistic Evaluation of Language Models (HELM) to improve the transparency of language models.
arXiv Detail & Related papers (2022-11-16T18:51:34Z) - Making Large Language Models Better Reasoners with Step-Aware Verifier [49.16750018427259]
DIVERSE (Diverse Verifier on Reasoning Step) is a novel approach that further enhances the reasoning capability of language models.
We evaluate DIVERSE on the latest language model code-davinci and show that it achieves new state-of-the-art results on six of eight reasoning benchmarks.
arXiv Detail & Related papers (2022-06-06T03:38:36Z) - Few-shot Learning with Multilingual Language Models [66.49496434282564]
We train multilingual autoregressive language models on a balanced corpus covering a diverse set of languages.
Our largest model sets new state of the art in few-shot learning in more than 20 representative languages.
We present a detailed analysis of where the model succeeds and fails, showing in particular that it enables cross-lingual in-context learning.
arXiv Detail & Related papers (2021-12-20T16:52:35Z) - Measuring Massive Multitask Language Understanding [79.6985576698597]
The test covers 57 tasks including elementary mathematics, US history, computer science, law, and more.
Largest GPT-3 model improves over random chance by almost 20 percentage points on average.
Models also have lopsided performance and frequently do not know when they are wrong.
arXiv Detail & Related papers (2020-09-07T17:59:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.