How FaR Are Large Language Models From Agents with Theory-of-Mind?
- URL: http://arxiv.org/abs/2310.03051v1
- Date: Wed, 4 Oct 2023 06:47:58 GMT
- Title: How FaR Are Large Language Models From Agents with Theory-of-Mind?
- Authors: Pei Zhou, Aman Madaan, Srividya Pranavi Potharaju, Aditya Gupta, Kevin
R. McKee, Ari Holtzman, Jay Pujara, Xiang Ren, Swaroop Mishra, Aida
Nematzadeh, Shyam Upadhyay, Manaal Faruqui
- Abstract summary: We propose a new evaluation paradigm for large language models (LLMs): Thinking for Doing (T4D)
T4D requires models to connect inferences about others' mental states to actions in social scenarios.
We introduce a zero-shot prompting framework, Foresee and Reflect (FaR), which provides a reasoning structure that encourages LLMs to anticipate future challenges.
- Score: 69.41586417697732
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: "Thinking is for Doing." Humans can infer other people's mental states from
observations--an ability called Theory-of-Mind (ToM)--and subsequently act
pragmatically on those inferences. Existing question answering benchmarks such
as ToMi ask models questions to make inferences about beliefs of characters in
a story, but do not test whether models can then use these inferences to guide
their actions. We propose a new evaluation paradigm for large language models
(LLMs): Thinking for Doing (T4D), which requires models to connect inferences
about others' mental states to actions in social scenarios. Experiments on T4D
demonstrate that LLMs such as GPT-4 and PaLM 2 seemingly excel at tracking
characters' beliefs in stories, but they struggle to translate this capability
into strategic action. Our analysis reveals the core challenge for LLMs lies in
identifying the implicit inferences about mental states without being
explicitly asked about as in ToMi, that lead to choosing the correct action in
T4D. To bridge this gap, we introduce a zero-shot prompting framework, Foresee
and Reflect (FaR), which provides a reasoning structure that encourages LLMs to
anticipate future challenges and reason about potential actions. FaR boosts
GPT-4's performance from 50% to 71% on T4D, outperforming other prompting
methods such as Chain-of-Thought and Self-Ask. Moreover, FaR generalizes to
diverse out-of-distribution story structures and scenarios that also require
ToM inferences to choose an action, consistently outperforming other methods
including few-shot in-context learning.
Related papers
- SimpleToM: Exposing the Gap between Explicit ToM Inference and Implicit ToM Application in LLMs [72.06808538971487]
We test whether large language models (LLMs) can implicitly apply a "theory of mind" (ToM) to predict behavior.
We create a new dataset, SimpleTom, containing stories with three questions that test different degrees of ToM reasoning.
To our knowledge, SimpleToM is the first dataset to explore downstream reasoning requiring knowledge of mental states in realistic scenarios.
arXiv Detail & Related papers (2024-10-17T15:15:00Z) - TypedThinker: Typed Thinking Improves Large Language Model Reasoning [44.8904486513791]
We propose TypedThinker, a framework that enhances Large Language Models' problem-solving abilities.
TypedThinker addresses two key challenges: selecting appropriate reasoning types for given problems and effectively implementing specific reasoning types.
Experimental results demonstrate significant improvements over baseline models, with accuracy increases of 3.4% for Mistral 7B and 16.7% for LLaMA3 8B.
arXiv Detail & Related papers (2024-10-02T18:54:45Z) - An Incomplete Loop: Deductive, Inductive, and Abductive Learning in Large Language Models [99.31449616860291]
Modern language models (LMs) can learn to perform new tasks in different ways.
In instruction following, the target task is described explicitly in natural language; in few-shot prompting, the task is specified implicitly.
In instruction inference, LMs are presented with in-context examples and are then prompted to generate a natural language task description.
arXiv Detail & Related papers (2024-04-03T19:31:56Z) - What's Next in Affective Modeling? Large Language Models [3.0902630634005797]
GPT-4 performs well across multiple emotion tasks.
It can distinguish emotion theories and come up with emotional stories.
We suggest that LLMs could play an important role in affective modeling.
arXiv Detail & Related papers (2023-10-03T16:39:20Z) - Probing the Multi-turn Planning Capabilities of LLMs via 20 Question
Games [14.063311955315077]
Large language models (LLMs) are effective at answering questions that are clearly asked.
When faced with ambiguous queries they can act unpredictably and produce incorrect outputs.
This underscores the need for the development of intelligent agents capable of asking clarification questions to resolve ambiguities effectively.
arXiv Detail & Related papers (2023-10-02T16:55:37Z) - Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in
Large Language Models [82.50173296858377]
Many anecdotal examples were used to suggest newer large language models (LLMs) like ChatGPT and GPT-4 exhibit Neural Theory-of-Mind (N-ToM)
We investigate the extent of LLMs' N-ToM through an extensive evaluation on 6 tasks and find that while LLMs exhibit certain N-ToM abilities, this behavior is far from being robust.
arXiv Detail & Related papers (2023-05-24T06:14:31Z) - Tree of Thoughts: Deliberate Problem Solving with Large Language Models [52.31950122881687]
We introduce a new framework for language model inference, Tree of Thoughts (ToT)
ToT generalizes over the popular Chain of Thought approach to prompting language models.
Our experiments show that ToT significantly enhances language models' problem-solving abilities.
arXiv Detail & Related papers (2023-05-17T23:16:17Z) - Document-Level Machine Translation with Large Language Models [91.03359121149595]
Large language models (LLMs) can produce coherent, cohesive, relevant, and fluent answers for various natural language processing (NLP) tasks.
This paper provides an in-depth evaluation of LLMs' ability on discourse modeling.
arXiv Detail & Related papers (2023-04-05T03:49:06Z) - The Goldilocks of Pragmatic Understanding: Fine-Tuning Strategy Matters
for Implicature Resolution by LLMs [26.118193748582197]
We evaluate four categories of widely used state-of-the-art models.
We find that, despite only evaluating on utterances that require a binary inference, models in three of these categories perform close to random.
These results suggest that certain fine-tuning strategies are far better at inducing pragmatic understanding in models.
arXiv Detail & Related papers (2022-10-26T19:04:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.