Related papers: Active Confusion Expression in Large Language Models: Leveraging World Models toward Better Social Reasoning

Active Confusion Expression in Large Language Models: Leveraging World Models toward Better Social Reasoning

URL: http://arxiv.org/abs/2510.07974v2
Date: Sat, 11 Oct 2025 05:57:45 GMT
Title: Active Confusion Expression in Large Language Models: Leveraging World Models toward Better Social Reasoning
Authors: Jialu Du, Guiyang Hou, Yihui Fu, Chen Wu, Wenqi Zhang, Yongliang Shen, Weiming Lu,
Abstract summary: Large language models (LLMs) exhibit cognitive confusion, logical inconsistencies, and conflation between objective world states and subjective belief states.<n>We propose an adaptive world model-enhanced reasoning mechanism that constructs a dynamic textual world model to track entity states and temporal sequences.
Score: 31.08532996770416
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: While large language models (LLMs) excel in mathematical and code reasoning, we observe they struggle with social reasoning tasks, exhibiting cognitive confusion, logical inconsistencies, and conflation between objective world states and subjective belief states. Through deteiled analysis of DeepSeek-R1's reasoning trajectories, we find that LLMs frequently encounter reasoning impasses and tend to output contradictory terms like "tricky" and "confused" when processing scenarios with multiple participants and timelines, leading to erroneous reasoning or infinite loops. The core issue is their inability to disentangle objective reality from agents' subjective beliefs. To address this, we propose an adaptive world model-enhanced reasoning mechanism that constructs a dynamic textual world model to track entity states and temporal sequences. It dynamically monitors reasoning trajectories for confusion indicators and promptly intervenes by providing clear world state descriptions, helping models navigate through cognitive dilemmas. The mechanism mimics how humans use implicit world models to distinguish between external events and internal beliefs. Evaluations on three social benchmarks demonstrate significant improvements in accuracy (e.g., +10% in Hi-ToM) while reducing computational costs (up to 33.8% token reduction), offering a simple yet effective solution for deploying LLMs in social contexts.

Related papers

Beyond Words: Evaluating and Bridging Epistemic Divergence in User-Agent Interaction via Theory of Mind [8.740788873949471]
Large Language Models (LLMs) have developed rapidly and are widely applied to both general-purpose and professional tasks.<n>They still struggle to comprehend and respond to the true user needs when intentions and instructions are imprecisely conveyed.
arXiv Detail & Related papers (2026-02-14T16:01:59Z)
To Think or Not To Think, That is The Question for Large Reasoning Models in Theory of Mind Tasks [56.11584171938381]
Theory of Mind (ToM) assesses whether models can infer hidden mental states such as beliefs, desires, and intentions.<n>Recent progress in Large Reasoning Models (LRMs) has boosted step-by-step inference in mathematics and coding.<n>We present a systematic study of nine advanced Large Language Models (LLMs) comparing reasoning models with non-reasoning models.
arXiv Detail & Related papers (2026-02-11T08:16:13Z)
The Imperfective Paradox in Large Language Models [19.058068907991277]
We investigate the Imperfective Paradox, where the past progressive aspect entails event realization for activities but not for accomplishments.<n>We introduce ImperfectiveNLI, a diagnostic dataset designed to probe this distinction across diverse semantic classes.<n>We uncover a pervasive Teleological Bias: models systematically hallucinate completion for goal-oriented events, often overriding explicit textual negation.
arXiv Detail & Related papers (2026-01-14T10:57:16Z)
Analyzing Reasoning Consistency in Large Multimodal Models under Cross-Modal Conflicts [74.47786985522762]
We identify a critical failure mode termed textual inertia, where models tend to blindly adhere to the erroneous text while neglecting conflicting visual evidence.<n>We propose the LogicGraph Perturbation Protocol that structurally injects perturbations into the reasoning chains of diverse LMMs.<n>Results reveal that models successfully self-correct in less than 10% of cases and predominantly succumb to blind textual error propagation.
arXiv Detail & Related papers (2026-01-07T16:39:34Z)
Revisiting the UID Hypothesis in LLM Reasoning Traces [10.833681318622467]
Large language models (LLMs) often solve problems using step-by-step Chain-of-Thought (CoT) reasoning.<n>We introduce entropy-based metrics to analyze the information flow within reasoning traces.<n>We find that successful reasoning in LLMs is globally non-uniform.
arXiv Detail & Related papers (2025-10-11T21:19:17Z)
Language Models Do Not Follow Occam's Razor: A Benchmark for Inductive and Abductive Reasoning [6.06071622429429]
This work focuses on evaluating large language models' inductive and abductive reasoning capabilities.<n>We introduce a programmable and synthetic dataset, InAbHyD, where each reasoning example consists of an incomplete world model and a set of observations.<n>We propose a new metric to evaluate the quality of hypotheses based on Occam's Razor.
arXiv Detail & Related papers (2025-09-03T14:22:42Z)
Implicit Reasoning in Large Language Models: A Comprehensive Survey [67.53966514728383]
Large Language Models (LLMs) have demonstrated strong generalization across a wide range of tasks.<n>Recent studies have shifted attention from explicit chain-of-thought prompting toward implicit reasoning.<n>This survey introduces a taxonomy centered on execution paradigms, shifting the focus from representational forms to computational strategies.
arXiv Detail & Related papers (2025-09-02T14:16:02Z)
From Passive to Active Reasoning: Can Large Language Models Ask the Right Questions under Incomplete Information? [34.959850282872594]
We present AR-Bench, a novel benchmark designed explicitly to evaluate an LLM's active reasoning skills.<n>AR-Bench comprises three task families-detective cases, situation puzzles, and guessing numbers.<n> Empirical evaluation on AR-Bench demonstrates that contemporary LLMs exhibit pronounced difficulties with active reasoning.
arXiv Detail & Related papers (2025-06-09T23:56:41Z)
Embodied-Reasoner: Synergizing Visual Search, Reasoning, and Action for Embodied Interactive Tasks [42.022527376404476]
Embodied Reasoner is a model that extends o1 style reasoning to interactive embodied search tasks.<n>We synthesize 9.3k coherent Observation-Thought-Action trajectories containing 64k interactive images and 90k diverse thinking processes.<n>We develop a three-stage training pipeline that progressively enhances the model's capabilities.
arXiv Detail & Related papers (2025-03-27T17:00:51Z)
Visualizing Thought: Conceptual Diagrams Enable Robust Planning in LMMs [59.66595230543127]
Conceptual diagrams externalize mental models, abstracting irrelevant details to efficiently capture how entities interact.<n>Large Language Models (LLMs) and Large MultiModal Models (LMMs) predominantly reason through text.<n>We propose Visual Thinking, a generalizable framework that enables LMMs to reason through multiple chains of self-generated conceptual diagrams.
arXiv Detail & Related papers (2025-03-14T18:27:02Z)
Failure Modes of LLMs for Causal Reasoning on Narratives [51.19592551510628]
We investigate the interaction between world knowledge and logical reasoning.<n>We find that state-of-the-art large language models (LLMs) often rely on superficial generalizations.<n>We show that simple reformulations of the task can elicit more robust reasoning behavior.
arXiv Detail & Related papers (2024-10-31T12:48:58Z)
A Closer Look at the Self-Verification Abilities of Large Language Models in Logical Reasoning [73.77088902676306]
We take a closer look at the self-verification abilities of large language models (LLMs) in the context of logical reasoning. Our main findings suggest that existing LLMs could struggle to identify fallacious reasoning steps accurately and may fall short of guaranteeing the validity of self-verification methods.
arXiv Detail & Related papers (2023-11-14T07:13:10Z)
From Heuristic to Analytic: Cognitively Motivated Strategies for Coherent Physical Commonsense Reasoning [66.98861219674039]
Heuristic-Analytic Reasoning (HAR) strategies drastically improve the coherence of rationalizations for model decisions. Our findings suggest that human-like reasoning strategies can effectively improve the coherence and reliability of PLM reasoning.
arXiv Detail & Related papers (2023-10-24T19:46:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.