Related papers: To Think or Not To Think, That is The Question for Large Reasoning Models in Theory of Mind Tasks

To Think or Not To Think, That is The Question for Large Reasoning Models in Theory of Mind Tasks

URL: http://arxiv.org/abs/2602.10625v1
Date: Wed, 11 Feb 2026 08:16:13 GMT
Title: To Think or Not To Think, That is The Question for Large Reasoning Models in Theory of Mind Tasks
Authors: Nanxu Gong, Haotian Li, Sixun Dong, Jianxun Lian, Yanjie Fu, Xing Xie,
Abstract summary: Theory of Mind (ToM) assesses whether models can infer hidden mental states such as beliefs, desires, and intentions.<n>Recent progress in Large Reasoning Models (LRMs) has boosted step-by-step inference in mathematics and coding.<n>We present a systematic study of nine advanced Large Language Models (LLMs) comparing reasoning models with non-reasoning models.
Score: 56.11584171938381
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Theory of Mind (ToM) assesses whether models can infer hidden mental states such as beliefs, desires, and intentions, which is essential for natural social interaction. Although recent progress in Large Reasoning Models (LRMs) has boosted step-by-step inference in mathematics and coding, it is still underexplored whether this benefit transfers to socio-cognitive skills. We present a systematic study of nine advanced Large Language Models (LLMs), comparing reasoning models with non-reasoning models on three representative ToM benchmarks. The results show that reasoning models do not consistently outperform non-reasoning models and sometimes perform worse. A fine-grained analysis reveals three insights. First, slow thinking collapses: accuracy significantly drops as responses grow longer, and larger reasoning budgets hurt performance. Second, moderate and adaptive reasoning benefits performance: constraining reasoning length mitigates failure, while distinct success patterns demonstrate the necessity of dynamic adaptation. Third, option matching shortcut: when multiple choice options are removed, reasoning models improve markedly, indicating reliance on option matching rather than genuine deduction. We also design two intervention approaches: Slow-to-Fast (S2F) adaptive reasoning and Think-to-Match (T2M) shortcut prevention to further verify and mitigate the problems. With all results, our study highlights the advancement of LRMs in formal reasoning (e.g., math, code) cannot be fully transferred to ToM, a typical task in social reasoning. We conclude that achieving robust ToM requires developing unique capabilities beyond existing reasoning methods.

Related papers

Reasoning or Retrieval? A Study of Answer Attribution on Large Reasoning Models [15.797612515648412]
Large reasoning models (LRMs) exhibit unprecedented capabilities in solving complex problems through Chain-of-Thought (CoT) reasoning.<n>Recent studies reveal that their final answers often contradict their own reasoning traces.<n>We hypothesize that this inconsistency stems from two competing mechanisms for generating answers: CoT reasoning and memory retrieval.<n>We introduce FARL, a novel fine-tuning framework that integrates memory unlearning with reinforcement learning.
arXiv Detail & Related papers (2025-09-29T01:13:33Z)
Thinking with Nothinking Calibration: A New In-Context Learning Paradigm in Reasoning Large Language Models [28.756240721942138]
Reasoning large language models (RLLMs) have recently demonstrated remarkable capabilities through structured and multi-step reasoning.<n>We propose Thinking with Nothinking (JointThinking), a new ICL paradigm that prompts the model to generate two answers in parallel.<n>JointThinking significantly outperforms few-shot chain-of-thought (CoT), thinking twice and majority voting.
arXiv Detail & Related papers (2025-08-05T12:09:55Z)
Don't Overthink It: A Survey of Efficient R1-style Large Reasoning Models [49.598776427454176]
Large Reasoning Models (LRMs) have gradually become a research hotspot due to their outstanding performance in handling complex tasks.<n>However, with the widespread application of these models, the problem of overthinking has gradually emerged.<n>Various efficient reasoning methods have been proposed, aiming to reduce the length of reasoning paths without compromising model performance and reasoning capability.
arXiv Detail & Related papers (2025-08-04T06:54:31Z)
Think How to Think: Mitigating Overthinking with Autonomous Difficulty Cognition in Large Reasoning Models [22.57102686737925]
Recent Large Reasoning Models (LRMs) excel at complex reasoning tasks but often suffer from overthinking.<n>We propose Think-How-to-Think (TH2T), a novel two-stage fine-tuning strategy that progressively inspires LRMs' difficulty cognition and redundancy cognition.
arXiv Detail & Related papers (2025-07-03T14:24:26Z)
Large Reasoning Models are not thinking straight: on the unreliability of thinking trajectories [0.0]
Large Language Models (LLMs) trained via Reinforcement Learning (RL) have recently achieved impressive results on reasoning benchmarks.<n>Yet, growing evidence shows that these models often generate longer but ineffective chains of thought (CoTs)<n>We present new evidence of overthinking, where models disregard correct solutions even when explicitly provided, instead continuing to generate unnecessary reasoning steps.
arXiv Detail & Related papers (2025-07-01T12:14:22Z)
Socratic-MCTS: Test-Time Visual Reasoning by Asking the Right Questions [100.41062461003389]
We show that framing reasoning as a search process helps the model "connect the dots" between fragmented knowledge and produce extended reasoning traces in non-reasoning models.<n>We evaluate our method across three benchmarks and observe consistent improvements.
arXiv Detail & Related papers (2025-06-10T15:51:16Z)
Does Thinking More always Help? Mirage of Test-Time Scaling in Reasoning Models [130.5487886246353]
Extending thinking traces using prompts like "Wait" or "Let me rethink" can improve performance.<n>This raises a natural question: Does thinking more at test-time truly lead to better reasoning?<n>We show a consistent pattern of initial performance improvements from additional thinking followed by a decline, due to "overthinking"
arXiv Detail & Related papers (2025-06-04T17:55:09Z)
The Price of a Second Thought: On the Evaluation of Reasoning Efficiency in Large Language Models [54.88805865447848]
We show that instruct models achieve higher efficiency overall, and problem difficulty affects efficiency.<n>We propose COTHINK, a simple two-stage pipeline: an instruct model drafts a brief outline, and a thinking model expands it.<n>On GSM8K, MATH500, and AIME24, COTHINK cuts token usage by 21.1% while keeping accuracy on four thinking models, and remains competitive with strong efficiency baselines.
arXiv Detail & Related papers (2025-05-28T06:24:45Z)
Think or Not? Selective Reasoning via Reinforcement Learning for Vision-Language Models [67.87579664988199]
TON is a two-stage training strategy for vision-language models (VLMs)<n>It introduces a think-or-not format that serves as a cold start for selective reasoning.<n>TON can reduce the completion length by up to 90% compared to vanilla GRPO.
arXiv Detail & Related papers (2025-05-22T16:13:29Z)
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models [49.61246073215651]
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks.<n>Recent advancements in OpenAI o1 and DeepSeek-R1 have further improved performance in System-2 reasoning domains.<n>However, they also introduce significant computational overhead due to verbose and redundant outputs.
arXiv Detail & Related papers (2025-03-20T17:59:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.