Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- URL: http://arxiv.org/abs/2506.09250v2
- Date: Mon, 16 Jun 2025 18:04:21 GMT
- Title: Comment on The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity
- Authors: A. Lawsen,
- Abstract summary: Large Reasoning Models (LRMs) exhibit "accuracy collapse" on planning puzzles beyond certain complexity thresholds.<n>We demonstrate that their findings primarily reflect experimental design limitations rather than fundamental reasoning failures.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Shojaee et al. (2025) report that Large Reasoning Models (LRMs) exhibit "accuracy collapse" on planning puzzles beyond certain complexity thresholds. We demonstrate that their findings primarily reflect experimental design limitations rather than fundamental reasoning failures. Our analysis reveals three critical issues: (1) Tower of Hanoi experiments risk exceeding model output token limits, with models explicitly acknowledging these constraints in their outputs; (2) The authors' automated evaluation framework fails to distinguish between reasoning failures and practical constraints, leading to misclassification of model capabilities; (3) Most concerningly, their River Crossing benchmarks include mathematically impossible instances for N > 5 due to insufficient boat capacity, yet models are scored as failures for not solving these unsolvable problems. When we control for these experimental artifacts, by requesting generating functions instead of exhaustive move lists, preliminary experiments across multiple models indicate high accuracy on Tower of Hanoi instances previously reported as complete failures. These findings highlight the importance of careful experimental design when evaluating AI reasoning capabilities.
Related papers
- Inverse Scaling in Test-Time Compute [51.16323216811257]
Extending the reasoning length of Large Reasoning Models (LRMs) deteriorates performance.<n>We identify five distinct failure modes when models reason for longer.<n>These findings suggest that while test-time compute scaling remains promising for improving model capabilities, it may inadvertently reinforce problematic reasoning patterns.
arXiv Detail & Related papers (2025-07-19T00:06:13Z) - Large Reasoning Models are not thinking straight: on the unreliability of thinking trajectories [0.0]
Large Language Models (LLMs) trained via Reinforcement Learning (RL) have recently achieved impressive results on reasoning benchmarks.<n>Yet, growing evidence shows that these models often generate longer but ineffective chains of thought (CoTs)<n>We present new evidence of overthinking, where models disregard correct solutions even when explicitly provided, instead continuing to generate unnecessary reasoning steps.
arXiv Detail & Related papers (2025-07-01T12:14:22Z) - Lost at the Beginning of Reasoning [82.18834329384514]
We show that the first reasoning step exerts a disproportionately large influence on the final prediction.<n>We propose an efficient sampling strategy that leverages a reward model to identify and retain high-quality first reasoning steps.<n>We introduce a new benchmark specifically constructed with deliberately flawed first reasoning steps to systematically evaluate model self-correction capabilities.
arXiv Detail & Related papers (2025-06-27T09:53:57Z) - A Comment On "The Illusion of Thinking": Reframing the Reasoning Cliff as an Agentic Gap [0.39073867995073247]
We argue that the observed failure is not evidence of a fundamental cognitive boundary, but rather a predictable outcome of system-level constraints.<n>A model, initially declaring a puzzle impossible when confined to text-only generation, now employs agentic tools to not only solve it but also master variations of complexity far beyond the reasoning cliff it previously failed to surmount.
arXiv Detail & Related papers (2025-06-23T17:14:21Z) - Mathematical Proof as a Litmus Test: Revealing Failure Modes of Advanced Large Reasoning Models [11.250861762443801]
We introduce the RFMDataset (Reveal Failure Modes), a collection of 200 diverse mathematical proof problems.<n>We thoroughly evaluate advanced models' performance on it.<n>Our analysis uncovers 10 fine-grained error types, which shows fundamental limitations in current large reasoning models.
arXiv Detail & Related papers (2025-06-20T16:14:18Z) - Discovering Hierarchical Latent Capabilities of Language Models via Causal Representation Learning [22.32435186013626]
We propose a causal representation learning framework wherein observed benchmark performance is modeled as a linear transformation of a few latent capability factors.<n>Applying this approach to a comprehensive dataset encompassing over 1500 models evaluated across six benchmarks, we identify a concise three-node linear causal structure that reliably explains the observed performance variations.
arXiv Detail & Related papers (2025-06-12T06:07:42Z) - Reasoning Models Are More Easily Gaslighted Than You Think [85.84943447589511]
We evaluate three state-of-the-art reasoning models, including OpenAI's o4-mini, Claude-3.7-Sonnet and Gemini-2.5-Flash.<n>Our evaluation reveals significant accuracy drops following gaslighting negation prompts.<n>We introduce GaslightingBench-R, a new diagnostic benchmark designed to evaluate reasoning models' susceptibility to defend their belief.
arXiv Detail & Related papers (2025-06-11T12:52:25Z) - The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity [16.266145641151375]
Large Reasoning Models generate detailed thinking processes before providing answers.<n>We show that LRMs face a complete accuracy collapse beyond certain complexities.<n>We also investigate the reasoning traces in more depth, studying the patterns of explored solutions.
arXiv Detail & Related papers (2025-06-07T22:42:29Z) - CoThink: Token-Efficient Reasoning via Instruct Models Guiding Reasoning Models [56.40065909544213]
Large language models (LLMs) benefit from increased test-time compute, a phenomenon known as test-time scaling.<n>However, reasoning-optimized models often overthink even simple problems, producing excessively verbose outputs and leading to low token efficiency.<n>We identify two key causes of this verbosity: (1) reinforcement learning reduces the information density of forward reasoning, and (2) backward chain-of thought training encourages redundant and often unnecessary verification steps.
arXiv Detail & Related papers (2025-05-28T06:24:45Z) - Reasoning Model is Stubborn: Diagnosing Instruction Overriding in Reasoning Models [27.437685534830457]
Large language models frequently exhibit a problematic reliance on familiar reasoning patterns.<n>Despite explicit instructions from users, these models often override clearly stated conditions and default to habitual reasoning trajectories.<n>This behavior presents significant challenges, particularly in domains such as mathematics and logic puzzle.
arXiv Detail & Related papers (2025-05-22T19:00:01Z) - Text-to-Image Diffusion Models Cannot Count, and Prompt Refinement Cannot Help [18.70937620674227]
We introduce T2ICountBench, a novel benchmark designed to rigorously evaluate the counting ability of state-of-the-art text-to-image diffusion models.<n>Our evaluations reveal that all state-of-the-art diffusion models fail to generate the correct number of objects, with accuracy dropping significantly as the number of objects increases.
arXiv Detail & Related papers (2025-03-10T03:28:18Z) - Causality can systematically address the monsters under the bench(marks) [64.36592889550431]
Benchmarks are plagued by various biases, artifacts, or leakage.<n>Models may behave unreliably due to poorly explored failure modes.<n> causality offers an ideal framework to systematically address these challenges.
arXiv Detail & Related papers (2025-02-07T17:01:37Z) - A Comprehensive Evaluation and Analysis Study for Chinese Spelling Check [53.152011258252315]
We show that using phonetic and graphic information reasonably is effective for Chinese Spelling Check.
Models are sensitive to the error distribution of the test set, which reflects the shortcomings of models.
The commonly used benchmark, SIGHAN, can not reliably evaluate models' performance.
arXiv Detail & Related papers (2023-07-25T17:02:38Z) - Causal Triplet: An Open Challenge for Intervention-centric Causal
Representation Learning [98.78136504619539]
Causal Triplet is a causal representation learning benchmark featuring visually more complex scenes.
We show that models built with the knowledge of disentangled or object-centric representations significantly outperform their distributed counterparts.
arXiv Detail & Related papers (2023-01-12T17:43:38Z) - Shortcomings of Question Answering Based Factuality Frameworks for Error
Localization [51.01957350348377]
We show that question answering (QA)-based factuality metrics fail to correctly identify error spans in generated summaries.
Our analysis reveals a major reason for such poor localization: questions generated by the QG module often inherit errors from non-factual summaries which are then propagated further into downstream modules.
Our experiments conclusively show that there exist fundamental issues with localization using the QA framework which cannot be fixed solely by stronger QA and QG models.
arXiv Detail & Related papers (2022-10-13T05:23:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.