Related papers: When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs

When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs

URL: http://arxiv.org/abs/2505.11423v2
Date: Tue, 20 May 2025 05:31:43 GMT
Title: When Thinking Fails: The Pitfalls of Reasoning for Instruction-Following in LLMs
Authors: Xiaomin Li, Zhou Yu, Zhiwei Zhang, Xupeng Chen, Ziji Zhang, Yingying Zhuang, Narayanan Sadagopan, Anurag Beniwal,
Abstract summary: Chain-of-thought reasoning can significantly degrade instruction-following accuracy.<n>This is the first work to systematically expose reasoning-induced failures in instruction-following.
Score: 16.659986373052217
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Reasoning-enhanced large language models (RLLMs), whether explicitly trained for reasoning or prompted via chain-of-thought (CoT), have achieved state-of-the-art performance on many complex reasoning tasks. However, we uncover a surprising and previously overlooked phenomenon: explicit CoT reasoning can significantly degrade instruction-following accuracy. Evaluating 15 models on two benchmarks: IFEval (with simple, rule-verifiable constraints) and ComplexBench (with complex, compositional constraints), we consistently observe performance drops when CoT prompting is applied. Through large-scale case studies and an attention-based analysis, we identify common patterns where reasoning either helps (e.g., with formatting or lexical precision) or hurts (e.g., by neglecting simple constraints or introducing unnecessary content). We propose a metric, constraint attention, to quantify model focus during generation and show that CoT reasoning often diverts attention away from instruction-relevant tokens. To mitigate these effects, we introduce and evaluate four strategies: in-context learning, self-reflection, self-selective reasoning, and classifier-selective reasoning. Our results demonstrate that selective reasoning strategies, particularly classifier-selective reasoning, can substantially recover lost performance. To our knowledge, this is the first work to systematically expose reasoning-induced failures in instruction-following and offer practical mitigation strategies.

Related papers

Learning Deliberately, Acting Intuitively: Unlocking Test-Time Reasoning in Multimodal LLMs [7.501387372794562]
Deliberate-to-Intuitive reasoning framework (D2I) improves understanding and reasoning ability of multimodal language models.<n>Our method sets deliberate reasoning strategies to enhance modality alignment only through the rule-based format reward during training.<n>While evaluating, the reasoning style shifts to intuitive, which removes deliberate reasoning strategies during training and implicitly reflects the model's acquired abilities in the response.
arXiv Detail & Related papers (2025-07-09T16:25:44Z)
Think Clearly: Improving Reasoning via Redundant Token Pruning [57.01254508252785]
We show that deliberately removing redundancy in the reasoning process significantly improves performance.<n>We demonstrate that our method significantly improves overall accuracy across reasoning-intensive benchmarks without any training.
arXiv Detail & Related papers (2025-06-17T06:04:01Z)
PixelThink: Towards Efficient Chain-of-Pixel Reasoning [70.32510083790069]
PixelThink is a simple yet effective scheme that integrates externally estimated task difficulty and internally measured model uncertainty.<n>It learns to compress reasoning length in accordance with scene complexity and predictive confidence.<n> Experimental results demonstrate that the proposed approach improves both reasoning efficiency and overall segmentation performance.
arXiv Detail & Related papers (2025-05-29T17:55:49Z)
Benchmarking Abstract and Reasoning Abilities Through A Theoretical Perspective [59.7140089198992]
We develop a mathematic framework that defines abstract reasoning as the ability to extract essential patterns.<n>We introduce two novel complementary metrics: (scoreGamma) measures basic reasoning accuracy, while (scoreDelta) quantifies a model's reliance on specific symbols.
arXiv Detail & Related papers (2025-05-28T09:02:45Z)
Scaling Reasoning, Losing Control: Evaluating Instruction Following in Large Reasoning Models [27.142703756752997]
We introduce MathIF, a benchmark for evaluating instruction-following in mathematical reasoning tasks.<n>Our empirical analysis reveals a consistent tension between scaling up reasoning capacity and maintaining controllability.<n>We show that even simple interventions can partially recover obedience, though at the cost of reasoning performance.
arXiv Detail & Related papers (2025-05-20T18:18:01Z)
The Curse of CoT: On the Limitations of Chain-of-Thought in In-Context Learning [39.613595533503144]
Chain-of-Thought (CoT) prompting has been widely recognized for its ability to enhance reasoning capabilities in large language models.<n>We show that CoT consistently underperforms direct answering across varying model scales and benchmark complexities.<n>Our analysis uncovers a fundamental explicit-implicit duality driving CoT's performance in pattern-based ICL.
arXiv Detail & Related papers (2025-04-07T13:51:06Z)
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models [54.04678363287392]
Large Language Models (LLMs) have demonstrated remarkable capabilities in complex tasks.<n>Recent advancements in OpenAI o1 and DeepSeek-R1 have further improved performance in System-2 reasoning domains.
arXiv Detail & Related papers (2025-03-20T17:59:38Z)
Attention Reveals More Than Tokens: Training-Free Long-Context Reasoning with Attention-guided Retrieval [33.84832445715185]
Large Language Models (LLMs) often exhibit substantially shorter effective context lengths than their claimed capacities.<n>We propose a novel training-free algorithm, Attrieval, which leverages attention weights to retrieve relevant facts from the long context.<n>Our results demonstrate that Attrieval enhances long-context reasoning capability notably on both synthetic and real-world QA datasets.
arXiv Detail & Related papers (2025-03-12T20:34:14Z)
Make LLMs better zero-shot reasoners: Structure-orientated autonomous reasoning [52.83539473110143]
We introduce a novel structure-oriented analysis method to help Large Language Models (LLMs) better understand a question. To further improve the reliability in complex question-answering tasks, we propose a multi-agent reasoning system, Structure-oriented Autonomous Reasoning Agents (SARA) Extensive experiments verify the effectiveness of the proposed reasoning system. Surprisingly, in some cases, the system even surpasses few-shot methods.
arXiv Detail & Related papers (2024-10-18T05:30:33Z)
Break the Chain: Large Language Models Can be Shortcut Reasoners [18.047917626825548]
Chain-of-Thought (CoT) reasoning utilize complex modules but are hampered by high token consumption, limited applicability, and challenges in thinking. This paper conducts a critical evaluation of CoT prompting, extending beyond arithmetic to include complex logical and commonsense reasoning tasks. We propose the integration of human-likes and shortcuts into language models (LMs) through "break the chain" strategies.
arXiv Detail & Related papers (2024-06-04T14:02:53Z)
Distilling Reasoning Ability from Large Language Models with Adaptive Thinking [54.047761094420174]
Chain of thought finetuning (cot-finetuning) aims to endow small language models (SLM) with reasoning ability to improve their performance towards specific tasks. Most existing cot-finetuning methods adopt a pre-thinking mechanism, allowing the SLM to generate a rationale before providing an answer. This mechanism enables SLM to analyze and think about complex questions, but it also makes answer correctness highly sensitive to minor errors in rationale. We propose a robust post-thinking mechanism to generate answers before rationale.
arXiv Detail & Related papers (2024-04-14T07:19:27Z)
LogicAsker: Evaluating and Improving the Logical Reasoning Ability of Large Language Models [63.14196038655506]
We introduce LogicAsker, a novel approach for evaluating and enhancing the logical reasoning capabilities of large language models (LLMs) Our methodology reveals significant gaps in LLMs' learning of logical rules, with identified reasoning failures ranging from 29% to 90% across different models. We leverage these findings to construct targeted demonstration examples and fine-tune data, notably enhancing logical reasoning in models like GPT-4o by up to 5%.
arXiv Detail & Related papers (2024-01-01T13:53:53Z)
Concise and Organized Perception Facilitates Reasoning in Large Language Models [31.238220405009617]
Exploiting large language models (LLMs) to tackle reasoning has garnered growing attention.<n>It still remains highly challenging to achieve satisfactory results in complex logical problems, characterized by plenty of premises within the context and requiring multi-hop reasoning.<n>In this work, we first examine the mechanism from the perspective of information flow and reveal that LLMs confront difficulties akin to human-like cognitive biases when dealing with disordered and irrelevant content in reasoning tasks.
arXiv Detail & Related papers (2023-10-05T04:47:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.