Related papers: Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models

Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models

URL: http://arxiv.org/abs/2601.21214v1
Date: Thu, 29 Jan 2026 03:24:32 GMT
Title: Scaling Reasoning Hop Exposes Weaknesses: Demystifying and Improving Hop Generalization in Large Language Models
Authors: Zhaoyi Li, Jiatong Li, Gangwei Jiang, Linqi Song, Defu Lian, Ying Wei,
Abstract summary: Chain-of-thought (CoT) reasoning has become the standard paradigm for enabling Large Language Models (LLMs) to solve complex problems.<n>Recent studies reveal a sharp performance drop in reasoning hop generalization scenarios.<n>We propose test-time correction of reasoning, a lightweight intervention method that dynamically identifies and deactivates ep heads in the reasoning process.
Score: 66.36240676392502
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Chain-of-thought (CoT) reasoning has become the standard paradigm for enabling Large Language Models (LLMs) to solve complex problems. However, recent studies reveal a sharp performance drop in reasoning hop generalization scenarios, where the required number of reasoning steps exceeds training distributions while the underlying algorithm remains unchanged. The internal mechanisms driving this failure remain poorly understood. In this work, we conduct a systematic study on tasks from multiple domains, and find that errors concentrate at token positions of a few critical error types, rather than being uniformly distributed. Closer inspection reveals that these token-level erroneous predictions stem from internal competition mechanisms: certain attention heads, termed erroneous processing heads (ep heads), tip the balance by amplifying incorrect reasoning trajectories while suppressing correct ones. Notably, removing individual ep heads during inference can often restore the correct predictions. Motivated by these insights, we propose test-time correction of reasoning, a lightweight intervention method that dynamically identifies and deactivates ep heads in the reasoning process. Extensive experiments across different tasks and LLMs show that it consistently improves reasoning hop generalization, highlighting both its effectiveness and potential.

Related papers

Beyond What Seems Necessary: Hidden Gains from Scaling Training-Time Reasoning Length under Outcome Supervision [30.75583081407994]
Training LLMs to think and reason for longer has become a key ingredient in building state-of-the-art models.<n>Recent efforts pursue this in different ways, such as RL fine-tuning to elicit long CoT or scaling latent reasoning through architectural recurrence.<n>Under outcome-only supervision, out-of-distribution (OOD) performance can continue improving as training-time reasoning length increases.
arXiv Detail & Related papers (2026-01-31T22:54:45Z)
Mitigating Cognitive Inertia in Large Reasoning Models via Latent Spike Steering [12.332146893333949]
Large Reasoning Models (LRMs) have achieved remarkable performance by scaling test-time compute.<n>LRMs frequently suffer from Cognitive Inertia, a failure pattern manifesting as either overthinking (inertia of motion) or rigidity (inertia of direction)
arXiv Detail & Related papers (2026-01-30T02:47:12Z)
Addressing Overthinking in Large Vision-Language Models via Gated Perception-Reasoning Optimization [56.59356959631999]
Gated Perception-Reasoning Optimization (GPRO) is a meta-reasoning controller that dynamically routes computation among three decision paths.<n>GPRO substantially improves both accuracy and efficiency, outperforming recent slow-thinking methods.
arXiv Detail & Related papers (2026-01-07T23:05:17Z)
How and Why LLMs Generalize: A Fine-Grained Analysis of LLM Reasoning from Cognitive Behaviors to Low-Level Patterns [51.02752099869218]
Large Language Models (LLMs) display strikingly different generalization behaviors.<n>We introduce a novel benchmark that decomposes reasoning into atomic core skills.<n>We show that RL-tuned models maintain more stable behavioral profiles and resist collapse in reasoning skills, whereas SFT models exhibit sharper drift and overfit to surface patterns.
arXiv Detail & Related papers (2025-12-30T08:16:20Z)
Failure by Interference: Language Models Make Balanced Parentheses Errors When Faulty Mechanisms Overshadow Sound Ones [13.381339115567288]
Language models (LMs) still struggle with simple syntactic tasks such as generating balanced parentheses.<n>Our study reveals that LMs rely on a number of components that independently make their own predictions.<n>We introduce RASteer, a steering method to systematically identify and increase the contribution of reliable components for improving model performance.
arXiv Detail & Related papers (2025-06-30T23:35:19Z)
Lost at the Beginning of Reasoning [85.17612793300238]
We show that the first reasoning step exerts a disproportionately large influence on the final prediction.<n>We propose an efficient sampling strategy that leverages a reward model to identify and retain high-quality first reasoning steps.
arXiv Detail & Related papers (2025-06-27T09:53:57Z)
Learning to Focus: Causal Attention Distillation via Gradient-Guided Token Pruning [62.23671919314693]
Large language models (LLMs) have demonstrated significant improvements in contextual understanding.<n>However, their ability to attend to truly critical information during long-context reasoning and generation still falls behind the pace.<n>We introduce a two-stage framework called Learning to Focus (LeaF) to mitigate confounding factors.
arXiv Detail & Related papers (2025-06-09T15:16:39Z)
Bridging Internal Probability and Self-Consistency for Effective and Efficient LLM Reasoning [53.25336975467293]
We present the first theoretical error decomposition analysis of methods such as perplexity and self-consistency.<n>Our analysis reveals a fundamental trade-off: perplexity methods suffer from substantial model error due to the absence of a proper consistency function.<n>We propose Reasoning-Pruning Perplexity Consistency (RPC), which integrates perplexity with self-consistency, and Reasoning Pruning, which eliminates low-probability reasoning paths.
arXiv Detail & Related papers (2025-02-01T18:09:49Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.