Thinking Traps in Long Chain-of-Thought: A Measurable Study and Trap-Aware Adaptive Restart
- URL: http://arxiv.org/abs/2601.11940v1
- Date: Sat, 17 Jan 2026 07:26:02 GMT
- Title: Thinking Traps in Long Chain-of-Thought: A Measurable Study and Trap-Aware Adaptive Restart
- Authors: Kang Chen, Fan Yu, Junjie Nian, Shihan Zhao, Zhuoka Feng, Zijun Yao, Heng Wang, Minshen Yu, Yixin Cao,
- Abstract summary: We introduce TAAR (Trap-Aware Adaptive Restart), a test-time control framework that trains a diagnostic policy to predict two signals from partial trajectories.<n>At inference time, TAAR truncates the trajectory before the predicted trap segment and adaptively restarts decoding.<n>Experiments show that TAAR improves reasoning performance without fine-tuning base model parameters.
- Score: 27.904791075662896
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Scaling test-time compute via Long Chain-of-Thought (Long-CoT) significantly enhances reasoning capabilities, yet extended generation does not guarantee correctness: after an early wrong commitment, models may keep elaborating a self-consistent but incorrect prefix. Through fine-grained trajectory analysis, we identify Thinking Traps, prefix-dominant deadlocks where later reflection, alternative attempts, or verification fails to revise the root error. On a curated subset of DAPO-MATH, 89\% of failures exhibit such traps. To solve this problem, we introduce TAAR (Trap-Aware Adaptive Restart), a test-time control framework that trains a diagnostic policy to predict two signals from partial trajectories: a trap index for where to truncate and an escape probability for whether and how strongly to intervene. At inference time, TAAR truncates the trajectory before the predicted trap segment and adaptively restarts decoding; for severely trapped cases, it applies stronger perturbations, including higher-temperature resampling and an optional structured reboot suffix. Experiments on challenging mathematical and scientific reasoning benchmarks (AIME24, AIME25, GPQA-Diamond, HMMT25, BRUMO25) show that TAAR improves reasoning performance without fine-tuning base model parameters.
Related papers
- Decoding Answers Before Chain-of-Thought: Evidence from Pre-CoT Probes and Activation Steering [5.427346259545067]
Chain-of-thought (CoT) has become central to scaling reasoning capabilities in large language models.<n>We show that instruction-tuned models often determine their answer before generating CoT.
arXiv Detail & Related papers (2026-03-02T04:33:55Z) - Precedent-Informed Reasoning: Mitigating Overthinking in Large Reasoning Models via Test-Time Precedent Learning [37.40951956513094]
Reasoning in Large Language Models (LLMs) often suffers from inefficient long chain-of-thought traces with redundant self-exploration and validation.<n>Inspired by human reasoning patterns where people solve new problems by leveraging past related cases to constrain search spaces and reduce trial-and-error, we propose Precedent Informed Reasoning (PIR)<n>PIR transforms LRMs'reasoning paradigm from exhaustive self-exploration to guided learning from precedents.
arXiv Detail & Related papers (2026-02-16T04:17:46Z) - Constraint-Rectified Training for Efficient Chain-of-Thought [60.52883907721588]
Chain-of-Thought (CoT) has significantly enhanced the reasoning capabilities of Large Language Models (LLMs)<n>While longer reasoning traces can improve answer quality and unlock abilities such as self-correction, they also incur high inference costs and often introduce redundant steps, known as overthinking.<n>Recent research seeks to develop efficient reasoning strategies that balance reasoning length and accuracy.
arXiv Detail & Related papers (2026-02-13T02:13:45Z) - APR: Penalizing Structural Redundancy in Large Reasoning Models via Anchor-based Process Rewards [61.52322047892064]
Test-Time Scaling (TTS) has significantly enhanced the capabilities of Large Reasoning Models (LRMs)<n>We observe that LRMs frequently conduct repetitive self-verification without revision even after obtaining the final answer during the reasoning process.<n>We propose Anchor-based Process Reward (APR), a structure-aware reward shaping method that localizes the reasoning anchor and penalizes exclusively the post-anchor AST.
arXiv Detail & Related papers (2026-01-31T14:53:20Z) - PROMISE: Process Reward Models Unlock Test-Time Scaling Laws in Generative Recommendations [52.67948063133533]
Generative Recommendation has emerged as a promising paradigm, reformulating recommendation as a sequence-to-sequence generation task over hierarchical Semantic IDs.<n>Existing methods suffer from a critical issue we term Semantic Drift, where errors in early, high-level tokens irreversibly divert the generation trajectory into irrelevant semantic subspaces.<n>We propose Promise, a novel framework that integrates dense, step-by-step verification into generative models.
arXiv Detail & Related papers (2026-01-08T07:38:46Z) - DTS: Enhancing Large Reasoning Models via Decoding Tree Sketching [54.98126916293868]
Large Reasoning Models (LRMs) produce excessively long chain-of-thought traces that degrade accuracy.<n>We propose a model-agnostic decoding framework that sketches the reasoning space by branching at high-entropy tokens and applies early stopping to select the shortest completed reasoning path.<n>This approach approximates the optimal solution that enhances both efficiency and accuracy, without requiring additional training or supervision.
arXiv Detail & Related papers (2025-11-01T17:41:28Z) - ResAD: Normalized Residual Trajectory Modeling for End-to-End Autonomous Driving [64.42138266293202]
ResAD is a Normalized Residual Trajectory Modeling framework.<n>It reframes the learning task to predict the residual deviation from an inertial reference.<n>On the NAVSIM benchmark, ResAD achieves a state-of-the-art PDMS of 88.6 using a vanilla diffusion policy.
arXiv Detail & Related papers (2025-10-09T17:59:36Z) - Stop Spinning Wheels: Mitigating LLM Overthinking via Mining Patterns for Early Reasoning Exit [114.83867400179354]
Overthinking can degrade overall performance of large language models.<n>We categorize reasoning into three stages: insufficient exploration stage, compensatory reasoning stage, and reasoning convergence stage.<n>We develop a lightweight thresholding strategy based on rules to improve reasoning accuracy.
arXiv Detail & Related papers (2025-08-25T03:17:17Z) - ASCoT: An Adaptive Self-Correction Chain-of-Thought Method for Late-Stage Fragility in LLMs [21.409155842171497]
Chain-of-Thought (CoT) prompting has significantly advanced the reasoning capabilities of Large Language Models (LLMs)<n>Errors introduced in the later stages of a CoT chain are significantly more likely to corrupt the final answer than identical errors made at the beginning.<n>We introduce the Adaptive Self-Correction Chain-of-Thought (ASCoT) method to address this specific vulnerability.
arXiv Detail & Related papers (2025-08-07T11:26:40Z) - Fractured Chain-of-Thought Reasoning [61.647243580650446]
We introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling.<n>We show that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget.
arXiv Detail & Related papers (2025-05-19T11:30:41Z) - S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models [2.9925837108958864]
Test-Time Scaling emerges as an active research focus in the large language model community.<n>Recent studies reveal that reasoning models (even Qwen3) consistently exhibit excessive thought redundancy.<n>This paper introduces Serial-Group Decaying-Reward Policy Optimization (S-GRPO), a novel reinforcement learning paradigm.
arXiv Detail & Related papers (2025-05-12T15:50:44Z) - The First Few Tokens Are All You Need: An Efficient and Effective Unsupervised Prefix Fine-Tuning Method for Reasoning Models [69.798277882245]
We introduce Unsupervised Prefix Fine-Tuning (UPFT) to enhance large language models' reasoning efficiency.<n>UPFT removes the need for labeled data or exhaustive sampling.<n> Experiments show that UPFT matches the performance of supervised methods.
arXiv Detail & Related papers (2025-03-04T18:56:03Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.