Related papers: Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens

Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens

URL: http://arxiv.org/abs/2602.13517v1
Date: Fri, 13 Feb 2026 23:07:37 GMT
Title: Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens
Authors: Wei-Lin Chen, Liqian Peng, Tian Tan, Chao Zhao, Blake JianHang Chen, Ziqian Lin, Alec Go, Yu Meng,
Abstract summary: We quantify inference-time effort by identifying deep-thinking tokens.<n>Think@n is a test-time scaling strategy that prioritizes samples with high deep-thinking ratios.
Score: 12.788799173865
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large language models (LLMs) have demonstrated impressive reasoning capabilities by scaling test-time compute via long Chain-of-Thought (CoT). However, recent findings suggest that raw token counts are unreliable proxies for reasoning quality: increased generation length does not consistently correlate with accuracy and may instead signal "overthinking," leading to performance degradation. In this work, we quantify inference-time effort by identifying deep-thinking tokens -- tokens where internal predictions undergo significant revisions in deeper model layers prior to convergence. Across four challenging mathematical and scientific benchmarks (AIME 24/25, HMMT 25, and GPQA-diamond) and a diverse set of reasoning-focused models (GPT-OSS, DeepSeek-R1, and Qwen3), we show that deep-thinking ratio (the proportion of deep-thinking tokens in a generated sequence) exhibits a robust and consistently positive correlation with accuracy, substantially outperforming both length-based and confidence-based baselines. Leveraging this insight, we introduce Think@n, a test-time scaling strategy that prioritizes samples with high deep-thinking ratios. We demonstrate that Think@n matches or exceeds standard self-consistency performance while significantly reducing inference costs by enabling the early rejection of unpromising generations based on short prefixes.

Related papers

One-Token Verification for Reasoning Correctness Estimation [31.590898058475464]
One-Token Verification (OTV) is a computational method that estimates reasoning correctness in a single forward pass during generation.<n>OTV consistently surpasses existing verifiers and reduces token usage by up to $90%$ through correctness-guided early termination.
arXiv Detail & Related papers (2026-03-01T10:09:58Z)
Thinking by Subtraction: Confidence-Driven Contrastive Decoding for LLM Reasoning [58.331709210563616]
Thinking by Subtraction is a confidence-driven contrastive decoding approach.<n>A small subset of low-confidence tokens disproportionately contributes to reasoning errors and unnecessary output expansion.<n>Our method, Confidence-Driven Contrastive Decoding, detects low-confidence tokens during decoding and intervenes at these positions.
arXiv Detail & Related papers (2026-02-20T14:13:22Z)
Recurrent Confidence Chain: Temporal-Aware Uncertainty Quantification in Large Language Models [0.0]
Uncertainty of answers can help prevent misleading or serious hallucinations for users.<n>Current methods analyze long reasoning sequences by filtering unrelated tokens and examining potential connections between nearby tokens or sentences.<n>We propose a novel method that incorporates inter-step attention to analyze semantic correlations across steps.
arXiv Detail & Related papers (2026-01-19T20:04:34Z)
ENTRA: Entropy-Based Redundancy Avoidance in Large Language Model Reasoning [30.786062954495403]
Large Reasoning Models (LRMs) often suffer from overthinking, generating unnecessarily long reasoning chains even for simple tasks.<n>We propose ENTRA, an entropy-based training framework that suppresses redundant reasoning while preserving performance.
arXiv Detail & Related papers (2026-01-12T01:26:30Z)
Efficient Thought Space Exploration through Strategic Intervention [54.35208611253168]
We propose a novel Hint-Practice Reasoning (HPR) framework that operationalizes this insight through two synergistic components.<n>The framework's core innovation lies in Distributional Inconsistency Reduction (DIR), which dynamically identifies intervention points.<n> Experiments across arithmetic and commonsense reasoning benchmarks demonstrate HPR's state-of-the-art efficiency-accuracy tradeoffs.
arXiv Detail & Related papers (2025-11-13T07:26:01Z)
Trace Length is a Simple Uncertainty Signal in Reasoning Models [18.432200654999082]
We show that reasoning trace length is a useful confidence estimator in large reasoning models.<n>Our work reveals that reasoning post-training fundamentally alters the relationship between trace length and accuracy.<n>We identify high-entropy or "forking" tokens as playing a key role in the mechanism.
arXiv Detail & Related papers (2025-10-12T02:04:06Z)
Accuracy Law for the Future of Deep Time Series Forecasting [65.46625911002202]
Time series forecasting inherently faces a non-zero error lower bound due to its partially observable and uncertain nature.<n>This paper focuses on a fundamental question: how to estimate the performance upper bound of deep time series forecasting.<n>Based on rigorous statistical tests of over 2,800 newly trained deep forecasters, we discover a significant exponential relationship between the minimum forecasting error of deep models and the complexity of window-wise series patterns.
arXiv Detail & Related papers (2025-10-03T05:18:47Z)
Stop Spinning Wheels: Mitigating LLM Overthinking via Mining Patterns for Early Reasoning Exit [114.83867400179354]
Overthinking can degrade overall performance of large language models.<n>We categorize reasoning into three stages: insufficient exploration stage, compensatory reasoning stage, and reasoning convergence stage.<n>We develop a lightweight thresholding strategy based on rules to improve reasoning accuracy.
arXiv Detail & Related papers (2025-08-25T03:17:17Z)
Deep Think with Confidence [33.167060610014715]
We introduce Deep Think with Confidence (DeepConf), a simple yet powerful method that enhances both reasoning efficiency and performance at test time.<n>DeepConf leverages model-internal confidence signals to dynamically filter out low-quality reasoning traces during or after generation.<n>We evaluate DeepConf across a variety of reasoning tasks and the latest open-source models, including Qwen 3 and GPT-OSS series.
arXiv Detail & Related papers (2025-08-21T05:48:38Z)
ConciseHint: Boosting Efficient Reasoning via Continuous Concise Hints during Generation [74.37307916314407]
We propose a framework dubbed ConciseHint, which continuously encourages the reasoning model to speak concisely.<n>Experiments on the state-of-the-art LRMs, including DeepSeek-R1 and Qwen-3 series, demonstrate that our method can effectively produce concise reasoning.
arXiv Detail & Related papers (2025-06-23T16:20:44Z)
Does Thinking More always Help? Mirage of Test-Time Scaling in Reasoning Models [130.5487886246353]
Extending thinking traces using prompts like "Wait" or "Let me rethink" can improve performance.<n>This raises a natural question: Does thinking more at test-time truly lead to better reasoning?<n>We show a consistent pattern of initial performance improvements from additional thinking followed by a decline, due to "overthinking"
arXiv Detail & Related papers (2025-06-04T17:55:09Z)
Fractured Chain-of-Thought Reasoning [61.647243580650446]
We introduce Fractured Sampling, a unified inference-time strategy that interpolates between full CoT and solution-only sampling.<n>We show that Fractured Sampling consistently achieves superior accuracy-cost trade-offs, yielding steep log-linear scaling gains in Pass@k versus token budget.
arXiv Detail & Related papers (2025-05-19T11:30:41Z)

This list is automatically generated from the titles and abstracts of the papers in this site.