SyncThink: A Training-Free Strategy to Align Inference Termination with Reasoning Saturation
- URL: http://arxiv.org/abs/2601.03649v1
- Date: Wed, 07 Jan 2026 07:00:15 GMT
- Title: SyncThink: A Training-Free Strategy to Align Inference Termination with Reasoning Saturation
- Authors: Gengyang Li, Wang Cai, Yifeng Gao, Yunfang Wu,
- Abstract summary: We present SyncThink, a training-free and plug-and-play decoding method that reduces Chain-of-Thought overhead without modifying model weights.<n>We find that answer tokens attend weakly to early reasoning and instead focus on the special token "/think", indicating an information bottleneck.<n>Experiments on GSM8K, MMLU, GPQA, and BBH across three DeepSeek-R1 distilled models show that SyncThink achieves 62.00 percent average Top-1 accuracy.
- Score: 11.021989271617835
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Chain-of-Thought (CoT) prompting improves reasoning but often produces long and redundant traces that substantially increase inference cost. We present SyncThink, a training-free and plug-and-play decoding method that reduces CoT overhead without modifying model weights. We find that answer tokens attend weakly to early reasoning and instead focus on the special token "/think", indicating an information bottleneck. Building on this observation, SyncThink monitors the model's own reasoning-transition signal and terminates reasoning. Experiments on GSM8K, MMLU, GPQA, and BBH across three DeepSeek-R1 distilled models show that SyncThink achieves 62.00 percent average Top-1 accuracy using 656 generated tokens and 28.68 s latency, compared to 61.22 percent, 2141 tokens, and 92.01 s for full CoT decoding. On long-horizon tasks such as GPQA, SyncThink can further yield up to +8.1 absolute accuracy by preventing over-thinking.
Related papers
- Think Deep, Not Just Long: Measuring LLM Reasoning Effort via Deep-Thinking Tokens [12.788799173865]
We quantify inference-time effort by identifying deep-thinking tokens.<n>Think@n is a test-time scaling strategy that prioritizes samples with high deep-thinking ratios.
arXiv Detail & Related papers (2026-02-13T23:07:37Z) - Thinking-Based Non-Thinking: Solving the Reward Hacking Problem in Training Hybrid Reasoning Models via Reinforcement Learning [57.57084309580296]
Thinking-Based Non-Thinking sets different maximum token usage for responses not using thinking across various queries.<n>Experiments on five mathematical benchmarks demonstrate that TNT reduces token usage by around 50%.<n>The probability of reward hacking problem in TNT's responses, which are classified as not using thinking, remains below 10%.
arXiv Detail & Related papers (2026-01-08T10:38:41Z) - Correct, Concise and Complete: Multi-stage Training For Adaptive Reasoning [11.179446105672461]
We propose a multi-stage efficient reasoning method that combines supervised fine-tuning and reinforcement learning.<n>Our approach reduces response length by an average of 28% for 8B models and 40% for 32B models.<n>It achieves a superior trade-off compared to more complex state-of-the-art efficient reasoning methods.
arXiv Detail & Related papers (2026-01-06T12:31:51Z) - LYNX: Learning Dynamic Exits for Confidence-Controlled Reasoning [15.597220136913258]
LYNX is an online early-exit mechanism that turns a model's own hidden-state awareness into confidence-controlled stopping decisions.<n>We train and calibrate this probe once on a generic mathematical corpus and reuse it unchanged across benchmarks, decoding temperatures, and even non-mathematical tasks.
arXiv Detail & Related papers (2025-12-05T00:04:42Z) - Logit-Entropy Adaptive Stopping Heuristic for Efficient Chain-of-Thought Reasoning [0.0]
Chain-of-Thought (CoT) prompting is a key technique for enabling complex reasoning in large language models.<n>We introduce LEASH: Logit-Entropy Adaptive Stopping Heuristic, a training-free decoding algorithm that adaptively halts rationale generation.
arXiv Detail & Related papers (2025-11-06T18:43:16Z) - DTS: Enhancing Large Reasoning Models via Decoding Tree Sketching [54.98126916293868]
Large Reasoning Models (LRMs) produce excessively long chain-of-thought traces that degrade accuracy.<n>We propose a model-agnostic decoding framework that sketches the reasoning space by branching at high-entropy tokens and applies early stopping to select the shortest completed reasoning path.<n>This approach approximates the optimal solution that enhances both efficiency and accuracy, without requiring additional training or supervision.
arXiv Detail & Related papers (2025-11-01T17:41:28Z) - DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning [134.03095505580276]
Doing Length pEnalty Right (DLER) is a training recipe combining batch-wise reward normalization, higher clipping, dynamic sampling, and a simple truncation length penalty.<n>DLER achieves state-of-the-art accuracy--efficiency trade-offs, cutting output length by over 70 percent while surpassing all previous baseline accuracy.
arXiv Detail & Related papers (2025-10-16T20:05:57Z) - What makes Reasoning Models Different? Follow the Reasoning Leader for Efficient Decoding [84.42056293290015]
We analyze the token-level misalignment between reasoning and non-reasoning models.<n>Motivated by the Local Misalignment Diminish, we propose FoReaL-Decoding.<n>On four popular math-reasoning benchmarks, FoReaL-Decoding reduces theoretical FLOPs by 30 to 50% and trims CoT length by up to 40%.
arXiv Detail & Related papers (2025-06-08T05:08:32Z) - The Price of a Second Thought: On the Evaluation of Reasoning Efficiency in Large Language Models [54.88805865447848]
We show that instruct models achieve higher efficiency overall, and problem difficulty affects efficiency.<n>We propose COTHINK, a simple two-stage pipeline: an instruct model drafts a brief outline, and a thinking model expands it.<n>On GSM8K, MATH500, and AIME24, COTHINK cuts token usage by 21.1% while keeping accuracy on four thinking models, and remains competitive with strong efficiency baselines.
arXiv Detail & Related papers (2025-05-28T06:24:45Z) - VeriThinker: Learning to Verify Makes Reasoning Model Efficient [52.74493506816969]
Large Reasoning Models excel at complex tasks using Chain-of-Thought (CoT) reasoning.<n>Their tendency to overthinking leads to unnecessarily lengthy reasoning chains.<n>We introduce VeriThinker, a novel approach for CoT compression.
arXiv Detail & Related papers (2025-05-23T14:17:56Z) - Not All Tokens Are What You Need In Thinking [34.767739567093656]
Conditional Token Selection (CTS) identifies and preserves only the most essential tokens in chains of thought.<n>CTS effectively compresses long CoT while maintaining strong reasoning performance.<n>Further reducing training tokens by 42% incurs only a marginal 5% accuracy drop while yielding a 75.8% reduction in reasoning tokens.
arXiv Detail & Related papers (2025-05-23T12:41:29Z) - CoT-Valve: Length-Compressible Chain-of-Thought Tuning [50.196317781229496]
We introduce a new tuning and inference strategy named CoT-Valve, designed to allow models to generate reasoning chains of varying lengths.<n>We show that CoT-Valve successfully enables controllability and compressibility of the chain and shows better performance than the prompt-based control.
arXiv Detail & Related papers (2025-02-13T18:52:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.