Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time
- URL: http://arxiv.org/abs/2504.12329v1
- Date: Sat, 12 Apr 2025 21:25:32 GMT
- Title: Speculative Thinking: Enhancing Small-Model Reasoning with Large Model Guidance at Inference Time
- Authors: Wang Yang, Xiang Yue, Vipin Chaudhary, Xiaotian Han,
- Abstract summary: We introduce Speculative Thinking, a training-free framework that enables large reasoning models to guide smaller ones during inference at the reasoning level.<n>By strategically delegating reflective steps to a more capable model, our method significantly boosts the reasoning accuracy of reasoning models while shortening their output.
- Score: 14.963731580129892
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent advances leverage post-training to enhance model reasoning performance, which typically requires costly training pipelines and still suffers from inefficient, overly lengthy outputs. We introduce Speculative Thinking, a training-free framework that enables large reasoning models to guide smaller ones during inference at the reasoning level, distinct from speculative decoding, which operates at the token level. Our approach is based on two observations: (1) reasoning-supportive tokens such as "wait" frequently appear after structural delimiters like "\n\n", serving as signals for reflection or continuation; and (2) larger models exhibit stronger control over reflective behavior, reducing unnecessary backtracking while improving reasoning quality. By strategically delegating reflective steps to a more capable model, our method significantly boosts the reasoning accuracy of reasoning models while shortening their output. With the assistance of the 32B reasoning model, the 1.5B model's accuracy on MATH500 increases from 83.2% to 89.4%, marking a substantial improvement of 6.2%. Simultaneously, the average output length is reduced from 5439 tokens to 4583 tokens, representing a 15.7% decrease. Moreover, when applied to a non-reasoning model (Qwen-2.5-7B-Instruct), our framework boosts its accuracy from 74.0% to 81.8% on the same benchmark, achieving a relative improvement of 7.8%.
Related papers
- Phi-4-reasoning Technical Report [42.508165017775]
We introduce Phi-4-reasoning, a 14-billion parameter reasoning model that achieves strong performance on complex reasoning tasks.
We develop Phi-4-reasoning-plus, a variant enhanced through a short phase of outcome-based reinforcement learning.
Both models outperform significantly larger open-weight models such as DeepSeek-R1-Distill-Llama-70B model and approach the performance levels of full DeepSeek-R1 model.
arXiv Detail & Related papers (2025-04-30T05:05:09Z) - Quantization Hurts Reasoning? An Empirical Study on Quantized Reasoning Models [48.98109982725689]
We conduct the first systematic study on quantized reasoning models, evaluating the open-sourced DeepSeek-R1-Distilled Qwen and LLaMA families.<n>Our investigation covers weight, KV cache, and activation quantization using state-of-the-art algorithms at varying bit-widths.<n>We identify model size, model origin, and task difficulty as critical determinants of performance.
arXiv Detail & Related papers (2025-04-07T08:22:45Z) - SEAL: Steerable Reasoning Calibration of Large Language Models for Free [58.190800043449336]
Large Language Models (LLMs) have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism.<n>Recent studies reveal substantial redundancy in the CoT reasoning traces, which negatively impacts model performance.<n>We introduce SEAL, a training-free approach that seamlessly calibrates the CoT process, improving accuracy while demonstrating significant efficiency gains.
arXiv Detail & Related papers (2025-04-07T02:42:07Z) - Retro-Search: Exploring Untaken Paths for Deeper and Efficient Reasoning [84.2749507577386]
We introduce Retro-Search, an MCTS-inspired search algorithm, for distilling higher quality reasoning paths from large models.<n>Retro-Search retrospectively revises reasoning paths to discover better, yet shorter traces, which can lead to student models with enhanced reasoning capabilities.<n>Our approach can enable two use cases: self-improvement, where models are fine-tuned on their own Retro-Search-ed traces, and weak-to-strong improvement.
arXiv Detail & Related papers (2025-04-06T06:23:27Z) - Benchmarking Reasoning Robustness in Large Language Models [76.79744000300363]
We find significant performance degradation on novel or incomplete data.<n>These findings highlight the reliance on recall over rigorous logical inference.<n>This paper introduces a novel benchmark, termed as Math-RoB, that exploits hallucinations triggered by missing information to expose reasoning gaps.
arXiv Detail & Related papers (2025-03-06T15:36:06Z) - LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters! [53.84130385074551]
Large reasoning models (LRMs) tackle complex reasoning problems by following long chain-of-thoughts (Long CoT)
We find that a Large Language model (LLM) can effectively learn Long CoT reasoning through data-efficient supervised fine-tuning (SFT) and parameter-efficient low-rank adaptation (LoRA)
With just 17k long CoT training samples, the Qwen2.5-32B-Instruct model achieves significant improvements on a wide range of math and coding benchmarks.
arXiv Detail & Related papers (2025-02-11T08:48:48Z) - Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning [65.2421542320293]
Reasoning abilities are crucial components of general intelligence.<n>Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks.<n>This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through textbfOutcome textbfREwtextbfArd-based reinforcement textbfLearning for mathematical reasoning tasks.
arXiv Detail & Related papers (2025-02-10T18:57:29Z) - Entropy Adaptive Decoding: Dynamic Model Switching for Efficient Inference [0.0]
We present Entropy Adaptive Decoding (EAD), a novel approach for efficient language model inference.<n>EAD switches between different-sized models based on prediction uncertainty.<n>We show remarkable efficiency gains across different model families.
arXiv Detail & Related papers (2025-02-05T22:15:21Z) - Weak-to-Strong Preference Optimization: Stealing Reward from Weak Aligned Model [28.569089876442682]
This work is inspired by weak-to-strong generalization, where a strong LM fine-tuned on labels generated by a weaker model can consistently outperform its weak supervisor.<n>We propose Weak-to-Strong Preference Optimization (WSPO), which achieves strong model alignment by learning the distribution differences before and after the alignment of the weak model.
arXiv Detail & Related papers (2024-10-24T11:06:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.