Efficient Reasoning for LLMs through Speculative Chain-of-Thought
- URL: http://arxiv.org/abs/2504.19095v1
- Date: Sun, 27 Apr 2025 03:56:39 GMT
- Title: Efficient Reasoning for LLMs through Speculative Chain-of-Thought
- Authors: Jikai Wang, Juntao Li, Lijun Wu, Min Zhang,
- Abstract summary: Large reasoning language models such as OpenAI-o1 and Deepseek-R1 have attracted widespread attention due to their impressive task-solving abilities.<n>Existing methods for efficient reasoning mainly focus on reducing the number of model parameters or shortening the chain-of-thought length.<n>We introduce Speculative Chain-of-Thought (SCoT), which reduces reasoning latency from another perspective by accelerated average reasoning speed.
- Score: 44.76494056102963
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large reasoning language models such as OpenAI-o1 and Deepseek-R1 have recently attracted widespread attention due to their impressive task-solving abilities. However, the enormous model size and the generation of lengthy thought chains introduce significant reasoning costs and response latency. Existing methods for efficient reasoning mainly focus on reducing the number of model parameters or shortening the chain-of-thought length. In this paper, we introduce Speculative Chain-of-Thought (SCoT), which reduces reasoning latency from another perspective by accelerated average reasoning speed through large and small model collaboration. SCoT conducts thought-level drafting using a lightweight draft model. Then it selects the best CoT draft and corrects the error cases with the target model. The proposed thinking behavior alignment improves the efficiency of drafting and the draft selection strategy maintains the prediction accuracy for complex problems. Experimental results on GSM8K, MATH, GaoKao, CollegeMath and Olympiad datasets show that SCoT reduces reasoning latency by 48\%$\sim$66\% for Deepseek-R1-Distill-Qwen-32B while achieving near-target-model-level performance. Our code is available at https://github.com/Jikai0Wang/Speculative_CoT.
Related papers
- AdaR1: From Long-CoT to Hybrid-CoT via Bi-Level Adaptive Reasoning Optimization [86.56120216550232]
We propose a novel two-stage framework for adaptive and efficient reasoning.
First, we construct a hybrid reasoning model by merging long and short CoT models.
Second, we apply bi-level preference training to guide the model to select suitable reasoning styles.
arXiv Detail & Related papers (2025-04-30T14:01:45Z) - SEAL: Steerable Reasoning Calibration of Large Language Models for Free [58.190800043449336]
Large Language Models (LLMs) have demonstrated compelling capabilities for complex reasoning tasks via the extended chain-of-thought (CoT) reasoning mechanism.<n>Recent studies reveal substantial redundancy in the CoT reasoning traces, which negatively impacts model performance.<n>We introduce SEAL, a training-free approach that seamlessly calibrates the CoT process, improving accuracy while demonstrating significant efficiency gains.
arXiv Detail & Related papers (2025-04-07T02:42:07Z) - DAST: Difficulty-Adaptive Slow-Thinking for Large Reasoning Models [31.189242663680695]
This paper introduces Difficulty-Adaptive Slow-Thinking (DAST), a novel framework that enables models to autonomously adjust the length of Chain-of-Thought(CoT) based on problem difficulty.<n>Experiments on diverse datasets and model scales demonstrate that DAST effectively mitigates overthinking while preserving reasoning accuracy on complex problems.
arXiv Detail & Related papers (2025-03-06T14:23:06Z) - Chain of Draft: Thinking Faster by Writing Less [37.492654173517046]
Chain of Draft (CoD) is a novel paradigm inspired by human cognitive processes.
CoD generates minimalistic yet informative intermediate reasoning outputs while solving tasks.
arXiv Detail & Related papers (2025-02-25T19:36:06Z) - Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning [113.49074603075032]
Recent studies have shown that making a model spend more time thinking through longer Chain of Thoughts (CoTs) enables it to gain significant improvements in complex reasoning tasks.<n>We explore whether scaling with longer CoTs can indeed impair the reasoning performance of Large Language Models (LLMs) in certain domains.
arXiv Detail & Related papers (2025-02-25T10:48:05Z) - When More is Less: Understanding Chain-of-Thought Length in LLMs [53.77747102201451]
Chain-of-thought (CoT) reasoning enhances the multi-step reasoning capabilities of large language models (LLMs)<n>However, for most models and tasks, does an increase in CoT length consistently lead to improved reasoning accuracy?<n>In this paper, we observe a nuanced relationship: as the number of reasoning steps increases, performance initially improves but eventually decreases.
arXiv Detail & Related papers (2025-02-11T05:28:59Z) - O1-Pruner: Length-Harmonizing Fine-Tuning for O1-Like Reasoning Pruning [98.3430004984531]
We propose Length-Harmonizing Fine-Tuning (O1-Pruner) to minimize reasoning overhead while maintaining accuracy.<n>Our code is coming soon at https://github.com/StarDewXXX/O1-Pruner.
arXiv Detail & Related papers (2025-01-22T01:35:11Z) - Cascade Speculative Drafting for Even Faster LLM Inference [25.642604897018852]
Speculative decoding improves the efficiency of large language model (LLM) inference.
We introduce Cascade Speculative Drafting (CS Drafting), a speculative execution algorithm that incorporates two types of cascades.
CS Drafting achieves up to an 81 percent additional speedup over speculative decoding in our experiments.
arXiv Detail & Related papers (2023-12-18T18:59:46Z) - Online Speculative Decoding [34.987825705622555]
We introduce online speculative decoding to accelerate the inference of large language models.
The main idea is to continuously update the (multiple) draft model(s) on observed user query data.
We develop a prototype of online speculative decoding based on knowledge distillation and evaluate it using both synthetic and real query data.
arXiv Detail & Related papers (2023-10-11T04:03:42Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.