Long-Short Chain-of-Thought Mixture Supervised Fine-Tuning Eliciting Efficient Reasoning in Large Language Models
- URL: http://arxiv.org/abs/2505.03469v2
- Date: Wed, 21 May 2025 06:17:56 GMT
- Title: Long-Short Chain-of-Thought Mixture Supervised Fine-Tuning Eliciting Efficient Reasoning in Large Language Models
- Authors: Bin Yu, Hang Yuan, Haotian Li, Xueyin Xu, Yuliang Wei, Bailing Wang, Weizhen Qi, Kai Chen,
- Abstract summary: We propose Long-Short Chain-of-Thought Mixture Supervised Fine-Tuning (LS-Mixture SFT)<n>Our experiments demonstrate that models trained using LS-Mixture SFT, compared to those trained with direct SFT, achieved an average accuracy improvement of 2.3%.<n>This work offers an approach to endow non-reasoning models with reasoning capabilities through supervised fine-tuning.
- Score: 23.34070841541423
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Recent advances in large language models have demonstrated that Supervised Fine-Tuning (SFT) with Chain-of-Thought (CoT) reasoning data distilled from large reasoning models (e.g., DeepSeek R1) can effectively transfer reasoning capabilities to non-reasoning models. However, models fine-tuned with this approach inherit the "overthinking" problem from teacher models, producing verbose and redundant reasoning chains during inference. To address this challenge, we propose Long-Short Chain-of-Thought Mixture Supervised Fine-Tuning (LS-Mixture SFT), which combines long CoT reasoning dataset with their short counterparts obtained through structure-preserved rewriting. Our experiments demonstrate that models trained using the LS-Mixture SFT method, compared to those trained with direct SFT, achieved an average accuracy improvement of 2.3% across various benchmarks while substantially reducing model response length by approximately 47.61%. This work offers an approach to endow non-reasoning models with reasoning capabilities through supervised fine-tuning while avoiding the inherent overthinking problems inherited from teacher models, thereby enabling efficient reasoning in the fine-tuned models.
Related papers
- Don't Overthink It: A Survey of Efficient R1-style Large Reasoning Models [49.598776427454176]
Large Reasoning Models (LRMs) have gradually become a research hotspot due to their outstanding performance in handling complex tasks.<n>However, with the widespread application of these models, the problem of overthinking has gradually emerged.<n>Various efficient reasoning methods have been proposed, aiming to reduce the length of reasoning paths without compromising model performance and reasoning capability.
arXiv Detail & Related papers (2025-08-04T06:54:31Z) - Large Reasoning Models are not thinking straight: on the unreliability of thinking trajectories [0.0]
Large Language Models (LLMs) trained via Reinforcement Learning (RL) have recently achieved impressive results on reasoning benchmarks.<n>Yet, growing evidence shows that these models often generate longer but ineffective chains of thought (CoTs)<n>We present new evidence of overthinking, where models disregard correct solutions even when explicitly provided, instead continuing to generate unnecessary reasoning steps.
arXiv Detail & Related papers (2025-07-01T12:14:22Z) - Lost at the Beginning of Reasoning [82.18834329384514]
We show that the first reasoning step exerts a disproportionately large influence on the final prediction.<n>We propose an efficient sampling strategy that leverages a reward model to identify and retain high-quality first reasoning steps.<n>We introduce a new benchmark specifically constructed with deliberately flawed first reasoning steps to systematically evaluate model self-correction capabilities.
arXiv Detail & Related papers (2025-06-27T09:53:57Z) - QFFT, Question-Free Fine-Tuning for Adaptive Reasoning [46.60300066127707]
Question-Free Fine-Tuning (QFFT) is a fine-tuning approach that removes the input question during training and learns exclusively from Long CoT responses.<n>QFFT reduces average response length by more than 50%, while achieving performance comparable to Supervised Fine-Tuning (SFT)
arXiv Detail & Related papers (2025-06-15T14:21:28Z) - Through the Valley: Path to Effective Long CoT Training for Small Language Models [9.673301245621802]
Long chain-of-thought (CoT) supervision has become a common strategy to enhance reasoning in language models.<n>We identify a phenomenon we call Long CoT Degradation, in which small language models (SLMs) trained on limited long CoT data experience significant performance deterioration.
arXiv Detail & Related papers (2025-06-09T12:56:41Z) - Accelerated Test-Time Scaling with Model-Free Speculative Sampling [58.69141724095398]
We introduce STAND (STochastic Adaptive N-gram Drafting), a novel model-free speculative decoding approach.<n>We show that STAND reduces inference latency by 60-65% compared to standard autoregressive decoding.<n>As a model-free approach, STAND can be applied to any existing language model without additional training.
arXiv Detail & Related papers (2025-06-05T07:31:18Z) - Towards Widening The Distillation Bottleneck for Reasoning Models [39.22557129190619]
Distillation--post-training on LRMs-generated data--is a straightforward yet effective method to enhance the reasoning abilities of smaller models.<n>We found that distilled long CoT data poses learning difficulty for small models and leads to the inheritance of biases.<n>We propose constructing tree-based CoT data from scratch via Monte Carlo Tree Search.
arXiv Detail & Related papers (2025-03-03T12:17:36Z) - Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning? [57.17826305464394]
o1-like models produce the long Chain-of-Thought (CoT) reasoning steps to improve the reasoning abilities of existing Large Language Models (LLMs)<n>We introduce the DeltaBench, including the generated long CoTs from different o1-like models for different reasoning tasks.<n>Based on DeltaBench, we first perform fine-grained analysis of the generated long CoTs to discover the effectiveness and efficiency of different o1-like models.
arXiv Detail & Related papers (2025-02-26T17:59:27Z) - Towards Thinking-Optimal Scaling of Test-Time Compute for LLM Reasoning [113.49074603075032]
Recent studies have shown that making a model spend more time thinking through longer Chain of Thoughts (CoTs) enables it to gain significant improvements in complex reasoning tasks.<n>We explore whether scaling with longer CoTs can indeed impair the reasoning performance of Large Language Models (LLMs) in certain domains.
arXiv Detail & Related papers (2025-02-25T10:48:05Z) - Small Models Struggle to Learn from Strong Reasoners [14.895026967556088]
Small models do not consistently benefit from long chain-of-thought reasoning or distillation from larger models.<n>We propose Mix Distillation, a strategy that balances reasoning complexity by combining long and short CoT examples or reasoning from both larger and smaller models.<n>Our experiments demonstrate that Mix Distillation significantly improves small model reasoning performance compared to training on either data alone.
arXiv Detail & Related papers (2025-02-17T18:56:15Z) - Fine-Tuning with Divergent Chains of Thought Boosts Reasoning Through Self-Correction in Language Models [63.36637269634553]
We present a novel method of further improving performance by requiring models to compare multiple reasoning chains.
We find that instruction tuning on DCoT datasets boosts the performance of even smaller, and therefore more accessible, language models.
arXiv Detail & Related papers (2024-07-03T15:01:18Z) - ChainLM: Empowering Large Language Models with Improved Chain-of-Thought Prompting [124.69672273754144]
Chain-of-Thought (CoT) prompting can enhance the reasoning capabilities of large language models (LLMs)
Existing CoT approaches usually focus on simpler reasoning tasks and thus result in low-quality and inconsistent CoT prompts.
We introduce CoTGenius, a novel framework designed for the automatic generation of superior CoT prompts.
arXiv Detail & Related papers (2024-03-21T11:34:26Z) - Diffusion of Thoughts: Chain-of-Thought Reasoning in Diffusion Language Models [100.53662473219806]
Diffusion-of-Thought (DoT) is a novel approach that integrates diffusion models with Chain-of-Thought.<n>DoT allows reasoning steps to diffuse over time through a diffusion language model.<n>Our results demonstrate the effectiveness of DoT in multi-digit multiplication, logic, and grade school math problems.
arXiv Detail & Related papers (2024-02-12T16:23:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.