LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
- URL: http://arxiv.org/abs/2502.07374v1
- Date: Tue, 11 Feb 2025 08:48:48 GMT
- Title: LLMs Can Easily Learn to Reason from Demonstrations Structure, not content, is what matters!
- Authors: Dacheng Li, Shiyi Cao, Tyler Griggs, Shu Liu, Xiangxi Mo, Shishir G. Patil, Matei Zaharia, Joseph E. Gonzalez, Ion Stoica,
- Abstract summary: Large reasoning models (LRMs) tackle complex reasoning problems by following long chain-of-thoughts (Long CoT)<n>We find that a Large Language model (LLM) can effectively learn Long CoT reasoning through data-efficient supervised fine-tuning (SFT) and parameter-efficient low-rank adaptation (LoRA)<n>With just 17k long CoT training samples, the Qwen2.5-32B-Instruct model achieves significant improvements on a wide range of math and coding benchmarks.
- Score: 56.75518291450102
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large reasoning models (LRMs) tackle complex reasoning problems by following long chain-of-thoughts (Long CoT) that incorporate reflection, backtracking, and self-validation. However, the training techniques and data requirements to elicit Long CoT remain poorly understood. In this work, we find that a Large Language model (LLM) can effectively learn Long CoT reasoning through data-efficient supervised fine-tuning (SFT) and parameter-efficient low-rank adaptation (LoRA). With just 17k long CoT training samples, the Qwen2.5-32B-Instruct model achieves significant improvements on a wide range of math and coding benchmarks, including 56.7% (+40.0%) on AIME 2024 and 57.0% (+8.1%) on LiveCodeBench, competitive to the proprietary o1-preview model's score of 44.6% and 59.1%. More importantly, we find that the structure of Long CoT is critical to the learning process, whereas the content of individual reasoning steps has minimal impact. Perturbations affecting content, such as training on incorrect samples or removing reasoning keywords, have little impact on performance. In contrast, structural modifications that disrupt logical consistency in the Long CoT, such as shuffling or deleting reasoning steps, significantly degrade accuracy. For example, a model trained on Long CoT samples with incorrect answers still achieves only 3.2% lower accuracy compared to training with fully correct samples. These insights deepen our understanding of how to elicit reasoning capabilities in LLMs and highlight key considerations for efficiently training the next generation of reasoning models. This is the academic paper of our previous released Sky-T1-32B-Preview model. Codes are available at https://github.com/NovaSky-AI/SkyThought.
Related papers
- Logit Arithmetic Elicits Long Reasoning Capabilities Without Training [14.015546463427732]
Large reasoning models (LRMs) can do complex reasoning via long chain-of-thought (CoT) involving cognitive strategies such as backtracking and self-correction.<n>Recent studies suggest that some models inherently possess these long reasoning abilities, which may be unlocked via extra training.<n>We propose a decoding-time approach, ThinkLogit, to tune a target large LM for long reasoning using a substantially smaller model as guider.
arXiv Detail & Related papers (2025-07-17T03:31:36Z) - The Challenge of Teaching Reasoning to LLMs Without RL or Distillation [31.973226821366325]
Reasoning-capable language models achieve state-of-the-art performance in diverse complex tasks by generating long, explicit Chain-of-Thought traces.<n>We ask whether long CoT can be induced in a base model using only prompting or minimal tuning.<n>The resulting model outperforms the much larger textttQwen2.5-Math-72B-Instruct, showing that a handful of high-quality examples can unlock strong reasoning capabilities.
arXiv Detail & Related papers (2025-07-14T01:14:50Z) - Teaching LLM to Reason: Reinforcement Learning from Algorithmic Problems without Code [76.80306464249217]
We propose TeaR, which aims at teaching LLMs to reason better.<n>TeaR leverages careful data curation and reinforcement learning to guide models in discovering optimal reasoning paths through code-related tasks.<n>We conduct extensive experiments using two base models and three long-CoT distillation models, with model sizes ranging from 1.5 billion to 32 billion parameters, and across 17 benchmarks spanning Math, Knowledge, Code, and Logical Reasoning.
arXiv Detail & Related papers (2025-07-10T07:34:05Z) - Through the Valley: Path to Effective Long CoT Training for Small Language Models [9.673301245621802]
Long chain-of-thought (CoT) supervision has become a common strategy to enhance reasoning in language models.<n>We identify a phenomenon we call Long CoT Degradation, in which small language models (SLMs) trained on limited long CoT data experience significant performance deterioration.
arXiv Detail & Related papers (2025-06-09T12:56:41Z) - Unleashing the Reasoning Potential of Pre-trained LLMs by Critique Fine-Tuning on One Problem [53.3188041952701]
We show that Critique Fine-Tuning (CFT) on only one problem can effectively unleash the reasoning potential of LLMs.<n>With just 5 GPU hours of training, Qwen-Math-7B-CFT show an average improvement of 15% on six math benchmarks and 16% on three logic reasoning benchmarks.<n>Results are comparable to or even surpass the results from RL with 20x less compute.
arXiv Detail & Related papers (2025-06-03T18:35:52Z) - TL;DR: Too Long, Do Re-weighting for Efficient LLM Reasoning Compression [55.37723860832064]
We propose a dynamic ratio-based training pipeline that does not rely on sophisticated data annotations.<n>We validate our approach across models on DeepSeek-R1-Distill-7B and DeepSeek-R1-Distill-14B and on a diverse set of benchmarks with varying difficulty levels.
arXiv Detail & Related papers (2025-06-03T09:23:41Z) - CoThink: Token-Efficient Reasoning via Instruct Models Guiding Reasoning Models [56.40065909544213]
Large language models (LLMs) benefit from increased test-time compute, a phenomenon known as test-time scaling.<n>However, reasoning-optimized models often overthink even simple problems, producing excessively verbose outputs and leading to low token efficiency.<n>We identify two key causes of this verbosity: (1) reinforcement learning reduces the information density of forward reasoning, and (2) backward chain-of thought training encourages redundant and often unnecessary verification steps.
arXiv Detail & Related papers (2025-05-28T06:24:45Z) - Concise Reasoning, Big Gains: Pruning Long Reasoning Trace with Difficulty-Aware Prompting [28.537281448659634]
We propose a difficulty-aware prompting (DAP) method to dynamically shorten reasoning traces without performance loss.<n>In experiments, a student model fine-tuned on just 100K of these difficulty-pruned CoT samples outperforms a model distilled on 800K original Long CoT samples.<n>Our method also generalizes well: across 11 diverse benchmarks, the shorter difficulty-aware CoTs achieve equal or better accuracy than Long chains, using far fewer tokens.
arXiv Detail & Related papers (2025-05-26T09:04:44Z) - SplitReason: Learning To Offload Reasoning [7.016347390223799]
Reasoning in large language models (LLMs) tends to produce substantially longer token generation sequences than simpler language modeling tasks.
We leverage this by offloading only the most challenging parts of the reasoning process to a larger, more capable model.
This approach improves AIME24 reasoning accuracy by 24% and 28.3% while offloading 1.35% and 5% of the generated tokens respectively.
arXiv Detail & Related papers (2025-04-23T03:00:02Z) - Climbing the Ladder of Reasoning: What LLMs Can-and Still Can't-Solve after SFT? [59.418994222096885]
We conduct a detailed analysis of model performance on the AIME24 dataset.
We categorize questions into four tiers (Easy, Medium, Hard, and Extremely Hard)
We find that progression from Easy to Medium tier requires adopting an R1 reasoning style with minimal SFT-1K instances.
Exh-level questions present a fundamentally different challenge; they require unconventional problem-solving skills.
arXiv Detail & Related papers (2025-04-16T03:39:38Z) - Long Is More Important Than Difficult for Training Reasoning Models [21.369780872368143]
We show that reasoning length, rather than problem difficulty, primarily influences the performance of trained models.
We present our model, Long1K-32B, which achieves remarkable performance with only 1,000 training samples.
arXiv Detail & Related papers (2025-03-23T13:33:59Z) - S$^2$R: Teaching LLMs to Self-verify and Self-correct via Reinforcement Learning [51.84977135926156]
We introduce S$2$R, an efficient framework that enhances LLM reasoning by teaching models to self-verify and self-correct during inference.
Our results demonstrate that Qwen2.5-math-7B achieves an accuracy improvement from 51.0% to 81.6%, outperforming models trained on an equivalent amount of long-CoT distilled data.
arXiv Detail & Related papers (2025-02-18T13:40:22Z) - When More is Less: Understanding Chain-of-Thought Length in LLMs [53.77747102201451]
Chain-of-thought (CoT) reasoning enhances the multi-step reasoning capabilities of large language models (LLMs)<n>However, for most models and tasks, does an increase in CoT length consistently lead to improved reasoning accuracy?<n>In this paper, we observe a nuanced relationship: as the number of reasoning steps increases, performance initially improves but eventually decreases.
arXiv Detail & Related papers (2025-02-11T05:28:59Z) - Exploring the Limit of Outcome Reward for Learning Mathematical Reasoning [65.2421542320293]
Reasoning abilities are crucial components of general intelligence.<n>Recent advances by proprietary companies, such as o-series models of OpenAI, have made remarkable progress on reasoning tasks.<n>This paper proposes a new RL framework, termed OREAL, to pursue the performance limit that can be achieved through textbfOutcome textbfREwtextbfArd-based reinforcement textbfLearning for mathematical reasoning tasks.
arXiv Detail & Related papers (2025-02-10T18:57:29Z) - BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation [88.77999917897702]
o1 from OpenAI has demonstrated remarkable reasoning capabilities.<n>Many teams have attempted to replicate its LongCoT and reasoning capabilities.<n>This paper introduces a novel approach to enable LLM's LongCoT capacity without distillation from o1-like models or expensive human annotations.
arXiv Detail & Related papers (2025-02-06T08:19:59Z) - Language Models are Hidden Reasoners: Unlocking Latent Reasoning Capabilities via Self-Rewarding [74.31981011985681]
Large language models (LLMs) have shown impressive capabilities, but still struggle with complex reasoning tasks requiring multiple steps.
We introduce LaTent Reasoning Optimization (LaTRO), a principled framework that formulates reasoning as sampling from a latent distribution.
We validate LaTRO through experiments on GSM8K and ARC-Challenge datasets using multiple model architectures.
arXiv Detail & Related papers (2024-11-06T22:02:30Z) - Improve Vision Language Model Chain-of-thought Reasoning [86.83335752119741]
Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving interpretability and trustworthiness.
We show that training VLM on short answers does not generalize well to reasoning tasks that require more detailed responses.
arXiv Detail & Related papers (2024-10-21T17:00:06Z) - Beyond Imitation: Learning Key Reasoning Steps from Dual Chain-of-Thoughts in Reasoning Distillation [24.272384832200522]
We propose mistaktextbfE-textbfDriven key reasontextbfIng step distillatextbfTion (textbfEDIT)
We design prompts to generate dual CoTs data with similar reasoning paths but divergent conclusions.
Experiments validate the effectiveness of EDIT across both in-domain and out-of-domain benchmark reasoning datasets.
arXiv Detail & Related papers (2024-05-30T06:32:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.