Enhancing Long-Chain Reasoning Distillation through Error-Aware Self-Reflection
- URL: http://arxiv.org/abs/2505.22131v2
- Date: Tue, 14 Oct 2025 07:56:38 GMT
- Title: Enhancing Long-Chain Reasoning Distillation through Error-Aware Self-Reflection
- Authors: Zhuoyang Wu, Xinze Li, Zhenghao Liu, Yukun Yan, Zhiyuan Liu, Minghe Yu, Cheng Yang, Yu Gu, Ge Yu, Maosong Sun,
- Abstract summary: errOr-aware self-ReflectION (ORION) is a framework that refines teacher CoTs through an Error-Aware Reflection process.<n> Experiments on multiple mathematical reasoning benchmarks demonstrate that ORION consistently improves performance by more than 2% over all baselines.
- Score: 64.73809794561305
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Large Language Models (LLMs) have exhibited strong reasoning capabilities and achieved remarkable performance in mathematical problem-solving tasks. Recently, distilling reasoning ability from long-form Chains-of-Thought (CoTs) has emerged as a promising approach for enhancing Small Language Models (SLMs). Existing studies typically treat SLMs as student models and use long-form CoTs as supervision signals for Supervised Fine-Tuning (SFT) to transfer reasoning ability. However, such long-form CoT teachers are usually unaware of the student model's capacity, which limits the effective utilization of the provided reasoning traces. To overcome this limitation, we propose errOr-aware self-ReflectION (ORION), a framework that refines teacher CoTs through an Error-Aware Reflection process. ORION enables the student model to construct more tailored teacher CoTs by refining teacher CoTs and incorporating its own reasoning errors. Experiments on multiple mathematical reasoning benchmarks demonstrate that ORION consistently improves performance by more than 2% over all baselines. Further analysis reveals that the CoTs constructed by ORION exhibit higher coherence and logical consistency, thereby serving as more effective supervision signals for SFT. All codes are available at https://github.com/NEUIR/ORION.git.
Related papers
- Long-Chain Reasoning Distillation via Adaptive Prefix Alignment [57.130176131042965]
We propose a framework that exploits teacher CoTs for distillation through adaptive prefix alignment.<n>P-ALIGN adaptively truncates teacher-generated reasoning trajectories by determining whether the remaining suffix is concise.<n>Experiments on multiple mathematical reasoning benchmarks demonstrate that P-ALIGN outperforms all baselines by over 3%.
arXiv Detail & Related papers (2026-01-15T04:40:45Z) - Merge-of-Thought Distillation [23.53356244978525]
Merge-of-Thought Distillation (MoT) is a lightweight framework that alternates between teacher-specific supervised fine-tuning branches and weight-space merging the resulting student variants.<n>On competition math benchmarks, applying MoT to a Qwen3-14B student surpasses strong models including Deepseek-R1, Qwen3-32B, and OpenAI-O1.<n>MoT consistently outperforms the best single-teacher distillation, improves general reasoning beyond mathematics, and shows robustness to distribution-shifted and peer-level teachers.
arXiv Detail & Related papers (2025-09-10T17:46:57Z) - Error Detection and Correction for Interpretable Mathematics in Large Language Models [5.258949636570995]
EDCIM (Error Detection and Correction for Interpretable Mathematics) is a method for detecting and correcting these errors in interpretable mathematics tasks.<n>It integrates lightweight, open-source LLMs with more powerful proprietary models, balancing cost and accuracy.<n> Experimental results show that EDCIM significantly reduces both computational and financial costs, while maintaining, and even improving, prediction accuracy.
arXiv Detail & Related papers (2025-08-05T14:30:35Z) - WarriorMath: Enhancing the Mathematical Ability of Large Language Models with a Defect-aware Framework [42.74246647841103]
WarriorMath is a defect-aware framework for mathematical problem solving.<n>We employ multiple expert LLMs in a collaborative process to generate, critique, and refine problems.<n>In the training stage, we introduce a progressive learning framework that iteratively fine-tunes the model using increasingly challenging data tailored to its weaknesses.
arXiv Detail & Related papers (2025-08-02T07:45:12Z) - The Challenge of Teaching Reasoning to LLMs Without RL or Distillation [31.973226821366325]
Reasoning-capable language models achieve state-of-the-art performance in diverse complex tasks by generating long, explicit Chain-of-Thought traces.<n>We ask whether long CoT can be induced in a base model using only prompting or minimal tuning.<n>The resulting model outperforms the much larger textttQwen2.5-Math-72B-Instruct, showing that a handful of high-quality examples can unlock strong reasoning capabilities.
arXiv Detail & Related papers (2025-07-14T01:14:50Z) - Distilling the Implicit Multi-Branch Structure in LLMs' Reasoning via Reinforcement Learning [63.888013006686364]
Distilling reasoning paths from teacher to student models via supervised fine-tuning (SFT) provides a shortcut for improving the reasoning ability of Large Language Models (LLMs)<n>We propose RLKD, a reinforcement learning-based distillation framework guided by a novel Generative Structure Reward Model (GSRM)<n>Our GSRM converts reasoning paths into multiple meta-reasoning-solving steps and computes rewards to measure structural alignment between student and teacher reasoning.
arXiv Detail & Related papers (2025-05-22T02:36:36Z) - Long-Short Chain-of-Thought Mixture Supervised Fine-Tuning Eliciting Efficient Reasoning in Large Language Models [23.34070841541423]
We propose Long-Short Chain-of-Thought Mixture Supervised Fine-Tuning (LS-Mixture SFT)<n>Our experiments demonstrate that models trained using LS-Mixture SFT, compared to those trained with direct SFT, achieved an average accuracy improvement of 2.3%.<n>This work offers an approach to endow non-reasoning models with reasoning capabilities through supervised fine-tuning.
arXiv Detail & Related papers (2025-05-06T12:18:11Z) - LEMMA: Learning from Errors for MatheMatical Advancement in LLMs [33.571479131705075]
We introduce Learning from Errors for Mathematical Advancement (LEMMA) to enhance large language models' reasoning ability.<n> LEMMA constructs data consisting of an incorrect solution with an erroneous step and a reflection connection to a correct solution for fine-tuning.<n> Experimental results demonstrate that LEMMA achieves significant performance improvements over other strong baselines.
arXiv Detail & Related papers (2025-03-21T17:59:10Z) - Can Large Language Models Detect Errors in Long Chain-of-Thought Reasoning? [57.17826305464394]
o1-like models produce the long Chain-of-Thought (CoT) reasoning steps to improve the reasoning abilities of existing Large Language Models (LLMs)<n>We introduce the DeltaBench, including the generated long CoTs from different o1-like models for different reasoning tasks.<n>Based on DeltaBench, we first perform fine-grained analysis of the generated long CoTs to discover the effectiveness and efficiency of different o1-like models.
arXiv Detail & Related papers (2025-02-26T17:59:27Z) - Unveiling the Key Factors for Distilling Chain-of-Thought Reasoning [33.02060729778806]
This study examines the factors influencing Chain-of-Thought (CoT) distillation in Small Language Models (SLMs)<n>We find that SLMs exhibit a non-monotonic relationship with granularity, with stronger models benefiting from finer-grained reasoning and weaker models performing better with simpler CoT supervision.<n>These findings emphasize the need to tailor CoT strategies to specific student model, offering actionable insights for optimizing CoT distillation in SLMs.
arXiv Detail & Related papers (2025-02-25T09:08:45Z) - When More is Less: Understanding Chain-of-Thought Length in LLMs [51.631483479081645]
Large Language Models (LLMs) employ Chain-of-Thought (CoT) reasoning to deconstruct complex problems.<n>This paper argues that longer CoTs are often presumed superior, arguing that longer is not always better.
arXiv Detail & Related papers (2025-02-11T05:28:59Z) - Error Classification of Large Language Models on Math Word Problems: A Dynamically Adaptive Framework [64.83955753606443]
Math Word Problems serve as a crucial benchmark for evaluating Large Language Models' reasoning abilities.<n>Current error classification methods rely on static and predefined categories.<n>We introduce MWPES-300K, a comprehensive dataset containing 304,865 error samples.
arXiv Detail & Related papers (2025-01-26T16:17:57Z) - Multi-Objective Large Language Model Unlearning [3.372396620898397]
Gradient Ascent (GA) is a proactive way to decrease the prediction probability of the model on the target data.<n>We propose Multi-Objective Large Language Model Unlearning (MOLLM) algorithm to overcome gradient explosion and catastrophic forgetting.<n>Our empirical results verify that MoLLM outperforms the SOTA GA-based LLM unlearning methods in terms of unlearning effect and model utility preservation.
arXiv Detail & Related papers (2024-12-29T09:35:56Z) - Subtle Errors in Reasoning: Preference Learning via Error-injected Self-editing [59.405145971637204]
We propose a novel preference learning framework called eRror-Injected Self-Editing (RISE)<n>RISE injects predefined subtle errors into pivotal tokens in reasoning or steps to construct hard pairs for error mitigation.<n>Experiments validate the effectiveness of RISE, with preference learning on Qwen2-7B-Instruct yielding notable improvements of 3.0% on GSM8K and 7.9% on MATH with only 4.5K training samples.
arXiv Detail & Related papers (2024-10-09T07:43:38Z) - S^3cMath: Spontaneous Step-level Self-correction Makes Large Language Models Better Mathematical Reasoners [23.713779973116733]
Self-correction is a method that can stimulate the potential reasoning abilities of large language models (LLMs)<n>We propose S$3$c-Math, which are able to perform Spontaneous Step-level Self-correction for Mathematical reasoning.
arXiv Detail & Related papers (2024-09-03T01:40:21Z) - CoMMIT: Coordinated Instruction Tuning for Multimodal Large Language Models [68.64605538559312]
In this paper, we analyze the MLLM instruction tuning from both theoretical and empirical perspectives.
Inspired by our findings, we propose a measurement to quantitatively evaluate the learning balance.
In addition, we introduce an auxiliary loss regularization method to promote updating of the generation distribution of MLLMs.
arXiv Detail & Related papers (2024-07-29T23:18:55Z) - LLMs-as-Instructors: Learning from Errors Toward Automating Model Improvement [93.38736019287224]
"LLMs-as-Instructors" framework autonomously enhances the training of smaller target models.
Inspired by the theory of "Learning from Errors", this framework employs an instructor LLM to meticulously analyze the specific errors within a target model.
Within this framework, we implement two strategies: "Learning from Error," which focuses solely on incorrect responses to tailor training data, and "Learning from Error by Contrast", which uses contrastive learning to analyze both correct and incorrect responses for a deeper understanding of errors.
arXiv Detail & Related papers (2024-06-29T17:16:04Z) - Learning From Mistakes Makes LLM Better Reasoner [106.48571828587728]
Large language models (LLMs) recently exhibited remarkable reasoning capabilities on solving math problems.
This work explores whether LLMs can LEarn from MistAkes (LEMA), akin to the human learning process.
arXiv Detail & Related papers (2023-10-31T17:52:22Z) - Pareto Optimal Learning for Estimating Large Language Model Errors [12.21899680905672]
Large Language Models (LLMs) have shown impressive abilities in many applications.
We present a method that generates a risk score to estimate the probability of error in an LLM response by integrating multiple sources of information.
arXiv Detail & Related papers (2023-06-28T21:11:15Z) - SCOTT: Self-Consistent Chain-of-Thought Distillation [68.40232422158569]
Large language models (LMs) generate free-text rationales for their predictions via chain-of-thought prompting.
We propose a faithful knowledge distillation method to learn a small, self-consistent CoT model from a teacher model that is orders of magnitude larger.
To ensure faithful distillation, we use the teacher-generated rationales to learn a student LM with a counterfactual reasoning objective.
arXiv Detail & Related papers (2023-05-03T03:47:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.