Long-Chain Reasoning Distillation via Adaptive Prefix Alignment
- URL: http://arxiv.org/abs/2601.10064v1
- Date: Thu, 15 Jan 2026 04:40:45 GMT
- Title: Long-Chain Reasoning Distillation via Adaptive Prefix Alignment
- Authors: Zhenghao Liu, Zhuoyang Wu, Xinze Li, Yukun Yan, Shuo Wang, Zulong Chen, Yu Gu, Ge Yu, Maosong Sun,
- Abstract summary: We propose a framework that exploits teacher CoTs for distillation through adaptive prefix alignment.<n>P-ALIGN adaptively truncates teacher-generated reasoning trajectories by determining whether the remaining suffix is concise.<n>Experiments on multiple mathematical reasoning benchmarks demonstrate that P-ALIGN outperforms all baselines by over 3%.
- Score: 57.130176131042965
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large Language Models (LLMs) have demonstrated remarkable reasoning capabilities, particularly in solving complex mathematical problems. Recent studies show that distilling long reasoning trajectories can effectively enhance the reasoning performance of small-scale student models. However, teacher-generated reasoning trajectories are often excessively long and structurally complex, making them difficult for student models to learn. This mismatch leads to a gap between the provided supervision signal and the learning capacity of the student model. To address this challenge, we propose Prefix-ALIGNment distillation (P-ALIGN), a framework that fully exploits teacher CoTs for distillation through adaptive prefix alignment. Specifically, P-ALIGN adaptively truncates teacher-generated reasoning trajectories by determining whether the remaining suffix is concise and sufficient to guide the student model. Then, P-ALIGN leverages the teacher-generated prefix to supervise the student model, encouraging effective prefix alignment. Experiments on multiple mathematical reasoning benchmarks demonstrate that P-ALIGN outperforms all baselines by over 3%. Further analysis indicates that the prefixes constructed by P-ALIGN provide more effective supervision signals, while avoiding the negative impact of redundant and uncertain reasoning components. All code is available at https://github.com/NEUIR/P-ALIGN.
Related papers
- Which Reasoning Trajectories Teach Students to Reason Better? A Simple Metric of Informative Alignment [82.00769536768509]
Rank-Surprisal Ratio is a simple metric that captures both alignment and informativeness to assess the suitability of a reasoning trajectory.<n>We demonstrate its practical utility in both trajectory selection and teacher selection.
arXiv Detail & Related papers (2026-01-20T18:58:10Z) - "The Whole Is Greater Than the Sum of Its Parts": A Compatibility-Aware Multi-Teacher CoT Distillation Framework [16.96094045628127]
Chain-of-Thought (CoT) reasoning empowers Large Language Models (LLMs) with remarkable capabilities but typically requires prohibitive parameter scales.<n>CoT distillation has emerged as a promising paradigm to transfer reasoning prowess into compact Student Models (SLMs)<n>We introduce COMPACT, a framework that adaptively fuses supervisions from different teachers by dynamically weighting teacher gradients.
arXiv Detail & Related papers (2026-01-20T14:05:19Z) - Learning When to Stop: Adaptive Latent Reasoning via Reinforcement Learning [7.669927190506031]
We develop adaptive-length latent reasoning models and introduce a post-SFT reinforcement-learning methodology.<n> Experiments on the Llama 3.2 1B model and the GSM8K-Aug dataset show a $52%$ drop in total reasoning length with no penalty to accuracy.
arXiv Detail & Related papers (2025-11-26T16:54:06Z) - From Correction to Mastery: Reinforced Distillation of Large Language Model Agents [13.982204994247718]
Large Language Model agents excel at solving complex tasks through iterative reasoning and tool use.<n>Existing distillation approaches train smaller students to imitate full teacher trajectories.<n>We propose SCoRe, a student-centered framework in which the student generates training trajectories and the teacher corrects only the earliest error.
arXiv Detail & Related papers (2025-09-12T15:34:07Z) - Merge-of-Thought Distillation [23.53356244978525]
Merge-of-Thought Distillation (MoT) is a lightweight framework that alternates between teacher-specific supervised fine-tuning branches and weight-space merging the resulting student variants.<n>On competition math benchmarks, applying MoT to a Qwen3-14B student surpasses strong models including Deepseek-R1, Qwen3-32B, and OpenAI-O1.<n>MoT consistently outperforms the best single-teacher distillation, improves general reasoning beyond mathematics, and shows robustness to distribution-shifted and peer-level teachers.
arXiv Detail & Related papers (2025-09-10T17:46:57Z) - NaturalThoughts: Selecting and Distilling Reasoning Traces for General Reasoning Tasks [65.70224757972068]
We select reasoning traces from a strong teacher model based on a large pool of questions from NaturalReasoning.<n>We find that simply scaling up data size with random sampling is a strong baseline with steady performance gains.<n>We find that selecting difficult examples that require more diverse reasoning strategies is more sample-efficient to transfer the teacher model's reasoning skills.
arXiv Detail & Related papers (2025-07-02T17:30:24Z) - Enhancing Long-Chain Reasoning Distillation through Error-Aware Self-Reflection [64.73809794561305]
errOr-aware self-ReflectION (ORION) is a framework that refines teacher CoTs through an Error-Aware Reflection process.<n> Experiments on multiple mathematical reasoning benchmarks demonstrate that ORION consistently improves performance by more than 2% over all baselines.
arXiv Detail & Related papers (2025-05-28T08:57:03Z) - Distilling the Implicit Multi-Branch Structure in LLMs' Reasoning via Reinforcement Learning [63.888013006686364]
Distilling reasoning paths from teacher to student models via supervised fine-tuning (SFT) provides a shortcut for improving the reasoning ability of Large Language Models (LLMs)<n>We propose RLKD, a reinforcement learning-based distillation framework guided by a novel Generative Structure Reward Model (GSRM)<n>Our GSRM converts reasoning paths into multiple meta-reasoning-solving steps and computes rewards to measure structural alignment between student and teacher reasoning.
arXiv Detail & Related papers (2025-05-22T02:36:36Z) - SCOTT: Self-Consistent Chain-of-Thought Distillation [68.40232422158569]
Large language models (LMs) generate free-text rationales for their predictions via chain-of-thought prompting.
We propose a faithful knowledge distillation method to learn a small, self-consistent CoT model from a teacher model that is orders of magnitude larger.
To ensure faithful distillation, we use the teacher-generated rationales to learn a student LM with a counterfactual reasoning objective.
arXiv Detail & Related papers (2023-05-03T03:47:00Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.