Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning
- URL: http://arxiv.org/abs/2601.09088v1
- Date: Wed, 14 Jan 2026 02:43:17 GMT
- Title: Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning
- Authors: Shaotian Yan, Kaiyuan Liu, Chen Shen, Bing Wang, Sinan Fan, Jun Zhang, Yue Wu, Zheng Wang, Jieping Ye,
- Abstract summary: We introduce DASD-4B-Thinking, a lightweight yet highly capable, fully open-source reasoning model.<n>It achieves SOTA performance among open-source models of comparable scale across challenging benchmarks in mathematics, scientific reasoning, and code generation.
- Score: 48.041170200238206
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In this report, we introduce DASD-4B-Thinking, a lightweight yet highly capable, fully open-source reasoning model. It achieves SOTA performance among open-source models of comparable scale across challenging benchmarks in mathematics, scientific reasoning, and code generation -- even outperforming several larger models. We begin by critically reexamining a widely adopted distillation paradigm in the community: SFT on teacher-generated responses, also known as sequence-level distillation. Although a series of recent works following this scheme have demonstrated remarkable efficiency and strong empirical performance, they are primarily grounded in the SFT perspective. Consequently, these approaches focus predominantly on designing heuristic rules for SFT data filtering, while largely overlooking the core principle of distillation itself -- enabling the student model to learn the teacher's full output distribution so as to inherit its generalization capability. Specifically, we identify three critical limitations in current practice: i) Inadequate representation of the teacher's sequence-level distribution; ii) Misalignment between the teacher's output distribution and the student's learning capacity; and iii) Exposure bias arising from teacher-forced training versus autoregressive inference. In summary, these shortcomings reflect a systemic absence of explicit teacher-student interaction throughout the distillation process, leaving the essence of distillation underexploited. To address these issues, we propose several methodological innovations that collectively form an enhanced sequence-level distillation training pipeline. Remarkably, DASD-4B-Thinking obtains competitive results using only 448K training samples -- an order of magnitude fewer than those employed by most existing open-source efforts. To support community research, we publicly release our models and the training dataset.
Related papers
- Reinforcement-aware Knowledge Distillation for LLM Reasoning [63.53679456364683]
Reinforcement learning (RL) post-training has recently driven gains in long chain-of-thought reasoning large language models (LLMs)<n>Most existing knowledge distillation methods are designed for supervised fine-tuning (SFT), relying on fixed teacher traces or teacher-student Kullback-Leibler (KL) divergence-based regularization.<n>We propose RL-aware distillation (RLAD), which performs selective imitation during RL -- guiding the student toward the teacher only when it improves the current policy update.
arXiv Detail & Related papers (2026-02-26T00:20:39Z) - Don't Ignore the Tail: Decoupling top-K Probabilities for Efficient Language Model Distillation [50.19746127327559]
We propose a new tail-aware divergence that decouples the contribution of the teacher model's top-K predicted probabilities from that of lower-probability predictions.<n> Experimental results demonstrate that our modified distillation method yields competitive performance in both pre-training and supervised distillation of decoder models.
arXiv Detail & Related papers (2026-02-24T11:54:06Z) - Decoupled DMD: CFG Augmentation as the Spear, Distribution Matching as the Shield [54.328202401611264]
Diffusion model distillation has emerged as a powerful technique for creating efficient few-step and single-step generators.<n>We show that the primary driver of few-step distillation is not distribution matching, but a previously overlooked component we identify as CFG Augmentation (CA)<n>We propose principled modifications to the distillation process, such as decoupling the noise schedules for the engine and the regularizer, leading to further performance gains.
arXiv Detail & Related papers (2025-11-27T18:24:28Z) - Group Relative Knowledge Distillation: Learning from Teacher's Relational Inductive Bias [5.434571018755813]
Group Relative Knowledge Distillation (GRKD) is a novel framework that distills teacher knowledge by learning the relative ranking among classes.<n>Experiments on classification benchmarks demonstrate GRKD achieves superior generalization compared to existing methods.
arXiv Detail & Related papers (2025-04-29T07:23:22Z) - Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only.<n> Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z) - TwT: Thinking without Tokens by Habitual Reasoning Distillation with Multi-Teachers' Guidance [32.6122298575412]
We propose TwT, a method that reduces inference-time costs through habitual reasoning distillation with multi-teachers' guidance.<n>Our approach internalizes explicit reasoning into the model's habitual behavior through a Teacher-Guided compression strategy.<n> Experimental results demonstrate that TwT effectively reduces inference costs while preserving superior performance.
arXiv Detail & Related papers (2025-03-31T15:16:31Z) - Distill Not Only Data but Also Rewards: Can Smaller Language Models Surpass Larger Ones? [58.80794196076336]
Distilling large language models (LLMs) typically involves transferring the teacher model's responses through supervised fine-tuning (SFT)<n>We propose a novel distillation pipeline that transfers both responses and rewards.<n>Our method generates pseudo-rewards through a self-supervised mechanism that leverages the inherent structure of both teacher and student responses.
arXiv Detail & Related papers (2025-02-26T20:50:11Z) - Efficient Diffusion as Low Light Enhancer [63.789138528062225]
Reflectance-Aware Trajectory Refinement (RATR) is a simple yet effective module to refine the teacher trajectory using the reflectance component of images.
textbfReflectance-aware textbfDiffusion with textbfDistilled textbfTrajectory (textbfReDDiT) is an efficient and flexible distillation framework tailored for Low-Light Image Enhancement (LLIE)
arXiv Detail & Related papers (2024-10-16T08:07:18Z) - Linear Projections of Teacher Embeddings for Few-Class Distillation [14.99228980898161]
Knowledge Distillation (KD) has emerged as a promising approach for transferring knowledge from a larger, more complex teacher model to a smaller student model.
We introduce a novel method for distilling knowledge from the teacher's model representations, which we term Learning Embedding Linear Projections (LELP)
Our experimental evaluation on large-scale NLP benchmarks like Amazon Reviews and Sentiment140 demonstrate the LELP is consistently competitive with, and typically superior to, existing state-of-the-art distillation algorithms for binary and few-class problems.
arXiv Detail & Related papers (2024-09-30T16:07:34Z) - Towards Effective Collaborative Learning in Long-Tailed Recognition [16.202524991074416]
Real-world data usually suffers from severe class imbalance and long-tailed distributions, where minority classes are significantly underrepresented.
Recent research prefers to utilize multi-expert architectures to mitigate the model uncertainty on the minority.
In this paper, we observe that the knowledge transfer between experts is imbalanced in terms of class distribution, which results in limited performance improvement of the minority classes.
arXiv Detail & Related papers (2023-05-05T09:16:06Z) - Mind the Gap in Distilling StyleGANs [100.58444291751015]
StyleGAN family is one of the most popular Generative Adversarial Networks (GANs) for unconditional generation.
This paper provides a comprehensive study of distilling from the popular StyleGAN-like architecture.
arXiv Detail & Related papers (2022-08-18T14:18:29Z) - Anomaly Detection via Reverse Distillation from One-Class Embedding [2.715884199292287]
We propose a novel T-S model consisting of a teacher encoder and a student decoder.
Instead of receiving raw images directly, the student network takes teacher model's one-class embedding as input.
In addition, we introduce a trainable one-class bottleneck embedding module in our T-S model.
arXiv Detail & Related papers (2022-01-26T01:48:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.