Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
- URL: http://arxiv.org/abs/2602.12125v1
- Date: Thu, 12 Feb 2026 16:14:29 GMT
- Title: Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation
- Authors: Wenkai Yang, Weijie Liu, Ruobing Xie, Kai Yang, Saiyong Yang, Yankai Lin,
- Abstract summary: On-policy distillation (OPD) has demonstrated strong empirical gains in improving student performance.<n>This work introduces a flexible reference model and a reward scaling factor that controls the relative weight of the reward term against the KL regularization.<n>In particular, in the setting where we merge the knowledge from different domain experts, obtained by applying domain-specific RL to the same student model, ExOPD enables the student to even surpass the teacher's performance boundary.
- Score: 57.524909883706556
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: On-policy distillation (OPD), which aligns the student with the teacher's logit distribution on student-generated trajectories, has demonstrated strong empirical gains in improving student performance and often outperforms off-policy distillation and reinforcement learning (RL) paradigms. In this work, we first theoretically show that OPD is a special case of dense KL-constrained RL where the reward function and the KL regularization are always weighted equally and the reference model can by any model. Then, we propose the Generalized On-Policy Distillation (G-OPD) framework, which extends the standard OPD objective by introducing a flexible reference model and a reward scaling factor that controls the relative weight of the reward term against the KL regularization. Through comprehensive experiments on math reasoning and code generation tasks, we derive two novel insights: (1) Setting the reward scaling factor to be greater than 1 (i.e., reward extrapolation), which we term ExOPD, consistently improves over standard OPD across a range of teacher-student size pairings. In particular, in the setting where we merge the knowledge from different domain experts, obtained by applying domain-specific RL to the same student model, back into the original student, ExOPD enables the student to even surpass the teacher's performance boundary and outperform the domain teachers. (2) Building on ExOPD, we further find that in the strong-to-weak distillation setting (i.e., distilling a smaller student from a larger teacher), performing reward correction by choosing the reference model as the teacher's base model before RL yields a more accurate reward signal and further improves distillation performance. However, this choice assumes access to the teacher's pre-RL variant and incurs more computational overhead. We hope our work offers new insights for future research on OPD.
Related papers
- Reinforcement-aware Knowledge Distillation for LLM Reasoning [63.53679456364683]
Reinforcement learning (RL) post-training has recently driven gains in long chain-of-thought reasoning large language models (LLMs)<n>Most existing knowledge distillation methods are designed for supervised fine-tuning (SFT), relying on fixed teacher traces or teacher-student Kullback-Leibler (KL) divergence-based regularization.<n>We propose RL-aware distillation (RLAD), which performs selective imitation during RL -- guiding the student toward the teacher only when it improves the current policy update.
arXiv Detail & Related papers (2026-02-26T00:20:39Z) - Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning [48.041170200238206]
We introduce DASD-4B-Thinking, a lightweight yet highly capable, fully open-source reasoning model.<n>It achieves SOTA performance among open-source models of comparable scale across challenging benchmarks in mathematics, scientific reasoning, and code generation.
arXiv Detail & Related papers (2026-01-14T02:43:17Z) - More Than One Teacher: Adaptive Multi-Guidance Policy Optimization for Diverse Exploration [103.1589018460702]
"guidance-on-demand" approach expands exploration while preserving the value of self-discovery.<n>Experiments show AMPO substantially outperforms a strong baseline.<n>Using four peer-sized teachers, our method achieves comparable results to approaches that leverage a single, more powerful teacher.
arXiv Detail & Related papers (2025-10-02T17:14:00Z) - Preference Distillation via Value based Reinforcement Learning [16.165599808093408]
We propose textitTeacher Value-based Knowledge Distillation (TVKD), which introduces an auxiliary reward from the value function of the teacher model to provide a soft guide.<n>TVKD can be integrated into the standard DPO training framework and does not require additional rollouts.<n>Our experimental results show that TVKD consistently improves performance across various benchmarks and model sizes.
arXiv Detail & Related papers (2025-09-21T07:52:28Z) - Enhancing Reasoning Capabilities in SLMs with Reward Guided Dataset Distillation [0.0]
We propose AdvDistill, a reward-guided dataset distillation framework.<n>We utilise multiple generations (responses) from a teacher for each prompt and assign rewards based on rule-based verifiers.<n>These varying and normally distributed rewards serve as weights when training student models.
arXiv Detail & Related papers (2025-06-25T20:07:47Z) - Biased Teacher, Balanced Student [0.0]
Long-Tailed Knowledge Distillation (LTKD) is a novel framework tailored for class-imbalanced scenarios.<n>Experiments on CIFAR-100-LT, TinyImageNet-LT, and ImageNet-LT show that LTKD consistently outperforms existing KD methods.
arXiv Detail & Related papers (2025-06-23T10:46:44Z) - Distill Not Only Data but Also Rewards: Can Smaller Language Models Surpass Larger Ones? [58.80794196076336]
Distilling large language models (LLMs) typically involves transferring the teacher model's responses through supervised fine-tuning (SFT)<n>We propose a novel distillation pipeline that transfers both responses and rewards.<n>Our method generates pseudo-rewards through a self-supervised mechanism that leverages the inherent structure of both teacher and student responses.
arXiv Detail & Related papers (2025-02-26T20:50:11Z) - Exploring and Enhancing the Transfer of Distribution in Knowledge Distillation for Autoregressive Language Models [62.5501109475725]
Knowledge distillation (KD) is a technique that compresses large teacher models by training smaller student models to mimic them.
This paper introduces Online Knowledge Distillation (OKD), where the teacher network integrates small online modules to concurrently train with the student model.
OKD achieves or exceeds the performance of leading methods in various model architectures and sizes, reducing training time by up to fourfold.
arXiv Detail & Related papers (2024-09-19T07:05:26Z) - Better Teacher Better Student: Dynamic Prior Knowledge for Knowledge
Distillation [70.92135839545314]
We propose the dynamic prior knowledge (DPK), which integrates part of teacher's features as the prior knowledge before the feature distillation.
Our DPK makes the performance of the student model positively correlated with that of the teacher model, which means that we can further boost the accuracy of students by applying larger teachers.
arXiv Detail & Related papers (2022-06-13T11:52:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.