Reinforcement-aware Knowledge Distillation for LLM Reasoning
- URL: http://arxiv.org/abs/2602.22495v1
- Date: Thu, 26 Feb 2026 00:20:39 GMT
- Title: Reinforcement-aware Knowledge Distillation for LLM Reasoning
- Authors: Zhaoyang Zhang, Shuli Jiang, Yantao Shen, Yuting Zhang, Dhananjay Ram, Shuo Yang, Zhuowen Tu, Wei Xia, Stefano Soatto,
- Abstract summary: Reinforcement learning (RL) post-training has recently driven gains in long chain-of-thought reasoning large language models (LLMs)<n>Most existing knowledge distillation methods are designed for supervised fine-tuning (SFT), relying on fixed teacher traces or teacher-student Kullback-Leibler (KL) divergence-based regularization.<n>We propose RL-aware distillation (RLAD), which performs selective imitation during RL -- guiding the student toward the teacher only when it improves the current policy update.
- Score: 63.53679456364683
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Reinforcement learning (RL) post-training has recently driven major gains in long chain-of-thought reasoning large language models (LLMs), but the high inference cost of such models motivates distillation into smaller students. Most existing knowledge distillation (KD) methods are designed for supervised fine-tuning (SFT), relying on fixed teacher traces or teacher-student Kullback-Leibler (KL) divergence-based regularization. When combined with RL, these approaches often suffer from distribution mismatch and objective interference: teacher supervision may not align with the student's evolving rollout distribution, and the KL regularizer can compete with reward maximization and require careful loss balancing. To address these issues, we propose RL-aware distillation (RLAD), which performs selective imitation during RL -- guiding the student toward the teacher only when it improves the current policy update. Our core component, Trust Region Ratio Distillation (TRRD), replaces the teacher-student KL regularizer with a PPO/GRPO-style likelihood-ratio objective anchored to a teacher--old-policy mixture, yielding advantage-aware, trust-region-bounded distillation on student rollouts and naturally balancing exploration, exploitation, and imitation. Across diverse logic reasoning and math benchmarks, RLAD consistently outperforms offline distillation, standard GRPO, and KL-based on-policy teacher-student knowledge distillation.
Related papers
- Learning beyond Teacher: Generalized On-Policy Distillation with Reward Extrapolation [57.524909883706556]
On-policy distillation (OPD) has demonstrated strong empirical gains in improving student performance.<n>This work introduces a flexible reference model and a reward scaling factor that controls the relative weight of the reward term against the KL regularization.<n>In particular, in the setting where we merge the knowledge from different domain experts, obtained by applying domain-specific RL to the same student model, ExOPD enables the student to even surpass the teacher's performance boundary.
arXiv Detail & Related papers (2026-02-12T16:14:29Z) - REDistill: Robust Estimator Distillation for Balancing Robustness and Efficiency [0.0]
We introduce REDistill, a principled framework grounded in robust statistics.<n>Redistill replaces the standard KD objective with a power divergence loss, a generalization of KL divergence.<n>Experiments on CIFAR-100 and ImageNet-1k demonstrate that REDistill consistently improves student accuracy in diverse teacher-student architectures.
arXiv Detail & Related papers (2026-02-04T15:50:53Z) - Distribution-Aligned Sequence Distillation for Superior Long-CoT Reasoning [48.041170200238206]
We introduce DASD-4B-Thinking, a lightweight yet highly capable, fully open-source reasoning model.<n>It achieves SOTA performance among open-source models of comparable scale across challenging benchmarks in mathematics, scientific reasoning, and code generation.
arXiv Detail & Related papers (2026-01-14T02:43:17Z) - Reinforcement Learning Teachers of Test Time Scaling [21.551446057221185]
Key use case of reasoning LMs is to act as teachers for distilling new students and cold-starting future RL iterations.<n>We introduce a new framework that avoids RL's exploration challenge by training a new class of Reinforcement-Learned Teachers (RLTs)<n>RLTs are prompted with both the question and solution to each problem, and tasked to simply "connect-the-dots" with detailed explanations tailored for their students.
arXiv Detail & Related papers (2025-06-10T02:53:24Z) - KDRL: Post-Training Reasoning LLMs via Unified Knowledge Distillation and Reinforcement Learning [72.53466291156604]
We present textbfKDRL, a textitunified post-training framework that jointly optimize a reasoning model through teacher supervision (KD) and self-exploration (RL)<n>We first formulate a unified objective that integrates GRPO and KD, and systematically explore how different KL approximations, KL coefficients, and reward-guided KD strategies affect the overall post-training dynamics and performance.
arXiv Detail & Related papers (2025-06-02T19:46:41Z) - ToDi: Token-wise Distillation via Fine-Grained Divergence Control [9.958797874295355]
Token-wise Distillation (ToDi) is a novel method that adaptively combines Forward KL and Reverse KL per token using a sigmoid-based weighting function.<n>ToDi consistently outperforms recent distillation baselines using uniform or less granular strategies.
arXiv Detail & Related papers (2025-05-22T06:51:16Z) - Distilling the Implicit Multi-Branch Structure in LLMs' Reasoning via Reinforcement Learning [63.888013006686364]
Distilling reasoning paths from teacher to student models via supervised fine-tuning (SFT) provides a shortcut for improving the reasoning ability of Large Language Models (LLMs)<n>We propose RLKD, a reinforcement learning-based distillation framework guided by a novel Generative Structure Reward Model (GSRM)<n>Our GSRM converts reasoning paths into multiple meta-reasoning-solving steps and computes rewards to measure structural alignment between student and teacher reasoning.
arXiv Detail & Related papers (2025-05-22T02:36:36Z) - Multi-Granularity Semantic Revision for Large Language Model Distillation [66.03746866578274]
We propose a multi-granularity semantic revision method for LLM distillation.
At the sequence level, we propose a sequence correction and re-generation strategy.
At the token level, we design a distribution adaptive clipping Kullback-Leibler loss as the distillation objective function.
At the span level, we leverage the span priors of a sequence to compute the probability correlations within spans, and constrain the teacher and student's probability correlations to be consistent.
arXiv Detail & Related papers (2024-07-14T03:51:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.