Related papers: Stable On-Policy Distillation through Adaptive Target Reformulation

Stable On-Policy Distillation through Adaptive Target Reformulation

URL: http://arxiv.org/abs/2601.07155v1
Date: Mon, 12 Jan 2026 02:57:39 GMT
Title: Stable On-Policy Distillation through Adaptive Target Reformulation
Authors: Ijun Jang, Jewon Yeom, Juan Yeo, Hyunggu Lim, Taesup Kim,
Abstract summary: Veto is an objective-level reformulation that constructs a geometric bridge in the logit space.<n>Veto consistently outperforms supervised fine-tuning and existing on-policy baselines.
Score: 7.361248172930405
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Knowledge distillation (KD) is a widely adopted technique for transferring knowledge from large language models to smaller student models; however, conventional supervised KD often suffers from a distribution mismatch between training and inference. While on-policy KD approaches attempt to mitigate this issue by learning directly from student-generated outputs, they frequently encounter training instabilities because the distributional gap between the novice student and the expert teacher is often too wide to bridge directly. These challenges manifest as pathological gradients in forward KL objectives or diversity collapse in reverse KL regimes. To address these limitations, we propose Veto, an objective-level reformulation that constructs a geometric bridge in the logit space. Unlike prior methods that mix data samples, Veto creates an intermediate target distribution that promotes alignment between the teacher and the student. By introducing a tunable parameter beta, Veto serves as an Adaptive Gradient Veto that stabilizes optimization by suppressing harmful gradients on low-confidence tokens, while simultaneously acting as a Decisiveness Knob to balance reward-driven performance with output diversity. Extensive experiments across various reasoning and generation tasks demonstrate that Veto consistently outperforms supervised fine-tuning and existing on-policy baselines.

Related papers

REDistill: Robust Estimator Distillation for Balancing Robustness and Efficiency [0.0]
We introduce REDistill, a principled framework grounded in robust statistics.<n>Redistill replaces the standard KD objective with a power divergence loss, a generalization of KL divergence.<n>Experiments on CIFAR-100 and ImageNet-1k demonstrate that REDistill consistently improves student accuracy in diverse teacher-student architectures.
arXiv Detail & Related papers (2026-02-04T15:50:53Z)
"The Whole Is Greater Than the Sum of Its Parts": A Compatibility-Aware Multi-Teacher CoT Distillation Framework [16.96094045628127]
Chain-of-Thought (CoT) reasoning empowers Large Language Models (LLMs) with remarkable capabilities but typically requires prohibitive parameter scales.<n>CoT distillation has emerged as a promising paradigm to transfer reasoning prowess into compact Student Models (SLMs)<n>We introduce COMPACT, a framework that adaptively fuses supervisions from different teachers by dynamically weighting teacher gradients.
arXiv Detail & Related papers (2026-01-20T14:05:19Z)
Dual-level Modality Debiasing Learning for Unsupervised Visible-Infrared Person Re-Identification [59.59359638389348]
We propose a Dual-level Modality Debiasing Learning framework that implements debiasing at both the model and optimization levels.<n>Experiments on benchmark datasets demonstrate that DMDL could enable modality-invariant feature learning and a more generalized model.
arXiv Detail & Related papers (2025-12-03T12:43:16Z)
Orthogonal Projection Subspace to Aggregate Online Prior-knowledge for Continual Test-time Adaptation [67.80294336559574]
Continual Test Time Adaptation (CTTA) is a task that requires a source pre-trained model to continually adapt to new scenarios.<n>We propose a novel pipeline, Orthogonal Projection Subspace to aggregate online Prior-knowledge, dubbed OoPk.
arXiv Detail & Related papers (2025-06-23T18:17:39Z)
ToDi: Token-wise Distillation via Fine-Grained Divergence Control [9.958797874295355]
Token-wise Distillation (ToDi) is a novel method that adaptively combines Forward KL and Reverse KL per token using a sigmoid-based weighting function.<n>ToDi consistently outperforms recent distillation baselines using uniform or less granular strategies.
arXiv Detail & Related papers (2025-05-22T06:51:16Z)
LoRanPAC: Low-rank Random Features and Pre-trained Models for Bridging Theory and Practice in Continual Learning [103.45785408116146]
Continual learning (CL) aims to train a model that can solve multiple tasks presented sequentially.<n>Recent CL approaches have achieved strong performance by leveraging large pre-trained models that generalize well to downstream tasks.<n>However, such methods lack theoretical guarantees, making them prone to unexpected failures.<n>We aim to bridge this gap by designing a simple CL method that is theoretically sound and highly performant.
arXiv Detail & Related papers (2024-10-01T12:58:37Z)
Fairness-Aware Meta-Learning via Nash Bargaining [63.44846095241147]
We introduce a two-stage meta-learning framework to address issues of group-level fairness in machine learning. The first stage involves the use of a Nash Bargaining Solution (NBS) to resolve hypergradient conflicts and steer the model. We show empirical effects across various fairness objectives in six key fairness datasets and two image classification tasks.
arXiv Detail & Related papers (2024-06-11T07:34:15Z)
Visual Prompt Tuning in Null Space for Continual Learning [51.96411454304625]
Existing prompt-tuning methods have demonstrated impressive performances in continual learning (CL) This paper aims to learn each task by tuning the prompts in the direction orthogonal to the subspace spanned by previous tasks' features. In practice, an effective null-space-based approximation solution has been proposed to implement the prompt gradient projection.
arXiv Detail & Related papers (2024-06-09T05:57:40Z)
TransFusion: Covariate-Shift Robust Transfer Learning for High-Dimensional Regression [11.040033344386366]
We propose a two-step method with a novel fused-regularizer to improve the learning performance on a target task with limited samples. Nonasymptotic bound is provided for the estimation error of the target model. We extend the method to a distributed setting, allowing for a pretraining-finetuning strategy.
arXiv Detail & Related papers (2024-04-01T14:58:16Z)
Selective Learning: Towards Robust Calibration with Dynamic Regularization [79.92633587914659]
Miscalibration in deep learning refers to there is a discrepancy between the predicted confidence and performance. We introduce Dynamic Regularization (DReg) which aims to learn what should be learned during training thereby circumventing the confidence adjusting trade-off.
arXiv Detail & Related papers (2024-02-13T11:25:20Z)
Model-Aware Contrastive Learning: Towards Escaping the Dilemmas [11.27589489269041]
Contrastive learning (CL) continuously achieves significant breakthroughs across multiple domains. InfoNCE-based methods suffer from some dilemmas, such as textituniformity-tolerance dilemma (UTD) and textitgradient reduction (UTD) We present a Model-Aware Contrastive Learning (MACL) strategy, whose temperature is adaptive to the magnitude of alignment that reflects the basic confidence of the instance discrimination task.
arXiv Detail & Related papers (2022-07-16T08:21:55Z)
Alleviating Robust Overfitting of Adversarial Training With Consistency Regularization [9.686724616328874]
Adversarial training (AT) has proven to be one of the most effective ways to defend Deep Neural Networks (DNNs) against adversarial attacks. robustness will drop sharply at a certain stage, always exists during AT. consistency regularization, a popular technique in semi-supervised learning, has a similar goal as AT and can be used to alleviate robust overfitting.
arXiv Detail & Related papers (2022-05-24T03:18:43Z)
Distributionally Robust Models with Parametric Likelihood Ratios [123.05074253513935]
Three simple ideas allow us to train models with DRO using a broader class of parametric likelihood ratios. We find that models trained with the resulting parametric adversaries are consistently more robust to subpopulation shifts when compared to other DRO approaches.
arXiv Detail & Related papers (2022-04-13T12:43:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.