Related papers: Single-Teacher View Augmentation: Boosting Knowledge Distillation via Angular Diversity

Single-Teacher View Augmentation: Boosting Knowledge Distillation via Angular Diversity

URL: http://arxiv.org/abs/2510.22480v1
Date: Sun, 26 Oct 2025 01:41:08 GMT
Title: Single-Teacher View Augmentation: Boosting Knowledge Distillation via Angular Diversity
Authors: Seonghoon Yu, Dongjun Nam, Dina Katabi, Jeany Son,
Abstract summary: Knowledge Distillation (KD) aims to train a lightweight student model by transferring knowledge from a large, high-capacity teacher.<n>Recent studies have shown that leveraging diverse teacher perspectives can significantly improve distillation performance.<n>We propose a novel cost-efficient knowledge augmentation method for KD that generates diverse multi-views by attaching multiple branches to a single teacher.
Score: 20.479130509494272
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Knowledge Distillation (KD) aims to train a lightweight student model by transferring knowledge from a large, high-capacity teacher. Recent studies have shown that leveraging diverse teacher perspectives can significantly improve distillation performance; however, achieving such diversity typically requires multiple teacher networks, leading to high computational costs. In this work, we propose a novel cost-efficient knowledge augmentation method for KD that generates diverse multi-views by attaching multiple branches to a single teacher. To ensure meaningful semantic variation across multi-views, we introduce two angular diversity objectives: 1) constrained inter-angle diversify loss, which maximizes angles between augmented views while preserving proximity to the original teacher output, and 2) intra-angle diversify loss, which encourages an even distribution of views around the original output. The ensembled knowledge from these angularly diverse views, along with the original teacher, is distilled into the student. We further theoretically demonstrate that our objectives increase the diversity among ensemble members and thereby reduce the upper bound of the ensemble's expected loss, leading to more effective distillation. Experimental results show that our method surpasses an existing knowledge augmentation method across diverse configurations. Moreover, the proposed method is compatible with other KD frameworks in a plug-and-play fashion, providing consistent improvements in generalization performance.

Related papers

Distilling Invariant Representations with Dual Augmentation [6.24302896438145]
We introduce a dual augmentation strategy to promote invariant feature learning in both teacher and student models.<n>Our approach leverages different augmentations applied to both models during distillation, pushing the student to capture robust, transferable features.
arXiv Detail & Related papers (2024-10-12T10:27:23Z)
Hybrid Distillation: Connecting Masked Autoencoders with Contrastive Learners [102.20090188997301]
We explore how to obtain a model that combines Contrastive Learning (CL) and Masked Image Modeling (MIM) strengths. In order to better obtain both discrimination and diversity, we propose a simple but effective Hybrid Distillation strategy. Experiment results prove that Hybrid Distill can achieve superior performance on different benchmarks.
arXiv Detail & Related papers (2023-06-28T02:19:35Z)
Channel Self-Supervision for Online Knowledge Distillation [14.033675223173933]
We propose a novel online knowledge distillation method, textbfChannel textbfSelf-textbfSupervision for Online Knowledge Distillation (CSS) We construct a dual-network multi-branch structure and enhance inter-branch diversity through self-supervised learning. Our method provides greater diversity than OKDDip and we also give pretty performance improvement, even over the state-of-the-art such as PCL.
arXiv Detail & Related papers (2022-03-22T12:35:20Z)
Exploring Inter-Channel Correlation for Diversity-preserved KnowledgeDistillation [91.56643684860062]
Inter-Channel Correlation for Knowledge Distillation(ICKD) is developed. ICKD captures intrinsic distribution of the featurespace and sufficient diversity properties of features in the teacher network. We are the first method based on knowl-edge distillation boosts ResNet18 beyond 72% Top-1 ac-curacy on ImageNet classification.
arXiv Detail & Related papers (2022-02-08T07:01:56Z)
Fixing the Teacher-Student Knowledge Discrepancy in Distillation [72.4354883997316]
We propose a novel student-dependent distillation method, knowledge consistent distillation, which makes teacher's knowledge more consistent with the student. Our method is very flexible that can be easily combined with other state-of-the-art approaches.
arXiv Detail & Related papers (2021-03-31T06:52:20Z)
Adaptive Multi-Teacher Multi-level Knowledge Distillation [11.722728148523366]
We propose a novel adaptive multi-teacher multi-level knowledge distillation learning framework(AMTML-KD) It consists two novel insights: (i) associating each teacher with a latent representation to adaptively learn instance-level teacher importance weights. As such, a student model can learn multi-level knowledge from multiple teachers through AMTML-KD.
arXiv Detail & Related papers (2021-03-06T08:18:16Z)
Wasserstein Contrastive Representation Distillation [114.24609306495456]
We propose Wasserstein Contrastive Representation Distillation (WCoRD), which leverages both primal and dual forms of Wasserstein distance for knowledge distillation. The dual form is used for global knowledge transfer, yielding a contrastive learning objective that maximizes the lower bound of mutual information between the teacher and the student networks. Experiments demonstrate that the proposed WCoRD method outperforms state-of-the-art approaches on privileged information distillation, model compression and cross-modal transfer.
arXiv Detail & Related papers (2020-12-15T23:43:28Z)
Differentiable Feature Aggregation Search for Knowledge Distillation [47.94874193183427]
We introduce the feature aggregation to imitate the multi-teacher distillation in the single-teacher distillation framework. DFA is a two-stage Differentiable Feature Aggregation search method motivated by DARTS in neural architecture search. Experimental results show that DFA outperforms existing methods on CIFAR-100 and CINIC-10 datasets.
arXiv Detail & Related papers (2020-08-02T15:42:29Z)
Knowledge Distillation Beyond Model Compression [13.041607703862724]
Knowledge distillation (KD) is commonly deemed as an effective model compression technique in which a compact model (student) is trained under the supervision of a larger pretrained model or ensemble of models (teacher) In this study, we provide an extensive study on nine different KD methods which covers a broad spectrum of approaches to capture and transfer knowledge.
arXiv Detail & Related papers (2020-07-03T19:54:04Z)
Knowledge Distillation Meets Self-Supervision [109.6400639148393]
Knowledge distillation involves extracting "dark knowledge" from a teacher network to guide the learning of a student network. We show that the seemingly different self-supervision task can serve as a simple yet powerful solution. By exploiting the similarity between those self-supervision signals as an auxiliary task, one can effectively transfer the hidden information from the teacher to the student.
arXiv Detail & Related papers (2020-06-12T12:18:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.