Enriching Knowledge Distillation with Cross-Modal Teacher Fusion
- URL: http://arxiv.org/abs/2511.09286v1
- Date: Thu, 13 Nov 2025 01:44:29 GMT
- Title: Enriching Knowledge Distillation with Cross-Modal Teacher Fusion
- Authors: Amir M. Mansourian, Amir Mohammad Babaei, Shohreh Kasaei,
- Abstract summary: Multi-teacher knowledge distillation (KD) transfers knowledge from expert teachers to a compact student model using logit or feature matching.<n>We propose a simple yet effective framework that fuses the logits and features of a conventional teacher with those from CLIP.<n>Analysis shows that the fused teacher yields more confident and reliable predictions, significantly increasing confident-correct cases while reducing confidently wrong ones.
- Score: 4.704107417683679
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Multi-teacher knowledge distillation (KD), a more effective technique than traditional single-teacher methods, transfers knowledge from expert teachers to a compact student model using logit or feature matching. However, most existing approaches lack knowledge diversity, as they rely solely on unimodal visual information, overlooking the potential of cross-modal representations. In this work, we explore the use of CLIP's vision-language knowledge as a complementary source of supervision for KD, an area that remains largely underexplored. We propose a simple yet effective framework that fuses the logits and features of a conventional teacher with those from CLIP. By incorporating CLIP's multi-prompt textual guidance, the fused supervision captures both dataset-specific and semantically enriched visual cues. Beyond accuracy, analysis shows that the fused teacher yields more confident and reliable predictions, significantly increasing confident-correct cases while reducing confidently wrong ones. Moreover, fusion with CLIP refines the entire logit distribution, producing semantically meaningful probabilities for non-target classes, thereby improving inter-class consistency and distillation quality. Despite its simplicity, the proposed method, Enriching Knowledge Distillation (RichKD), consistently outperforms most existing baselines across multiple benchmarks and exhibits stronger robustness under distribution shifts and input corruptions.
Related papers
- WeCKD: Weakly-supervised Chained Distillation Network for Efficient Multimodal Medical Imaging [1.9316515057518757]
We present a Weakly-supervised Chain-based KD network that redefines knowledge transfer through a structured sequence of interconnected models.<n>Each model in the chain is trained on only a fraction of the dataset and shows that effective learning can be achieved with minimal supervision.<n>The proposed distillation chain resulted in cumulative accuracy gains of up to +23% over a single backbone trained on the same limited data.
arXiv Detail & Related papers (2025-10-16T13:22:51Z) - MST-Distill: Mixture of Specialized Teachers for Cross-Modal Knowledge Distillation [8.68486556125022]
MST-Distill is a novel cross-modal knowledge distillation framework featuring a mixture of specialized teachers.<n>This paper empirically reveals two critical issues in existing approaches: distillation path selection and knowledge drift.<n>Our approach employs a diverse ensemble of teacher models across both cross-modal and multimodal configurations, integrated with an instance-level routing network.
arXiv Detail & Related papers (2025-07-09T16:45:28Z) - Group Relative Knowledge Distillation: Learning from Teacher's Relational Inductive Bias [5.434571018755813]
Group Relative Knowledge Distillation (GRKD) is a novel framework that distills teacher knowledge by learning the relative ranking among classes.<n>Experiments on classification benchmarks demonstrate GRKD achieves superior generalization compared to existing methods.
arXiv Detail & Related papers (2025-04-29T07:23:22Z) - Cross-View Consistency Regularisation for Knowledge Distillation [13.918476599394603]
This work is inspired by the success of cross-view learning in fields such as semi-supervised learning.<n>We introduce within-view and cross-view regularisations to standard logit-based distillation frameworks.<n>We also perform confidence-based soft label mining to improve the quality of distilling signals from the teacher.
arXiv Detail & Related papers (2024-12-21T05:41:47Z) - Knowledge Distillation Meets Open-Set Semi-Supervised Learning [69.21139647218456]
We propose a novel em modelname (bfem shortname) method dedicated for distilling representational knowledge semantically from a pretrained teacher to a target student.
At the problem level, this establishes an interesting connection between knowledge distillation with open-set semi-supervised learning (SSL)
Our shortname outperforms significantly previous state-of-the-art knowledge distillation methods on both coarse object classification and fine face recognition tasks.
arXiv Detail & Related papers (2022-05-13T15:15:27Z) - Nested Collaborative Learning for Long-Tailed Visual Recognition [71.6074806468641]
NCL consists of two core components, namely Nested Individual Learning (NIL) and Nested Balanced Online Distillation (NBOD)
To learn representations more thoroughly, both NIL and NBOD are formulated in a nested way, in which the learning is conducted on not just all categories from a full perspective but some hard categories from a partial perspective.
In the NCL, the learning from two perspectives is nested, highly related and complementary, and helps the network to capture not only global and robust features but also meticulous distinguishing ability.
arXiv Detail & Related papers (2022-03-29T08:55:39Z) - Exploring Inter-Channel Correlation for Diversity-preserved
KnowledgeDistillation [91.56643684860062]
Inter-Channel Correlation for Knowledge Distillation(ICKD) is developed.
ICKD captures intrinsic distribution of the featurespace and sufficient diversity properties of features in the teacher network.
We are the first method based on knowl-edge distillation boosts ResNet18 beyond 72% Top-1 ac-curacy on ImageNet classification.
arXiv Detail & Related papers (2022-02-08T07:01:56Z) - Wasserstein Contrastive Representation Distillation [114.24609306495456]
We propose Wasserstein Contrastive Representation Distillation (WCoRD), which leverages both primal and dual forms of Wasserstein distance for knowledge distillation.
The dual form is used for global knowledge transfer, yielding a contrastive learning objective that maximizes the lower bound of mutual information between the teacher and the student networks.
Experiments demonstrate that the proposed WCoRD method outperforms state-of-the-art approaches on privileged information distillation, model compression and cross-modal transfer.
arXiv Detail & Related papers (2020-12-15T23:43:28Z) - Knowledge Distillation Meets Self-Supervision [109.6400639148393]
Knowledge distillation involves extracting "dark knowledge" from a teacher network to guide the learning of a student network.
We show that the seemingly different self-supervision task can serve as a simple yet powerful solution.
By exploiting the similarity between those self-supervision signals as an auxiliary task, one can effectively transfer the hidden information from the teacher to the student.
arXiv Detail & Related papers (2020-06-12T12:18:52Z) - Learning From Multiple Experts: Self-paced Knowledge Distillation for
Long-tailed Classification [106.08067870620218]
We propose a self-paced knowledge distillation framework, termed Learning From Multiple Experts (LFME)
We refer to these models as 'Experts', and the proposed LFME framework aggregates the knowledge from multiple 'Experts' to learn a unified student model.
We conduct extensive experiments and demonstrate that our method is able to achieve superior performances compared to state-of-the-art methods.
arXiv Detail & Related papers (2020-01-06T12:57:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.