WeCKD: Weakly-supervised Chained Distillation Network for Efficient Multimodal Medical Imaging
- URL: http://arxiv.org/abs/2510.14668v2
- Date: Tue, 04 Nov 2025 07:51:22 GMT
- Title: WeCKD: Weakly-supervised Chained Distillation Network for Efficient Multimodal Medical Imaging
- Authors: Md. Abdur Rahman, Mohaimenul Azam Khan Raiaan, Sami Azam, Asif Karim, Jemima Beissbarth, Amanda Leach,
- Abstract summary: We present a Weakly-supervised Chain-based KD network that redefines knowledge transfer through a structured sequence of interconnected models.<n>Each model in the chain is trained on only a fraction of the dataset and shows that effective learning can be achieved with minimal supervision.<n>The proposed distillation chain resulted in cumulative accuracy gains of up to +23% over a single backbone trained on the same limited data.
- Score: 1.9316515057518757
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Knowledge distillation (KD) has traditionally relied on a static teacher-student framework, where a large, well-trained teacher transfers knowledge to a single student model. However, these approaches often suffer from knowledge degradation, inefficient supervision, and reliance on either a very strong teacher model or large labeled datasets. To address these, we present the first-ever Weakly-supervised Chain-based KD network (WeCKD) that redefines knowledge transfer through a structured sequence of interconnected models. Unlike conventional KD, it forms a progressive distillation chain, where each model not only learns from its predecessor but also refines the knowledge before passing it forward. This structured knowledge transfer further enhances feature learning and addresses the limitations of one-step KD. Each model in the chain is trained on only a fraction of the dataset and shows that effective learning can be achieved with minimal supervision. Extensive evaluation on six imaging datasets across otoscopic, microscopic, and magnetic resonance imaging modalities shows that it generalizes and outperforms existing methods. Furthermore, the proposed distillation chain resulted in cumulative accuracy gains of up to +23% over a single backbone trained on the same limited data, which highlights its potential for real-world adoption.
Related papers
- Enriching Knowledge Distillation with Cross-Modal Teacher Fusion [4.704107417683679]
Multi-teacher knowledge distillation (KD) transfers knowledge from expert teachers to a compact student model using logit or feature matching.<n>We propose a simple yet effective framework that fuses the logits and features of a conventional teacher with those from CLIP.<n>Analysis shows that the fused teacher yields more confident and reliable predictions, significantly increasing confident-correct cases while reducing confidently wrong ones.
arXiv Detail & Related papers (2025-11-12T12:50:15Z) - Cross-View Consistency Regularisation for Knowledge Distillation [13.918476599394603]
This work is inspired by the success of cross-view learning in fields such as semi-supervised learning.<n>We introduce within-view and cross-view regularisations to standard logit-based distillation frameworks.<n>We also perform confidence-based soft label mining to improve the quality of distilling signals from the teacher.
arXiv Detail & Related papers (2024-12-21T05:41:47Z) - Active Data Curation Effectively Distills Large-Scale Multimodal Models [66.23057263509027]
Knowledge distillation (KD) is the de facto standard for compressing large-scale models into smaller ones.<n>In this work we explore an alternative, yet simple approach -- active data curation as effective distillation for contrastive multimodal pretraining.<n>Our simple online batch selection method, ACID, outperforms strong KD baselines across various model-, data- and compute-configurations.
arXiv Detail & Related papers (2024-11-27T18:50:15Z) - Multi-Stage Knowledge Integration of Vision-Language Models for Continual Learning [79.46570165281084]
We propose a Multi-Stage Knowledge Integration network (MulKI) to emulate the human learning process in distillation methods.
MulKI achieves this through four stages, including Eliciting Ideas, Adding New Ideas, Distinguishing Ideas, and Making Connections.
Our method demonstrates significant improvements in maintaining zero-shot capabilities while supporting continual learning across diverse downstream tasks.
arXiv Detail & Related papers (2024-11-11T07:36:19Z) - Multi Teacher Privileged Knowledge Distillation for Multimodal Expression Recognition [58.41784639847413]
Human emotion is a complex phenomenon conveyed and perceived through facial expressions, vocal tones, body language, and physiological signals.
In this paper, a multi-teacher PKD (MT-PKDOT) method with self-distillation is introduced to align diverse teacher representations before distilling them to the student.
Results indicate that our proposed method can outperform SOTA PKD methods.
arXiv Detail & Related papers (2024-08-16T22:11:01Z) - DisCoM-KD: Cross-Modal Knowledge Distillation via Disentanglement Representation and Adversarial Learning [3.763772992906958]
Cross-modal knowledge distillation (CMKD) refers to the scenario in which a learning framework must handle training and test data that exhibit a modality mismatch.
DisCoM-KD (Disentanglement-learning based Cross-Modal Knowledge Distillation) explicitly models different types of per-modality information.
arXiv Detail & Related papers (2024-08-05T13:44:15Z) - Multi-Task Multi-Scale Contrastive Knowledge Distillation for Efficient Medical Image Segmentation [0.0]
This thesis aims to investigate the feasibility of knowledge transfer between neural networks for medical image segmentation tasks.
In the context of medical imaging, where the data volumes are often limited, leveraging knowledge from a larger pre-trained network could be useful.
arXiv Detail & Related papers (2024-06-05T12:06:04Z) - Distilling Privileged Multimodal Information for Expression Recognition using Optimal Transport [46.91791643660991]
Deep learning models for multimodal expression recognition have reached remarkable performance in controlled laboratory environments.
These models struggle in the wild because of the unavailability and quality of modalities used for training.
In practice, only a subset of the training-time modalities may be available at test time.
Learning with privileged information enables models to exploit data from additional modalities that are only available during training.
arXiv Detail & Related papers (2024-01-27T19:44:15Z) - Robustness-Reinforced Knowledge Distillation with Correlation Distance and Network Pruning [3.1423836318272773]
Knowledge distillation (KD) improves the performance of efficient and lightweight models.<n>Most existing KD techniques rely on Kullback-Leibler (KL) divergence.<n>We propose a Robustness-Reinforced Knowledge Distillation (R2KD) that leverages correlation distance and network pruning.
arXiv Detail & Related papers (2023-11-23T11:34:48Z) - Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference.
We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples.
CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z) - EvDistill: Asynchronous Events to End-task Learning via Bidirectional
Reconstruction-guided Cross-modal Knowledge Distillation [61.33010904301476]
Event cameras sense per-pixel intensity changes and produce asynchronous event streams with high dynamic range and less motion blur.
We propose a novel approach, called bfEvDistill, to learn a student network on the unlabeled and unpaired event data.
We show that EvDistill achieves significantly better results than the prior works and KD with only events and APS frames.
arXiv Detail & Related papers (2021-11-24T08:48:16Z) - How and When Adversarial Robustness Transfers in Knowledge Distillation? [137.11016173468457]
This paper studies how and when the adversarial robustness can be transferred from a teacher model to a student model in Knowledge distillation (KD)
We show that standard KD training fails to preserve adversarial robustness, and we propose KD with input gradient alignment (KDIGA) for remedy.
Under certain assumptions, we prove that the student model using our proposed KDIGA can achieve at least the same certified robustness as the teacher model.
arXiv Detail & Related papers (2021-10-22T21:30:53Z) - Knowledge Distillation Meets Self-Supervision [109.6400639148393]
Knowledge distillation involves extracting "dark knowledge" from a teacher network to guide the learning of a student network.
We show that the seemingly different self-supervision task can serve as a simple yet powerful solution.
By exploiting the similarity between those self-supervision signals as an auxiliary task, one can effectively transfer the hidden information from the teacher to the student.
arXiv Detail & Related papers (2020-06-12T12:18:52Z) - Heterogeneous Knowledge Distillation using Information Flow Modeling [82.83891707250926]
We propose a novel KD method that works by modeling the information flow through the various layers of the teacher model.
The proposed method is capable of overcoming the aforementioned limitations by using an appropriate supervision scheme during the different phases of the training process.
arXiv Detail & Related papers (2020-05-02T06:56:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.