MST-Distill: Mixture of Specialized Teachers for Cross-Modal Knowledge Distillation
- URL: http://arxiv.org/abs/2507.07015v1
- Date: Wed, 09 Jul 2025 16:45:28 GMT
- Title: MST-Distill: Mixture of Specialized Teachers for Cross-Modal Knowledge Distillation
- Authors: Hui Li, Pengfei Yang, Juanyang Chen, Le Dong, Yanxin Chen, Quan Wang,
- Abstract summary: MST-Distill is a novel cross-modal knowledge distillation framework featuring a mixture of specialized teachers.<n>This paper empirically reveals two critical issues in existing approaches: distillation path selection and knowledge drift.<n>Our approach employs a diverse ensemble of teacher models across both cross-modal and multimodal configurations, integrated with an instance-level routing network.
- Score: 8.68486556125022
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Knowledge distillation as an efficient knowledge transfer technique, has achieved remarkable success in unimodal scenarios. However, in cross-modal settings, conventional distillation methods encounter significant challenges due to data and statistical heterogeneities, failing to leverage the complementary prior knowledge embedded in cross-modal teacher models. This paper empirically reveals two critical issues in existing approaches: distillation path selection and knowledge drift. To address these limitations, we propose MST-Distill, a novel cross-modal knowledge distillation framework featuring a mixture of specialized teachers. Our approach employs a diverse ensemble of teacher models across both cross-modal and multimodal configurations, integrated with an instance-level routing network that facilitates adaptive and dynamic distillation. This architecture effectively transcends the constraints of traditional methods that rely on monotonous and static teacher models. Additionally, we introduce a plug-in masking module, independently trained to suppress modality-specific discrepancies and reconstruct teacher representations, thereby mitigating knowledge drift and enhancing transfer effectiveness. Extensive experiments across five diverse multimodal datasets, spanning visual, audio, and text, demonstrate that our method significantly outperforms existing state-of-the-art knowledge distillation methods in cross-modal distillation tasks. The source code is available at https://github.com/Gray-OREO/MST-Distill.
Related papers
- Cross-Modal Distillation For Widely Differing Modalities [31.049823782188437]
We conduct multi-modal learning by introducing a teacher model to transfer discriminative knowledge to a student model during training.<n>This knowledge transfer via distillation is not trivial because the big domain gap between the widely differing modalities can easily lead to overfitting.<n>We propose two soft constrained knowledge distillation strategies at the feature level and a quality-based adaptive weights module to weigh input samples.
arXiv Detail & Related papers (2025-07-22T07:34:00Z) - JointDistill: Adaptive Multi-Task Distillation for Joint Depth Estimation and Scene Segmentation [31.89422375115854]
This work explores how the multi-task distillation could be used to improve unified modeling.<n>We propose a self-adaptive distillation method that can adjust the knowledge amount from each teacher according to the student's current learning ability.<n>We evaluate our method on multiple benchmarking datasets including Cityscapes and NYU-v2.
arXiv Detail & Related papers (2025-05-15T08:00:48Z) - Learning from Stochastic Teacher Representations Using Student-Guided Knowledge Distillation [64.15918654558816]
Self-distillation (SSD) training strategy is introduced for filtering and weighting teacher representation to distill from task-relevant representations only.<n> Experimental results on real-world affective computing, wearable/biosignal datasets from the UCR Archive, the HAR dataset, and image classification datasets show that the proposed SSD method can outperform state-of-the-art methods.
arXiv Detail & Related papers (2025-04-19T14:08:56Z) - Is Contrastive Distillation Enough for Learning Comprehensive 3D Representations? [55.99654128127689]
Cross-modal contrastive distillation has recently been explored for learning effective 3D representations.<n>Existing methods focus primarily on modality-shared features, neglecting the modality-specific features during the pre-training process.<n>We propose a new framework, namely CMCR, to address these shortcomings.
arXiv Detail & Related papers (2024-12-12T06:09:49Z) - Multi-Stage Knowledge Integration of Vision-Language Models for Continual Learning [79.46570165281084]
We propose a Multi-Stage Knowledge Integration network (MulKI) to emulate the human learning process in distillation methods.
MulKI achieves this through four stages, including Eliciting Ideas, Adding New Ideas, Distinguishing Ideas, and Making Connections.
Our method demonstrates significant improvements in maintaining zero-shot capabilities while supporting continual learning across diverse downstream tasks.
arXiv Detail & Related papers (2024-11-11T07:36:19Z) - DisCoM-KD: Cross-Modal Knowledge Distillation via Disentanglement Representation and Adversarial Learning [3.763772992906958]
Cross-modal knowledge distillation (CMKD) refers to the scenario in which a learning framework must handle training and test data that exhibit a modality mismatch.
DisCoM-KD (Disentanglement-learning based Cross-Modal Knowledge Distillation) explicitly models different types of per-modality information.
arXiv Detail & Related papers (2024-08-05T13:44:15Z) - CMD: Self-supervised 3D Action Representation Learning with Cross-modal
Mutual Distillation [130.08432609780374]
In 3D action recognition, there exists rich complementary information between skeleton modalities.
We propose a new Cross-modal Mutual Distillation (CMD) framework with the following designs.
Our approach outperforms existing self-supervised methods and sets a series of new records.
arXiv Detail & Related papers (2022-08-26T06:06:09Z) - Collaborative Teacher-Student Learning via Multiple Knowledge Transfer [79.45526596053728]
We propose a collaborative teacher-student learning via multiple knowledge transfer (CTSL-MKT)
It allows multiple students learn knowledge from both individual instances and instance relations in a collaborative way.
The experiments and ablation studies on four image datasets demonstrate that the proposed CTSL-MKT significantly outperforms the state-of-the-art KD methods.
arXiv Detail & Related papers (2021-01-21T07:17:04Z) - Knowledge Distillation Meets Self-Supervision [109.6400639148393]
Knowledge distillation involves extracting "dark knowledge" from a teacher network to guide the learning of a student network.
We show that the seemingly different self-supervision task can serve as a simple yet powerful solution.
By exploiting the similarity between those self-supervision signals as an auxiliary task, one can effectively transfer the hidden information from the teacher to the student.
arXiv Detail & Related papers (2020-06-12T12:18:52Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.