Related papers: Cross-Modal Distillation For Widely Differing Modalities

Cross-Modal Distillation For Widely Differing Modalities

URL: http://arxiv.org/abs/2507.16296v1
Date: Tue, 22 Jul 2025 07:34:00 GMT
Title: Cross-Modal Distillation For Widely Differing Modalities
Authors: Cairong Zhao, Yufeng Jin, Zifan Song, Haonan Chen, Duoqian Miao, Guosheng Hu,
Abstract summary: We conduct multi-modal learning by introducing a teacher model to transfer discriminative knowledge to a student model during training.<n>This knowledge transfer via distillation is not trivial because the big domain gap between the widely differing modalities can easily lead to overfitting.<n>We propose two soft constrained knowledge distillation strategies at the feature level and a quality-based adaptive weights module to weigh input samples.
Score: 31.049823782188437
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Deep learning achieved great progress recently, however, it is not easy or efficient to further improve its performance by increasing the size of the model. Multi-modal learning can mitigate this challenge by introducing richer and more discriminative information as input. To solve the problem of limited access to multi-modal data at the time of use, we conduct multi-modal learning by introducing a teacher model to transfer discriminative knowledge to a student model during training. However, this knowledge transfer via distillation is not trivial because the big domain gap between the widely differing modalities can easily lead to overfitting. In this work, we introduce a cross-modal distillation framework. Specifically, we find hard constrained loss, e.g. l2 loss forcing the student being exact the same as the teacher, can easily lead to overfitting in cross-modality distillation. To address this, we propose two soft constrained knowledge distillation strategies at the feature level and classifier level respectively. In addition, we propose a quality-based adaptive weights module to weigh input samples via quantified data quality, leading to robust model training. We conducted experiments on speaker recognition and image classification tasks, and the results show that our approach is able to effectively achieve knowledge transfer between the commonly used and widely differing modalities of image, text, and speech.

Related papers

MST-Distill: Mixture of Specialized Teachers for Cross-Modal Knowledge Distillation [8.68486556125022]
MST-Distill is a novel cross-modal knowledge distillation framework featuring a mixture of specialized teachers.<n>This paper empirically reveals two critical issues in existing approaches: distillation path selection and knowledge drift.<n>Our approach employs a diverse ensemble of teacher models across both cross-modal and multimodal configurations, integrated with an instance-level routing network.
arXiv Detail & Related papers (2025-07-09T16:45:28Z)
Unlocking the Potential of Difficulty Prior in RL-based Multimodal Reasoning [69.64809103333839]
We investigate how explicitly modeling problem's difficulty prior information shapes the effectiveness of reinforcement learning based fine-tuning for multimodal reasoning.<n>Our approach demonstrates significant performances across various multi-modal mathematical reasoning benchmarks with only 2K+0.6K two-stage training data.
arXiv Detail & Related papers (2025-05-19T15:43:10Z)
JointDistill: Adaptive Multi-Task Distillation for Joint Depth Estimation and Scene Segmentation [31.89422375115854]
This work explores how the multi-task distillation could be used to improve unified modeling.<n>We propose a self-adaptive distillation method that can adjust the knowledge amount from each teacher according to the student's current learning ability.<n>We evaluate our method on multiple benchmarking datasets including Cityscapes and NYU-v2.
arXiv Detail & Related papers (2025-05-15T08:00:48Z)
Knowledge Distillation for Multimodal Egocentric Action Recognition Robust to Missing Modalities [43.15852057358654]
We introduce an efficient multimodal knowledge distillation approach for egocentric action recognition.<n>Our method focuses on resource-efficient development by leveraging pre-trained models as unimodal feature extractors in our teacher model.
arXiv Detail & Related papers (2025-04-11T14:30:42Z)
Sample-level Adaptive Knowledge Distillation for Action Recognition [43.35357057084902]
Knowledge Distillation (KD) compresses neural networks by learning a small network (student) via transferring knowledge from a pre-trained large network (teacher)<n>We propose a Sample-level Adaptive Knowledge Distillation framework for action recognition.<n> Experimental results on two video benchmarks and one image benchmark demonstrate the superiority of the proposed method.
arXiv Detail & Related papers (2025-04-01T10:04:20Z)
Multi-Stage Knowledge Integration of Vision-Language Models for Continual Learning [79.46570165281084]
We propose a Multi-Stage Knowledge Integration network (MulKI) to emulate the human learning process in distillation methods. MulKI achieves this through four stages, including Eliciting Ideas, Adding New Ideas, Distinguishing Ideas, and Making Connections. Our method demonstrates significant improvements in maintaining zero-shot capabilities while supporting continual learning across diverse downstream tasks.
arXiv Detail & Related papers (2024-11-11T07:36:19Z)
Modality-Balanced Learning for Multimedia Recommendation [21.772064939915214]
We propose a Counterfactual Knowledge Distillation method to solve the imbalance problem and make the best use of all modalities. We also design a novel generic-and-specific distillation loss to guide the multimodal student to learn wider-and-deeper knowledge from teachers. Our method could serve as a plug-and-play module for both late-fusion and early-fusion backbones.
arXiv Detail & Related papers (2024-07-26T07:53:01Z)
Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. We pose the problem of unseen modality interaction and introduce a first solution. It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z)
Learning Transferable Adversarial Robust Representations via Multi-view Consistency [57.73073964318167]
We propose a novel meta-adversarial multi-view representation learning framework with dual encoders. We demonstrate the effectiveness of our framework on few-shot learning tasks from unseen domains.
arXiv Detail & Related papers (2022-10-19T11:48:01Z)
Dynamic Contrastive Distillation for Image-Text Retrieval [90.05345397400144]
We present a novel plug-in dynamic contrastive distillation (DCD) framework to compress image-text retrieval models. We successfully apply our proposed DCD strategy to two state-of-the-art vision-language pretrained models, i.e. ViLT and METER. Experiments on MS-COCO and Flickr30K benchmarks show the effectiveness and efficiency of our DCD framework.
arXiv Detail & Related papers (2022-07-04T14:08:59Z)
On Modality Bias Recognition and Reduction [70.69194431713825]
We study the modality bias problem in the context of multi-modal classification. We propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned. Our method yields remarkable performance improvements compared with the baselines.
arXiv Detail & Related papers (2022-02-25T13:47:09Z)
Multi-Scale Aligned Distillation for Low-Resolution Detection [68.96325141432078]
This paper focuses on boosting the performance of low-resolution models by distilling knowledge from a high- or multi-resolution model. On several instance-level detection tasks and datasets, the low-resolution models trained via our approach perform competitively with high-resolution models trained via conventional multi-scale training.
arXiv Detail & Related papers (2021-09-14T12:53:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.