Related papers: Information-Theoretic Criteria for Knowledge Distillation in Multimodal Learning

Information-Theoretic Criteria for Knowledge Distillation in Multimodal Learning

URL: http://arxiv.org/abs/2510.13182v1
Date: Wed, 15 Oct 2025 06:10:10 GMT
Title: Information-Theoretic Criteria for Knowledge Distillation in Multimodal Learning
Authors: Rongrong Xie, Yizhou Xu, Guido Sanguinetti,
Abstract summary: Cross-modal knowledge distillation (KD) is a technique where "teacher" modalities transfer information to weaker "student" modalities during model training to improve performance.<n>Despite successes across various applications, cross-modal KD does not always result in improved outcomes, primarily due to a limited theoretical understanding that could inform practice.<n>We propose that cross-modal KD is effective when the mutual information between teacher and student representations exceeds the mutual information between the student representation and the labels.<n>Our study establishes a novel theoretical framework for understanding cross-modal KD and offers practical guidelines based on the CCH criterion to select optimal teacher modalities for improving the performance of
Score: 7.255275023242901
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The rapid increase in multimodal data availability has sparked significant interest in cross-modal knowledge distillation (KD) techniques, where richer "teacher" modalities transfer information to weaker "student" modalities during model training to improve performance. However, despite successes across various applications, cross-modal KD does not always result in improved outcomes, primarily due to a limited theoretical understanding that could inform practice. To address this gap, we introduce the Cross-modal Complementarity Hypothesis (CCH): we propose that cross-modal KD is effective when the mutual information between teacher and student representations exceeds the mutual information between the student representation and the labels. We theoretically validate the CCH in a joint Gaussian model and further confirm it empirically across diverse multimodal datasets, including image, text, video, audio, and cancer-related omics data. Our study establishes a novel theoretical framework for understanding cross-modal KD and offers practical guidelines based on the CCH criterion to select optimal teacher modalities for improving the performance of weaker modalities.

Related papers

On the Comparison between Multi-modal and Single-modal Contrastive Learning [50.74988548106031]
We introduce a theoretical foundation for understanding the differences between multi-modal and single-modal contrastive learning. We identify the critical factor, which is the signal-to-noise ratio (SNR), that impacts the generalizability in downstream tasks of both multi-modal and single-modal contrastive learning. Our analysis provides a unified framework that can characterize the optimization and generalization of both single-modal and multi-modal contrastive learning.
arXiv Detail & Related papers (2024-11-05T06:21:17Z)
Speculative Knowledge Distillation: Bridging the Teacher-Student Gap Through Interleaved Sampling [81.00825302340984]
We introduce Speculative Knowledge Distillation (SKD) to generate high-quality training data on-the-fly.<n>In SKD, the student proposes tokens, and the teacher replaces poorly ranked ones based on its own distribution.<n>We evaluate SKD on various text generation tasks, including translation, summarization, math, and instruction following.
arXiv Detail & Related papers (2024-10-15T06:51:25Z)
DisCoM-KD: Cross-Modal Knowledge Distillation via Disentanglement Representation and Adversarial Learning [3.763772992906958]
Cross-modal knowledge distillation (CMKD) refers to the scenario in which a learning framework must handle training and test data that exhibit a modality mismatch. DisCoM-KD (Disentanglement-learning based Cross-Modal Knowledge Distillation) explicitly models different types of per-modality information.
arXiv Detail & Related papers (2024-08-05T13:44:15Z)
Leveraging Weak Cross-Modal Guidance for Coherence Modelling via Iterative Learning [66.28872204574648]
Cross-modal coherence modeling is essential for intelligent systems to help them organize and structure information. Previous work on cross-modal coherence modeling attempted to leverage the order information from another modality to assist the coherence recovering of the target modality. This paper explores a new way to take advantage of cross-modal guidance without gold labels on coherency.
arXiv Detail & Related papers (2024-08-01T06:04:44Z)
Distilling Privileged Multimodal Information for Expression Recognition using Optimal Transport [46.91791643660991]
Deep learning models for multimodal expression recognition have reached remarkable performance in controlled laboratory environments. These models struggle in the wild because of the unavailability and quality of modalities used for training. In practice, only a subset of the training-time modalities may be available at test time. Learning with privileged information enables models to exploit data from additional modalities that are only available during training.
arXiv Detail & Related papers (2024-01-27T19:44:15Z)
Comparative Knowledge Distillation [102.35425896967791]
Traditional Knowledge Distillation (KD) assumes readily available access to teacher models for frequent inference. We propose Comparative Knowledge Distillation (CKD), which encourages student models to understand the nuanced differences in a teacher model's interpretations of samples. CKD consistently outperforms state of the art data augmentation and KD techniques.
arXiv Detail & Related papers (2023-11-03T21:55:33Z)
Enhanced Multimodal Representation Learning with Cross-modal KD [14.14709952127258]
This paper explores leveraging auxiliary modalities which are only available at training to enhance multimodal representation learning through cross-modal Knowledge Distillation (KD) The widely adopted mutual information-based objective leads to a short-cut solution of the weak teacher, i.e., achieving the maximum mutual information by simply making the teacher model as weak as the student model. To prevent such a weak solution, we introduce an additional objective term, i.e., the mutual information between the teacher and the auxiliary modality model.
arXiv Detail & Related papers (2023-06-13T09:35:37Z)
CMD: Self-supervised 3D Action Representation Learning with Cross-modal Mutual Distillation [130.08432609780374]
In 3D action recognition, there exists rich complementary information between skeleton modalities. We propose a new Cross-modal Mutual Distillation (CMD) framework with the following designs. Our approach outperforms existing self-supervised methods and sets a series of new records.
arXiv Detail & Related papers (2022-08-26T06:06:09Z)
The Modality Focusing Hypothesis: On the Blink of Multimodal Knowledge Distillation [16.399589194973814]
Multimodal knowledge distillation extends traditional knowledge distillation to the area of multimodal learning. One common practice is to adopt a well-performed multimodal network as the teacher in the hope that it can transfer its full knowledge to a unimodal student for performance improvement.
arXiv Detail & Related papers (2022-06-13T21:34:21Z)
Multi-Modal Mutual Information Maximization: A Novel Approach for Unsupervised Deep Cross-Modal Hashing [73.29587731448345]
We propose a novel method, dubbed Cross-Modal Info-Max Hashing (CMIMH) We learn informative representations that can preserve both intra- and inter-modal similarities. The proposed method consistently outperforms other state-of-the-art cross-modal retrieval methods.
arXiv Detail & Related papers (2021-12-13T08:58:03Z)
Modality-specific Distillation [30.190082262375395]
We propose modality-specific distillation (MSD) to effectively transfer knowledge from a teacher on multimodal datasets. Our idea aims at mimicking a teacher's modality-specific predictions by introducing an auxiliary loss term for each modality. Because each modality has different importance for predictions, we also propose weighting approaches for the auxiliary losses.
arXiv Detail & Related papers (2021-01-06T05:45:07Z)
Knowledge Distillation Beyond Model Compression [13.041607703862724]
Knowledge distillation (KD) is commonly deemed as an effective model compression technique in which a compact model (student) is trained under the supervision of a larger pretrained model or ensemble of models (teacher) In this study, we provide an extensive study on nine different KD methods which covers a broad spectrum of approaches to capture and transfer knowledge.
arXiv Detail & Related papers (2020-07-03T19:54:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.