Related papers: Asymmetric Cross-Modal Knowledge Distillation: Bridging Modalities with Weak Semantic Consistency

Asymmetric Cross-Modal Knowledge Distillation: Bridging Modalities with Weak Semantic Consistency

URL: http://arxiv.org/abs/2511.08901v1
Date: Thu, 13 Nov 2025 01:15:58 GMT
Title: Asymmetric Cross-Modal Knowledge Distillation: Bridging Modalities with Weak Semantic Consistency
Authors: Riling Wei, Kelu Yao, Chuanguang Yang, Jin Wang, Zhuoyan Gao, Chao Li,
Abstract summary: Cross-modal Knowledge Distillation has demonstrated promising performance on paired modalities with strong semantic connections.<n>We investigate a general and effective knowledge learning concept under weak semantic consistency, dubbed Asymmetric Cross-modal Knowledge Distillation (ACKD)<n>We propose a framework, namely SemBridge, integrating a Student-Friendly Matching module and a Semantic-aware Knowledge Alignment module.
Score: 16.550957851406014
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Cross-modal Knowledge Distillation has demonstrated promising performance on paired modalities with strong semantic connections, referred to as Symmetric Cross-modal Knowledge Distillation (SCKD). However, implementing SCKD becomes exceedingly constrained in real-world scenarios due to the limited availability of paired modalities. To this end, we investigate a general and effective knowledge learning concept under weak semantic consistency, dubbed Asymmetric Cross-modal Knowledge Distillation (ACKD), aiming to bridge modalities with limited semantic overlap. Nevertheless, the shift from strong to weak semantic consistency improves flexibility but exacerbates challenges in knowledge transmission costs, which we rigorously verified based on optimal transport theory. To mitigate the issue, we further propose a framework, namely SemBridge, integrating a Student-Friendly Matching module and a Semantic-aware Knowledge Alignment module. The former leverages self-supervised learning to acquire semantic-based knowledge and provide personalized instruction for each student sample by dynamically selecting the relevant teacher samples. The latter seeks the optimal transport path by employing Lagrangian optimization. To facilitate the research, we curate a benchmark dataset derived from two modalities, namely Multi-Spectral (MS) and asymmetric RGB images, tailored for remote sensing scene classification. Comprehensive experiments exhibit that our framework achieves state-of-the-art performance compared with 7 existing approaches on 6 different model architectures across various datasets.

Related papers

From Sparse Decisions to Dense Reasoning: A Multi-attribute Trajectory Paradigm for Multimodal Moderation [59.27094165576015]
We propose a novel learning paradigm (UniMod) that transitions from sparse decision-making to dense reasoning traces.<n>By constructing structured trajectories encompassing evidence grounding, modality assessment, risk mapping, policy decision, and response generation, we reformulate monolithic decision tasks into a multi-dimensional boundary learning process.<n>We introduce specialized optimization strategies to decouple task-specific parameters and rebalance training dynamics, effectively resolving interference between diverse objectives in multi-task learning.
arXiv Detail & Related papers (2026-01-28T09:29:40Z)
DIS2: Disentanglement Meets Distillation with Classwise Attention for Robust Remote Sensing Segmentation under Missing Modalities [28.992992584085787]
DIS2 is a new paradigm shifting from modality-shared feature dependence to active, guided missing features compensation.<n> Compensatory features are explicitly captured which, when fused with the features of the available modality, approximate the ideal fused representation of the full-modality case.<n>Our proposed approach significantly outperforms state-of-the-art methods across benchmarks.
arXiv Detail & Related papers (2026-01-20T01:33:54Z)
Multi-Level Aware Preference Learning: Enhancing RLHF for Complex Multi-Instruction Tasks [81.44256822500257]
RLHF has emerged as a predominant approach for aligning artificial intelligence systems with human preferences.<n> RLHF exhibits insufficient compliance capabilities when confronted with complex multi-instruction tasks.<n>We propose a novel Multi-level Aware Preference Learning (MAPL) framework, capable of enhancing multi-instruction capabilities.
arXiv Detail & Related papers (2025-05-19T08:33:11Z)
GSSF: Generalized Structural Sparse Function for Deep Cross-modal Metric Learning [51.677086019209554]
We propose a Generalized Structural Sparse to capture powerful relationships across modalities for pair-wise similarity learning. The distance metric delicately encapsulates two formats of diagonal and block-diagonal terms. Experiments on cross-modal and two extra uni-modal retrieval tasks have validated its superiority and flexibility.
arXiv Detail & Related papers (2024-10-20T03:45:50Z)
M3-JEPA: Multimodal Alignment via Multi-gate MoE based on the Joint-Embedding Predictive Architecture [6.928469290518152]
We introduce the Joint-Embedding Predictive Architecture (JEPA) on the multimodal tasks.<n>It converts the input embedding into the output embedding space by a predictor and then conducts the cross-modal alignment on the latent space.<n>We show that M3-JEPA can obtain state-of-the-art performance on different modalities and tasks, generalize to unseen datasets and domains, and is computationally efficient in both training and inference.
arXiv Detail & Related papers (2024-09-09T10:40:50Z)
DisCoM-KD: Cross-Modal Knowledge Distillation via Disentanglement Representation and Adversarial Learning [3.763772992906958]
Cross-modal knowledge distillation (CMKD) refers to the scenario in which a learning framework must handle training and test data that exhibit a modality mismatch. DisCoM-KD (Disentanglement-learning based Cross-Modal Knowledge Distillation) explicitly models different types of per-modality information.
arXiv Detail & Related papers (2024-08-05T13:44:15Z)
Deep Boosting Learning: A Brand-new Cooperative Approach for Image-Text Matching [53.05954114863596]
We propose a brand-new Deep Boosting Learning (DBL) algorithm for image-text matching. An anchor branch is first trained to provide insights into the data properties. A target branch is concurrently tasked with more adaptive margin constraints to further enlarge the relative distance between matched and unmatched samples.
arXiv Detail & Related papers (2024-04-28T08:44:28Z)
Correlation-Decoupled Knowledge Distillation for Multimodal Sentiment Analysis with Incomplete Modalities [16.69453837626083]
We propose a Correlation-decoupled Knowledge Distillation (CorrKD) framework for the Multimodal Sentiment Analysis (MSA) task under uncertain missing modalities. We present a sample-level contrastive distillation mechanism that transfers comprehensive knowledge containing cross-sample correlations to reconstruct missing semantics. We design a response-disentangled consistency distillation strategy to optimize the sentiment decision boundaries of the student network.
arXiv Detail & Related papers (2024-04-25T09:35:09Z)
Unleashing Network Potentials for Semantic Scene Completion [50.95486458217653]
This paper proposes a novel SSC framework - Adrial Modality Modulation Network (AMMNet) AMMNet introduces two core modules: a cross-modal modulation enabling the interdependence of gradient flows between modalities, and a customized adversarial training scheme leveraging dynamic gradient competition. Extensive experimental results demonstrate that AMMNet outperforms state-of-the-art SSC methods by a large margin.
arXiv Detail & Related papers (2024-03-12T11:48:49Z)
A Dimensional Structure based Knowledge Distillation Method for Cross-Modal Learning [15.544134849816528]
We discover the correlation between feature discriminability and dimensional structure (DS) by analyzing and observing features extracted from simple and hard tasks. We propose a novel cross-modal knowledge distillation (CMKD) method for better supervised cross-modal learning (CML) performance. The proposed method enforces output features to be channel-wise independent and intermediate ones to be uniformly distributed, thereby learning semantically irrelevant features from the hard task to boost its accuracy.
arXiv Detail & Related papers (2023-06-28T07:29:26Z)
CMD: Self-supervised 3D Action Representation Learning with Cross-modal Mutual Distillation [130.08432609780374]
In 3D action recognition, there exists rich complementary information between skeleton modalities. We propose a new Cross-modal Mutual Distillation (CMD) framework with the following designs. Our approach outperforms existing self-supervised methods and sets a series of new records.
arXiv Detail & Related papers (2022-08-26T06:06:09Z)

This list is automatically generated from the titles and abstracts of the papers in this site.