Related papers: Distilling Cross-Modal Knowledge via Feature Disentanglement

Distilling Cross-Modal Knowledge via Feature Disentanglement

URL: http://arxiv.org/abs/2511.19887v1
Date: Tue, 25 Nov 2025 03:45:37 GMT
Title: Distilling Cross-Modal Knowledge via Feature Disentanglement
Authors: Junhong Liu, Yuan Zhang, Tao Huang, Wenchao Xu, Renyu Yang,
Abstract summary: We propose frequency-decoupled cross-modal knowledge distillation, a method designed to decouple and balance knowledge transfer across modalities.<n>We observed that low-frequency features exhibit high consistency across different modalities, whereas high-frequency features demonstrate extremely low cross-modal similarity.<n>Our method substantially outperforms traditional KD and state-of-the-art cross-modal KD approaches.
Score: 19.981536371167852
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Knowledge distillation (KD) has proven highly effective for compressing large models and enhancing the performance of smaller ones. However, its effectiveness diminishes in cross-modal scenarios, such as vision-to-language distillation, where inconsistencies in representation across modalities lead to difficult knowledge transfer. To address this challenge, we propose frequency-decoupled cross-modal knowledge distillation, a method designed to decouple and balance knowledge transfer across modalities by leveraging frequency-domain features. We observed that low-frequency features exhibit high consistency across different modalities, whereas high-frequency features demonstrate extremely low cross-modal similarity. Accordingly, we apply distinct losses to these features: enforcing strong alignment in the low-frequency domain and introducing relaxed alignment for high-frequency features. We also propose a scale consistency loss to address distributional shifts between modalities, and employ a shared classifier to unify feature spaces. Extensive experiments across multiple benchmark datasets show our method substantially outperforms traditional KD and state-of-the-art cross-modal KD approaches. Code is available at https://github.com/Johumliu/FD-CMKD.

Related papers

Test-time Adaptive Hierarchical Co-enhanced Denoising Network for Reliable Multimodal Classification [55.56234913868664]
We propose Test-time Adaptive Hierarchical Co-enhanced Denoising Network (TAHCD) for reliable learning on multimodal data.<n>The proposed method achieves superior classification performance, robustness, and generalization compared with state-of-the-art reliable multimodal learning approaches.
arXiv Detail & Related papers (2026-01-12T03:14:12Z)
Overcoming Joint Intractability with Lossless Hierarchical Speculative Decoding [58.92526489742584]
We propose provably lossless.<n> verification method that significantly boosts the expected number of accepted tokens.<n>We show that HSD yields consistent improvements in acceptance rates across diverse model families and benchmarks.
arXiv Detail & Related papers (2026-01-09T11:10:29Z)
Decoupled Audio-Visual Dataset Distillation [44.63243875072762]
We propose DAVDD, a pretraining-based decoupled audio-visual distillation framework.<n>To address these challenges, we propose DAVDD, a pretraining-based decoupled audio-visual distillation framework.
arXiv Detail & Related papers (2025-11-22T02:36:50Z)
Modest-Align: Data-Efficient Alignment for Vision-Language Models [67.48633659305592]
Cross-modal alignment models often suffer from overconfidence and degraded performance when operating in resource-constrained settings.<n>We propose Modest-Align, a lightweight alignment framework designed for robustness and efficiency.<n>Our method offers a practical and scalable solution for cross-modal alignment in real-world, low-resource scenarios.
arXiv Detail & Related papers (2025-10-24T16:11:10Z)
Align Your Tangent: Training Better Consistency Models via Manifold-Aligned Tangents [55.43139356528315]
Consistency Models (CMs) are trained to be consistent on flow ordinary differential equation trajectories.<n>CMs typically require prolonged training with large batch sizes to obtain competitive sample quality.<n>We propose a new loss function, called the manifold feature distance (MFD), which provides manifold-aligned tangents that point toward the data manifold.
arXiv Detail & Related papers (2025-10-01T08:35:18Z)
FiGKD: Fine-Grained Knowledge Distillation via High-Frequency Detail Transfer [0.0]
Fine-Grained Knowledge Distillation (FiGKD) is a frequency-aware framework that decomposes a model's logits into low-frequency (content) and high-frequency (detail) components.<n>FiGKD consistently outperforms state-of-the-art logit-based and feature-based distillation methods across a variety of teacher-student configurations.
arXiv Detail & Related papers (2025-05-17T08:27:02Z)
On Distilling the Displacement Knowledge for Few-Shot Class-Incremental Learning [17.819582979803286]
Few-shot Class-Incremental Learning (FSCIL) addresses the challenges of evolving data distributions and the difficulty of data acquisition in real-world scenarios.<n>To counteract the catastrophic forgetting typically encountered in FSCIL, knowledge distillation is employed as a way to maintain the knowledge from learned data distribution.
arXiv Detail & Related papers (2024-12-15T02:10:18Z)
Disentangled Noisy Correspondence Learning [56.06801962154915]
Cross-modal retrieval is crucial in understanding latent correspondences across modalities. DisNCL is a novel information-theoretic framework for feature Disentanglement in Noisy Correspondence Learning.
arXiv Detail & Related papers (2024-08-10T09:49:55Z)
Multi-Dimensional Refinement Graph Convolutional Network with Robust Decouple Loss for Fine-Grained Skeleton-Based Action Recognition [19.031036881780107]
We propose a flexible attention block called Channel-Variable Spatial-Temporal Attention (CVSTA) to enhance the discriminative power of spatial-temporal joints. Based on CVSTA, we construct a Multi-Dimensional Refinement Graph Convolutional Network (MDR-GCN), which can improve the discrimination among channel-, joint- and frame-level features. Furthermore, we propose a Robust Decouple Loss (RDL), which significantly boosts the effect of the CVSTA and reduces the impact of noise.
arXiv Detail & Related papers (2023-06-27T09:23:36Z)
Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical Fusion for Multimodal Affect Recognition [69.32305810128994]
Incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition. We propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model. HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.
arXiv Detail & Related papers (2023-05-23T01:24:15Z)
Deep Multimodal Fusion by Channel Exchanging [87.40768169300898]
This paper proposes a parameter-free multimodal fusion framework that dynamically exchanges channels between sub-networks of different modalities. The validity of such exchanging process is also guaranteed by sharing convolutional filters yet keeping separate BN layers across modalities, which, as an add-on benefit, allows our multimodal architecture to be almost as compact as a unimodal network.
arXiv Detail & Related papers (2020-11-10T09:53:20Z)

This list is automatically generated from the titles and abstracts of the papers in this site.