Related papers: Beyond Modality Collapse: Representations Blending for Multimodal Dataset Distillation

Beyond Modality Collapse: Representations Blending for Multimodal Dataset Distillation

URL: http://arxiv.org/abs/2505.14705v1
Date: Fri, 16 May 2025 03:00:56 GMT
Title: Beyond Modality Collapse: Representations Blending for Multimodal Dataset Distillation
Authors: Xin Zhang, Ziruo Zhang, Jiawei Du, Zuozhu Liu, Joey Tianyi Zhou,
Abstract summary: We introduce textbfRepBlend, a novel MDD framework that weakens overdominant cross-modal supervision via representation blending.<n> Experiments on Flickr-30K and MS-COCO show that RepBlend consistently outperforms prior state-of-the-art MDD methods.
Score: 41.00894254469267
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal Dataset Distillation (MDD) seeks to condense large-scale image-text datasets into compact surrogates while retaining their effectiveness for cross-modal learning. Despite recent progress, existing MDD approaches often suffer from \textit{\textbf{Modality Collapse}}, characterized by over-concentrated intra-modal representations and enlarged distributional gap across modalities. In this paper, at the first time, we identify this issue as stemming from a fundamental conflict between the over-compression behavior inherent in dataset distillation and the cross-modal supervision imposed by contrastive objectives. To alleviate modality collapse, we introduce \textbf{RepBlend}, a novel MDD framework that weakens overdominant cross-modal supervision via representation blending, thereby significantly enhancing intra-modal diversity. Additionally, we observe that current MDD methods impose asymmetric supervision across modalities, resulting in biased optimization. To address this, we propose symmetric projection trajectory matching, which synchronizes the optimization dynamics using modality-specific projection heads, thereby promoting balanced supervision and enhancing cross-modal alignment. Experiments on Flickr-30K and MS-COCO show that RepBlend consistently outperforms prior state-of-the-art MDD methods, achieving significant gains in retrieval performance (e.g., +9.4 IR@10, +6.3 TR@10 under the 100-pair setting) and offering up to 6.7$\times$ distillation speedup.

Related papers

Plug, Play, and Fortify: A Low-Cost Module for Robust Multimodal Image Understanding Models [6.350443894942629]
Multimodal Weight Allocation Module (MWAM) is a plug-and-play component that dynamically re-balances the contribution of each branch during training.<n>MWAM delivers consistent performance gains across a wide range of tasks and modality combinations.
arXiv Detail & Related papers (2026-02-26T05:51:41Z)
Modality-Specific Enhancement and Complementary Fusion for Semi-Supervised Multi-Modal Brain Tumor Segmentation [6.302779966909783]
We propose a novel semi-supervised multi-modal framework for medical image segmentation.<n>We introduce a Modality-specific Enhancing Module (MEM) to strengthen semantic unique cues to each modality.<n>We also introduce a learnable Complementary Information Fusion (CIF) module to adaptively exchange complementary knowledge between modalities.
arXiv Detail & Related papers (2025-12-10T16:15:17Z)
Modality-Balanced Collaborative Distillation for Multi-Modal Domain Generalization [72.83292830785336]
Weight Averaging (WA) has emerged as a powerful technique for enhancing generalization by promoting convergence to a flat loss landscape.<n>We propose MBCD, a unified collaborative distillation framework that retains WA's flatness-inducing advantages while overcoming its shortcomings in multi-modal contexts.
arXiv Detail & Related papers (2025-11-25T12:38:28Z)
DMAF-Net: An Effective Modality Rebalancing Framework for Incomplete Multi-Modal Medical Image Segmentation [7.441945494253697]
We propose a novel model, named Dynamic Modality-Aware Fusion Network (DMAF-Net)<n>First, it introduces a Dynamic Modality-Aware Fusion (DMAF) module to suppress missing-modality interference.<n>Second, it designs a synergistic Relation Distillation and Prototype Distillation framework to enforce global-local feature alignment.<n>Third, it presents a Dynamic Training Monitoring (DTM) strategy to stabilize optimization under imbalanced missing rates.
arXiv Detail & Related papers (2025-06-13T11:38:18Z)
Mitigating Modality Bias in Multi-modal Entity Alignment from a Causal Perspective [15.239882327601016]
We propose a counterfactual debiasing framework for MMEA, termed CDMEA, which investigates visual modality bias from a causal perspective.<n>Our approach aims to leverage both visual and graph modalities to enhance MMEA while suppressing the direct causal effect of the visual modality on model predictions.
arXiv Detail & Related papers (2025-04-28T03:48:23Z)
Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations [2.992602379681373]
We show that multi-modal fine-tuning can achieve notable OoDD performance.<n>We propose a training objective that enhances cross-modal alignment by regularizing the distances between image and text embeddings of ID data.
arXiv Detail & Related papers (2025-03-24T16:00:21Z)
Taming Flow Matching with Unbalanced Optimal Transport into Fast Pansharpening [10.23957420290553]
We propose the Optimal Transport Flow Matching framework to achieve one-step, high-quality pansharpening.<n>The OTFM framework enables simulation-free training and single-step inference while maintaining strict adherence to pansharpening constraints.
arXiv Detail & Related papers (2025-03-19T08:10:49Z)
Adaptive Illumination-Invariant Synergistic Feature Integration in a Stratified Granular Framework for Visible-Infrared Re-Identification [18.221111822542024]
Visible-Infrared Person Re-Identification (VI-ReID) plays a crucial role in applications such as search and rescue, infrastructure protection, and nighttime surveillance.<n>We propose textbfAMINet, an Adaptive Modality Interaction Network.<n>AMINet employs multi-granularity feature extraction to capture comprehensive identity attributes from both full-body and upper-body images.
arXiv Detail & Related papers (2025-02-28T15:42:58Z)
MITA: Bridging the Gap between Model and Data for Test-time Adaptation [68.62509948690698]
Test-Time Adaptation (TTA) has emerged as a promising paradigm for enhancing the generalizability of models. We propose Meet-In-The-Middle based MITA, which introduces energy-based optimization to encourage mutual adaptation of the model and data from opposing directions.
arXiv Detail & Related papers (2024-10-12T07:02:33Z)
BlindDiff: Empowering Degradation Modelling in Diffusion Models for Blind Image Super-Resolution [52.47005445345593]
BlindDiff is a DM-based blind SR method to tackle the blind degradation settings in SISR. BlindDiff seamlessly integrates the MAP-based optimization into DMs. Experiments on both synthetic and real-world datasets show that BlindDiff achieves the state-of-the-art performance.
arXiv Detail & Related papers (2024-03-15T11:21:34Z)
Ensemble Modeling for Multimodal Visual Action Recognition [50.38638300332429]
We propose an ensemble modeling approach for multimodal action recognition. We independently train individual modality models using a variant of focal loss tailored to handle the long-tailed distribution of the MECCANO [21] dataset.
arXiv Detail & Related papers (2023-08-10T08:43:20Z)
Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical Fusion for Multimodal Affect Recognition [69.32305810128994]
Incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition. We propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model. HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.
arXiv Detail & Related papers (2023-05-23T01:24:15Z)
Multi-Modal Mutual Information Maximization: A Novel Approach for Unsupervised Deep Cross-Modal Hashing [73.29587731448345]
We propose a novel method, dubbed Cross-Modal Info-Max Hashing (CMIMH) We learn informative representations that can preserve both intra- and inter-modal similarities. The proposed method consistently outperforms other state-of-the-art cross-modal retrieval methods.
arXiv Detail & Related papers (2021-12-13T08:58:03Z)

This list is automatically generated from the titles and abstracts of the papers in this site.