Buffer replay enhances the robustness of multimodal learning under missing-modality
- URL: http://arxiv.org/abs/2511.23070v1
- Date: Fri, 28 Nov 2025 10:55:31 GMT
- Title: Buffer replay enhances the robustness of multimodal learning under missing-modality
- Authors: Hongye Zhu, Xuan Liu, Yanwen Ba, Jingye Xue, Shigeng Zhang,
- Abstract summary: We introduce REplay Prompting (REP), which builds modality-wise buffers and replays them in deeper layers to mitigate information loss as network depth increases.<n>Experiments on vision-language, vision-language-audio, and temporal multimodal benchmarks demonstrate that REP consistently outperforms prior methods under both single- and multi-modality missing scenarios.<n>These results establish REP as a lightweight and effective paradigm for robust multimodal learning in challenging missing-modality environments.
- Score: 9.512378886218395
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Missing modalities consistently lead to significant performance degradation in multimodal models. Existing approaches either synthesize missing modalities at high computational cost or apply prompt-based fine-tuning that relies only on adjacent-layer features and overlooks long-distance contextual information, which may offer additional tolerance to errors when one or more modalities are missing. To address this, we introduce REplay Prompting (REP): (1) construct modality-wise feature buffers via a residual bypass to cache early-layer representations and replay them in deeper layers, mitigating information loss as network depth increases; (2) employ a private-shared feature decoupling strategy, where private buffers preserve modality-specific signals and shared buffers encode cross-modal semantics; and (3) design a task-aware dynamic initialization mechanism to configure these buffers differently, improving stability and generalization under diverse missing-modality conditions. Experiments on vision-language, vision-language-audio, and temporal multimodal benchmarks demonstrate that REP consistently outperforms prior methods under both single- and multi-modality missing scenarios, while introducing only negligible parameter overhead. These results establish REP as a lightweight and effective paradigm for robust multimodal learning in challenging missing-modality environments.
Related papers
- Enhancing Foundation VLM Robustness to Missing Modality: Scalable Diffusion for Bi-directional Feature Restoration [40.720288165545476]
We introduce an enhanced diffusion model as a pluggable mid-stage training module to effectively restore missing features.<n>Our strategy introduces two key innovations: (I) Dynamic Modality Gating, which adaptively leverages conditional features to steer the generation of semantically consistent features; (II) Cross-Modal Mutual Learning mechanism, which bridges the semantic spaces of dual encoders to achieve bidirectional alignment.
arXiv Detail & Related papers (2026-02-03T06:06:35Z) - Dual-Stream Cross-Modal Representation Learning via Residual Semantic Decorrelation [5.272868130772015]
Cross-modal representations often suffer from modality dominance, redundant information coupling, and spurious cross-modal correlations.<n>We propose a Dual-Stream Residual Semantic Decorrelation Network (DSRSD-Net) to disentangle modality-specific and modality-shared information.
arXiv Detail & Related papers (2025-12-08T14:01:16Z) - UMCL: Unimodal-generated Multimodal Contrastive Learning for Cross-compression-rate Deepfake Detection [37.37926854174864]
In deepfake detection, the varying degrees of compression employed by social media platforms pose significant challenges for model generalization and reliability.<n>We propose a novel Unimodal-generated Multimodal Contrastive Learning framework for cross-modal-rate deepfake detection.<n>Our method achieves superior performance across various compression rates and manipulation types, establishing a new benchmark for robust deepfake detection.
arXiv Detail & Related papers (2025-11-24T10:56:22Z) - I$^3$-MRec: Invariant Learning with Information Bottleneck for Incomplete Modality Recommendation [56.55935146424585]
We introduce textbfI$3$-MRec, which learns with textbfInformation bottleneck principle for textbfIncomplete textbfModality textbfRecommendation.<n>By treating each modality as a distinct semantic environment, I$3$-MRec employs invariant risk minimization (IRM) to learn preference-oriented representations.<n>I$3$-MRec consistently outperforms existing state-of-the-art MRS methods across various modality-missing scenarios
arXiv Detail & Related papers (2025-08-06T09:29:50Z) - Advancing Reliable Test-Time Adaptation of Vision-Language Models under Visual Variations [67.35596444651037]
Vision-language models (VLMs) exhibit remarkable zero-shot capabilities but struggle with distribution shifts in downstream tasks when labeled data is unavailable.<n>We propose a Reliable Test-time Adaptation (ReTA) method that enhances reliability from two perspectives.
arXiv Detail & Related papers (2025-07-13T05:37:33Z) - Synergistic Prompting for Robust Visual Recognition with Missing Modalities [13.821274074204082]
Large-scale multi-modal models have demonstrated remarkable performance across various visual recognition tasks.<n>The presence of missing or incomplete modality inputs often leads to significant performance degradation.<n>We propose a novel Synergistic Prompting framework for robust visual recognition with missing modalities.
arXiv Detail & Related papers (2025-07-10T14:28:12Z) - FindRec: Stein-Guided Entropic Flow for Multi-Modal Sequential Recommendation [57.577843653775]
We propose textbfFindRec (textbfFlexible unified textbfinformation textbfdisentanglement for multi-modal sequential textbfRecommendation)<n>A Stein kernel-based Integrated Information Coordination Module (IICM) theoretically guarantees distribution consistency between multimodal features and ID streams.<n>A cross-modal expert routing mechanism that adaptively filters and combines multimodal features based on their contextual relevance.
arXiv Detail & Related papers (2025-07-07T04:09:45Z) - MoCa: Modality-aware Continual Pre-training Makes Better Bidirectional Multimodal Embeddings [75.0617088717528]
MoCa is a framework for transforming pre-trained VLM backbones into effective bidirectional embedding models.<n>MoCa consistently improves performance across MMEB and ViDoRe-v2 benchmarks, achieving new state-of-the-art results.
arXiv Detail & Related papers (2025-06-29T06:41:00Z) - Rethinking Explainability in the Era of Multimodal AI [9.57008593971486]
multimodal AI systems have become ubiquitous and achieved remarkable performance across high-stakes applications.<n>Most existing explainability techniques remain unimodal, generating modality-specific feature attributions, concepts, or circuit traces in isolation.<n>This paper argues that such unimodal explanations systematically misrepresent and fail to capture the cross-modal influence that drives multimodal model decisions.
arXiv Detail & Related papers (2025-06-16T03:08:29Z) - Enhancing Unimodal Latent Representations in Multimodal VAEs through Iterative Amortized Inference [20.761803725098005]
Multimodal variational autoencoders (VAEs) aim to capture shared latent representations by integrating information from different data modalities.
A significant challenge is accurately inferring representations from any subset of modalities without training an impractical number of inference networks for all possible modality combinations.
We introduce multimodal iterative amortized inference, an iterative refinement mechanism within the multimodal VAE framework.
arXiv Detail & Related papers (2024-10-15T08:49:38Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - Exploiting modality-invariant feature for robust multimodal emotion
recognition with missing modalities [76.08541852988536]
We propose to use invariant features for a missing modality imagination network (IF-MMIN)
We show that the proposed model outperforms all baselines and invariantly improves the overall emotion recognition performance under uncertain missing-modality conditions.
arXiv Detail & Related papers (2022-10-27T12:16:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.