Related papers: Deep Correlated Prompting for Visual Recognition with Missing Modalities

Deep Correlated Prompting for Visual Recognition with Missing Modalities

URL: http://arxiv.org/abs/2410.06558v4
Date: Mon, 21 Oct 2024 14:11:54 GMT
Title: Deep Correlated Prompting for Visual Recognition with Missing Modalities
Authors: Lianyu Hu, Tongkai Shi, Wei Feng, Fanhua Shang, Liang Wan,
Abstract summary: Large-scale multimodal models have shown excellent performance over a series of tasks powered by the large corpus of paired multimodal training data. However, this simple assumption may not always hold in the real world due to privacy constraints or collection difficulty. We refer to prompt learning to adapt large pretrained multimodal models to handle missing-modality scenarios by regarding different missing cases as different types of input.
Score: 22.40271366031256
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Large-scale multimodal models have shown excellent performance over a series of tasks powered by the large corpus of paired multimodal training data. Generally, they are always assumed to receive modality-complete inputs. However, this simple assumption may not always hold in the real world due to privacy constraints or collection difficulty, where models pretrained on modality-complete data easily demonstrate degraded performance on missing-modality cases. To handle this issue, we refer to prompt learning to adapt large pretrained multimodal models to handle missing-modality scenarios by regarding different missing cases as different types of input. Instead of only prepending independent prompts to the intermediate layers, we present to leverage the correlations between prompts and input features and excavate the relationships between different layers of prompts to carefully design the instructions. We also incorporate the complementary semantics of different modalities to guide the prompting design for each modality. Extensive experiments on three commonly-used datasets consistently demonstrate the superiority of our method compared to the previous approaches upon different missing scenarios. Plentiful ablations are further given to show the generalizability and reliability of our method upon different modality-missing ratios and types.

Related papers

Calibrated Multimodal Representation Learning with Missing Modalities [100.55774771852468]
Multimodal representation learning harmonizes distinct modalities by aligning them into a unified latent space.<n>Recent research generalizes traditional cross-modal alignment to produce enhanced multimodal synergy but requires all modalities to be present for a common instance.<n>We provide theoretical insights into this issue from an anchor shift perspective.<n>We propose CalMRL for multimodal representation learning to calibrate incomplete alignments caused by missing modalities.
arXiv Detail & Related papers (2025-11-15T05:01:43Z)
Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models [63.032359320629105]
We introduce: Unpaired Multimodal, a modality-agnostic training paradigm in which a single model alternately processes inputs from different modalities while sharing parameters across them.<n>We show that using unpaired data from auxiliary modalities consistently improves downstream performance across diverse unimodal targets such as image and audio.
arXiv Detail & Related papers (2025-10-09T17:32:23Z)
Decoupled Multimodal Prototypes for Visual Recognition with Missing Modalities [3.88369051454137]
Multimodal learning enhances deep learning models by enabling them to perceive and understand information from multiple data modalities.<n>Most existing approaches assume the availability of all modalities, an assumption that often fails in real-world applications.<n>Recent works have introduced learnable missing-case-aware prompts to mitigate performance degradation caused by missing modalities.<n>We propose a novel decoupled prototype-based output head, which leverages missing-case-aware class-wise prototypes tailored for each individual modality.
arXiv Detail & Related papers (2025-05-13T06:53:37Z)
Efficient Prompting for Continual Adaptation to Missing Modalities [7.782217188939437]
We formulate the dynamic missing modality problem as a continual learning task. We introduce three types of prompts: modality-specific, task-aware, and task-specific prompts. These prompts enable the model to learn intra-modality, inter-modality, intra-task, and inter-task features.
arXiv Detail & Related papers (2025-03-01T15:09:37Z)
Missing Modality Prediction for Unpaired Multimodal Learning via Joint Embedding of Unimodal Models [6.610033827647869]
In real-world scenarios, consistently acquiring complete multimodal data presents significant challenges. This often leads to the issue of missing modalities, where data for certain modalities are absent. We propose a novel framework integrating parameter-efficient fine-tuning of unimodal pretrained models with a self-supervised joint-embedding learning method.
arXiv Detail & Related papers (2024-07-17T14:44:25Z)
Dealing with All-stage Missing Modality: Towards A Universal Model with Robust Reconstruction and Personalization [14.606035444283984]
Current approaches focus on developing models that handle modality-incomplete inputs during inference. We propose a robust universal model with modality reconstruction and model personalization. Our method has been extensively validated on two brain tumor segmentation benchmarks.
arXiv Detail & Related papers (2024-06-04T06:07:24Z)
Combating Missing Modalities in Egocentric Videos at Test Time [92.38662956154256]
Real-world applications often face challenges with incomplete modalities due to privacy concerns, efficiency needs, or hardware issues. We propose a novel approach to address this issue at test time without requiring retraining. MiDl represents the first self-supervised, online solution for handling missing modalities exclusively at test time.
arXiv Detail & Related papers (2024-04-23T16:01:33Z)
Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding. We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL. UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z)
Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. We pose the problem of unseen modality interaction and introduce a first solution. It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z)
Correlation Information Bottleneck: Towards Adapting Pretrained Multimodal Models for Robust Visual Question Answering [63.87200781247364]
Correlation Information Bottleneck (CIB) seeks a tradeoff between compression and redundancy in representations. We derive a tight theoretical upper bound for the mutual information between multimodal inputs and representations.
arXiv Detail & Related papers (2022-09-14T22:04:10Z)
On Modality Bias Recognition and Reduction [70.69194431713825]
We study the modality bias problem in the context of multi-modal classification. We propose a plug-and-play loss function method, whereby the feature space for each label is adaptively learned. Our method yields remarkable performance improvements compared with the baselines.
arXiv Detail & Related papers (2022-02-25T13:47:09Z)
Discriminative Multimodal Learning via Conditional Priors in Generative Models [21.166519800652047]
This research studies the realistic scenario in which all modalities and class labels are available for model training. We show, in this scenario, that the variational lower bound limits mutual information between joint representations and missing modalities.
arXiv Detail & Related papers (2021-10-09T17:22:24Z)
Relating by Contrasting: A Data-efficient Framework for Multimodal Generative Models [86.9292779620645]
We develop a contrastive framework for generative model learning, allowing us to train the model not just by the commonality between modalities, but by the distinction between "related" and "unrelated" multimodal data. Under our proposed framework, the generative model can accurately identify related samples from unrelated ones, making it possible to make use of the plentiful unlabeled, unpaired multimodal data.
arXiv Detail & Related papers (2020-07-02T15:08:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.