Related papers: Robust Multimodal Learning via Cross-Modal Proxy Tokens

Robust Multimodal Learning via Cross-Modal Proxy Tokens

URL: http://arxiv.org/abs/2501.17823v2
Date: Mon, 10 Mar 2025 01:34:24 GMT
Title: Robust Multimodal Learning via Cross-Modal Proxy Tokens
Authors: Md Kaykobad Reza, Ameya Patil, Mashhour Solh, M. Salman Asif,
Abstract summary: Multimodal models often experience a significant performance drop when one or more modalities are missing during inference.<n>We propose a simple yet effective approach that enhances robustness to missing modalities while maintaining strong performance when all modalities are available.<n>Our method introduces cross-modal proxy tokens (CMPTs), which approximate the class token of a missing modality by attending only to the tokens of the available modality.
Score: 11.704477276235847
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal models often experience a significant performance drop when one or more modalities are missing during inference. To address this challenge, we propose a simple yet effective approach that enhances robustness to missing modalities while maintaining strong performance when all modalities are available. Our method introduces cross-modal proxy tokens (CMPTs), which approximate the class token of a missing modality by attending only to the tokens of the available modality. To efficiently learn the approximation for the missing modality via CMPTs with minimal computational overhead, we employ low-rank adapters in frozen unimodal encoders and jointly optimize an alignment loss with a task-specific loss. Extensive experiments on five multimodal datasets show that our method outperforms state-of-the-art baselines across various missing rates while achieving competitive results in complete-modality settings. Overall, our method offers a flexible and efficient solution for robust multimodal learning. The code and pretrained models will be released on GitHub.

Related papers

LLMs Can Evolve Continually on Modality for X-Modal Reasoning [62.2874638875554]
Existing methods rely heavily on modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities. We propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities. PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%.
arXiv Detail & Related papers (2024-10-26T13:19:57Z)
MMP: Towards Robust Multi-Modal Learning with Masked Modality Projection [10.909746391230206]
Multimodal learning seeks to combine data from multiple input sources to enhance the performance of downstream tasks. Existing methods that can handle missing modalities involve custom training or adaptation steps for each input modality combination. We propose Masked Modality Projection (MMP), a method designed to train a single model that is robust to any missing modality scenario.
arXiv Detail & Related papers (2024-10-03T21:41:12Z)
Modality Invariant Multimodal Learning to Handle Missing Modalities: A Single-Branch Approach [29.428067329993173]
We propose a modality invariant multimodal learning method, which is less susceptible to the impact of missing modalities. It consists of a single-branch network sharing weights across multiple modalities to learn inter-modality representations to maximize performance. Our proposed method achieves superior performance when all modalities are present as well as in the case of missing modalities during training or testing compared to the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-08-14T10:32:16Z)
Multi-modal Crowd Counting via a Broker Modality [64.5356816448361]
Multi-modal crowd counting involves estimating crowd density from both visual and thermal/depth images. We propose a novel approach by introducing an auxiliary broker modality and frame the task as a triple-modal learning problem. We devise a fusion-based method to generate this broker modality, leveraging a non-diffusion, lightweight counterpart of modern denoising diffusion-based fusion models.
arXiv Detail & Related papers (2024-07-10T10:13:11Z)
Exploring Missing Modality in Multimodal Egocentric Datasets [89.76463983679058]
We introduce a novel concept -Missing Modality Token (MMT)-to maintain performance even when modalities are absent. Our method mitigates the performance loss, reducing it from its original $sim 30%$ drop to only $sim 10%$ when half of the test set is modal-incomplete.
arXiv Detail & Related papers (2024-01-21T11:55:42Z)
Multimodal Federated Learning with Missing Modality via Prototype Mask and Contrast [23.936677199734213]
In this paper, we introduce a prototype library into the FedAvg-based Federated Learning framework. The proposed method utilizes prototypes as masks representing missing modalities to formulate a task-calibrated training loss and a model-agnostic uni-modality inference strategy. Compared to the baselines, our method improved inference accuracy by 3.7% with 50% modality missing during training and by 23.8% during uni-modality inference.
arXiv Detail & Related papers (2023-12-21T00:55:12Z)
Multimodal Representation Learning by Alternating Unimodal Adaptation [73.15829571740866]
We propose MLA (Multimodal Learning with Alternating Unimodal Adaptation) to overcome challenges where some modalities appear more dominant than others during multimodal learning. MLA reframes the conventional joint multimodal learning process by transforming it into an alternating unimodal learning process. It captures cross-modal interactions through a shared head, which undergoes continuous optimization across different modalities. Experiments are conducted on five diverse datasets, encompassing scenarios with complete modalities and scenarios with missing modalities.
arXiv Detail & Related papers (2023-11-17T18:57:40Z)
Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding. We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL. UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z)
Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. We pose the problem of unseen modality interaction and introduce a first solution. It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z)
Missing Modality Robustness in Semi-Supervised Multi-Modal Semantic Segmentation [27.23513712371972]
We propose a simple yet efficient multi-modal fusion mechanism Linear Fusion. We also propose M3L: Multi-modal Teacher for Masked Modality Learning. Our proposal shows an absolute improvement of up to 10% on robust mIoU above the most competitive baselines.
arXiv Detail & Related papers (2023-04-21T05:52:50Z)
Efficient Multimodal Fusion via Interactive Prompting [62.08292938484994]
Large-scale pre-training has brought unimodal fields such as computer vision and natural language processing to a new era. We propose an efficient and flexible multimodal fusion method, namely PMF, tailored for fusing unimodally pre-trained transformers.
arXiv Detail & Related papers (2023-04-13T07:31:51Z)
Towards Good Practices for Missing Modality Robust Action Recognition [20.26021126604409]
This paper seeks a set of good practices for multi-modal action recognition. We study how to effectively regularize the model during training. Second, we investigate on fusion methods for robustness to missing modalities. Third, we propose a simple modular network, ActionMAE, which learns missing modality predictive coding.
arXiv Detail & Related papers (2022-11-25T06:10:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.