Related papers: U2A: Unified Unimodal Adaptation for Robust and Efficient Multimodal Learning

U2A: Unified Unimodal Adaptation for Robust and Efficient Multimodal Learning

URL: http://arxiv.org/abs/2501.17823v1
Date: Wed, 29 Jan 2025 18:15:49 GMT
Title: U2A: Unified Unimodal Adaptation for Robust and Efficient Multimodal Learning
Authors: Md Kaykobad Reza, Niki Nezakati, Ameya Patil, Mashhour Solh, M. Salman Asif,
Abstract summary: We present Unified Unimodal Adaptation (U2A), which jointly fine-tunes unimodal encoders using low-rank adaptation (LoRA) for various multimodal tasks.<n>Our method significantly reduces the number of learnable parameters and eliminates the need for complex training strategies.<n>Mask Tokens (MT) generate missing modality features from available modalities using a single token per modality.
Score: 10.909746391230206
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multimodal learning often relies on designing new models and complex training strategies to achieve optimal performance. We present Unified Unimodal Adaptation (U2A), which jointly fine-tunes pretrained unimodal encoders using low-rank adaptation (LoRA) for various multimodal tasks. Our method significantly reduces the number of learnable parameters and eliminates the need for complex training strategies, such as alternating training, gradient modifications, or unimodal fine-tuning. To address missing modalities during both training and testing, we introduce Mask Tokens (MT), which generate missing modality features from available modalities using a single token per modality. This simplifies the process, removing the need for specialized feature estimation or prompt-tuning methods. Our evaluation demonstrates that U2A matches or outperforms state-of-the-art methods in both complete and missing modality settings, showcasing strong performance and robustness across various modalities, tasks, and datasets. We also analyze and report the effectiveness of Mask Tokens in different missing modality scenarios. Overall, our method provides a robust, flexible, and efficient solution for multimodal learning, with minimal computational overhead.

Related papers

LLMs Can Evolve Continually on Modality for X-Modal Reasoning [62.2874638875554]
Existing methods rely heavily on modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities. We propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities. PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%.
arXiv Detail & Related papers (2024-10-26T13:19:57Z)
MMP: Towards Robust Multi-Modal Learning with Masked Modality Projection [10.909746391230206]
Multimodal learning seeks to combine data from multiple input sources to enhance the performance of downstream tasks. Existing methods that can handle missing modalities involve custom training or adaptation steps for each input modality combination. We propose Masked Modality Projection (MMP), a method designed to train a single model that is robust to any missing modality scenario.
arXiv Detail & Related papers (2024-10-03T21:41:12Z)
Modality Invariant Multimodal Learning to Handle Missing Modalities: A Single-Branch Approach [29.428067329993173]
We propose a modality invariant multimodal learning method, which is less susceptible to the impact of missing modalities. It consists of a single-branch network sharing weights across multiple modalities to learn inter-modality representations to maximize performance. Our proposed method achieves superior performance when all modalities are present as well as in the case of missing modalities during training or testing compared to the existing state-of-the-art methods.
arXiv Detail & Related papers (2024-08-14T10:32:16Z)
Multi-modal Crowd Counting via a Broker Modality [64.5356816448361]
Multi-modal crowd counting involves estimating crowd density from both visual and thermal/depth images. We propose a novel approach by introducing an auxiliary broker modality and frame the task as a triple-modal learning problem. We devise a fusion-based method to generate this broker modality, leveraging a non-diffusion, lightweight counterpart of modern denoising diffusion-based fusion models.
arXiv Detail & Related papers (2024-07-10T10:13:11Z)
Exploring Missing Modality in Multimodal Egocentric Datasets [89.76463983679058]
We introduce a novel concept -Missing Modality Token (MMT)-to maintain performance even when modalities are absent. Our method mitigates the performance loss, reducing it from its original $sim 30%$ drop to only $sim 10%$ when half of the test set is modal-incomplete.
arXiv Detail & Related papers (2024-01-21T11:55:42Z)
Multimodal Federated Learning with Missing Modality via Prototype Mask and Contrast [23.936677199734213]
In this paper, we introduce a prototype library into the FedAvg-based Federated Learning framework. The proposed method utilizes prototypes as masks representing missing modalities to formulate a task-calibrated training loss and a model-agnostic uni-modality inference strategy. Compared to the baselines, our method improved inference accuracy by 3.7% with 50% modality missing during training and by 23.8% during uni-modality inference.
arXiv Detail & Related papers (2023-12-21T00:55:12Z)
Multimodal Representation Learning by Alternating Unimodal Adaptation [73.15829571740866]
We propose MLA (Multimodal Learning with Alternating Unimodal Adaptation) to overcome challenges where some modalities appear more dominant than others during multimodal learning. MLA reframes the conventional joint multimodal learning process by transforming it into an alternating unimodal learning process. It captures cross-modal interactions through a shared head, which undergoes continuous optimization across different modalities. Experiments are conducted on five diverse datasets, encompassing scenarios with complete modalities and scenarios with missing modalities.
arXiv Detail & Related papers (2023-11-17T18:57:40Z)
Unified Multi-modal Unsupervised Representation Learning for Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding. We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL. UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z)
Learning Unseen Modality Interaction [54.23533023883659]
Multimodal learning assumes all modality combinations of interest are available during training to learn cross-modal correspondences. We pose the problem of unseen modality interaction and introduce a first solution. It exploits a module that projects the multidimensional features of different modalities into a common space with rich information preserved.
arXiv Detail & Related papers (2023-06-22T10:53:10Z)
Missing Modality Robustness in Semi-Supervised Multi-Modal Semantic Segmentation [27.23513712371972]
We propose a simple yet efficient multi-modal fusion mechanism Linear Fusion. We also propose M3L: Multi-modal Teacher for Masked Modality Learning. Our proposal shows an absolute improvement of up to 10% on robust mIoU above the most competitive baselines.
arXiv Detail & Related papers (2023-04-21T05:52:50Z)
Efficient Multimodal Fusion via Interactive Prompting [62.08292938484994]
Large-scale pre-training has brought unimodal fields such as computer vision and natural language processing to a new era. We propose an efficient and flexible multimodal fusion method, namely PMF, tailored for fusing unimodally pre-trained transformers.
arXiv Detail & Related papers (2023-04-13T07:31:51Z)
Towards Good Practices for Missing Modality Robust Action Recognition [20.26021126604409]
This paper seeks a set of good practices for multi-modal action recognition. We study how to effectively regularize the model during training. Second, we investigate on fusion methods for robustness to missing modalities. Third, we propose a simple modular network, ActionMAE, which learns missing modality predictive coding.
arXiv Detail & Related papers (2022-11-25T06:10:57Z)

This list is automatically generated from the titles and abstracts of the papers in this site.