Improving Discriminative Multi-Modal Learning with Large-Scale
Pre-Trained Models
- URL: http://arxiv.org/abs/2310.05193v1
- Date: Sun, 8 Oct 2023 15:01:54 GMT
- Title: Improving Discriminative Multi-Modal Learning with Large-Scale
Pre-Trained Models
- Authors: Chenzhuang Du, Yue Zhao, Chonghua Liao, Jiacheng You, Jie Fu, Hang
Zhao
- Abstract summary: This paper investigates how to better leverage large-scale pre-trained uni-modal models to enhance discriminative multi-modal learning.
We introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA)
- Score: 51.5543321122664
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper investigates how to better leverage large-scale pre-trained
uni-modal models to further enhance discriminative multi-modal learning. Even
when fine-tuned with only uni-modal data, these models can outperform previous
multi-modal models in certain tasks. It's clear that their incorporation into
multi-modal learning would significantly improve performance. However,
multi-modal learning with these models still suffers from insufficient learning
of uni-modal features, which weakens the resulting multi-modal model's
generalization ability. While fine-tuning uni-modal models separately and then
aggregating their predictions is straightforward, it doesn't allow for adequate
adaptation between modalities, also leading to sub-optimal results. To this
end, we introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA). By
freezing the weights of uni-modal fine-tuned models, adding extra trainable
rank decomposition matrices to them, and subsequently performing multi-modal
joint training, our method enhances adaptation between modalities and boosts
overall performance. We demonstrate the effectiveness of MMLoRA on three
dataset categories: audio-visual (e.g., AVE, Kinetics-Sound, CREMA-D),
vision-language (e.g., MM-IMDB, UPMC Food101), and RGB-Optical Flow (UCF101).
Related papers
- LLMs Can Evolve Continually on Modality for X-Modal Reasoning [62.2874638875554]
Existing methods rely heavily on modal-specific pretraining and joint-modal tuning, leading to significant computational burdens when expanding to new modalities.
We propose PathWeave, a flexible and scalable framework with modal-Path sWitching and ExpAnsion abilities.
PathWeave performs comparably to state-of-the-art MLLMs while concurrently reducing parameter training burdens by 98.73%.
arXiv Detail & Related papers (2024-10-26T13:19:57Z) - On-the-fly Modulation for Balanced Multimodal Learning [53.616094855778954]
Multimodal learning is expected to boost model performance by integrating information from different modalities.
The widely-used joint training strategy leads to imbalanced and under-optimized uni-modal representations.
We propose On-the-fly Prediction Modulation (OPM) and On-the-fly Gradient Modulation (OGM) strategies to modulate the optimization of each modality.
arXiv Detail & Related papers (2024-10-15T13:15:50Z) - MM-Lego: Modular Biomedical Multimodal Models with Minimal Fine-Tuning [10.774128925670183]
Multimodal Lego (MM-Lego) is a modular and general-purpose fusion and model merging framework.
We show that MM-Lego can be used as a model merging method with end-to-end fusion models without any fine-tuning.
It achieves state-of-the-art results on six benchmarked multimodal biomedical tasks.
arXiv Detail & Related papers (2024-05-30T11:14:01Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - What Makes for Robust Multi-Modal Models in the Face of Missing
Modalities? [35.19295402483624]
We model the scenarios of multi-modal models encountering missing modalities from an information-theoretic perspective.
We introduce Uni-Modal Ensemble with Missing Modality Adaptation (UME-MMA)
UME-MMA employs uni-modal pre-trained weights for the multi-modal model to enhance feature extraction and utilizes missing modality data augmentation techniques to better adapt to situations with missing modalities.
arXiv Detail & Related papers (2023-10-10T07:47:57Z) - On Uni-Modal Feature Learning in Supervised Multi-Modal Learning [21.822251958013737]
We abstract the features (i.e. learned representations) of multi-modal data into 1) uni-modal features, which can be learned from uni-modal training, and 2) paired features, which can only be learned from cross-modal interactions.
We demonstrate that, under a simple guiding strategy, we can achieve comparable results to other complex late-fusion or intermediate-fusion methods on various multi-modal datasets.
arXiv Detail & Related papers (2023-05-02T07:15:10Z) - Efficient Multimodal Fusion via Interactive Prompting [62.08292938484994]
Large-scale pre-training has brought unimodal fields such as computer vision and natural language processing to a new era.
We propose an efficient and flexible multimodal fusion method, namely PMF, tailored for fusing unimodally pre-trained transformers.
arXiv Detail & Related papers (2023-04-13T07:31:51Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.