Improving Discriminative Multi-Modal Learning with Large-Scale
Pre-Trained Models
- URL: http://arxiv.org/abs/2310.05193v1
- Date: Sun, 8 Oct 2023 15:01:54 GMT
- Title: Improving Discriminative Multi-Modal Learning with Large-Scale
Pre-Trained Models
- Authors: Chenzhuang Du, Yue Zhao, Chonghua Liao, Jiacheng You, Jie Fu, Hang
Zhao
- Abstract summary: This paper investigates how to better leverage large-scale pre-trained uni-modal models to enhance discriminative multi-modal learning.
We introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA)
- Score: 51.5543321122664
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper investigates how to better leverage large-scale pre-trained
uni-modal models to further enhance discriminative multi-modal learning. Even
when fine-tuned with only uni-modal data, these models can outperform previous
multi-modal models in certain tasks. It's clear that their incorporation into
multi-modal learning would significantly improve performance. However,
multi-modal learning with these models still suffers from insufficient learning
of uni-modal features, which weakens the resulting multi-modal model's
generalization ability. While fine-tuning uni-modal models separately and then
aggregating their predictions is straightforward, it doesn't allow for adequate
adaptation between modalities, also leading to sub-optimal results. To this
end, we introduce Multi-Modal Low-Rank Adaptation learning (MMLoRA). By
freezing the weights of uni-modal fine-tuned models, adding extra trainable
rank decomposition matrices to them, and subsequently performing multi-modal
joint training, our method enhances adaptation between modalities and boosts
overall performance. We demonstrate the effectiveness of MMLoRA on three
dataset categories: audio-visual (e.g., AVE, Kinetics-Sound, CREMA-D),
vision-language (e.g., MM-IMDB, UPMC Food101), and RGB-Optical Flow (UCF101).
Related papers
- MM-Lego: Modular Biomedical Multimodal Models with Minimal Fine-Tuning [10.774128925670183]
Multimodal Lego (MM-Lego) is a modular and general-purpose fusion and model merging framework.
We show that MM-Lego can be used as a model merging method with end-to-end fusion models without any fine-tuning.
It achieves state-of-the-art results on six benchmarked multimodal biomedical tasks.
arXiv Detail & Related papers (2024-05-30T11:14:01Z) - EMR-Merging: Tuning-Free High-Performance Model Merging [55.03509900949149]
We show that Elect, Mask & Rescale-Merging (EMR-Merging) shows outstanding performance compared to existing merging methods.
EMR-Merging is tuning-free, thus requiring no data availability or any additional training while showing impressive performance.
arXiv Detail & Related papers (2024-05-23T05:25:45Z) - Multimodal Representation Learning by Alternating Unimodal Adaptation [73.15829571740866]
We propose MLA (Multimodal Learning with Alternating Unimodal Adaptation) to overcome challenges where some modalities appear more dominant than others during multimodal learning.
MLA reframes the conventional joint multimodal learning process by transforming it into an alternating unimodal learning process.
It captures cross-modal interactions through a shared head, which undergoes continuous optimization across different modalities.
Experiments are conducted on five diverse datasets, encompassing scenarios with complete modalities and scenarios with missing modalities.
arXiv Detail & Related papers (2023-11-17T18:57:40Z) - Unlock the Power: Competitive Distillation for Multi-Modal Large
Language Models [17.25135606956287]
Competitive Multi-modal Distillation framework (CoMD) captures bidirectional feedback between teacher and student models.
Our experimental analysis of diverse datasets shows that our knowledge transfer method consistently improves the capabilities of the student model.
arXiv Detail & Related papers (2023-11-14T14:49:46Z) - Unified Multi-modal Unsupervised Representation Learning for
Skeleton-based Action Understanding [62.70450216120704]
Unsupervised pre-training has shown great success in skeleton-based action understanding.
We propose a Unified Multimodal Unsupervised Representation Learning framework, called UmURL.
UmURL exploits an efficient early-fusion strategy to jointly encode the multi-modal features in a single-stream manner.
arXiv Detail & Related papers (2023-11-06T13:56:57Z) - What Makes for Robust Multi-Modal Models in the Face of Missing
Modalities? [35.19295402483624]
We model the scenarios of multi-modal models encountering missing modalities from an information-theoretic perspective.
We introduce Uni-Modal Ensemble with Missing Modality Adaptation (UME-MMA)
UME-MMA employs uni-modal pre-trained weights for the multi-modal model to enhance feature extraction and utilizes missing modality data augmentation techniques to better adapt to situations with missing modalities.
arXiv Detail & Related papers (2023-10-10T07:47:57Z) - On Uni-Modal Feature Learning in Supervised Multi-Modal Learning [21.822251958013737]
We abstract the features (i.e. learned representations) of multi-modal data into 1) uni-modal features, which can be learned from uni-modal training, and 2) paired features, which can only be learned from cross-modal interactions.
We demonstrate that, under a simple guiding strategy, we can achieve comparable results to other complex late-fusion or intermediate-fusion methods on various multi-modal datasets.
arXiv Detail & Related papers (2023-05-02T07:15:10Z) - Efficient Multimodal Fusion via Interactive Prompting [62.08292938484994]
Large-scale pre-training has brought unimodal fields such as computer vision and natural language processing to a new era.
We propose an efficient and flexible multimodal fusion method, namely PMF, tailored for fusing unimodally pre-trained transformers.
arXiv Detail & Related papers (2023-04-13T07:31:51Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.