Cross-Modal Adapter: Parameter-Efficient Transfer Learning Approach for Vision-Language Models
- URL: http://arxiv.org/abs/2404.12588v1
- Date: Fri, 19 Apr 2024 02:33:23 GMT
- Title: Cross-Modal Adapter: Parameter-Efficient Transfer Learning Approach for Vision-Language Models
- Authors: Juncheng Yang, Zuchao Li, Shuai Xie, Weiping Zhu, Wei Yu, Shijun Li,
- Abstract summary: This work introduces a cross-modal parameter-efficient approach named XMAdapter.
XMAdapter establishes cache models for both text and image modalities.
It then leverages retrieval through visual-language bimodal information to gather clues for inference.
- Score: 38.751158173278796
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Adapter-based parameter-efficient transfer learning has achieved exciting results in vision-language models. Traditional adapter methods often require training or fine-tuning, facing challenges such as insufficient samples or resource limitations. While some methods overcome the need for training by leveraging image modality cache and retrieval, they overlook the text modality's importance and cross-modal cues for the efficient adaptation of parameters in visual-language models. This work introduces a cross-modal parameter-efficient approach named XMAdapter. XMAdapter establishes cache models for both text and image modalities. It then leverages retrieval through visual-language bimodal information to gather clues for inference. By dynamically adjusting the affinity ratio, it achieves cross-modal fusion, decoupling different modal similarities to assess their respective contributions. Additionally, it explores hard samples based on differences in cross-modal affinity and enhances model performance through adaptive adjustment of sample learning intensity. Extensive experimental results on benchmark datasets demonstrate that XMAdapter outperforms previous adapter-based methods significantly regarding accuracy, generalization, and efficiency.
Related papers
- Boosting Continual Learning of Vision-Language Models via Mixture-of-Experts Adapters [65.15700861265432]
We present a parameter-efficient continual learning framework to alleviate long-term forgetting in incremental learning with vision-language models.
Our approach involves the dynamic expansion of a pre-trained CLIP model, through the integration of Mixture-of-Experts (MoE) adapters.
To preserve the zero-shot recognition capability of vision-language models, we introduce a Distribution Discriminative Auto-Selector.
arXiv Detail & Related papers (2024-03-18T08:00:23Z) - Multimodal Pathway: Improve Transformers with Irrelevant Data from Other Modalities [56.666806962214565]
We propose to improve transformers of a specific modality with irrelevant data from other modalities.
We use an auxiliary transformer trained with data of another modality and construct pathways to connect components of the two models.
We observe significant and consistent performance improvements with irrelevant data from other modalities.
arXiv Detail & Related papers (2024-01-25T18:59:58Z) - p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models [10.713680139939354]
Vision-Language models (VLMs) pre-trained on large corpora have demonstrated notable success across a range of downstream tasks.
PETL has garnered attention as a viable alternative to full fine-tuning.
We propose a new adapter architecture, $p$-adapter, which employs $p$-Laplacian message passing in Graph Neural Networks (GNNs)
arXiv Detail & Related papers (2023-12-17T05:30:35Z) - Prompt Tuning based Adapter for Vision-Language Model Adaption [38.576215369504446]
We introduce a new model, termed Prompt-Adapter, that combines pre-trained prompt tunning with an efficient adaptation network.
Our approach beat the state-of-the-art methods in few-shot image classification on the public 11 datasets.
Our proposed method demonstrates the promise of combining prompt tuning and parameter-efficient networks for efficient vision-language model adaptation.
arXiv Detail & Related papers (2023-03-24T15:05:17Z) - Cross-modal Audio-visual Co-learning for Text-independent Speaker
Verification [55.624946113550195]
This paper proposes a cross-modal speech co-learning paradigm.
Two cross-modal boosters are introduced based on an audio-visual pseudo-siamese structure to learn the modality-transformed correlation.
Experimental results on the LRSLip3, GridLip, LomGridLip, and VoxLip datasets demonstrate that our proposed method achieves 60% and 20% average relative performance improvement.
arXiv Detail & Related papers (2023-02-22T10:06:37Z) - UniAdapter: Unified Parameter-Efficient Transfer Learning for
Cross-modal Modeling [49.134517040512414]
This paper proposes UniAdapter, which unifies unimodal and multimodal adapters for parameter-efficient cross-modal adaptation on vision-language models.
Experiments show that UniAdapter not only outperforms the state-of-the-arts, but even beats the full fine-tuning strategy.
arXiv Detail & Related papers (2023-02-13T18:59:10Z) - Parameter-efficient Model Adaptation for Vision Transformers [45.3460867776953]
We study parameter-efficient model adaptation strategies for vision transformers on the image classification task.
We propose a parameter-efficient model adaptation framework, which first selects submodules by measuring local intrinsic dimensions.
Our method performs the best in terms of the tradeoff between accuracy and parameter efficiency across 20 image classification datasets.
arXiv Detail & Related papers (2022-03-29T05:30:09Z) - Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in
Multimodal Transformers [15.826109118064716]
Pretrained vision-and-language BERTs aim to learn representations that combine information from both modalities.
We propose a diagnostic method based on cross-modal input ablation to assess the extent to which these models actually integrate cross-modal information.
arXiv Detail & Related papers (2021-09-09T17:47:50Z) - Modality Compensation Network: Cross-Modal Adaptation for Action
Recognition [77.24983234113957]
We propose a Modality Compensation Network (MCN) to explore the relationships of different modalities.
Our model bridges data from source and auxiliary modalities by a modality adaptation block to achieve adaptive representation learning.
Experimental results reveal that MCN outperforms state-of-the-art approaches on four widely-used action recognition benchmarks.
arXiv Detail & Related papers (2020-01-31T04:51:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.