UniAdapter: Unified Parameter-Efficient Transfer Learning for
Cross-modal Modeling
- URL: http://arxiv.org/abs/2302.06605v2
- Date: Sun, 21 May 2023 17:50:30 GMT
- Title: UniAdapter: Unified Parameter-Efficient Transfer Learning for
Cross-modal Modeling
- Authors: Haoyu Lu, Yuqi Huo, Guoxing Yang, Zhiwu Lu, Wei Zhan, Masayoshi
Tomizuka, Mingyu Ding
- Abstract summary: This paper proposes UniAdapter, which unifies unimodal and multimodal adapters for parameter-efficient cross-modal adaptation on vision-language models.
Experiments show that UniAdapter not only outperforms the state-of-the-arts, but even beats the full fine-tuning strategy.
- Score: 49.134517040512414
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large-scale vision-language pre-trained models have shown promising
transferability to various downstream tasks. As the size of these foundation
models and the number of downstream tasks grow, the standard full fine-tuning
paradigm becomes unsustainable due to heavy computational and storage costs.
This paper proposes UniAdapter, which unifies unimodal and multimodal adapters
for parameter-efficient cross-modal adaptation on pre-trained vision-language
models. Specifically, adapters are distributed to different modalities and
their interactions, with the total number of tunable parameters reduced by
partial weight sharing. The unified and knowledge-sharing design enables
powerful cross-modal representations that can benefit various downstream tasks,
requiring only 1.0%-2.0% tunable parameters of the pre-trained model. Extensive
experiments on 6 cross-modal downstream benchmarks (including video-text
retrieval, image-text retrieval, VideoQA, and VQA) show that in most cases,
UniAdapter not only outperforms the state-of-the-arts, but even beats the full
fine-tuning strategy. Particularly, on the MSRVTT retrieval task, UniAdapter
achieves 49.7% recall@1 with 2.2% model parameters, outperforming the latest
competitors by 2.0%. The code and models are available at
https://github.com/RERV/UniAdapter.
Related papers
- Cross-Modal Adapter: Parameter-Efficient Transfer Learning Approach for Vision-Language Models [38.751158173278796]
This work introduces a cross-modal parameter-efficient approach named XMAdapter.
XMAdapter establishes cache models for both text and image modalities.
It then leverages retrieval through visual-language bimodal information to gather clues for inference.
arXiv Detail & Related papers (2024-04-19T02:33:23Z) - Prototype-based HyperAdapter for Sample-Efficient Multi-task Tuning [30.251155072822055]
Prototype-based HyperAdapter (PHA) is a novel framework built on the adapter-tuning and hypernetwork.
It introduces an instance-dense retriever and prototypical hypernetwork to generate conditional modules in a sample-efficient manner.
We show that PHA strikes a better trade-off between trainable parameters, accuracy on stream tasks, and sample efficiency.
arXiv Detail & Related papers (2023-10-18T02:42:17Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - FAME-ViL: Multi-Tasking Vision-Language Model for Heterogeneous Fashion
Tasks [129.49630356651454]
We propose a novel FAshion-focused Multi-task Efficient learning method for Vision-and-Language tasks (FAME-ViL)
Our FAME-ViL can save 61.5% of parameters over alternatives, while significantly outperforming the conventional independently trained single-task models.
arXiv Detail & Related papers (2023-03-04T19:07:48Z) - Towards Efficient Visual Adaption via Structural Re-parameterization [76.57083043547296]
We propose a parameter-efficient and computational friendly adapter for giant vision models, called RepAdapter.
RepAdapter outperforms full tuning by +7.2% on average and saves up to 25% training time, 20% GPU memory, and 94.6% storage cost of ViT-B/16 on VTAB-1k.
arXiv Detail & Related papers (2023-02-16T06:14:15Z) - Uni-Perceiver v2: A Generalist Model for Large-Scale Vision and
Vision-Language Tasks [86.66733026149892]
We propose Uni-Perceiver v2, which is the first generalist model capable of handling major large-scale vision and vision-gnostic tasks.
Specifically, images are encoded as general region proposals, while texts are encoded via a Transformer-based language model.
Uni-Perceiver v2 achieves competitive performance on a broad range of vision and vision-language tasks.
arXiv Detail & Related papers (2022-11-17T18:59:52Z) - AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large
Language Models [119.7093605087114]
Fine-tuning large-scale pre-trained language models to downstream tasks require updating hundreds of millions of parameters.
This not only increases the serving cost to store a large copy of the model weights for every task, but also exhibits instability during few-shot task adaptation.
We introduce a new mechanism to improve adapter capacity without increasing parameters or computational cost by two key techniques.
arXiv Detail & Related papers (2022-05-24T23:41:22Z) - Efficient Adapter Transfer of Self-Supervised Speech Models for
Automatic Speech Recognition [0.1909808926064466]
Transformer based models such as wav2vec 2.0 and HuBERT are leading the field in the speech domain.
We propose applying adapters to wav2vec 2.0 to reduce the number of parameters required for downstream ASR tasks.
arXiv Detail & Related papers (2022-02-07T14:20:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.