VL-Adapter: Parameter-Efficient Transfer Learning for
Vision-and-Language Tasks
- URL: http://arxiv.org/abs/2112.06825v1
- Date: Mon, 13 Dec 2021 17:35:26 GMT
- Title: VL-Adapter: Parameter-Efficient Transfer Learning for
Vision-and-Language Tasks
- Authors: Yi-Lin Sung, Jaemin Cho, Mohit Bansal
- Abstract summary: Recently, fine-tuning language models pre-trained on large text corpora have provided huge improvements on vision-and-language (V&L) tasks.
We introduce adapter-based parameter-efficient transfer learning techniques to V&L models such as VL-BART and VL-T5.
Our results demonstrate that training the adapter with the weight-sharing technique can match the performance of fine-tuning the entire model.
- Score: 71.40656211497162
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently, fine-tuning language models pre-trained on large text corpora have
provided huge improvements on vision-and-language (V&L) tasks as well as on
pure language tasks. However, fine-tuning the entire parameter set of
pre-trained models becomes impractical since the model size is growing rapidly.
Hence, in this paper, we introduce adapter-based parameter-efficient transfer
learning techniques to V&L models such as VL-BART and VL-T5. We evaluate our
methods in a unified multi-task setup on four diverse V&L tasks: VQAv2, GQA,
NLVR2 , and MSCOCO image captioning. With careful training and thorough
experiments, we benchmark three popular adapter-based methods (Adapter,
Hyperformer, Compacter) against the standard full fine-tuning and the recently
proposed prompt-tuning approach. We also enhance the efficiency and performance
of adapters by sharing their weights to attain knowledge across tasks. Our
results demonstrate that training the adapter with the weight-sharing technique
(4.4% of total parameters) can match the performance of fine-tuning the entire
model. Lastly, we present a comprehensive analysis including the combination of
adapter and task-specific prompts and the impact of V&L pre-training on
adapters. Our code is available at: https://github.com/ylsung/VL_adapter.
Related papers
- Negative Yields Positive: Unified Dual-Path Adapter for Vision-Language Models [11.545127156146368]
We introduce the concept of dual learning into fine-tuning Vision-Language Models (VLMs)
We introduce a novel DualAdapter approach to enable dual-path adaptation of VLMs from both positive and negative perspectives.
Our experimental results validate that the proposed DualAdapter outperforms existing state-of-the-art methods on both few-shot learning and domain generalization tasks.
arXiv Detail & Related papers (2024-03-19T17:59:39Z) - APoLLo: Unified Adapter and Prompt Learning for Vision Language Models [58.9772868980283]
We present APoLLo, a unified multi-modal approach that combines Adapter and Prompt learning for Vision-Language models.
APoLLo achieves a relative gain up to 6.03% over MaPLe (SOTA) on novel classes for 10 diverse image recognition datasets.
arXiv Detail & Related papers (2023-12-04T01:42:09Z) - Mini but Mighty: Finetuning ViTs with Mini Adapters [7.175668563148084]
adapters perform poorly when the dimension of adapters is small.
We propose MiMi, a training framework that addresses this issue.
Our method outperforms existing methods in finding the best trade-off between accuracy and trained parameters.
arXiv Detail & Related papers (2023-11-07T10:41:27Z) - MerA: Merging Pretrained Adapters For Few-Shot Learning [71.44422347502409]
We propose textbftextttMerging Pretrained Adapters (MerA) that efficiently incorporates pretrained adapters to a single model through model fusion.
Experiments on two PLMs demonstrate that MerA substantial improvements compared to both single adapters and AdapterFusion.
arXiv Detail & Related papers (2023-08-30T12:10:17Z) - A Comprehensive Analysis of Adapter Efficiency [20.63580880344425]
We show that for Natural Language Understanding (NLU) tasks, the parameter efficiency in adapters does not translate to efficiency gains compared to full fine-tuning of models.
We recommend that for moderately sized models for NLU tasks, practitioners should rely on full fine-tuning or multi-task training rather than using adapters.
arXiv Detail & Related papers (2023-05-12T14:05:45Z) - AdaMix: Mixture-of-Adapter for Parameter-efficient Tuning of Large
Language Models [119.7093605087114]
Fine-tuning large-scale pre-trained language models to downstream tasks require updating hundreds of millions of parameters.
This not only increases the serving cost to store a large copy of the model weights for every task, but also exhibits instability during few-shot task adaptation.
We introduce a new mechanism to improve adapter capacity without increasing parameters or computational cost by two key techniques.
arXiv Detail & Related papers (2022-05-24T23:41:22Z) - HyperPELT: Unified Parameter-Efficient Language Model Tuning for Both
Language and Vision-and-Language Tasks [38.43269863509866]
How to perform parameter-efficient fine-tuning has become fairly important for quick transfer learning and deployment.
We design a novel unified parameter-efficient transfer learning framework that works effectively on both pure language and V&L tasks.
Our proposed framework adds fewer trainable parameters in multi-task learning while achieving superior performances and transfer ability compared to state-of-the-art methods.
arXiv Detail & Related papers (2022-03-08T06:51:33Z) - An Empirical Study of Training End-to-End Vision-and-Language
Transformers [50.23532518166621]
We present METER(textbfMultimodal textbfEnd-to-end textbfTransformtextbfER), through which we investigate how to design and pre-train a fully transformer-based VL model.
Specifically, we dissect the model designs along multiple dimensions: vision encoders (e.g., CLIP-ViT, Swin transformer), text encoders (e.g., RoBERTa, DeBERTa), multimodal fusion (e.g., merged attention vs. co-
arXiv Detail & Related papers (2021-11-03T17:55:36Z) - Exploiting Adapters for Cross-lingual Low-resource Speech Recognition [52.40623653290499]
Cross-lingual speech adaptation aims to solve the problem of leveraging multiple rich-resource languages to build models for a low-resource target language.
We propose adapters to investigate the performance of multiple adapters for parameter-efficient cross-lingual speech adaptation.
arXiv Detail & Related papers (2021-05-18T08:30:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.