Parameter and Computation Efficient Transfer Learning for
Vision-Language Pre-trained Models
- URL: http://arxiv.org/abs/2309.01479v3
- Date: Sun, 29 Oct 2023 14:20:43 GMT
- Title: Parameter and Computation Efficient Transfer Learning for
Vision-Language Pre-trained Models
- Authors: Qiong Wu, Wei Yu, Yiyi Zhou, Shubin Huang, Xiaoshuai Sun, Rongrong Ji
- Abstract summary: In this paper, we aim at parameter and efficient transfer learning (PCETL) for vision-language pre-trained models.
We propose a novel dynamic architecture skipping (DAS) approach towards effective PCETL.
- Score: 79.34513906324727
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: With ever increasing parameters and computation, vision-language pre-trained
(VLP) models exhibit prohibitive expenditure in downstream task adaption.
Recent endeavors mainly focus on parameter efficient transfer learning (PETL)
for VLP models by only updating a small number of parameters. However,
excessive computational overhead still plagues the application of VLPs. In this
paper, we aim at parameter and computation efficient transfer learning (PCETL)
for VLP models. In particular, PCETL not only needs to limit the number of
trainable parameters in VLP models, but also to reduce the computational
redundancy during inference, thus enabling a more efficient transfer. To
approach this target, we propose a novel dynamic architecture skipping (DAS)
approach towards effective PCETL. Instead of directly optimizing the intrinsic
architectures of VLP models, DAS first observes the significances of their
modules to downstream tasks via a reinforcement learning (RL) based process,
and then skips the redundant ones with lightweight networks, i.e., adapters,
according to the obtained rewards. In this case, the VLP model can well
maintain the scale of trainable parameters while speeding up its inference on
downstream tasks. To validate DAS, we apply it to two representative VLP
models, namely ViLT and METER, and conduct extensive experiments on a bunch of
VL tasks. The experimental results not only show the great advantages of DAS in
reducing computational complexity, e.g. -11.97% FLOPs of METER on VQA2.0, but
also confirm its competitiveness against existing PETL methods in terms of
parameter scale and performance. Our source code is given in our appendix.
Related papers
- Scaling Laws for Predicting Downstream Performance in LLMs [75.28559015477137]
This work focuses on the pre-training loss as a more-efficient metric for performance estimation.
We extend the power law analytical function to predict domain-specific pre-training loss based on FLOPs across data sources.
We employ a two-layer neural network to model the non-linear relationship between multiple domain-specific loss and downstream performance.
arXiv Detail & Related papers (2024-10-11T04:57:48Z) - Reference Trustable Decoding: A Training-Free Augmentation Paradigm for Large Language Models [79.41139393080736]
Large language models (LLMs) have rapidly advanced and demonstrated impressive capabilities.
We propose Reference Trustable Decoding (RTD), a paradigm that allows models to quickly adapt to new tasks without fine-tuning.
arXiv Detail & Related papers (2024-09-30T10:48:20Z) - Unsupervised Domain Adaption Harnessing Vision-Language Pre-training [4.327763441385371]
This paper focuses on harnessing the power of Vision-Language Pre-training models in Unsupervised Domain Adaptation (UDA)
We propose a novel method called Cross-Modal Knowledge Distillation (CMKD)
Our proposed method outperforms existing techniques on standard benchmarks.
arXiv Detail & Related papers (2024-08-05T02:37:59Z) - Parameter-Efficient Fine-Tuning With Adapters [5.948206235442328]
This research introduces a novel adaptation method utilizing the UniPELT framework as a base.
Our method employs adapters, which enable efficient transfer of pretrained models to new tasks with minimal retraining of the base model parameters.
arXiv Detail & Related papers (2024-05-09T01:40:38Z) - VLN-PETL: Parameter-Efficient Transfer Learning for Vision-and-Language
Navigation [23.22586831122625]
We present the first study to explore PETL methods for VLN tasks and propose a VLN-specific PETL method named VLN-PETL.
VLN-PETL achieves comparable or even better performance to full fine-tuning and outperforms other PETL methods with promising margins.
arXiv Detail & Related papers (2023-08-20T05:55:30Z) - Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks.
We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z) - Towards a Unified View on Visual Parameter-Efficient Transfer Learning [96.99924127527002]
We propose a framework with a unified view called visual-PETL (V-PETL) to investigate the different aspects affecting the trade-off.
An effective scheme Swin-BAPAT derived from the proposed V-PETL framework achieves significantly better performance than the state-of-the-art AdaptFormer-Swin.
arXiv Detail & Related papers (2022-10-03T09:54:39Z) - Few-Shot Parameter-Efficient Fine-Tuning is Better and Cheaper than
In-Context Learning [81.3514358542452]
Few-shot in-context learning (ICL) incurs substantial computational, memory, and storage costs because it involves processing all of the training examples every time a prediction is made.
parameter-efficient fine-tuning offers an alternative paradigm where a small set of parameters are trained to enable a model to perform the new task.
In this paper, we rigorously compare few-shot ICL and parameter-efficient fine-tuning and demonstrate that the latter offers better accuracy as well as dramatically lower computational costs.
arXiv Detail & Related papers (2022-05-11T17:10:41Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.