Towards a Unified View on Visual Parameter-Efficient Transfer Learning
- URL: http://arxiv.org/abs/2210.00788v1
- Date: Mon, 3 Oct 2022 09:54:39 GMT
- Title: Towards a Unified View on Visual Parameter-Efficient Transfer Learning
- Authors: Bruce X.B. Yu, Jianlong Chang, Lingbo Liu, Qi Tian, Chang Wen Chen
- Abstract summary: We propose a framework with a unified view called visual-PETL (V-PETL) to investigate the different aspects affecting the trade-off.
An effective scheme Swin-BAPAT derived from the proposed V-PETL framework achieves significantly better performance than the state-of-the-art AdaptFormer-Swin.
- Score: 96.99924127527002
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Since the release of various large-scale natural language processing (NLP)
pre-trained models, parameter efficient transfer learning (PETL) has become a
popular paradigm capable of achieving impressive performance on various
downstream tasks. PETL aims at making good use of the representation knowledge
in the pre-trained large models by fine-tuning a small number of parameters.
Recently, it has also attracted increasing attention to developing various PETL
techniques for vision tasks. Popular PETL techniques such as Prompt-tuning and
Adapter have been proposed for high-level visual downstream tasks such as image
classification and video recognition. However, Prefix-tuning remains
under-explored for vision tasks. In this work, we intend to adapt large
video-based models to downstream tasks with a good parameter-accuracy
trade-off. Towards this goal, we propose a framework with a unified view called
visual-PETL (V-PETL) to investigate the different aspects affecting the
trade-off. Specifically, we analyze the positional importance of trainable
parameters and differences between NLP and vision tasks in terms of data
structures and pre-training mechanisms while implementing various PETL
techniques, especially for the under-explored prefix-tuning technique. Based on
a comprehensive understanding of differences between NLP and video data, we
propose a new variation of prefix-tuning module called parallel attention
(PATT) for video-based downstream tasks. An extensive empirical analysis on two
video datasets via different frozen backbones has been carried and the findings
show that the proposed PATT can effectively contribute to other PETL
techniques. An effective scheme Swin-BAPAT derived from the proposed V-PETL
framework achieves significantly better performance than the state-of-the-art
AdaptFormer-Swin with slightly more parameters and outperforms full-tuning with
far less parameters.
Related papers
- Preserving Pre-trained Representation Space: On Effectiveness of Prefix-tuning for Large Multi-modal Models [24.62337386603331]
Large Multi-modal Models (LMMs) are revolutionizing the way machines interact with the world.
To adapt LMMs for downstream tasks, parameter-efficient fine-tuning (PEFT) has gained popularity.
This paper focuses on the strengths and weaknesses of each tuning strategy, shifting the focus from the efficiency typically associated with these approaches.
arXiv Detail & Related papers (2024-10-29T07:55:50Z) - Parameter-Efficient and Memory-Efficient Tuning for Vision Transformer: A Disentangled Approach [87.8330887605381]
We show how to adapt a pre-trained Vision Transformer to downstream recognition tasks with only a few learnable parameters.
We synthesize a task-specific query with a learnable and lightweight module, which is independent of the pre-trained model.
Our method achieves state-of-the-art performance under memory constraints, showcasing its applicability in real-world situations.
arXiv Detail & Related papers (2024-07-09T15:45:04Z) - Gradient Projection For Continual Parameter-Efficient Tuning [42.800411328615894]
We reformulate Adapter, LoRA, Prefix-tuning, and Prompt-tuning from the perspective of gradient projection.
We show that the condition for the gradient can effectively resist forgetting even for large-scale models.
We extensively evaluate our method with different backbones, including ViT and CLIP, on diverse datasets.
arXiv Detail & Related papers (2024-05-22T06:33:48Z) - Dynamic Tuning Towards Parameter and Inference Efficiency for ViT Adaptation [67.13876021157887]
Dynamic Tuning (DyT) is a novel approach to improve both parameter and inference efficiency for ViT adaptation.
DyT achieves superior performance compared to existing PEFT methods while evoking only 71% of their FLOPs on the VTAB-1K benchmark.
arXiv Detail & Related papers (2024-03-18T14:05:52Z) - Learning Semantic Proxies from Visual Prompts for Parameter-Efficient Fine-Tuning in Deep Metric Learning [13.964106147449051]
Existing solutions concentrate on fine-tuning the pre-trained models on conventional image datasets.
We propose a novel and effective framework based on learning Visual Prompts (VPT) in the pre-trained Vision Transformers (ViT)
We demonstrate that our new approximations with semantic information are superior to representative capabilities.
arXiv Detail & Related papers (2024-02-04T04:42:05Z) - p-Laplacian Adaptation for Generative Pre-trained Vision-Language Models [10.713680139939354]
Vision-Language models (VLMs) pre-trained on large corpora have demonstrated notable success across a range of downstream tasks.
PETL has garnered attention as a viable alternative to full fine-tuning.
We propose a new adapter architecture, $p$-adapter, which employs $p$-Laplacian message passing in Graph Neural Networks (GNNs)
arXiv Detail & Related papers (2023-12-17T05:30:35Z) - Parameter and Computation Efficient Transfer Learning for
Vision-Language Pre-trained Models [79.34513906324727]
In this paper, we aim at parameter and efficient transfer learning (PCETL) for vision-language pre-trained models.
We propose a novel dynamic architecture skipping (DAS) approach towards effective PCETL.
arXiv Detail & Related papers (2023-09-04T09:34:33Z) - Pro-tuning: Unified Prompt Tuning for Vision Tasks [133.12978197265596]
Fine-tuning is the de-facto approach to leverage pre-trained vision models to perform downstream tasks.
In this work, we propose parameter-efficient Prompt tuning (Pro-tuning) to adapt frozen vision models to various downstream vision tasks.
arXiv Detail & Related papers (2022-07-28T21:09:31Z) - Parameter-Efficient Image-to-Video Transfer Learning [66.82811235484607]
Large pre-trained models for various downstream tasks of interest have recently emerged with promising performance.
Due to the ever-growing model size, the standard full fine-tuning based task adaptation strategy becomes costly in terms of model training and storage.
We propose a new Spatio-Adapter for parameter-efficient fine-tuning per video task.
arXiv Detail & Related papers (2022-06-27T18:02:29Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.