Visual Prompting: Modifying Pixel Space to Adapt Pre-trained Models
- URL: http://arxiv.org/abs/2203.17274v1
- Date: Thu, 31 Mar 2022 17:59:30 GMT
- Title: Visual Prompting: Modifying Pixel Space to Adapt Pre-trained Models
- Authors: Hyojin Bahng, Ali Jahanian, Swami Sankaranarayanan, Phillip Isola
- Abstract summary: We introduce visual prompting, which learns a task-specific image perturbation such that a frozen pre-trained model prompted with this perturbation performs a new task.
We discover that changing only a few pixels is enough to adapt models to new tasks and datasets, and performs on par with linear probing.
- Score: 29.413887954758053
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Prompting has recently become a popular paradigm for adapting language models
to downstream tasks. Rather than fine-tuning model parameters or adding
task-specific heads, this approach steers a model to perform a new task simply
by adding a text prompt to the model's inputs. In this paper, we explore the
question: can we create prompts with pixels instead? In other words, can
pre-trained vision models be adapted to a new task solely by adding pixels to
their inputs? We introduce visual prompting, which learns a task-specific image
perturbation such that a frozen pre-trained model prompted with this
perturbation performs a new task. We discover that changing only a few pixels
is enough to adapt models to new tasks and datasets, and performs on par with
linear probing, the current de facto approach to lightweight adaptation. The
surprising effectiveness of visual prompting provides a new perspective on how
to adapt pre-trained models in vision, and opens up the possibility of adapting
models solely through their inputs, which, unlike model parameters or outputs,
are typically under an end-user's control. Code is available at
http://hjbahng.github.io/visual_prompting .
Related papers
- Making the Most of What You Have: Adapting Pre-trained Visual Language
Models in the Low-data Regime [23.255873641249263]
We look into task adaptation in the low-data regime, and provide a study of the existing adaptation methods for generative Visual Language Models.
We show important benefits of self-labelling, i.e. using the model's own predictions to self-improve when having access to a larger number of unlabelled images.
arXiv Detail & Related papers (2023-05-03T17:42:54Z) - $\Delta$-Patching: A Framework for Rapid Adaptation of Pre-trained
Convolutional Networks without Base Performance Loss [71.46601663956521]
Models pre-trained on large-scale datasets are often fine-tuned to support newer tasks and datasets that arrive over time.
We propose $Delta$-Patching for fine-tuning neural network models in an efficient manner, without the need to store model copies.
Our experiments show that $Delta$-Networks outperform earlier model patching work while only requiring a fraction of parameters to be trained.
arXiv Detail & Related papers (2023-03-26T16:39:44Z) - Prompt Tuning based Adapter for Vision-Language Model Adaption [38.576215369504446]
We introduce a new model, termed Prompt-Adapter, that combines pre-trained prompt tunning with an efficient adaptation network.
Our approach beat the state-of-the-art methods in few-shot image classification on the public 11 datasets.
Our proposed method demonstrates the promise of combining prompt tuning and parameter-efficient networks for efficient vision-language model adaptation.
arXiv Detail & Related papers (2023-03-24T15:05:17Z) - Contrastive Alignment of Vision to Language Through Parameter-Efficient
Transfer Learning [60.26952378997713]
Contrastive vision-language models (e.g. CLIP) are created by updating all the parameters of a vision model and language model through contrastive training.
We show that a minimal set of parameter updates ($$7%) can achieve the same performance as full-model training.
We describe a series of experiments: we show that existing knowledge is conserved more strongly in parameter-efficient training.
arXiv Detail & Related papers (2023-03-21T14:12:08Z) - eP-ALM: Efficient Perceptual Augmentation of Language Models [70.47962271121389]
We propose to direct effort to efficient adaptations of existing models, and propose to augment Language Models with perception.
Existing approaches for adapting pretrained models for vision-language tasks still rely on several key components that hinder their efficiency.
We show that by freezing more than 99% of total parameters, training only one linear projection layer, and prepending only one trainable token, our approach (dubbed eP-ALM) significantly outperforms other baselines on VQA and Captioning.
arXiv Detail & Related papers (2023-03-20T19:20:34Z) - Mini-Model Adaptation: Efficiently Extending Pretrained Models to New
Languages via Aligned Shallow Training [36.5936227129021]
It is possible to expand pretrained Masked Language Models to new languages by learning a new set of embeddings, while keeping the transformer body frozen.
We propose mini-model adaptation, a compute-efficient alternative that builds a shallow mini-model from a fraction of a large model's parameters.
New language-specific embeddings can then be efficiently trained over the mini-model and plugged into the aligned large model for rapid cross-lingual transfer.
arXiv Detail & Related papers (2022-12-20T18:17:28Z) - Multi-Modal Few-Shot Temporal Action Detection [157.96194484236483]
Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection to new classes.
We introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD.
arXiv Detail & Related papers (2022-11-27T18:13:05Z) - Visual Prompting via Image Inpainting [104.98602202198668]
Inspired by prompting in NLP, this paper investigates visual prompting: given input-output image example(s) of a new task at test time and a new input image.
We apply visual prompting to pretrained models and demonstrate results on various downstream image-to-image tasks.
arXiv Detail & Related papers (2022-09-01T17:59:33Z) - Pro-tuning: Unified Prompt Tuning for Vision Tasks [133.12978197265596]
Fine-tuning is the de-facto approach to leverage pre-trained vision models to perform downstream tasks.
In this work, we propose parameter-efficient Prompt tuning (Pro-tuning) to adapt frozen vision models to various downstream vision tasks.
arXiv Detail & Related papers (2022-07-28T21:09:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.