Visual Prompt Tuning for Generative Transfer Learning
- URL: http://arxiv.org/abs/2210.00990v1
- Date: Mon, 3 Oct 2022 14:56:05 GMT
- Title: Visual Prompt Tuning for Generative Transfer Learning
- Authors: Kihyuk Sohn, Yuan Hao, Jos\'e Lezama, Luisa Polania, Huiwen Chang, Han
Zhang, Irfan Essa, Lu Jiang
- Abstract summary: We present a recipe for learning vision transformers by generative knowledge transfer.
We base our framework on state-of-the-art generative vision transformers that represent an image as a sequence of visual tokens to the autoregressive or non-autoregressive transformers.
To adapt to a new domain, we employ prompt tuning, which prepends learnable tokens called prompt to the image token sequence.
- Score: 26.895321693202284
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transferring knowledge from an image synthesis model trained on a large
dataset is a promising direction for learning generative image models from
various domains efficiently. While previous works have studied GAN models, we
present a recipe for learning vision transformers by generative knowledge
transfer. We base our framework on state-of-the-art generative vision
transformers that represent an image as a sequence of visual tokens to the
autoregressive or non-autoregressive transformers. To adapt to a new domain, we
employ prompt tuning, which prepends learnable tokens called prompt to the
image token sequence, and introduce a new prompt design for our task. We study
on a variety of visual domains, including visual task adaptation
benchmark~\cite{zhai2019large}, with varying amount of training images, and
show effectiveness of knowledge transfer and a significantly better image
generation quality over existing works.
Related papers
- Is Synthetic Image Useful for Transfer Learning? An Investigation into Data Generation, Volume, and Utilization [62.157627519792946]
We introduce a novel framework called bridged transfer, which initially employs synthetic images for fine-tuning a pre-trained model to improve its transferability.
We propose dataset style inversion strategy to improve the stylistic alignment between synthetic and real images.
Our proposed methods are evaluated across 10 different datasets and 5 distinct models, demonstrating consistent improvements.
arXiv Detail & Related papers (2024-03-28T22:25:05Z) - A Comprehensive Study of Vision Transformers in Image Classification
Tasks [0.46040036610482665]
We conduct a comprehensive survey of existing papers on Vision Transformers for image classification.
We first introduce the popular image classification datasets that influenced the design of models.
We present Vision Transformers models in chronological order, starting with early attempts at adapting attention mechanism to vision tasks.
arXiv Detail & Related papers (2023-12-02T21:38:16Z) - Instruction-ViT: Multi-Modal Prompts for Instruction Learning in ViT [58.70209492842953]
In this paper, we focus on adapting prompt design based on instruction tuning into a visual transformer model for image classification.
The key idea is to implement multi-modal prompts related to category information to guide the fine-tuning of the model.
Based on the experiments of several image captionining tasks, the performance and domain adaptability were improved.
arXiv Detail & Related papers (2023-04-29T08:59:12Z) - Plug-and-Play Diffusion Features for Text-Driven Image-to-Image
Translation [10.39028769374367]
We present a new framework that takes text-to-image synthesis to the realm of image-to-image translation.
Our method harnesses the power of a pre-trained text-to-image diffusion model to generate a new image that complies with the target text.
arXiv Detail & Related papers (2022-11-22T20:39:18Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - Transformers in Vision: A Survey [101.07348618962111]
Transformers enable modeling long dependencies between input sequence elements and support parallel processing of sequence.
Transformers require minimal inductive biases for their design and are naturally suited as set-functions.
This survey aims to provide a comprehensive overview of the Transformer models in the computer vision discipline.
arXiv Detail & Related papers (2021-01-04T18:57:24Z) - Pre-Trained Image Processing Transformer [95.93031793337613]
We develop a new pre-trained model, namely, image processing transformer (IPT)
We present to utilize the well-known ImageNet benchmark for generating a large amount of corrupted image pairs.
IPT model is trained on these images with multi-heads and multi-tails.
arXiv Detail & Related papers (2020-12-01T09:42:46Z) - Fine-grained Image-to-Image Transformation towards Visual Recognition [102.51124181873101]
We aim at transforming an image with a fine-grained category to synthesize new images that preserve the identity of the input image.
We adopt a model based on generative adversarial networks to disentangle the identity related and unrelated factors of an image.
Experiments on the CompCars and Multi-PIE datasets demonstrate that our model preserves the identity of the generated images much better than the state-of-the-art image-to-image transformation models.
arXiv Detail & Related papers (2020-01-12T05:26:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.