Learning Expressive Prompting With Residuals for Vision Transformers
- URL: http://arxiv.org/abs/2303.15591v1
- Date: Mon, 27 Mar 2023 20:47:01 GMT
- Title: Learning Expressive Prompting With Residuals for Vision Transformers
- Authors: Rajshekhar Das, Yonatan Dukler, Avinash Ravichandran, Ashwin
Swaminathan
- Abstract summary: We present Expressive Prompts with Residuals (EXPRES) which modifies the prompt learning paradigm specifically for effective adaptation of vision transformers (ViT)
We apply EXPRES for image classification, few shot learning, and semantic segmentation, and show our method is capable of achieving state of the art prompt tuning on 3/3 categories of the VTAB benchmark.
- Score: 11.342913284654706
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Prompt learning is an efficient approach to adapt transformers by inserting
learnable set of parameters into the input and intermediate representations of
a pre-trained model. In this work, we present Expressive Prompts with Residuals
(EXPRES) which modifies the prompt learning paradigm specifically for effective
adaptation of vision transformers (ViT). Out method constructs downstream
representations via learnable ``output'' tokens, that are akin to the learned
class tokens of the ViT. Further for better steering of the downstream
representation processed by the frozen transformer, we introduce residual
learnable tokens that are added to the output of various computations. We apply
EXPRES for image classification, few shot learning, and semantic segmentation,
and show our method is capable of achieving state of the art prompt tuning on
3/3 categories of the VTAB benchmark. In addition to strong performance, we
observe that our approach is an order of magnitude more prompt efficient than
existing visual prompting baselines. We analytically show the computational
benefits of our approach over weight space adaptation techniques like
finetuning. Lastly we systematically corroborate the architectural design of
our method via a series of ablation experiments.
Related papers
- DETAIL: Task DEmonsTration Attribution for Interpretable In-context Learning [75.68193159293425]
In-context learning (ICL) allows transformer-based language models to learn a specific task with a few "task demonstrations" without updating their parameters.
We propose an influence function-based attribution technique, DETAIL, that addresses the specific characteristics of ICL.
We experimentally prove the wide applicability of DETAIL by showing our attribution scores obtained on white-box models are transferable to black-box models in improving model performance.
arXiv Detail & Related papers (2024-05-22T15:52:52Z) - Keypoint Action Tokens Enable In-Context Imitation Learning in Robotics [11.88216611522207]
We show that off-the-shelf text-based Transformers, with no additional training, can perform few-shot in-context visual imitation learning.
We achieve this by transforming visual observations into sequences of tokens that a text-pretrained Transformer can ingest and generate.
Despite being trained only on language, we show that these Transformers excel at translating tokenised visual keypoint observations into action trajectories.
arXiv Detail & Related papers (2024-03-28T17:04:00Z) - BUS:Efficient and Effective Vision-language Pre-training with Bottom-Up
Patch Summarization [89.52943129132217]
We propose a Bottom-Up Patch Summarization approach named BUS to learn a concise summary of lengthy visual token sequences efficiently.
We incorporate a Text-Semantics-Aware Patch Selector (TSPS) into the ViT backbone to perform a coarse-grained visual token extraction.
This bottom-up collaboration enables our BUS to yield high training efficiency while maintaining or even improving effectiveness.
arXiv Detail & Related papers (2023-07-17T14:08:17Z) - Approximated Prompt Tuning for Vision-Language Pre-trained Models [54.326232586461614]
In vision-language pre-trained models, prompt tuning often requires a large number of learnable tokens to bridge the gap between the pre-training and downstream tasks.
We propose a novel Approximated Prompt Tuning (APT) approach towards efficient VL transfer learning.
arXiv Detail & Related papers (2023-06-27T05:43:47Z) - ExpPoint-MAE: Better interpretability and performance for self-supervised point cloud transformers [7.725095281624494]
We evaluate the effectiveness of Masked Autoencoding as a pretraining scheme, and explore Momentum Contrast as an alternative.
We observe that the transformer learns to attend to semantically meaningful regions, indicating that pretraining leads to a better understanding of the underlying geometry.
arXiv Detail & Related papers (2023-06-19T09:38:21Z) - Rethinking Visual Prompt Learning as Masked Visual Token Modeling [106.71983630652323]
We propose Visual Prompt learning as masked visual Token Modeling (VPTM) to transform the downstream visual classification into the pre-trained masked visual token prediction.
VPTM is the first visual prompt method on the generative pre-trained visual model, which achieves consistency between pre-training and downstream visual classification by task reformulation.
arXiv Detail & Related papers (2023-03-09T02:43:10Z) - ProFormer: Learning Data-efficient Representations of Body Movement with
Prototype-based Feature Augmentation and Visual Transformers [31.908276711898548]
Methods for data-efficient recognition from body poses increasingly leverage skeleton sequences structured as image-like arrays.
We look at this paradigm from the perspective of transformer networks, for the first time exploring visual transformers as data-efficient encoders of skeleton movement.
In our pipeline, body pose sequences cast as image-like representations are converted into patch embeddings and then passed to a visual transformer backbone optimized with deep metric learning.
arXiv Detail & Related papers (2022-02-23T11:11:54Z) - Few-shot Sequence Learning with Transformers [79.87875859408955]
Few-shot algorithms aim at learning new tasks provided only a handful of training examples.
In this work we investigate few-shot learning in the setting where the data points are sequences of tokens.
We propose an efficient learning algorithm based on Transformers.
arXiv Detail & Related papers (2020-12-17T12:30:38Z) - Improving Transformation Invariance in Contrastive Representation
Learning [31.223892428863238]
We introduce a training objective for contrastive learning that uses a novel regularizer to control how the representation changes under transformation.
Second, we propose a change to how test time representations are generated by introducing a feature averaging approach that combines encodings from multiple transformations of the original input.
Third, we introduce the novel Spirograph dataset to explore our ideas in the context of a differentiable generative process with multiple downstream tasks.
arXiv Detail & Related papers (2020-10-19T13:49:29Z) - Addressing Some Limitations of Transformers with Feedback Memory [51.94640029417114]
Transformers have been successfully applied to sequential, auto-regressive tasks despite being feedforward networks.
We propose the Feedback Transformer architecture that exposes all previous representations to all future representations.
We demonstrate on a variety of benchmarks in language modeling, machine translation, and reinforcement learning that the increased representation capacity can create small, shallow models with much stronger performance than comparable Transformers.
arXiv Detail & Related papers (2020-02-21T16:37:57Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.