On the Surprising Effectiveness of Attention Transfer for Vision Transformers
- URL: http://arxiv.org/abs/2411.09702v1
- Date: Thu, 14 Nov 2024 18:59:40 GMT
- Title: On the Surprising Effectiveness of Attention Transfer for Vision Transformers
- Authors: Alexander C. Li, Yuandong Tian, Beidi Chen, Deepak Pathak, Xinlei Chen,
- Abstract summary: Conventional wisdom suggests that pre-training Vision Transformers (ViT) improves downstream performance by learning useful representations.
We investigate this question and find that the features and representations learned during pre-training are not essential.
- Score: 118.83572030360843
- License:
- Abstract: Conventional wisdom suggests that pre-training Vision Transformers (ViT) improves downstream performance by learning useful representations. Is this actually true? We investigate this question and find that the features and representations learned during pre-training are not essential. Surprisingly, using only the attention patterns from pre-training (i.e., guiding how information flows between tokens) is sufficient for models to learn high quality features from scratch and achieve comparable downstream performance. We show this by introducing a simple method called attention transfer, where only the attention patterns from a pre-trained teacher ViT are transferred to a student, either by copying or distilling the attention maps. Since attention transfer lets the student learn its own features, ensembling it with a fine-tuned teacher also further improves accuracy on ImageNet. We systematically study various aspects of our findings on the sufficiency of attention maps, including distribution shift settings where they underperform fine-tuning. We hope our exploration provides a better understanding of what pre-training accomplishes and leads to a useful alternative to the standard practice of fine-tuning
Related papers
- Learning Visual Prompts for Guiding the Attention of Vision Transformers [29.023448448047805]
This work proposes to learn the visual prompts for guiding the attention of vision transformers.
The learned visual prompt, added to any input image would redirect the attention of the pre-trained vision transformer to its spatial location on the image.
arXiv Detail & Related papers (2024-06-05T14:13:38Z) - TOAST: Transfer Learning via Attention Steering [77.83191769502763]
Current transfer learning methods often fail to focus on task-relevant features.
We introduce Top-Down Attention Steering (TOAST), a novel transfer learning algorithm that steers the attention to task-specific features.
TOAST substantially improves performance across a range of fine-grained visual classification datasets.
arXiv Detail & Related papers (2023-05-24T20:03:04Z) - Preserving Locality in Vision Transformers for Class Incremental
Learning [54.696808348218426]
We find that when the ViT is incrementally trained, the attention layers gradually lose concentration on local features.
We devise a Locality-Preserved Attention layer to emphasize the importance of local features.
The improved model gets consistently better performance on CIFAR100 and ImageNet100.
arXiv Detail & Related papers (2023-04-14T07:42:21Z) - Rethinking Visual Prompt Learning as Masked Visual Token Modeling [106.71983630652323]
We propose Visual Prompt learning as masked visual Token Modeling (VPTM) to transform the downstream visual classification into the pre-trained masked visual token prediction.
VPTM is the first visual prompt method on the generative pre-trained visual model, which achieves consistency between pre-training and downstream visual classification by task reformulation.
arXiv Detail & Related papers (2023-03-09T02:43:10Z) - Self-Promoted Supervision for Few-Shot Transformer [178.52948452353834]
Self-promoted sUpervisioN (SUN) is a few-shot learning framework for vision transformers (ViTs)
SUN pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token.
Experiments show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
arXiv Detail & Related papers (2022-03-14T12:53:27Z) - Improving Transferability of Representations via Augmentation-Aware
Self-Supervision [117.15012005163322]
AugSelf is an auxiliary self-supervised loss that learns the difference of augmentation parameters between two randomly augmented samples.
Our intuition is that AugSelf encourages to preserve augmentation-aware information in learned representations, which could be beneficial for their transferability.
AugSelf can easily be incorporated into recent state-of-the-art representation learning methods with a negligible additional training cost.
arXiv Detail & Related papers (2021-11-18T10:43:50Z) - Self-supervised visual feature learning with curriculum [0.24366811507669126]
This paper takes inspiration from curriculum learning to progressively remove low level signals.
It shows that it significantly increase the speed of convergence of the downstream task.
arXiv Detail & Related papers (2020-01-16T03:28:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.