Preserving Locality in Vision Transformers for Class Incremental
Learning
- URL: http://arxiv.org/abs/2304.06971v1
- Date: Fri, 14 Apr 2023 07:42:21 GMT
- Title: Preserving Locality in Vision Transformers for Class Incremental
Learning
- Authors: Bowen Zheng, Da-Wei Zhou, Han-Jia Ye, De-Chuan Zhan
- Abstract summary: We find that when the ViT is incrementally trained, the attention layers gradually lose concentration on local features.
We devise a Locality-Preserved Attention layer to emphasize the importance of local features.
The improved model gets consistently better performance on CIFAR100 and ImageNet100.
- Score: 54.696808348218426
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Learning new classes without forgetting is crucial for real-world
applications for a classification model. Vision Transformers (ViT) recently
achieve remarkable performance in Class Incremental Learning (CIL). Previous
works mainly focus on block design and model expansion for ViTs. However, in
this paper, we find that when the ViT is incrementally trained, the attention
layers gradually lose concentration on local features. We call this interesting
phenomenon as \emph{Locality Degradation} in ViTs for CIL. Since the low-level
local information is crucial to the transferability of the representation, it
is beneficial to preserve the locality in attention layers. In this paper, we
encourage the model to preserve more local information as the training
procedure goes on and devise a Locality-Preserved Attention (LPA) layer to
emphasize the importance of local features. Specifically, we incorporate the
local information directly into the vanilla attention and control the initial
gradients of the vanilla attention by weighting it with a small initial value.
Extensive experiments show that the representations facilitated by LPA capture
more low-level general information which is easier to transfer to follow-up
tasks. The improved model gets consistently better performance on CIFAR100 and
ImageNet100.
Related papers
- Locality Alignment Improves Vision-Language Models [55.275235524659905]
Vision language models (VLMs) have seen growing adoption in recent years, but many still struggle with basic spatial reasoning errors.
We propose a new efficient post-training stage for ViTs called locality alignment.
We show that locality-aligned backbones improve performance across a range of benchmarks.
arXiv Detail & Related papers (2024-10-14T21:01:01Z) - GTA: Guided Transfer of Spatial Attention from Object-Centric
Representations [3.187381965457262]
We propose a novel and simple regularization method for ViT called Guided Transfer of spatial Attention (GTA)
Our experimental results show that the proposed GTA consistently improves the accuracy across five benchmark datasets especially when the number of training data is small.
arXiv Detail & Related papers (2024-01-05T06:24:41Z) - Rethinking Local Perception in Lightweight Vision Transformer [63.65115590184169]
This paper introduces CloFormer, a lightweight vision transformer that leverages context-aware local enhancement.
CloFormer explores the relationship between globally shared weights often used in vanilla convolutional operators and token-specific context-aware weights appearing in attention.
The proposed AttnConv uses shared weights to aggregate local information and deploys carefully designed context-aware weights to enhance local features.
arXiv Detail & Related papers (2023-03-31T05:25:32Z) - ViTOL: Vision Transformer for Weakly Supervised Object Localization [0.735996217853436]
Weakly supervised object localization (WSOL) aims at predicting object locations in an image using only image-level category labels.
Common challenges that image classification models encounter when localizing objects are, (a) they tend to look at the most discriminative features in an image that confines the localization map to a very small region, (b) the localization maps are class agnostic, and the models highlight objects of multiple classes in the same image.
arXiv Detail & Related papers (2022-04-14T06:16:34Z) - Self-Promoted Supervision for Few-Shot Transformer [178.52948452353834]
Self-promoted sUpervisioN (SUN) is a few-shot learning framework for vision transformers (ViTs)
SUN pretrains the ViT on the few-shot learning dataset and then uses it to generate individual location-specific supervision for guiding each patch token.
Experiments show that SUN using ViTs significantly surpasses other few-shot learning frameworks with ViTs and is the first one that achieves higher performance than those CNN state-of-the-arts.
arXiv Detail & Related papers (2022-03-14T12:53:27Z) - Boosting Crowd Counting via Multifaceted Attention [109.89185492364386]
Large-scale variations often exist within crowd images.
Neither fixed-size convolution kernel of CNN nor fixed-size attention of recent vision transformers can handle this kind of variation.
We propose a Multifaceted Attention Network (MAN) to improve transformer models in local spatial relation encoding.
arXiv Detail & Related papers (2022-03-05T01:36:43Z) - Refiner: Refining Self-attention for Vision Transformers [85.80887884154427]
Vision Transformers (ViTs) have shown competitive accuracy in image classification tasks compared with CNNs.
We introduce a conceptually simple scheme, called refiner, to directly refine the self-attention maps of ViTs.
refiner applies convolutions to augment local patterns of the attention maps, which we show is equivalent to a distributed local attention features are aggregated locally with learnable kernels and then globally aggregated with self-attention.
arXiv Detail & Related papers (2021-06-07T15:24:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.