BViT: Broad Attention based Vision Transformer
- URL: http://arxiv.org/abs/2202.06268v2
- Date: Fri, 9 Jun 2023 06:08:37 GMT
- Title: BViT: Broad Attention based Vision Transformer
- Authors: Nannan Li, Yaran Chen, Weifan Li, Zixiang Ding, Dongbin Zhao
- Abstract summary: We propose the broad attention to improve the performance by incorporating the attention relationship of different layers for vision transformer, which is called BViT.
Experiments on image classification tasks demonstrate that BViT delivers state-of-the-art accuracy of 74.8%/81.6% top-1 accuracy on ImageNet with 5M/22M parameters.
- Score: 13.994231768182907
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent works have demonstrated that transformer can achieve promising
performance in computer vision, by exploiting the relationship among image
patches with self-attention. While they only consider the attention in a single
feature layer, but ignore the complementarity of attention in different levels.
In this paper, we propose the broad attention to improve the performance by
incorporating the attention relationship of different layers for vision
transformer, which is called BViT. The broad attention is implemented by broad
connection and parameter-free attention. Broad connection of each transformer
layer promotes the transmission and integration of information for BViT.
Without introducing additional trainable parameters, parameter-free attention
jointly focuses on the already available attention information in different
layers for extracting useful information and building their relationship.
Experiments on image classification tasks demonstrate that BViT delivers
state-of-the-art accuracy of 74.8\%/81.6\% top-1 accuracy on ImageNet with
5M/22M parameters. Moreover, we transfer BViT to downstream object recognition
benchmarks to achieve 98.9\% and 89.9\% on CIFAR10 and CIFAR100 respectively
that exceed ViT with fewer parameters. For the generalization test, the broad
attention in Swin Transformer and T2T-ViT also bring an improvement of more
than 1\%. To sum up, broad attention is promising to promote the performance of
attention based models. Code and pre-trained models are available at
https://github.com/DRL-CASIA/Broad_ViT.
Related papers
- DAT++: Spatially Dynamic Vision Transformer with Deformable Attention [87.41016963608067]
We present Deformable Attention Transformer ( DAT++), a vision backbone efficient and effective for visual recognition.
DAT++ achieves state-of-the-art results on various visual recognition benchmarks, with 85.9% ImageNet accuracy, 54.5 and 47.0 MS-COCO instance segmentation mAP, and 51.5 ADE20K semantic segmentation mIoU.
arXiv Detail & Related papers (2023-09-04T08:26:47Z) - SpectFormer: Frequency and Attention is what you need in a Vision
Transformer [28.01996628113975]
Vision transformers have been applied successfully for image recognition tasks.
We hypothesize that both spectral and multi-headed attention plays a major role.
We propose the novel Spectformer architecture for transformers that combines spectral and multi-headed attention layers.
arXiv Detail & Related papers (2023-04-13T12:27:17Z) - A Close Look at Spatial Modeling: From Attention to Convolution [70.5571582194057]
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism.
We generalize self-attention formulation to abstract a queryirrelevant global context directly and integrate the global context into convolutions.
With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-12-23T19:13:43Z) - Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity.
Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z) - Visualizing and Understanding Patch Interactions in Vision Transformer [96.70401478061076]
Vision Transformer (ViT) has become a leading tool in various computer vision tasks.
We propose a novel explainable visualization approach to analyze and interpret the crucial attention interactions among patches for vision transformer.
arXiv Detail & Related papers (2022-03-11T13:48:11Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - Vision Transformers are Robust Learners [65.91359312429147]
We study the robustness of the Vision Transformer (ViT) against common corruptions and perturbations, distribution shifts, and natural adversarial examples.
We present analyses that provide both quantitative and qualitative indications to explain why ViTs are indeed more robust learners.
arXiv Detail & Related papers (2021-05-17T02:39:22Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.