SpectFormer: Frequency and Attention is what you need in a Vision
Transformer
- URL: http://arxiv.org/abs/2304.06446v2
- Date: Fri, 14 Apr 2023 22:20:46 GMT
- Title: SpectFormer: Frequency and Attention is what you need in a Vision
Transformer
- Authors: Badri N. Patro, Vinay P. Namboodiri, Vijay Srinivas Agneeswaran
- Abstract summary: Vision transformers have been applied successfully for image recognition tasks.
We hypothesize that both spectral and multi-headed attention plays a major role.
We propose the novel Spectformer architecture for transformers that combines spectral and multi-headed attention layers.
- Score: 28.01996628113975
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Vision transformers have been applied successfully for image recognition
tasks. There have been either multi-headed self-attention based (ViT
\cite{dosovitskiy2020image}, DeIT, \cite{touvron2021training}) similar to the
original work in textual models or more recently based on spectral layers
(Fnet\cite{lee2021fnet}, GFNet\cite{rao2021global},
AFNO\cite{guibas2021efficient}). We hypothesize that both spectral and
multi-headed attention plays a major role. We investigate this hypothesis
through this work and observe that indeed combining spectral and multi-headed
attention layers provides a better transformer architecture. We thus propose
the novel Spectformer architecture for transformers that combines spectral and
multi-headed attention layers. We believe that the resulting representation
allows the transformer to capture the feature representation appropriately and
it yields improved performance over other transformer representations. For
instance, it improves the top-1 accuracy by 2\% on ImageNet compared to both
GFNet-H and LiT. SpectFormer-S reaches 84.25\% top-1 accuracy on ImageNet-1K
(state of the art for small version). Further, Spectformer-L achieves 85.7\%
that is the state of the art for the comparable base version of the
transformers. We further ensure that we obtain reasonable results in other
scenarios such as transfer learning on standard datasets such as CIFAR-10,
CIFAR-100, Oxford-IIIT-flower, and Standford Car datasets. We then investigate
its use in downstream tasks such of object detection and instance segmentation
on the MS-COCO dataset and observe that Spectformer shows consistent
performance that is comparable to the best backbones and can be further
optimized and improved. Hence, we believe that combined spectral and attention
layers are what are needed for vision transformers.
Related papers
- A Close Look at Spatial Modeling: From Attention to Convolution [70.5571582194057]
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism.
We generalize self-attention formulation to abstract a queryirrelevant global context directly and integrate the global context into convolutions.
With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-12-23T19:13:43Z) - Multimodal Fusion Transformer for Remote Sensing Image Classification [35.57881383390397]
Vision transformers (ViTs) have been trending in image classification tasks due to their promising performance when compared to convolutional neural networks (CNNs)
To achieve satisfactory performance, close to that of CNNs, transformers need fewer parameters.
We introduce a new multimodal fusion transformer (MFT) network which comprises a multihead cross patch attention (mCrossPA) for HSI land-cover classification.
arXiv Detail & Related papers (2022-03-31T11:18:41Z) - Attribute Surrogates Learning and Spectral Tokens Pooling in
Transformers for Few-shot Learning [50.95116994162883]
Vision transformers have been thought of as a promising alternative to convolutional neural networks for visual recognition.
This paper presents hierarchically cascaded transformers that exploit intrinsic image structures through spectral tokens pooling.
HCTransformers surpass the DINO baseline by a large margin of 9.7% 5-way 1-shot accuracy and 9.17% 5-way 5-shot accuracy on miniImageNet.
arXiv Detail & Related papers (2022-03-17T03:49:58Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - BViT: Broad Attention based Vision Transformer [13.994231768182907]
We propose the broad attention to improve the performance by incorporating the attention relationship of different layers for vision transformer, which is called BViT.
Experiments on image classification tasks demonstrate that BViT delivers state-of-the-art accuracy of 74.8%/81.6% top-1 accuracy on ImageNet with 5M/22M parameters.
arXiv Detail & Related papers (2022-02-13T09:23:29Z) - Transformer-Based Deep Image Matching for Generalizable Person
Re-identification [114.56752624945142]
We investigate the possibility of applying Transformers for image matching and metric learning given pairs of images.
We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention.
We propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity.
arXiv Detail & Related papers (2021-05-30T05:38:33Z) - Going deeper with Image Transformers [102.61950708108022]
We build and optimize deeper transformer networks for image classification.
We make two transformers architecture changes that significantly improve the accuracy of deep transformers.
Our best model establishes the new state of the art on Imagenet with Reassessed labels and Imagenet-V2 / match frequency.
arXiv Detail & Related papers (2021-03-31T17:37:32Z) - CvT: Introducing Convolutions to Vision Transformers [44.74550305869089]
Convolutional vision Transformer (CvT) improves Vision Transformer (ViT) in performance and efficiency.
New architecture introduces convolutions into ViT to yield the best of both designs.
arXiv Detail & Related papers (2021-03-29T17:58:22Z) - Vision Transformers for Dense Prediction [77.34726150561087]
We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks.
Our experiments show that this architecture yields substantial improvements on dense prediction tasks.
arXiv Detail & Related papers (2021-03-24T18:01:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.