A free lunch from ViT: Adaptive Attention Multi-scale Fusion Transformer
for Fine-grained Visual Recognition
- URL: http://arxiv.org/abs/2110.01240v1
- Date: Mon, 4 Oct 2021 08:11:21 GMT
- Title: A free lunch from ViT: Adaptive Attention Multi-scale Fusion Transformer
for Fine-grained Visual Recognition
- Authors: Yuan Zhang, Jian Cao, Ling Zhang, Xiangcheng Liu, Zhiyi Wang, Feng
Ling, Weiqian Chen
- Abstract summary: Learning subtle representation about object parts plays a vital role in fine-grained visual recognition (FGVR) field.
With the fixed size of patches in ViT, the class token in deep layer focuses on the global receptive field and cannot generate multi-granularity features for FGVR.
We propose a novel method named Adaptive attention multi-scale Fusion Transformer (AFTrans) to capture region attention without box annotations.
- Score: 10.045205311757028
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: Learning subtle representation about object parts plays a vital role in
fine-grained visual recognition (FGVR) field. The vision transformer (ViT)
achieves promising results on computer vision due to its attention mechanism.
Nonetheless, with the fixed size of patches in ViT, the class token in deep
layer focuses on the global receptive field and cannot generate
multi-granularity features for FGVR. To capture region attention without box
annotations and compensate for ViT shortcomings in FGVR, we propose a novel
method named Adaptive attention multi-scale Fusion Transformer (AFTrans). The
Selective Attention Collection Module (SACM) in our approach leverages
attention weights in ViT and filters them adaptively to corre-spond with the
relative importance of input patches. The multiple scales (global and local)
pipeline is supervised by our weights sharing encoder and can be easily trained
end-to-end. Comprehensive experiments demonstrate that AFTrans can achieve SOTA
performance on three published fine-grained benchmarks: CUB-200-2011, Stanford
Dogs and iNat2017.
Related papers
- DAT++: Spatially Dynamic Vision Transformer with Deformable Attention [87.41016963608067]
We present Deformable Attention Transformer ( DAT++), a vision backbone efficient and effective for visual recognition.
DAT++ achieves state-of-the-art results on various visual recognition benchmarks, with 85.9% ImageNet accuracy, 54.5 and 47.0 MS-COCO instance segmentation mAP, and 51.5 ADE20K semantic segmentation mIoU.
arXiv Detail & Related papers (2023-09-04T08:26:47Z) - A Close Look at Spatial Modeling: From Attention to Convolution [70.5571582194057]
Vision Transformers have shown great promise recently for many vision tasks due to the insightful architecture design and attention mechanism.
We generalize self-attention formulation to abstract a queryirrelevant global context directly and integrate the global context into convolutions.
With less than 14M parameters, our FCViT-S12 outperforms related work ResT-Lite by 3.7% top1 accuracy on ImageNet-1K.
arXiv Detail & Related papers (2022-12-23T19:13:43Z) - ViT-FOD: A Vision Transformer based Fine-grained Object Discriminator [21.351034332423374]
We propose a novel ViT based fine-grained object discriminator for Fine-Grained Visual Classification (FGVC) tasks.
Besides a ViT backbone, it introduces three novel components, i.e. Attention Patch Combination (APC), Critical Regions Filter (CRF) and Complementary Tokens Integration (CTI)
We conduct comprehensive experiments on widely used datasets and the results demonstrate that ViT-FOD is able to achieve state-of-the-art performance.
arXiv Detail & Related papers (2022-03-24T02:34:57Z) - Coarse-to-Fine Vision Transformer [83.45020063642235]
We propose a coarse-to-fine vision transformer (CF-ViT) to relieve computational burden while retaining performance.
Our proposed CF-ViT is motivated by two important observations in modern ViT models.
Our CF-ViT reduces 53% FLOPs of LV-ViT, and also achieves 2.01x throughput.
arXiv Detail & Related papers (2022-03-08T02:57:49Z) - Vision Transformer with Deformable Attention [29.935891419574602]
Large, sometimes even global, receptive field endows Transformer models with higher representation power over their CNN counterparts.
We propose a novel deformable self-attention module, where the positions of key and value pairs in self-attention are selected in a data-dependent way.
We present Deformable Attention Transformer, a general backbone model with deformable attention for both image classification and dense prediction tasks.
arXiv Detail & Related papers (2022-01-03T08:29:01Z) - RAMS-Trans: Recurrent Attention Multi-scale Transformer forFine-grained
Image Recognition [26.090419694326823]
localization and amplification of region attention is an important factor, which has been explored a lot by convolutional neural networks (CNNs) based approaches.
We propose the recurrent attention multi-scale transformer (RAMS-Trans) which uses the transformer's self-attention to learn discriminative region attention.
arXiv Detail & Related papers (2021-07-17T06:22:20Z) - Feature Fusion Vision Transformer for Fine-Grained Visual Categorization [22.91753200323264]
We propose a novel pure transformer-based framework Feature Fusion Vision Transformer (FFVT)
We aggregate the important tokens from each transformer layer to compensate the local, low-level and middle-level information.
We design a novel token selection mod-ule called mutual attention weight selection (MAWS) to guide the network effectively and efficiently towards selecting discriminative tokens.
arXiv Detail & Related papers (2021-07-06T01:48:43Z) - Focal Self-attention for Local-Global Interactions in Vision
Transformers [90.9169644436091]
We present focal self-attention, a new mechanism that incorporates both fine-grained local and coarse-grained global interactions.
With focal self-attention, we propose a new variant of Vision Transformer models, called Focal Transformer, which achieves superior performance over the state-of-the-art vision Transformers.
arXiv Detail & Related papers (2021-07-01T17:56:09Z) - ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias [76.16156833138038]
We propose a novel Vision Transformer Advanced by Exploring intrinsic IB from convolutions, ie, ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
In each transformer layer, ViTAE has a convolution block in parallel to the multi-head self-attention module, whose features are fused and fed into the feed-forward network.
arXiv Detail & Related papers (2021-06-07T05:31:06Z) - OmniNet: Omnidirectional Representations from Transformers [49.23834374054286]
This paper proposes Omnidirectional Representations from Transformers ( OmniNet)
In OmniNet, instead of maintaining a strictly horizontal receptive field, each token is allowed to attend to all tokens in the entire network.
Experiments are conducted on autoregressive language modeling, Machine Translation, Long Range Arena (LRA), and Image Recognition.
arXiv Detail & Related papers (2021-03-01T15:31:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.