RAMS-Trans: Recurrent Attention Multi-scale Transformer forFine-grained
Image Recognition
- URL: http://arxiv.org/abs/2107.08192v1
- Date: Sat, 17 Jul 2021 06:22:20 GMT
- Title: RAMS-Trans: Recurrent Attention Multi-scale Transformer forFine-grained
Image Recognition
- Authors: Yunqing Hu, Xuan Jin, Yin Zhang, Haiwen Hong, Jingfeng Zhang, Yuan He,
Hui Xue
- Abstract summary: localization and amplification of region attention is an important factor, which has been explored a lot by convolutional neural networks (CNNs) based approaches.
We propose the recurrent attention multi-scale transformer (RAMS-Trans) which uses the transformer's self-attention to learn discriminative region attention.
- Score: 26.090419694326823
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In fine-grained image recognition (FGIR), the localization and amplification
of region attention is an important factor, which has been explored a lot by
convolutional neural networks (CNNs) based approaches. The recently developed
vision transformer (ViT) has achieved promising results on computer vision
tasks. Compared with CNNs, Image sequentialization is a brand new manner.
However, ViT is limited in its receptive field size and thus lacks local
attention like CNNs due to the fixed size of its patches, and is unable to
generate multi-scale features to learn discriminative region attention. To
facilitate the learning of discriminative region attention without box/part
annotations, we use the strength of the attention weights to measure the
importance of the patch tokens corresponding to the raw images. We propose the
recurrent attention multi-scale transformer (RAMS-Trans), which uses the
transformer's self-attention to recursively learn discriminative region
attention in a multi-scale manner. Specifically, at the core of our approach
lies the dynamic patch proposal module (DPPM) guided region amplification to
complete the integration of multi-scale image patches. The DPPM starts with the
full-size image patches and iteratively scales up the region attention to
generate new patches from global to local by the intensity of the attention
weights generated at each scale as an indicator. Our approach requires only the
attention weights that come with ViT itself and can be easily trained
end-to-end. Extensive experiments demonstrate that RAMS-Trans performs better
than concurrent works, in addition to efficient CNN models, achieving
state-of-the-art results on three benchmark datasets.
Related papers
- TransY-Net:Learning Fully Transformer Networks for Change Detection of
Remote Sensing Images [64.63004710817239]
We propose a novel Transformer-based learning framework named TransY-Net for remote sensing image CD.
It improves the feature extraction from a global view and combines multi-level visual features in a pyramid manner.
Our proposed method achieves a new state-of-the-art performance on four optical and two SAR image CD benchmarks.
arXiv Detail & Related papers (2023-10-22T07:42:19Z) - Affine-Consistent Transformer for Multi-Class Cell Nuclei Detection [76.11864242047074]
We propose a novel Affine-Consistent Transformer (AC-Former), which directly yields a sequence of nucleus positions.
We introduce an Adaptive Affine Transformer (AAT) module, which can automatically learn the key spatial transformations to warp original images for local network training.
Experimental results demonstrate that the proposed method significantly outperforms existing state-of-the-art algorithms on various benchmarks.
arXiv Detail & Related papers (2023-10-22T02:27:02Z) - Laplacian-Former: Overcoming the Limitations of Vision Transformers in
Local Texture Detection [3.784298636620067]
Vision Transformer (ViT) models have demonstrated a breakthrough in a wide range of computer vision tasks.
These models struggle to capture high-frequency components of images, which can limit their ability to detect local textures and edge information.
We propose a new technique, Laplacian-Former, that enhances the self-attention map by adaptively re-calibrating the frequency information in a Laplacian pyramid.
arXiv Detail & Related papers (2023-08-31T19:56:14Z) - Accurate Image Restoration with Attention Retractable Transformer [50.05204240159985]
We propose Attention Retractable Transformer (ART) for image restoration.
ART presents both dense and sparse attention modules in the network.
We conduct extensive experiments on image super-resolution, denoising, and JPEG compression artifact reduction tasks.
arXiv Detail & Related papers (2022-10-04T07:35:01Z) - Boosting Crowd Counting via Multifaceted Attention [109.89185492364386]
Large-scale variations often exist within crowd images.
Neither fixed-size convolution kernel of CNN nor fixed-size attention of recent vision transformers can handle this kind of variation.
We propose a Multifaceted Attention Network (MAN) to improve transformer models in local spatial relation encoding.
arXiv Detail & Related papers (2022-03-05T01:36:43Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - TransFG: A Transformer Architecture for Fine-grained Recognition [27.76159820385425]
Recently, vision transformer (ViT) shows its strong performance in the traditional classification task.
We propose a novel transformer-based framework TransFG where we integrate all raw attention weights of the transformer into an attention map.
A contrastive loss is applied to further enlarge the distance between feature representations of similar sub-classes.
arXiv Detail & Related papers (2021-03-14T17:03:53Z) - Image Fine-grained Inpainting [89.17316318927621]
We present a one-stage model that utilizes dense combinations of dilated convolutions to obtain larger and more effective receptive fields.
To better train this efficient generator, except for frequently-used VGG feature matching loss, we design a novel self-guided regression loss.
We also employ a discriminator with local and global branches to ensure local-global contents consistency.
arXiv Detail & Related papers (2020-02-07T03:45:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.