CAT: Cross Attention in Vision Transformer
- URL: http://arxiv.org/abs/2106.05786v1
- Date: Thu, 10 Jun 2021 14:38:32 GMT
- Title: CAT: Cross Attention in Vision Transformer
- Authors: Hezheng Lin, Xing Cheng, Xiangyu Wu, Fan Yang, Dong Shen, Zhongyuan
Wang, Qing Song, Wei Yuan
- Abstract summary: We propose a new attention mechanism in Transformer called Cross Attention.
It alternates attention inner the image patch instead of the whole image to capture local information.
We build a hierarchical network called Cross Attention Transformer(CAT) for other vision tasks.
- Score: 39.862909079452294
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Since Transformer has found widespread use in NLP, the potential of
Transformer in CV has been realized and has inspired many new approaches.
However, the computation required for replacing word tokens with image patches
for Transformer after the tokenization of the image is vast(e.g., ViT), which
bottlenecks model training and inference. In this paper, we propose a new
attention mechanism in Transformer termed Cross Attention, which alternates
attention inner the image patch instead of the whole image to capture local
information and apply attention between image patches which are divided from
single-channel feature maps capture global information. Both operations have
less computation than standard self-attention in Transformer. By alternately
applying attention inner patch and between patches, we implement cross
attention to maintain the performance with lower computational cost and build a
hierarchical network called Cross Attention Transformer(CAT) for other vision
tasks. Our base model achieves state-of-the-arts on ImageNet-1K, and improves
the performance of other methods on COCO and ADE20K, illustrating that our
network has the potential to serve as general backbones. The code and models
are available at \url{https://github.com/linhezheng19/CAT}.
Related papers
- Cross Aggregation Transformer for Image Restoration [48.390140041131886]
Recently, Transformer architecture has been introduced into image restoration to replace convolution neural network (CNN) with surprising results.
To address the above issue, we propose a new image restoration model, Cross Aggregation Transformer (CAT)
The core of our CAT is the Rectangle-Window Self-Attention (Rwin-SA), which utilizes horizontal and vertical rectangle window attention in different heads parallelly to expand the attention area and aggregate the features cross different windows.
Furthermore, we propose the Locality Complementary Module to complement the self-attention mechanism, which incorporates the inductive bias of CNN (e.g., translation in
arXiv Detail & Related papers (2022-11-24T15:09:33Z) - Accurate Image Restoration with Attention Retractable Transformer [50.05204240159985]
We propose Attention Retractable Transformer (ART) for image restoration.
ART presents both dense and sparse attention modules in the network.
We conduct extensive experiments on image super-resolution, denoising, and JPEG compression artifact reduction tasks.
arXiv Detail & Related papers (2022-10-04T07:35:01Z) - XCiT: Cross-Covariance Image Transformers [73.33400159139708]
We propose a "transposed" version of self-attention that operates across feature channels rather than tokens.
The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images.
arXiv Detail & Related papers (2021-06-17T17:33:35Z) - Transformer-Based Deep Image Matching for Generalizable Person
Re-identification [114.56752624945142]
We investigate the possibility of applying Transformers for image matching and metric learning given pairs of images.
We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention.
We propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity.
arXiv Detail & Related papers (2021-05-30T05:38:33Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - KVT: k-NN Attention for Boosting Vision Transformers [44.189475770152185]
We propose a sparse attention scheme, dubbed k-NN attention, for boosting vision transformers.
The proposed k-NN attention naturally inherits the local bias of CNNs without introducing convolutional operations.
We verify, both theoretically and empirically, that $k$-NN attention is powerful in distilling noise from input tokens and in speeding up training.
arXiv Detail & Related papers (2021-05-28T06:49:10Z) - CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image
Classification [17.709880544501758]
We propose a dual-branch transformer to combine image patches of different sizes to produce stronger image features.
Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity.
Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise.
arXiv Detail & Related papers (2021-03-27T13:03:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.