KVT: k-NN Attention for Boosting Vision Transformers
- URL: http://arxiv.org/abs/2106.00515v1
- Date: Fri, 28 May 2021 06:49:10 GMT
- Title: KVT: k-NN Attention for Boosting Vision Transformers
- Authors: Pichao Wang and Xue Wang and Fan Wang and Ming Lin and Shuning Chang
and Wen Xie and Hao Li and Rong Jin
- Abstract summary: We propose a sparse attention scheme, dubbed k-NN attention, for boosting vision transformers.
The proposed k-NN attention naturally inherits the local bias of CNNs without introducing convolutional operations.
We verify, both theoretically and empirically, that $k$-NN attention is powerful in distilling noise from input tokens and in speeding up training.
- Score: 44.189475770152185
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Convolutional Neural Networks (CNNs) have dominated computer vision for
years, due to its ability in capturing locality and translation invariance.
Recently, many vision transformer architectures have been proposed and they
show promising performance. A key component in vision transformers is the
fully-connected self-attention which is more powerful than CNNs in modelling
long range dependencies. However, since the current dense self-attention uses
all image patches (tokens) to compute attention matrix, it may neglect locality
of images patches and involve noisy tokens (e.g., clutter background and
occlusion), leading to a slow training process and potentially degradation of
performance. To address these problems, we propose a sparse attention scheme,
dubbed k-NN attention, for boosting vision transformers. Specifically, instead
of involving all the tokens for attention matrix calculation, we only select
the top-k similar tokens from the keys for each query to compute the attention
map. The proposed k-NN attention naturally inherits the local bias of CNNs
without introducing convolutional operations, as nearby tokens tend to be more
similar than others. In addition, the k-NN attention allows for the exploration
of long range correlation and at the same time filter out irrelevant tokens by
choosing the most similar tokens from the entire image. Despite its simplicity,
we verify, both theoretically and empirically, that $k$-NN attention is
powerful in distilling noise from input tokens and in speeding up training.
Extensive experiments are conducted by using ten different vision transformer
architectures to verify that the proposed k-NN attention can work with any
existing transformer architectures to improve its prediction performance.
Related papers
- LeMeViT: Efficient Vision Transformer with Learnable Meta Tokens for Remote Sensing Image Interpretation [37.72775203647514]
This paper proposes to use learnable meta tokens to formulate sparse tokens, which effectively learn key information and improve inference speed.
By employing Dual Cross-Attention (DCA) in the early stages with dense visual tokens, we obtain the hierarchical architecture LeMeViT with various sizes.
Experimental results in classification and dense prediction tasks show that LeMeViT has a significant $1.7 times$ speedup, fewer parameters, and competitive performance compared to the baseline models.
arXiv Detail & Related papers (2024-05-16T03:26:06Z) - Robustifying Token Attention for Vision Transformers [72.07710236246285]
Vision transformers (ViTs) still suffer from significant drops in accuracy in the presence of common corruptions.
We propose two techniques to make attention more stable through two general techniques.
First, our Token-aware Average Pooling (TAP) module encourages the local neighborhood of each token to take part in the attention mechanism.
Second, we force the output tokens to aggregate information from a diverse set of input tokens rather than focusing on just a few.
arXiv Detail & Related papers (2023-03-20T14:04:40Z) - Accurate Image Restoration with Attention Retractable Transformer [50.05204240159985]
We propose Attention Retractable Transformer (ART) for image restoration.
ART presents both dense and sparse attention modules in the network.
We conduct extensive experiments on image super-resolution, denoising, and JPEG compression artifact reduction tasks.
arXiv Detail & Related papers (2022-10-04T07:35:01Z) - Transformer Compressed Sensing via Global Image Tokens [4.722333456749269]
We propose a novel image decomposition that naturally embeds images into low-resolution inputs.
We replace CNN components in a well-known CS-MRI neural network with TNN blocks and demonstrate the improvements afforded by KD.
arXiv Detail & Related papers (2022-03-24T05:56:30Z) - XCiT: Cross-Covariance Image Transformers [73.33400159139708]
We propose a "transposed" version of self-attention that operates across feature channels rather than tokens.
The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images.
arXiv Detail & Related papers (2021-06-17T17:33:35Z) - CAT: Cross Attention in Vision Transformer [39.862909079452294]
We propose a new attention mechanism in Transformer called Cross Attention.
It alternates attention inner the image patch instead of the whole image to capture local information.
We build a hierarchical network called Cross Attention Transformer(CAT) for other vision tasks.
arXiv Detail & Related papers (2021-06-10T14:38:32Z) - DynamicViT: Efficient Vision Transformers with Dynamic Token
Sparsification [134.9393799043401]
We propose a dynamic token sparsification framework to prune redundant tokens based on the input.
By hierarchically pruning 66% of the input tokens, our method greatly reduces 31%37% FLOPs and improves the throughput by over 40%.
DynamicViT models can achieve very competitive complexity/accuracy trade-offs compared to state-of-the-art CNNs and vision transformers on ImageNet.
arXiv Detail & Related papers (2021-06-03T17:57:41Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image
Classification [17.709880544501758]
We propose a dual-branch transformer to combine image patches of different sizes to produce stronger image features.
Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity.
Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise.
arXiv Detail & Related papers (2021-03-27T13:03:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.