Shunted Self-Attention via Multi-Scale Token Aggregation
- URL: http://arxiv.org/abs/2111.15193v1
- Date: Tue, 30 Nov 2021 08:08:47 GMT
- Title: Shunted Self-Attention via Multi-Scale Token Aggregation
- Authors: Sucheng Ren, Daquan Zhou, Shengfeng He, Jiashi Feng, Xinchao Wang
- Abstract summary: Recent Vision Transformer(ViT) models have demonstrated encouraging results across various computer vision tasks.
We propose shunted self-attention(SSA) that allows ViTs to model the attentions at hybrid scales per attention layer.
The SSA-based transformer achieves 84.0% Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on ImageNet.
- Score: 124.16925784748601
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent Vision Transformer~(ViT) models have demonstrated encouraging results
across various computer vision tasks, thanks to their competence in modeling
long-range dependencies of image patches or tokens via self-attention. These
models, however, usually designate the similar receptive fields of each token
feature within each layer. Such a constraint inevitably limits the ability of
each self-attention layer in capturing multi-scale features, thereby leading to
performance degradation in handling images with multiple objects of different
scales. To address this issue, we propose a novel and generic strategy, termed
shunted self-attention~(SSA), that allows ViTs to model the attentions at
hybrid scales per attention layer. The key idea of SSA is to inject
heterogeneous receptive field sizes into tokens: before computing the
self-attention matrix, it selectively merges tokens to represent larger object
features while keeping certain tokens to preserve fine-grained features. This
novel merging scheme enables the self-attention to learn relationships between
objects with different sizes and simultaneously reduces the token numbers and
the computational cost. Extensive experiments across various tasks demonstrate
the superiority of SSA. Specifically, the SSA-based transformer achieves 84.0\%
Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on
ImageNet with only half of the model size and computation cost, and surpasses
Focal Transformer by 1.3 mAP on COCO and 2.9 mIOU on ADE20K under similar
parameter and computation cost. Code has been released at
https://github.com/OliverRensu/Shunted-Transformer.
Related papers
- Accelerating Transformers with Spectrum-Preserving Token Merging [43.463808781808645]
PiToMe prioritizes the preservation of informative tokens using an additional metric termed the energy score.
Experimental findings demonstrate that PiToMe saved from 40-60% FLOPs of the base models.
arXiv Detail & Related papers (2024-05-25T09:37:01Z) - TiC: Exploring Vision Transformer in Convolution [37.50285921899263]
We propose the Multi-Head Self-Attention Convolution (MSA-Conv)
MSA-Conv incorporates Self-Attention within generalized convolutions, including standard, dilated, and depthwise ones.
We present the Vision Transformer in Convolution (TiC) as a proof of concept for image classification with MSA-Conv.
arXiv Detail & Related papers (2023-10-06T10:16:26Z) - Vision Transformer with Super Token Sampling [93.70963123497327]
Vision transformer has achieved impressive performance for many vision tasks.
It may suffer from high redundancy in capturing local features for shallow layers.
Super tokens attempt to provide a semantically meaningful tessellation of visual content.
arXiv Detail & Related papers (2022-11-21T03:48:13Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - ViTAEv2: Vision Transformer Advanced by Exploring Inductive Bias for
Image Recognition and Beyond [76.35955924137986]
We propose a Vision Transformer Advanced by Exploring intrinsic IB from convolutions, i.e., ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
We obtain the state-of-the-art classification performance, i.e., 88.5% Top-1 classification accuracy on ImageNet validation set and the best 91.2% Top-1 accuracy on ImageNet real validation set.
arXiv Detail & Related papers (2022-02-21T10:40:05Z) - XCiT: Cross-Covariance Image Transformers [73.33400159139708]
We propose a "transposed" version of self-attention that operates across feature channels rather than tokens.
The resulting cross-covariance attention (XCA) has linear complexity in the number of tokens, and allows efficient processing of high-resolution images.
arXiv Detail & Related papers (2021-06-17T17:33:35Z) - ViTAE: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias [76.16156833138038]
We propose a novel Vision Transformer Advanced by Exploring intrinsic IB from convolutions, ie, ViTAE.
ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context.
In each transformer layer, ViTAE has a convolution block in parallel to the multi-head self-attention module, whose features are fused and fed into the feed-forward network.
arXiv Detail & Related papers (2021-06-07T05:31:06Z) - CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image
Classification [17.709880544501758]
We propose a dual-branch transformer to combine image patches of different sizes to produce stronger image features.
Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity.
Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise.
arXiv Detail & Related papers (2021-03-27T13:03:17Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.