Global Interaction Modelling in Vision Transformer via Super Tokens
- URL: http://arxiv.org/abs/2111.13156v1
- Date: Thu, 25 Nov 2021 16:22:57 GMT
- Title: Global Interaction Modelling in Vision Transformer via Super Tokens
- Authors: Ammarah Farooq, Muhammad Awais, Sara Ahmed, Josef Kittler
- Abstract summary: Window-based local attention is one of the major techniques being adopted in recent works.
We present a novel isotropic architecture that adopts local windows and special tokens, called Super tokens, for self-attention.
In standard image classification on Imagenet-1K, the proposed Super tokens based transformer (STT-S25) achieves 83.5% accuracy.
- Score: 20.700750237972155
- License: http://creativecommons.org/licenses/by-nc-nd/4.0/
- Abstract: With the popularity of Transformer architectures in computer vision, the
research focus has shifted towards developing computationally efficient
designs. Window-based local attention is one of the major techniques being
adopted in recent works. These methods begin with very small patch size and
small embedding dimensions and then perform strided convolution (patch merging)
in order to reduce the feature map size and increase embedding dimensions,
hence, forming a pyramidal Convolutional Neural Network (CNN) like design. In
this work, we investigate local and global information modelling in
transformers by presenting a novel isotropic architecture that adopts local
windows and special tokens, called Super tokens, for self-attention.
Specifically, a single Super token is assigned to each image window which
captures the rich local details for that window. These tokens are then employed
for cross-window communication and global representation learning. Hence, most
of the learning is independent of the image patches $(N)$ in the higher layers,
and the class embedding is learned solely based on the Super tokens $(N/M^2)$
where $M^2$ is the window size. In standard image classification on
Imagenet-1K, the proposed Super tokens based transformer (STT-S25) achieves
83.5\% accuracy which is equivalent to Swin transformer (Swin-B) with circa
half the number of parameters (49M) and double the inference time throughput.
The proposed Super token transformer offers a lightweight and promising
backbone for visual recognition tasks.
Related papers
- Making Vision Transformers Efficient from A Token Sparsification View [26.42498120556985]
We propose a novel Semantic Token ViT (STViT) for efficient global and local vision transformers.
Our method can achieve competitive results compared to the original networks in object detection and instance segmentation, with over 30% FLOPs reduction for backbone.
In addition, we design a STViT-R(ecover) network to restore the detailed spatial information based on the STViT, making it work for downstream tasks.
arXiv Detail & Related papers (2023-03-15T15:12:36Z) - Vision Transformer with Super Token Sampling [93.70963123497327]
Vision transformer has achieved impressive performance for many vision tasks.
It may suffer from high redundancy in capturing local features for shallow layers.
Super tokens attempt to provide a semantically meaningful tessellation of visual content.
arXiv Detail & Related papers (2022-11-21T03:48:13Z) - Accurate Image Restoration with Attention Retractable Transformer [50.05204240159985]
We propose Attention Retractable Transformer (ART) for image restoration.
ART presents both dense and sparse attention modules in the network.
We conduct extensive experiments on image super-resolution, denoising, and JPEG compression artifact reduction tasks.
arXiv Detail & Related papers (2022-10-04T07:35:01Z) - Not All Tokens Are Equal: Human-centric Visual Analysis via Token
Clustering Transformer [91.49837514935051]
We propose a novel Vision Transformer, called Token Clustering Transformer (TCFormer)
TCFormer merges tokens by progressive clustering, where the tokens can be merged from different locations with flexible shapes and sizes.
Experiments show that TCFormer consistently outperforms its counterparts on different challenging human-centric tasks and datasets.
arXiv Detail & Related papers (2022-04-19T05:38:16Z) - Lawin Transformer: Improving Semantic Segmentation Transformer with
Multi-Scale Representations via Large Window Attention [16.75003034164463]
Multi-scale representations are crucial for semantic segmentation.
In this paper, we introduce multi-scale representations into semantic segmentation ViT via window attention mechanism.
Our resulting ViT, Lawin Transformer, is composed of an efficient vision transformer (HVT) as encoder and a LawinASPP as decoder.
arXiv Detail & Related papers (2022-01-05T13:51:20Z) - Local-to-Global Self-Attention in Vision Transformers [130.0369761612812]
Transformers have demonstrated great potential in computer vision tasks.
Some recent Transformer models adopt a hierarchical design, where self-attentions are only computed within local windows.
This design significantly improves the efficiency but lacks global feature reasoning in early stages.
In this work, we design a multi-path structure of the Transformer, which enables local-to-global reasoning at multiple granularities in each stage.
arXiv Detail & Related papers (2021-07-10T02:34:55Z) - MlTr: Multi-label Classification with Transformer [35.14232810099418]
We propose a Multi-label Transformer architecture (MlTr) constructed with windows partitioning, in-window pixel attention, cross-window attention.
The proposed MlTr shows state-of-the-art results on various prevalent multi-label datasets such as MS-COCO, Pascal-VOC, and NUS-WIDE.
arXiv Detail & Related papers (2021-06-11T06:53:09Z) - Vision Transformers with Hierarchical Attention [61.16912607330001]
This paper tackles the high computational/space complexity associated with Multi-Head Self-Attention (MHSA) in vision transformers.
We propose Hierarchical MHSA (H-MHSA), a novel approach that computes self-attention in a hierarchical fashion.
We build a family of Hierarchical-Attention-based Transformer Networks, namely HAT-Net.
arXiv Detail & Related papers (2021-06-06T17:01:13Z) - CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image
Classification [17.709880544501758]
We propose a dual-branch transformer to combine image patches of different sizes to produce stronger image features.
Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity.
Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise.
arXiv Detail & Related papers (2021-03-27T13:03:17Z) - Tokens-to-Token ViT: Training Vision Transformers from Scratch on
ImageNet [128.96032932640364]
We propose a new Tokens-To-Token Vision Transformers (T2T-ViT) to solve vision tasks.
T2T-ViT reduces the parameter counts and MACs of vanilla ViT by 200%, while achieving more than 2.5% improvement when trained from scratch on ImageNet.
For example, T2T-ViT with ResNet50 comparable size can achieve 80.7% top-1 accuracy on ImageNet.
arXiv Detail & Related papers (2021-01-28T13:25:28Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.