Vision Transformers with Hierarchical Attention
- URL: http://arxiv.org/abs/2106.03180v5
- Date: Tue, 26 Mar 2024 07:44:45 GMT
- Title: Vision Transformers with Hierarchical Attention
- Authors: Yun Liu, Yu-Huan Wu, Guolei Sun, Le Zhang, Ajad Chhatkuli, Luc Van Gool,
- Abstract summary: This paper tackles the high computational/space complexity associated with Multi-Head Self-Attention (MHSA) in vision transformers.
We propose Hierarchical MHSA (H-MHSA), a novel approach that computes self-attention in a hierarchical fashion.
We build a family of Hierarchical-Attention-based Transformer Networks, namely HAT-Net.
- Score: 61.16912607330001
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper tackles the high computational/space complexity associated with Multi-Head Self-Attention (MHSA) in vanilla vision transformers. To this end, we propose Hierarchical MHSA (H-MHSA), a novel approach that computes self-attention in a hierarchical fashion. Specifically, we first divide the input image into patches as commonly done, and each patch is viewed as a token. Then, the proposed H-MHSA learns token relationships within local patches, serving as local relationship modeling. Then, the small patches are merged into larger ones, and H-MHSA models the global dependencies for the small number of the merged tokens. At last, the local and global attentive features are aggregated to obtain features with powerful representation capacity. Since we only calculate attention for a limited number of tokens at each step, the computational load is reduced dramatically. Hence, H-MHSA can efficiently model global relationships among tokens without sacrificing fine-grained information. With the H-MHSA module incorporated, we build a family of Hierarchical-Attention-based Transformer Networks, namely HAT-Net. To demonstrate the superiority of HAT-Net in scene understanding, we conduct extensive experiments on fundamental vision tasks, including image classification, semantic segmentation, object detection, and instance segmentation. Therefore, HAT-Net provides a new perspective for vision transformers. Code and pretrained models are available at https://github.com/yun-liu/HAT-Net.
Related papers
- DuoFormer: Leveraging Hierarchical Visual Representations by Local and Global Attention [1.5624421399300303]
We propose a novel hierarchical transformer model that adeptly integrates the feature extraction capabilities of Convolutional Neural Networks (CNNs) with the advanced representational potential of Vision Transformers (ViTs)
Addressing the lack of inductive biases and dependence on extensive training datasets in ViTs, our model employs a CNN backbone to generate hierarchical visual representations.
These representations are then adapted for transformer input through an innovative patch tokenization.
arXiv Detail & Related papers (2024-07-18T22:15:35Z) - Vision Transformer with Super Token Sampling [93.70963123497327]
Vision transformer has achieved impressive performance for many vision tasks.
It may suffer from high redundancy in capturing local features for shallow layers.
Super tokens attempt to provide a semantically meaningful tessellation of visual content.
arXiv Detail & Related papers (2022-11-21T03:48:13Z) - MAFormer: A Transformer Network with Multi-scale Attention Fusion for
Visual Recognition [45.68567088645708]
We introduce Multi-scale Attention Fusion into transformer (MAFormer)
MAFormer explores local aggregation and global feature extraction in a dual-stream framework for visual recognition.
Our MAFormer achieves state-of-the-art performance on common vision tasks.
arXiv Detail & Related papers (2022-08-31T06:29:27Z) - Shunted Self-Attention via Multi-Scale Token Aggregation [124.16925784748601]
Recent Vision Transformer(ViT) models have demonstrated encouraging results across various computer vision tasks.
We propose shunted self-attention(SSA) that allows ViTs to model the attentions at hybrid scales per attention layer.
The SSA-based transformer achieves 84.0% Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on ImageNet.
arXiv Detail & Related papers (2021-11-30T08:08:47Z) - Global Interaction Modelling in Vision Transformer via Super Tokens [20.700750237972155]
Window-based local attention is one of the major techniques being adopted in recent works.
We present a novel isotropic architecture that adopts local windows and special tokens, called Super tokens, for self-attention.
In standard image classification on Imagenet-1K, the proposed Super tokens based transformer (STT-S25) achieves 83.5% accuracy.
arXiv Detail & Related papers (2021-11-25T16:22:57Z) - P2T: Pyramid Pooling Transformer for Scene Understanding [62.41912463252468]
We build a downstream-task-oriented transformer network, dubbed Pyramid Pooling Transformer (P2T)
Plugged with our pooling-based MHSA, we build a downstream-task-oriented transformer network, dubbed Pyramid Pooling Transformer (P2T)
arXiv Detail & Related papers (2021-06-22T18:28:52Z) - Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length.
It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity.
Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z) - OmniNet: Omnidirectional Representations from Transformers [49.23834374054286]
This paper proposes Omnidirectional Representations from Transformers ( OmniNet)
In OmniNet, instead of maintaining a strictly horizontal receptive field, each token is allowed to attend to all tokens in the entire network.
Experiments are conducted on autoregressive language modeling, Machine Translation, Long Range Arena (LRA), and Image Recognition.
arXiv Detail & Related papers (2021-03-01T15:31:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.