Vision Transformers with Hierarchical Attention
- URL: http://arxiv.org/abs/2106.03180v5
- Date: Tue, 26 Mar 2024 07:44:45 GMT
- Title: Vision Transformers with Hierarchical Attention
- Authors: Yun Liu, Yu-Huan Wu, Guolei Sun, Le Zhang, Ajad Chhatkuli, Luc Van Gool,
- Abstract summary: This paper tackles the high computational/space complexity associated with Multi-Head Self-Attention (MHSA) in vision transformers.
We propose Hierarchical MHSA (H-MHSA), a novel approach that computes self-attention in a hierarchical fashion.
We build a family of Hierarchical-Attention-based Transformer Networks, namely HAT-Net.
- Score: 61.16912607330001
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: This paper tackles the high computational/space complexity associated with Multi-Head Self-Attention (MHSA) in vanilla vision transformers. To this end, we propose Hierarchical MHSA (H-MHSA), a novel approach that computes self-attention in a hierarchical fashion. Specifically, we first divide the input image into patches as commonly done, and each patch is viewed as a token. Then, the proposed H-MHSA learns token relationships within local patches, serving as local relationship modeling. Then, the small patches are merged into larger ones, and H-MHSA models the global dependencies for the small number of the merged tokens. At last, the local and global attentive features are aggregated to obtain features with powerful representation capacity. Since we only calculate attention for a limited number of tokens at each step, the computational load is reduced dramatically. Hence, H-MHSA can efficiently model global relationships among tokens without sacrificing fine-grained information. With the H-MHSA module incorporated, we build a family of Hierarchical-Attention-based Transformer Networks, namely HAT-Net. To demonstrate the superiority of HAT-Net in scene understanding, we conduct extensive experiments on fundamental vision tasks, including image classification, semantic segmentation, object detection, and instance segmentation. Therefore, HAT-Net provides a new perspective for vision transformers. Code and pretrained models are available at https://github.com/yun-liu/HAT-Net.
Related papers
- Revisiting the Integration of Convolution and Attention for Vision Backbone [59.50256661158862]
Convolutions and multi-head self-attentions (MHSAs) are typically considered alternatives to each other for building vision backbones.
We propose in this work to use MSHAs and Convs in parallel textbfat different granularity levels instead.
We empirically verify the potential of the proposed integration scheme, named textitGLMix: by offloading the burden of fine-grained features to light-weight Convs, it is sufficient to use MHSAs in a few semantic slots.
arXiv Detail & Related papers (2024-11-21T18:59:08Z) - Vision Transformer with Super Token Sampling [93.70963123497327]
Vision transformer has achieved impressive performance for many vision tasks.
It may suffer from high redundancy in capturing local features for shallow layers.
Super tokens attempt to provide a semantically meaningful tessellation of visual content.
arXiv Detail & Related papers (2022-11-21T03:48:13Z) - MAFormer: A Transformer Network with Multi-scale Attention Fusion for
Visual Recognition [45.68567088645708]
We introduce Multi-scale Attention Fusion into transformer (MAFormer)
MAFormer explores local aggregation and global feature extraction in a dual-stream framework for visual recognition.
Our MAFormer achieves state-of-the-art performance on common vision tasks.
arXiv Detail & Related papers (2022-08-31T06:29:27Z) - Shunted Self-Attention via Multi-Scale Token Aggregation [124.16925784748601]
Recent Vision Transformer(ViT) models have demonstrated encouraging results across various computer vision tasks.
We propose shunted self-attention(SSA) that allows ViTs to model the attentions at hybrid scales per attention layer.
The SSA-based transformer achieves 84.0% Top-1 accuracy and outperforms the state-of-the-art Focal Transformer on ImageNet.
arXiv Detail & Related papers (2021-11-30T08:08:47Z) - Global Interaction Modelling in Vision Transformer via Super Tokens [20.700750237972155]
Window-based local attention is one of the major techniques being adopted in recent works.
We present a novel isotropic architecture that adopts local windows and special tokens, called Super tokens, for self-attention.
In standard image classification on Imagenet-1K, the proposed Super tokens based transformer (STT-S25) achieves 83.5% accuracy.
arXiv Detail & Related papers (2021-11-25T16:22:57Z) - P2T: Pyramid Pooling Transformer for Scene Understanding [62.41912463252468]
We build a downstream-task-oriented transformer network, dubbed Pyramid Pooling Transformer (P2T)
Plugged with our pooling-based MHSA, we build a downstream-task-oriented transformer network, dubbed Pyramid Pooling Transformer (P2T)
arXiv Detail & Related papers (2021-06-22T18:28:52Z) - OmniNet: Omnidirectional Representations from Transformers [49.23834374054286]
This paper proposes Omnidirectional Representations from Transformers ( OmniNet)
In OmniNet, instead of maintaining a strictly horizontal receptive field, each token is allowed to attend to all tokens in the entire network.
Experiments are conducted on autoregressive language modeling, Machine Translation, Long Range Arena (LRA), and Image Recognition.
arXiv Detail & Related papers (2021-03-01T15:31:54Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.