Slide-Transformer: Hierarchical Vision Transformer with Local
Self-Attention
- URL: http://arxiv.org/abs/2304.04237v1
- Date: Sun, 9 Apr 2023 13:37:59 GMT
- Title: Slide-Transformer: Hierarchical Vision Transformer with Local
Self-Attention
- Authors: Xuran Pan, Tianzhu Ye, Zhuofan Xia, Shiji Song, Gao Huang
- Abstract summary: Self-attention mechanism has been a key factor in the recent progress of Vision Transformer (ViT)
We propose a novel local attention module, which leverages common convolution operations to achieve high efficiency, flexibility and generalizability.
Our module realizes the local attention paradigm in both efficient and flexible manner.
- Score: 34.26177289099421
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Self-attention mechanism has been a key factor in the recent progress of
Vision Transformer (ViT), which enables adaptive feature extraction from global
contexts. However, existing self-attention methods either adopt sparse global
attention or window attention to reduce the computation complexity, which may
compromise the local feature learning or subject to some handcrafted designs.
In contrast, local attention, which restricts the receptive field of each query
to its own neighboring pixels, enjoys the benefits of both convolution and
self-attention, namely local inductive bias and dynamic feature selection.
Nevertheless, current local attention modules either use inefficient Im2Col
function or rely on specific CUDA kernels that are hard to generalize to
devices without CUDA support. In this paper, we propose a novel local attention
module, Slide Attention, which leverages common convolution operations to
achieve high efficiency, flexibility and generalizability. Specifically, we
first re-interpret the column-based Im2Col function from a new row-based
perspective and use Depthwise Convolution as an efficient substitution. On this
basis, we propose a deformed shifting module based on the re-parameterization
technique, which further relaxes the fixed key/value positions to deformed
features in the local region. In this way, our module realizes the local
attention paradigm in both efficient and flexible manner. Extensive experiments
show that our slide attention module is applicable to a variety of advanced
Vision Transformer models and compatible with various hardware devices, and
achieves consistently improved performances on comprehensive benchmarks. Code
is available at https://github.com/LeapLabTHU/Slide-Transformer.
Related papers
- LoFLAT: Local Feature Matching using Focused Linear Attention Transformer [36.53651224633837]
We propose the LoFLAT, a novel Local Feature matching using Focused Linear Attention Transformer.
Our LoFLAT consists of three main modules: the Feature Extraction Module, the Feature Transformer Module, and the Matching Module.
The proposed LoFLAT outperforms the LoFTR method in terms of both efficiency and accuracy.
arXiv Detail & Related papers (2024-10-30T05:38:07Z) - DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision.
The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z) - DAT++: Spatially Dynamic Vision Transformer with Deformable Attention [87.41016963608067]
We present Deformable Attention Transformer ( DAT++), a vision backbone efficient and effective for visual recognition.
DAT++ achieves state-of-the-art results on various visual recognition benchmarks, with 85.9% ImageNet accuracy, 54.5 and 47.0 MS-COCO instance segmentation mAP, and 51.5 ADE20K semantic segmentation mIoU.
arXiv Detail & Related papers (2023-09-04T08:26:47Z) - FLatten Transformer: Vision Transformer using Focused Linear Attention [80.61335173752146]
Linear attention offers a much more efficient alternative with its linear complexity.
Current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead.
We propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness.
arXiv Detail & Related papers (2023-08-01T10:37:12Z) - CSformer: Bridging Convolution and Transformer for Compressive Sensing [65.22377493627687]
This paper proposes a hybrid framework that integrates the advantages of leveraging detailed spatial information from CNN and the global context provided by transformer for enhanced representation learning.
The proposed approach is an end-to-end compressive image sensing method, composed of adaptive sampling and recovery.
The experimental results demonstrate the effectiveness of the dedicated transformer-based architecture for compressive sensing.
arXiv Detail & Related papers (2021-12-31T04:37:11Z) - LCTR: On Awakening the Local Continuity of Transformer for Weakly
Supervised Object Localization [38.376238216214524]
Weakly supervised object localization (WSOL) aims to learn object localizer solely by using image-level labels.
We propose a novel framework built upon the transformer, termed LCTR, which targets at enhancing the local perception capability of global features.
arXiv Detail & Related papers (2021-12-10T01:48:40Z) - X-volution: On the unification of convolution and self-attention [52.80459687846842]
We propose a multi-branch elementary module composed of both convolution and self-attention operation.
The proposed X-volution achieves highly competitive visual understanding improvements.
arXiv Detail & Related papers (2021-06-04T04:32:02Z) - LocalViT: Bringing Locality to Vision Transformers [132.42018183859483]
locality is essential for images since it pertains to structures like lines, edges, shapes, and even objects.
We add locality to vision transformers by introducing depth-wise convolution into the feed-forward network.
This seemingly simple solution is inspired by the comparison between feed-forward networks and inverted residual blocks.
arXiv Detail & Related papers (2021-04-12T17:59:22Z) - Patch-NetVLAD: Multi-Scale Fusion of Locally-Global Descriptors for
Place Recognition [29.282413482297255]
This paper introduces Patch-NetVLAD, which provides a novel formulation for combining the advantages of both local and global descriptor methods.
We show that Patch-NetVLAD outperforms both global and local feature descriptor-based methods with comparable compute.
It is also adaptable to user requirements, with a speed-optimised version operating over an order of magnitude faster than the state-of-the-art.
arXiv Detail & Related papers (2021-03-02T05:53:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.