AxWin Transformer: A Context-Aware Vision Transformer Backbone with
Axial Windows
- URL: http://arxiv.org/abs/2305.01280v1
- Date: Tue, 2 May 2023 09:33:11 GMT
- Title: AxWin Transformer: A Context-Aware Vision Transformer Backbone with
Axial Windows
- Authors: Fangjian Lin, Yizhe Ma, Sitong Wu, Long Yu, Shengwei Tian
- Abstract summary: Recently Transformer has shown good performance in several vision tasks due to its powerful modeling capabilities.
We propose AxWin Attention, which models context information in both local windows and axial views.
Based on the AxWin Attention, we develop a context-aware vision transformer backbone, named AxWin Transformer.
- Score: 4.406336825345075
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recently Transformer has shown good performance in several vision tasks due
to its powerful modeling capabilities. To reduce the quadratic complexity
caused by the attention, some outstanding work restricts attention to local
regions or extends axial interactions. However, these methos often lack the
interaction of local and global information, balancing coarse and fine-grained
information. To address this problem, we propose AxWin Attention, which models
context information in both local windows and axial views. Based on the AxWin
Attention, we develop a context-aware vision transformer backbone, named AxWin
Transformer, which outperforming the state-of-the-art methods in both
classification and downstream segmentation and detection tasks.
Related papers
- ACC-ViT : Atrous Convolution's Comeback in Vision Transformers [5.224344210588584]
We introduce Atrous Attention, a fusion of regional and sparse attention, which can adaptively consolidate both local and global information.
We also propose a general vision transformer backbone, named ACC-ViT, following conventional practices for standard vision tasks.
ACC-ViT is therefore a strong vision backbone, which is also competitive in mobile-scale versions, ideal for niche applications with small datasets.
arXiv Detail & Related papers (2024-03-07T04:05:16Z) - Attention Deficit is Ordered! Fooling Deformable Vision Transformers
with Collaborative Adversarial Patches [3.4673556247932225]
Deformable vision transformers significantly reduce the complexity of attention modeling.
Recent work has demonstrated adversarial attacks against conventional vision transformers.
We develop new collaborative attacks where a source patch manipulates attention to point to a target patch, which contains the adversarial noise to fool the model.
arXiv Detail & Related papers (2023-11-21T17:55:46Z) - Xformer: Hybrid X-Shaped Transformer for Image Denoising [114.37510775636811]
We present a hybrid X-shaped vision Transformer, named Xformer, which performs notably on image denoising tasks.
Xformer achieves state-of-the-art performance on the synthetic and real-world image denoising tasks.
arXiv Detail & Related papers (2023-03-11T16:32:09Z) - Video Frame Interpolation with Transformer [55.12620857638253]
We introduce a novel framework, which takes advantage of Transformer to model long-range pixel correlation among video frames.
Our network is equipped with a novel cross-scale window-based attention mechanism, where cross-scale windows interact with each other.
arXiv Detail & Related papers (2022-05-15T09:30:28Z) - V2X-ViT: Vehicle-to-Everything Cooperative Perception with Vision
Transformer [58.71845618090022]
We build a holistic attention model, namely V2X-ViT, to fuse information across on-road agents.
V2X-ViT consists of alternating layers of heterogeneous multi-agent self-attention and multi-scale window self-attention.
To validate our approach, we create a large-scale V2X perception dataset.
arXiv Detail & Related papers (2022-03-20T20:18:25Z) - BOAT: Bilateral Local Attention Vision Transformer [70.32810772368151]
Early Vision Transformers such as ViT and DeiT adopt global self-attention, which is computationally expensive when the number of patches is large.
Recent Vision Transformers adopt local self-attention mechanisms, where self-attention is computed within local windows.
We propose a Bilateral lOcal Attention vision Transformer (BOAT), which integrates feature-space local attention with image-space local attention.
arXiv Detail & Related papers (2022-01-31T07:09:50Z) - Blending Anti-Aliasing into Vision Transformer [57.88274087198552]
discontinuous patch-wise tokenization process implicitly introduces jagged artifacts into attention maps.
Aliasing effect occurs when discrete patterns are used to produce high frequency or continuous information, resulting in the indistinguishable distortions.
We propose a plug-and-play Aliasing-Reduction Module(ARM) to alleviate the aforementioned issue.
arXiv Detail & Related papers (2021-10-28T14:30:02Z) - Transformers Solve the Limited Receptive Field for Monocular Depth
Prediction [82.90445525977904]
We propose TransDepth, an architecture which benefits from both convolutional neural networks and transformers.
This is the first paper which applies transformers into pixel-wise prediction problems involving continuous labels.
arXiv Detail & Related papers (2021-03-22T18:00:13Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.