Lightweight Structure-Aware Attention for Visual Understanding
- URL: http://arxiv.org/abs/2211.16289v1
- Date: Tue, 29 Nov 2022 15:20:14 GMT
- Title: Lightweight Structure-Aware Attention for Visual Understanding
- Authors: Heeseung Kwon, Francisco M. Castro, Manuel J. Marin-Jimenez, Nicolas
Guil, Karteek Alahari
- Abstract summary: Vision Transformers (ViTs) have become a dominant paradigm for visual representation learning with self-attention operators.
We propose a novel attention operator, called lightweight structure-aware attention (LiSA), which has a better representation power with log-linear complexity.
Our experiments and ablation studies demonstrate that ViTs based on the proposed operator outperform self-attention and other existing operators.
- Score: 16.860625620412943
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformers (ViTs) have become a dominant paradigm for visual
representation learning with self-attention operators. Although these operators
provide flexibility to the model with their adjustable attention kernels, they
suffer from inherent limitations: (1) the attention kernel is not
discriminative enough, resulting in high redundancy of the ViT layers, and (2)
the complexity in computation and memory is quadratic in the sequence length.
In this paper, we propose a novel attention operator, called lightweight
structure-aware attention (LiSA), which has a better representation power with
log-linear complexity. Our operator learns structural patterns by using a set
of relative position embeddings (RPEs). To achieve log-linear complexity, the
RPEs are approximated with fast Fourier transforms. Our experiments and
ablation studies demonstrate that ViTs based on the proposed operator
outperform self-attention and other existing operators, achieving
state-of-the-art results on ImageNet, and competitive results on other visual
understanding benchmarks such as COCO and Something-Something-V2. The source
code of our approach will be released online.
Related papers
- Fibottention: Inceptive Visual Representation Learning with Diverse Attention Across Heads [10.169639612525643]
Visual perception tasks are predominantly solved by ViT, despite their effectiveness.
Despite their effectiveness, ViT encounters a computational bottleneck due to the complexity of computing self-attention.
We propose Fibottention architecture, which approximating self-attention that is built upon.
arXiv Detail & Related papers (2024-06-27T17:59:40Z) - You Only Need Less Attention at Each Stage in Vision Transformers [19.660385306028047]
Vision Transformers (ViTs) capture the global information of images through self-attention modules.
We propose the Less-Attention Vision Transformer (LaViT), which computes only a few attention operations at each stage.
Our architecture demonstrates exceptional performance across various vision tasks including classification, detection and segmentation.
arXiv Detail & Related papers (2024-06-01T12:49:16Z) - Betrayed by Attention: A Simple yet Effective Approach for Self-supervised Video Object Segmentation [76.68301884987348]
We propose a simple yet effective approach for self-supervised video object segmentation (VOS)
Our key insight is that the inherent structural dependencies present in DINO-pretrained Transformers can be leveraged to establish robust-temporal segmentation correspondences in videos.
Our method demonstrates state-of-the-art performance across multiple unsupervised VOS benchmarks and excels in complex real-world multi-object video segmentation tasks.
arXiv Detail & Related papers (2023-11-29T18:47:17Z) - ViR: Towards Efficient Vision Retention Backbones [97.93707844681893]
We propose a new class of computer vision models, dubbed Vision Retention Networks (ViR)
ViR has dual parallel and recurrent formulations, which strike an optimal balance between fast inference and parallel training with competitive performance.
We have validated the effectiveness of ViR through extensive experiments with different dataset sizes and various image resolutions.
arXiv Detail & Related papers (2023-10-30T16:55:50Z) - FLatten Transformer: Vision Transformer using Focused Linear Attention [80.61335173752146]
Linear attention offers a much more efficient alternative with its linear complexity.
Current linear attention approaches either suffer from significant performance degradation or introduce additional computation overhead.
We propose a novel Focused Linear Attention module to achieve both high efficiency and expressiveness.
arXiv Detail & Related papers (2023-08-01T10:37:12Z) - Multiscale Attention via Wavelet Neural Operators for Vision
Transformers [0.0]
Transformers have achieved widespread success in computer vision. At their heart, there is a Self-Attention (SA) mechanism.
Standard SA mechanism has quadratic complexity with the sequence length, which impedes its utility to long sequences appearing in high resolution vision.
We introduce a Multiscale Wavelet Attention (MWA) by leveraging wavelet neural operators which incurs linear complexity in the sequence size.
arXiv Detail & Related papers (2023-03-22T09:06:07Z) - Synthesizer Based Efficient Self-Attention for Vision Tasks [10.822515889248676]
Self-attention module shows outstanding competence in capturing long-range relationships while enhancing performance on vision tasks, such as image classification and image captioning.
This paper proposes a self-attention plug-in module with its variants, namely, Synthesizing Transformations (STT) for directly processing image tensor features.
arXiv Detail & Related papers (2022-01-05T02:07:32Z) - Dynamic Inference with Neural Interpreters [72.90231306252007]
We present Neural Interpreters, an architecture that factorizes inference in a self-attention network as a system of modules.
inputs to the model are routed through a sequence of functions in a way that is end-to-end learned.
We show that Neural Interpreters perform on par with the vision transformer using fewer parameters, while being transferrable to a new task in a sample efficient manner.
arXiv Detail & Related papers (2021-10-12T23:22:45Z) - X-volution: On the unification of convolution and self-attention [52.80459687846842]
We propose a multi-branch elementary module composed of both convolution and self-attention operation.
The proposed X-volution achieves highly competitive visual understanding improvements.
arXiv Detail & Related papers (2021-06-04T04:32:02Z) - Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length.
It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity.
Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.