Learned Queries for Efficient Local Attention
- URL: http://arxiv.org/abs/2112.11435v1
- Date: Tue, 21 Dec 2021 18:52:33 GMT
- Title: Learned Queries for Efficient Local Attention
- Authors: Moab Arar, Ariel Shamir, Amit H. Bermano
- Abstract summary: Self-attention mechanism in vision transformers suffers from high latency and inefficient memory utilization.
We propose a new shift-invariant local attention layer, called query and attend (QnA), that aggregates the input locally in an overlapping manner.
We show improvements in speed and memory complexity while achieving comparable accuracy with state-of-the-art models.
- Score: 11.123272845092611
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Vision Transformers (ViT) serve as powerful vision models. Unlike
convolutional neural networks, which dominated vision research in previous
years, vision transformers enjoy the ability to capture long-range dependencies
in the data. Nonetheless, an integral part of any transformer architecture, the
self-attention mechanism, suffers from high latency and inefficient memory
utilization, making it less suitable for high-resolution input images. To
alleviate these shortcomings, hierarchical vision models locally employ
self-attention on non-interleaving windows. This relaxation reduces the
complexity to be linear in the input size; however, it limits the cross-window
interaction, hurting the model performance. In this paper, we propose a new
shift-invariant local attention layer, called query and attend (QnA), that
aggregates the input locally in an overlapping manner, much like convolutions.
The key idea behind QnA is to introduce learned queries, which allow fast and
efficient implementation. We verify the effectiveness of our layer by
incorporating it into a hierarchical vision transformer model. We show
improvements in speed and memory complexity while achieving comparable accuracy
with state-of-the-art models. Finally, our layer scales especially well with
window size, requiring up-to x10 less memory while being up-to x5 faster than
existing methods.
Related papers
- DRCT: Saving Image Super-resolution away from Information Bottleneck [7.765333471208582]
Vision Transformer-based approaches for low-level vision tasks have achieved widespread success.
Dense-residual-connected Transformer (DRCT) is proposed to mitigate the loss of spatial information.
Our approach surpasses state-of-the-art methods on benchmark datasets.
arXiv Detail & Related papers (2024-03-31T15:34:45Z) - Factorization Vision Transformer: Modeling Long Range Dependency with
Local Window Cost [25.67071603343174]
We propose a factorization self-attention mechanism (FaSA) that enjoys both the advantages of local window cost and long-range dependency modeling capability.
FaViT achieves high performance and robustness, with linear computational complexity concerning input image spatial resolution.
Our FaViT-B2 significantly improves classification accuracy by 1% and robustness by 7%, while reducing model parameters by 14%.
arXiv Detail & Related papers (2023-12-14T02:38:12Z) - Laplacian-Former: Overcoming the Limitations of Vision Transformers in
Local Texture Detection [3.784298636620067]
Vision Transformer (ViT) models have demonstrated a breakthrough in a wide range of computer vision tasks.
These models struggle to capture high-frequency components of images, which can limit their ability to detect local textures and edge information.
We propose a new technique, Laplacian-Former, that enhances the self-attention map by adaptively re-calibrating the frequency information in a Laplacian pyramid.
arXiv Detail & Related papers (2023-08-31T19:56:14Z) - CageViT: Convolutional Activation Guided Efficient Vision Transformer [90.69578999760206]
This paper presents an efficient vision Transformer, called CageViT, that is guided by convolutional activation to reduce computation.
Our CageViT, unlike current Transformers, utilizes a new encoder to handle the rearranged tokens.
Experimental results demonstrate that the proposed CageViT outperforms the most recent state-of-the-art backbones by a large margin in terms of efficiency.
arXiv Detail & Related papers (2023-05-17T03:19:18Z) - Dynamic Spatial Sparsification for Efficient Vision Transformers and
Convolutional Neural Networks [88.77951448313486]
We present a new approach for model acceleration by exploiting spatial sparsity in visual data.
We propose a dynamic token sparsification framework to prune redundant tokens.
We extend our method to hierarchical models including CNNs and hierarchical vision Transformers.
arXiv Detail & Related papers (2022-07-04T17:00:51Z) - Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity.
Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z) - Local-to-Global Self-Attention in Vision Transformers [130.0369761612812]
Transformers have demonstrated great potential in computer vision tasks.
Some recent Transformer models adopt a hierarchical design, where self-attentions are only computed within local windows.
This design significantly improves the efficiency but lacks global feature reasoning in early stages.
In this work, we design a multi-path structure of the Transformer, which enables local-to-global reasoning at multiple granularities in each stage.
arXiv Detail & Related papers (2021-07-10T02:34:55Z) - Vision Xformers: Efficient Attention for Image Classification [0.0]
We modify the ViT architecture to work on longer sequence data by replacing the quadratic attention with efficient transformers.
We show that ViX performs better than ViT in image classification consuming lesser computing resources.
arXiv Detail & Related papers (2021-07-05T19:24:23Z) - Less is More: Pay Less Attention in Vision Transformers [61.05787583247392]
Less attention vIsion Transformer builds upon the fact that convolutions, fully-connected layers, and self-attentions have almost equivalent mathematical expressions for processing image patch sequences.
The proposed LIT achieves promising performance on image recognition tasks, including image classification, object detection and instance segmentation.
arXiv Detail & Related papers (2021-05-29T05:26:07Z) - Scalable Visual Transformers with Hierarchical Pooling [61.05787583247392]
We propose a Hierarchical Visual Transformer (HVT) which progressively pools visual tokens to shrink the sequence length.
It brings a great benefit by scaling dimensions of depth/width/resolution/patch size without introducing extra computational complexity.
Our HVT outperforms the competitive baselines on ImageNet and CIFAR-100 datasets.
arXiv Detail & Related papers (2021-03-19T03:55:58Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.