Adaptive Attention Span in Computer Vision
- URL: http://arxiv.org/abs/2004.08708v1
- Date: Sat, 18 Apr 2020 21:32:47 GMT
- Title: Adaptive Attention Span in Computer Vision
- Authors: Jerrod Parker, Shakti Kumar, Joe Roussy
- Abstract summary: Recent developments in Transformers for language modeling have opened new areas of research in computer vision.
Results from late 2019 showed vast performance increases in both object detection and recognition when convolutions are replaced by local self-attention kernels.
Models using local self-attention kernels were also shown to have less parameters and FLOPS compared to equivalent architectures that only use convolutions.
- Score: 5.371337604556311
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Recent developments in Transformers for language modeling have opened new
areas of research in computer vision. Results from late 2019 showed vast
performance increases in both object detection and recognition when
convolutions are replaced by local self-attention kernels. Models using local
self-attention kernels were also shown to have less parameters and FLOPS
compared to equivalent architectures that only use convolutions. In this work
we propose a novel method for learning the local self-attention kernel size. We
then compare its performance to fixed-size local attention and convolution
kernels. The code for all our experiments and models is available at
https://github.com/JoeRoussy/adaptive-attention-in-cv
Related papers
- Laplacian-Former: Overcoming the Limitations of Vision Transformers in
Local Texture Detection [3.784298636620067]
Vision Transformer (ViT) models have demonstrated a breakthrough in a wide range of computer vision tasks.
These models struggle to capture high-frequency components of images, which can limit their ability to detect local textures and edge information.
We propose a new technique, Laplacian-Former, that enhances the self-attention map by adaptively re-calibrating the frequency information in a Laplacian pyramid.
arXiv Detail & Related papers (2023-08-31T19:56:14Z) - Slide-Transformer: Hierarchical Vision Transformer with Local
Self-Attention [34.26177289099421]
Self-attention mechanism has been a key factor in the recent progress of Vision Transformer (ViT)
We propose a novel local attention module, which leverages common convolution operations to achieve high efficiency, flexibility and generalizability.
Our module realizes the local attention paradigm in both efficient and flexible manner.
arXiv Detail & Related papers (2023-04-09T13:37:59Z) - ParCNetV2: Oversized Kernel with Enhanced Attention [60.141606180434195]
We introduce a convolutional neural network architecture named ParCNetV2.
It extends position-aware circular convolution (ParCNet) with oversized convolutions and strengthens attention through bifurcate gate units.
Our method outperforms other pure convolutional neural networks as well as neural networks hybridizing CNNs and transformers.
arXiv Detail & Related papers (2022-11-14T07:22:55Z) - Focal Sparse Convolutional Networks for 3D Object Detection [121.45950754511021]
We introduce two new modules to enhance the capability of Sparse CNNs.
They are focal sparse convolution (Focals Conv) and its multi-modal variant of focal sparse convolution with fusion.
For the first time, we show that spatially learnable sparsity in sparse convolution is essential for sophisticated 3D object detection.
arXiv Detail & Related papers (2022-04-26T17:34:10Z) - Adaptive Split-Fusion Transformer [90.04885335911729]
We propose an Adaptive Split-Fusion Transformer (ASF-former) to treat convolutional and attention branches differently with adaptive weights.
Experiments on standard benchmarks, such as ImageNet-1K, show that our ASF-former outperforms its CNN, transformer counterparts, and hybrid pilots in terms of accuracy.
arXiv Detail & Related papers (2022-04-26T10:00:28Z) - BOAT: Bilateral Local Attention Vision Transformer [70.32810772368151]
Early Vision Transformers such as ViT and DeiT adopt global self-attention, which is computationally expensive when the number of patches is large.
Recent Vision Transformers adopt local self-attention mechanisms, where self-attention is computed within local windows.
We propose a Bilateral lOcal Attention vision Transformer (BOAT), which integrates feature-space local attention with image-space local attention.
arXiv Detail & Related papers (2022-01-31T07:09:50Z) - Learned Queries for Efficient Local Attention [11.123272845092611]
Self-attention mechanism in vision transformers suffers from high latency and inefficient memory utilization.
We propose a new shift-invariant local attention layer, called query and attend (QnA), that aggregates the input locally in an overlapping manner.
We show improvements in speed and memory complexity while achieving comparable accuracy with state-of-the-art models.
arXiv Detail & Related papers (2021-12-21T18:52:33Z) - X-volution: On the unification of convolution and self-attention [52.80459687846842]
We propose a multi-branch elementary module composed of both convolution and self-attention operation.
The proposed X-volution achieves highly competitive visual understanding improvements.
arXiv Detail & Related papers (2021-06-04T04:32:02Z) - Involution: Inverting the Inherence of Convolution for Visual
Recognition [72.88582255910835]
We present a novel atomic operation for deep neural networks by inverting the principles of convolution, coined as involution.
The proposed involution operator could be leveraged as fundamental bricks to build the new generation of neural networks for visual recognition.
Our involution-based models improve the performance of convolutional baselines using ResNet-50 by up to 1.6% top-1 accuracy, 2.5% and 2.4% bounding box AP, and 4.7% mean IoU absolutely.
arXiv Detail & Related papers (2021-03-10T18:40:46Z) - SA-Det3D: Self-Attention Based Context-Aware 3D Object Detection [9.924083358178239]
We propose two variants of self-attention for contextual modeling in 3D object detection.
We first incorporate the pairwise self-attention mechanism into the current state-of-the-art BEV, voxel and point-based detectors.
Next, we propose a self-attention variant that samples a subset of the most representative features by learning deformations over randomly sampled locations.
arXiv Detail & Related papers (2021-01-07T18:30:32Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.