Rethinking Query-Key Pairwise Interactions in Vision Transformers
- URL: http://arxiv.org/abs/2207.00188v2
- Date: Mon, 4 Jul 2022 02:23:46 GMT
- Title: Rethinking Query-Key Pairwise Interactions in Vision Transformers
- Authors: Cheng Li, Yangxin Liu
- Abstract summary: We propose key-only attention, which excludes query-key pairwise interactions and uses a compute-efficient saliency-gate to obtain attention weights.
We develop a new self-attention model family, LinGlos, which reach state-of-the-art accuracies on the parameter-limited setting of ImageNet classification benchmark.
- Score: 5.141895475956681
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Vision Transformers have achieved state-of-the-art performance in many visual
tasks. Due to the quadratic computational and memory complexities of
self-attention, recent works either apply attention only to low-resolution
inputs or restrict the receptive field to a small local region. To overcome
these limitations, we propose key-only attention, which excludes query-key
pairwise interactions and uses a compute-efficient saliency-gate to obtain
attention weights, modeling local-global interactions in all stages. Key-only
attention has linear computational and memory complexities w.r.t input size. We
use alternate layout to hybridize convolution and attention layers instead of
grafting which is suggested by previous works, so that all stages can benefit
from both spatial attentions and convolutions. We leverage these improvements
to develop a new self-attention model family, LinGlos, which reach
state-of-the-art accuracies on the parameter-limited setting of ImageNet
classification benchmark, and outperform baselines significantly in downstream
tasks, e.g., COCO object detection and ADE20K semantic segmentation.
Related papers
- DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision.
The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z) - HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models [96.76995840807615]
HiRes-LLaVA is a novel framework designed to process any size of high-resolution input without altering the original contextual and geometric information.
HiRes-LLaVA comprises two innovative components: (i) a SliceRestore adapter that reconstructs sliced patches into their original form, efficiently extracting both global and local features via down-up-sampling and convolution layers, and (ii) a Self-Mining Sampler to compress the vision tokens based on themselves.
arXiv Detail & Related papers (2024-07-11T17:42:17Z) - You Only Need Less Attention at Each Stage in Vision Transformers [19.660385306028047]
Vision Transformers (ViTs) capture the global information of images through self-attention modules.
We propose the Less-Attention Vision Transformer (LaViT), which computes only a few attention operations at each stage.
Our architecture demonstrates exceptional performance across various vision tasks including classification, detection and segmentation.
arXiv Detail & Related papers (2024-06-01T12:49:16Z) - Learning Feature Matching via Matchable Keypoint-Assisted Graph Neural
Network [52.29330138835208]
Accurately matching local features between a pair of images is a challenging computer vision task.
Previous studies typically use attention based graph neural networks (GNNs) with fully-connected graphs over keypoints within/across images.
We propose MaKeGNN, a sparse attention-based GNN architecture which bypasses non-repeatable keypoints and leverages matchable ones to guide message passing.
arXiv Detail & Related papers (2023-07-04T02:50:44Z) - RFAConv: Innovating Spatial Attention and Standard Convolutional Operation [7.2646541547165056]
We propose a novel attention mechanism called Receptive-Field Attention (RFA)
RFA focuses on the receptive-field spatial feature but also provides effective attention weights for large-size convolutional kernels.
It offers nearly negligible increment of computational cost and parameters, while significantly improving network performance.
arXiv Detail & Related papers (2023-04-06T16:21:56Z) - BiFormer: Vision Transformer with Bi-Level Routing Attention [26.374724782056557]
We propose a novel dynamic sparse attention via bi-level routing to enable a more flexible allocation of computations with content awareness.
Specifically, for a query, irrelevant key-value pairs are first filtered out at a coarse region level, and then fine-grained token-to-token attention is applied in the union of remaining candidate regions.
Built with the proposed bi-level routing attention, a new general vision transformer, named BiFormer, is then presented.
arXiv Detail & Related papers (2023-03-15T17:58:46Z) - Sparse Attention Acceleration with Synergistic In-Memory Pruning and
On-Chip Recomputation [6.303594714446706]
Self-attention mechanism gauges pairwise correlations across entire input sequence.
Despite favorable performance, calculating pairwise correlations is prohibitively costly.
This work addresses these constraints by architecting an accelerator, called SPRINT, which computes attention scores in an approximate manner.
arXiv Detail & Related papers (2022-09-01T17:18:19Z) - ClusTR: Exploring Efficient Self-attention via Clustering for Vision
Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention.
Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count.
The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z) - CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point
Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation.
We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration.
The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z) - Unlocking Pixels for Reinforcement Learning via Implicit Attention [61.666538764049854]
We make use of new efficient attention algorithms, recently shown to be highly effective for Transformers.
This allows our attention-based controllers to scale to larger visual inputs, and facilitate the use of smaller patches.
In addition, we propose a new efficient algorithm approximating softmax attention with what we call hybrid random features.
arXiv Detail & Related papers (2021-02-08T17:00:26Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.