Related papers: Rethinking Query-Key Pairwise Interactions in Vision Transformers

Rethinking Query-Key Pairwise Interactions in Vision Transformers

URL: http://arxiv.org/abs/2207.00188v2
Date: Mon, 4 Jul 2022 02:23:46 GMT
Title: Rethinking Query-Key Pairwise Interactions in Vision Transformers
Authors: Cheng Li, Yangxin Liu
Abstract summary: We propose key-only attention, which excludes query-key pairwise interactions and uses a compute-efficient saliency-gate to obtain attention weights. We develop a new self-attention model family, LinGlos, which reach state-of-the-art accuracies on the parameter-limited setting of ImageNet classification benchmark.
Score: 5.141895475956681
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Vision Transformers have achieved state-of-the-art performance in many visual tasks. Due to the quadratic computational and memory complexities of self-attention, recent works either apply attention only to low-resolution inputs or restrict the receptive field to a small local region. To overcome these limitations, we propose key-only attention, which excludes query-key pairwise interactions and uses a compute-efficient saliency-gate to obtain attention weights, modeling local-global interactions in all stages. Key-only attention has linear computational and memory complexities w.r.t input size. We use alternate layout to hybridize convolution and attention layers instead of grafting which is suggested by previous works, so that all stages can benefit from both spatial attentions and convolutions. We leverage these improvements to develop a new self-attention model family, LinGlos, which reach state-of-the-art accuracies on the parameter-limited setting of ImageNet classification benchmark, and outperform baselines significantly in downstream tasks, e.g., COCO object detection and ADE20K semantic segmentation.

Related papers

Unified Local and Global Attention Interaction Modeling for Vision Transformers [1.9571946424055506]
We present a novel method that extends the self-attention mechanism of a vision transformer (ViT) for more accurate object detection across diverse datasets. ViTs show strong capability for image understanding tasks such as object detection, segmentation, and classification. We introduce two modifications to the traditional self-attention framework; a novel aggressive convolution pooling strategy for local feature mixing, and a new conceptual attention transformation.
arXiv Detail & Related papers (2024-12-25T04:53:19Z)
DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision. The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z)
HiRes-LLaVA: Restoring Fragmentation Input in High-Resolution Large Vision-Language Models [96.76995840807615]
HiRes-LLaVA is a novel framework designed to process any size of high-resolution input without altering the original contextual and geometric information. HiRes-LLaVA comprises two innovative components: (i) a SliceRestore adapter that reconstructs sliced patches into their original form, efficiently extracting both global and local features via down-up-sampling and convolution layers, and (ii) a Self-Mining Sampler to compress the vision tokens based on themselves.
arXiv Detail & Related papers (2024-07-11T17:42:17Z)
You Only Need Less Attention at Each Stage in Vision Transformers [19.660385306028047]
Vision Transformers (ViTs) capture the global information of images through self-attention modules. We propose the Less-Attention Vision Transformer (LaViT), which computes only a few attention operations at each stage. Our architecture demonstrates exceptional performance across various vision tasks including classification, detection and segmentation.
arXiv Detail & Related papers (2024-06-01T12:49:16Z)
Sharing Key Semantics in Transformer Makes Efficient Image Restoration [148.22790334216117]
Self-attention mechanism, a cornerstone of Vision Transformers (ViTs) tends to encompass all global cues. Small segments of a degraded image, particularly those closely aligned semantically, provide particularly relevant information to aid in the restoration process. We propose boosting IR's performance by sharing the key semantics via Transformer for IR (ie, SemanIR) in this paper.
arXiv Detail & Related papers (2024-05-30T12:45:34Z)
Learning Feature Matching via Matchable Keypoint-Assisted Graph Neural Network [52.29330138835208]
Accurately matching local features between a pair of images is a challenging computer vision task. Previous studies typically use attention based graph neural networks (GNNs) with fully-connected graphs over keypoints within/across images. We propose MaKeGNN, a sparse attention-based GNN architecture which bypasses non-repeatable keypoints and leverages matchable ones to guide message passing.
arXiv Detail & Related papers (2023-07-04T02:50:44Z)
RFAConv: Innovating Spatial Attention and Standard Convolutional Operation [7.2646541547165056]
We propose a novel attention mechanism called Receptive-Field Attention (RFA) RFA focuses on the receptive-field spatial feature but also provides effective attention weights for large-size convolutional kernels. It offers nearly negligible increment of computational cost and parameters, while significantly improving network performance.
arXiv Detail & Related papers (2023-04-06T16:21:56Z)
BiFormer: Vision Transformer with Bi-Level Routing Attention [26.374724782056557]
We propose a novel dynamic sparse attention via bi-level routing to enable a more flexible allocation of computations with content awareness. Specifically, for a query, irrelevant key-value pairs are first filtered out at a coarse region level, and then fine-grained token-to-token attention is applied in the union of remaining candidate regions. Built with the proposed bi-level routing attention, a new general vision transformer, named BiFormer, is then presented.
arXiv Detail & Related papers (2023-03-15T17:58:46Z)
Sparse Attention Acceleration with Synergistic In-Memory Pruning and On-Chip Recomputation [6.303594714446706]
Self-attention mechanism gauges pairwise correlations across entire input sequence. Despite favorable performance, calculating pairwise correlations is prohibitively costly. This work addresses these constraints by architecting an accelerator, called SPRINT, which computes attention scores in an approximate manner.
arXiv Detail & Related papers (2022-09-01T17:18:19Z)
ClusTR: Exploring Efficient Self-attention via Clustering for Vision Transformers [70.76313507550684]
We propose a content-based sparse attention method, as an alternative to dense self-attention. Specifically, we cluster and then aggregate key and value tokens, as a content-based method of reducing the total token count. The resulting clustered-token sequence retains the semantic diversity of the original signal, but can be processed at a lower computational cost.
arXiv Detail & Related papers (2022-08-28T04:18:27Z)
CloudAttention: Efficient Multi-Scale Attention Scheme For 3D Point Cloud Learning [81.85951026033787]
We set transformers in this work and incorporate them into a hierarchical framework for shape classification and part and scene segmentation. We also compute efficient and dynamic global cross attentions by leveraging sampling and grouping at each iteration. The proposed hierarchical model achieves state-of-the-art shape classification in mean accuracy and yields results on par with the previous segmentation methods.
arXiv Detail & Related papers (2022-07-31T21:39:15Z)
Unlocking Pixels for Reinforcement Learning via Implicit Attention [61.666538764049854]
We make use of new efficient attention algorithms, recently shown to be highly effective for Transformers. This allows our attention-based controllers to scale to larger visual inputs, and facilitate the use of smaller patches. In addition, we propose a new efficient algorithm approximating softmax attention with what we call hybrid random features.
arXiv Detail & Related papers (2021-02-08T17:00:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.