Related papers: Ripple Attention for Visual Perception with Sub-quadratic Complexity

Ripple Attention for Visual Perception with Sub-quadratic Complexity

URL: http://arxiv.org/abs/2110.02453v1
Date: Wed, 6 Oct 2021 02:00:38 GMT
Title: Ripple Attention for Visual Perception with Sub-quadratic Complexity
Authors: Lin Zheng, Huijie Pan, Lingpeng Kong
Abstract summary: Transformer architectures are now central to modeling in natural language processing tasks. We propose ripple attention, a sub-quadratic attention mechanism for visual perception. In ripple attention, contributions of different tokens to a query are weighted with respect to their relative spatial distances in the 2D space.
Score: 7.425337104538644
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformer architectures are now central to modeling in natural language processing tasks. At its heart is the attention mechanism, which enables effective modeling of long-term dependencies in a sequence. Recently, transformers have been successfully applied in the computer vision domain, where 2D images are first segmented into patches and then treated as 1D sequences. Such linearization, however, impairs the notion of spatial locality in images, which bears important visual clues. To bridge the gap, we propose ripple attention, a sub-quadratic attention mechanism for visual perception. In ripple attention, contributions of different tokens to a query are weighted with respect to their relative spatial distances in the 2D space. To favor correlations with vicinal tokens yet permit long-term dependencies, we derive the spatial weights through a stick-breaking transformation. We further design a dynamic programming algorithm that computes weighted contributions for all queries in linear observed time, taking advantage of the summed-area table and recent advances in linearized attention. Extensive experiments and analyses demonstrate the effectiveness of ripple attention on various visual tasks.

Related papers

Autoregressive Image Generation with Linear Complexity: A Spatial-Aware Decay Perspective [47.87649021414188]
We present LASADGen, an autoregressive image generator that enables selective attention to relevant spatial contexts with linear complexity.<n>Experiments on ImageNet show LASADGen achieves state-of-the-art image generation performance and computational efficiency.
arXiv Detail & Related papers (2025-07-02T12:27:06Z)
Bridging the Divide: Reconsidering Softmax and Linear Attention [116.34723260730405]
We present two key perspectives to understand and alleviate the limitations of linear attention. We prove that linear attention is not injective, which is prone to assign identical attention weights to different query vectors. Secondly, we confirm that effective local modeling is essential for the success of Softmax attention, in which linear attention falls short.
arXiv Detail & Related papers (2024-12-09T15:44:22Z)
Where Am I and What Will I See: An Auto-Regressive Model for Spatial Localization and View Prediction [60.964512894143475]
We present Generative Spatial Transformer ( GST), a novel auto-regressive framework that jointly addresses spatial localization and view prediction. Our model simultaneously estimates the camera pose from a single image and predicts the view from a new camera pose, effectively bridging the gap between spatial awareness and visual prediction.
arXiv Detail & Related papers (2024-10-24T17:58:05Z)
DAPE V2: Process Attention Score as Feature Map for Length Extrapolation [63.87956583202729]
We conceptualize attention as a feature map and apply the convolution operator to mimic the processing methods in computer vision. The novel insight, which can be adapted to various attention-related models, reveals that the current Transformer architecture has the potential for further evolution.
arXiv Detail & Related papers (2024-10-07T07:21:49Z)
Elliptical Attention [1.7597562616011944]
Pairwise dot-product self-attention is key to the success of transformers that achieve state-of-the-art performance across a variety of applications in language and vision. We propose using a Mahalanobis distance metric for computing the attention weights to stretch the underlying feature space in directions of high contextual relevance.
arXiv Detail & Related papers (2024-06-19T18:38:11Z)
Cohere3D: Exploiting Temporal Coherence for Unsupervised Representation Learning of Vision-based Autonomous Driving [73.3702076688159]
We propose a novel contrastive learning algorithm, Cohere3D, to learn coherent instance representations in a long-term input sequence. We evaluate our algorithm by finetuning the pretrained model on various downstream perception, prediction, and planning tasks.
arXiv Detail & Related papers (2024-02-23T19:43:01Z)
Cross-Modal Learning with 3D Deformable Attention for Action Recognition [4.128256616073278]
We propose a new 3D deformable transformer for action recognition with adaptive attention fields and a cross-temporal learning scheme. The proposed 3D deformable transformer was tested on the. 60,.120 FineGYM, and PennActionAction datasets, and showed results better than or similar to pre-trained state-of-the-art methods.
arXiv Detail & Related papers (2022-12-12T00:31:08Z)
Graph Reasoning Transformer for Image Parsing [67.76633142645284]
We propose a novel Graph Reasoning Transformer (GReaT) for image parsing to enable image patches to interact following a relation reasoning pattern. Compared to the conventional transformer, GReaT has higher interaction efficiency and a more purposeful interaction pattern. Results show that GReaT achieves consistent performance gains with slight computational overheads on the state-of-the-art transformer baselines.
arXiv Detail & Related papers (2022-09-20T08:21:37Z)
Rethinking Query-Key Pairwise Interactions in Vision Transformers [5.141895475956681]
We propose key-only attention, which excludes query-key pairwise interactions and uses a compute-efficient saliency-gate to obtain attention weights. We develop a new self-attention model family, LinGlos, which reach state-of-the-art accuracies on the parameter-limited setting of ImageNet classification benchmark.
arXiv Detail & Related papers (2022-07-01T03:36:49Z)
Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity. Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z)
Beyond Self-attention: External Attention using Two Linear Layers for Visual Tasks [34.32609892928909]
We propose a novel attention mechanism which we call external attention, based on two external, small, learnable, and shared memories. Our method provides comparable or superior performance to the self-attention mechanism and some of its variants, with much lower computational and memory costs.
arXiv Detail & Related papers (2021-05-05T22:29:52Z)
Spatial-Temporal Correlation and Topology Learning for Person Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation. CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body. It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z)

This list is automatically generated from the titles and abstracts of the papers in this site.