Ripple Attention for Visual Perception with Sub-quadratic Complexity
- URL: http://arxiv.org/abs/2110.02453v1
- Date: Wed, 6 Oct 2021 02:00:38 GMT
- Title: Ripple Attention for Visual Perception with Sub-quadratic Complexity
- Authors: Lin Zheng, Huijie Pan, Lingpeng Kong
- Abstract summary: Transformer architectures are now central to modeling in natural language processing tasks.
We propose ripple attention, a sub-quadratic attention mechanism for visual perception.
In ripple attention, contributions of different tokens to a query are weighted with respect to their relative spatial distances in the 2D space.
- Score: 7.425337104538644
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer architectures are now central to modeling in natural language
processing tasks. At its heart is the attention mechanism, which enables
effective modeling of long-term dependencies in a sequence. Recently,
transformers have been successfully applied in the computer vision domain,
where 2D images are first segmented into patches and then treated as 1D
sequences. Such linearization, however, impairs the notion of spatial locality
in images, which bears important visual clues. To bridge the gap, we propose
ripple attention, a sub-quadratic attention mechanism for visual perception. In
ripple attention, contributions of different tokens to a query are weighted
with respect to their relative spatial distances in the 2D space. To favor
correlations with vicinal tokens yet permit long-term dependencies, we derive
the spatial weights through a stick-breaking transformation. We further design
a dynamic programming algorithm that computes weighted contributions for all
queries in linear observed time, taking advantage of the summed-area table and
recent advances in linearized attention. Extensive experiments and analyses
demonstrate the effectiveness of ripple attention on various visual tasks.
Related papers
- Bridging the Divide: Reconsidering Softmax and Linear Attention [116.34723260730405]
We present two key perspectives to understand and alleviate the limitations of linear attention.
We prove that linear attention is not injective, which is prone to assign identical attention weights to different query vectors.
Secondly, we confirm that effective local modeling is essential for the success of Softmax attention, in which linear attention falls short.
arXiv Detail & Related papers (2024-12-09T15:44:22Z) - Where Am I and What Will I See: An Auto-Regressive Model for Spatial Localization and View Prediction [60.964512894143475]
We present Generative Spatial Transformer ( GST), a novel auto-regressive framework that jointly addresses spatial localization and view prediction.
Our model simultaneously estimates the camera pose from a single image and predicts the view from a new camera pose, effectively bridging the gap between spatial awareness and visual prediction.
arXiv Detail & Related papers (2024-10-24T17:58:05Z) - Elliptical Attention [1.7597562616011944]
Pairwise dot-product self-attention is key to the success of transformers that achieve state-of-the-art performance across a variety of applications in language and vision.
We propose using a Mahalanobis distance metric for computing the attention weights to stretch the underlying feature space in directions of high contextual relevance.
arXiv Detail & Related papers (2024-06-19T18:38:11Z) - Cohere3D: Exploiting Temporal Coherence for Unsupervised Representation
Learning of Vision-based Autonomous Driving [73.3702076688159]
We propose a novel contrastive learning algorithm, Cohere3D, to learn coherent instance representations in a long-term input sequence.
We evaluate our algorithm by finetuning the pretrained model on various downstream perception, prediction, and planning tasks.
arXiv Detail & Related papers (2024-02-23T19:43:01Z) - Cross-Modal Learning with 3D Deformable Attention for Action Recognition [4.128256616073278]
We propose a new 3D deformable transformer for action recognition with adaptive attention fields and a cross-temporal learning scheme.
The proposed 3D deformable transformer was tested on the.
60,.120 FineGYM, and PennActionAction datasets, and showed results better than or similar to pre-trained state-of-the-art methods.
arXiv Detail & Related papers (2022-12-12T00:31:08Z) - Graph Reasoning Transformer for Image Parsing [67.76633142645284]
We propose a novel Graph Reasoning Transformer (GReaT) for image parsing to enable image patches to interact following a relation reasoning pattern.
Compared to the conventional transformer, GReaT has higher interaction efficiency and a more purposeful interaction pattern.
Results show that GReaT achieves consistent performance gains with slight computational overheads on the state-of-the-art transformer baselines.
arXiv Detail & Related papers (2022-09-20T08:21:37Z) - Rethinking Query-Key Pairwise Interactions in Vision Transformers [5.141895475956681]
We propose key-only attention, which excludes query-key pairwise interactions and uses a compute-efficient saliency-gate to obtain attention weights.
We develop a new self-attention model family, LinGlos, which reach state-of-the-art accuracies on the parameter-limited setting of ImageNet classification benchmark.
arXiv Detail & Related papers (2022-07-01T03:36:49Z) - Vicinity Vision Transformer [53.43198716947792]
We present a Vicinity Attention that introduces a locality bias to vision transformers with linear complexity.
Our approach achieves state-of-the-art image classification accuracy with 50% fewer parameters than previous methods.
arXiv Detail & Related papers (2022-06-21T17:33:53Z) - Beyond Self-attention: External Attention using Two Linear Layers for
Visual Tasks [34.32609892928909]
We propose a novel attention mechanism which we call external attention, based on two external, small, learnable, and shared memories.
Our method provides comparable or superior performance to the self-attention mechanism and some of its variants, with much lower computational and memory costs.
arXiv Detail & Related papers (2021-05-05T22:29:52Z) - Spatial-Temporal Correlation and Topology Learning for Person
Re-Identification in Videos [78.45050529204701]
We propose a novel framework to pursue discriminative and robust representation by modeling cross-scale spatial-temporal correlation.
CTL utilizes a CNN backbone and a key-points estimator to extract semantic local features from human body.
It explores a context-reinforced topology to construct multi-scale graphs by considering both global contextual information and physical connections of human body.
arXiv Detail & Related papers (2021-04-15T14:32:12Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.