Rethinking and Improving Relative Position Encoding for Vision
Transformer
- URL: http://arxiv.org/abs/2107.14222v1
- Date: Thu, 29 Jul 2021 17:55:10 GMT
- Title: Rethinking and Improving Relative Position Encoding for Vision
Transformer
- Authors: Kan Wu and Houwen Peng and Minghao Chen and Jianlong Fu and Hongyang
Chao
- Abstract summary: Relative position encoding (RPE) is important for transformer to capture sequence ordering of input tokens.
We propose new relative position encoding methods dedicated to 2D images, called image RPE (iRPE)
- Score: 61.559777439200744
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Relative position encoding (RPE) is important for transformer to capture
sequence ordering of input tokens. General efficacy has been proven in natural
language processing. However, in computer vision, its efficacy is not well
studied and even remains controversial, e.g., whether relative position
encoding can work equally well as absolute position? In order to clarify this,
we first review existing relative position encoding methods and analyze their
pros and cons when applied in vision transformers. We then propose new relative
position encoding methods dedicated to 2D images, called image RPE (iRPE). Our
methods consider directional relative distance modeling as well as the
interactions between queries and relative position embeddings in self-attention
mechanism. The proposed iRPE methods are simple and lightweight. They can be
easily plugged into transformer blocks. Experiments demonstrate that solely due
to the proposed encoding methods, DeiT and DETR obtain up to 1.5% (top-1 Acc)
and 1.3% (mAP) stable improvements over their original versions on ImageNet and
COCO respectively, without tuning any extra hyperparameters such as learning
rate and weight decay. Our ablation and analysis also yield interesting
findings, some of which run counter to previous understanding. Code and models
are open-sourced at https://github.com/microsoft/Cream/tree/main/iRPE.
Related papers
- Cameras as Relative Positional Encoding [37.675563572777136]
Multi-view transformers must use camera geometry to ground visual tokens in 3D space.<n>We show how relative camera conditioning improves performance in feedforward novel view synthesis.<n>We then verify that these benefits persist for different tasks, stereo depth estimation and discriminative cognition, as well as larger model sizes.
arXiv Detail & Related papers (2025-07-14T17:22:45Z) - Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models [35.471513870514585]
Rotary Position Embedding (RoPE) is a widely adopted technique for encoding relative positional information in large language models.<n>RoPE variants enforce relative positional dependencies between text token indices and image tokens, causing spurious alignments.<n>We introduce Circle-RoPE, a novel encoding scheme that maps image token indices onto a circular trajectory to the linear path of text token indices, forming a cone-like structure.
arXiv Detail & Related papers (2025-05-22T09:05:01Z) - PaTH Attention: Position Encoding via Accumulating Householder Transformations [56.32365080761523]
PaTH is a flexible data-dependent position encoding scheme based on accumulated products of Householder transformations.<n>We derive an efficient parallel algorithm for training through exploiting a compact representation of products of Householder matrices.
arXiv Detail & Related papers (2025-05-22T08:36:09Z) - A 2D Semantic-Aware Position Encoding for Vision Transformers [32.86183384267028]
Vision transformers have demonstrated significant advantages in computer vision tasks due to their ability to capture long-range dependencies and contextual relationships through self-attention.<n>Existing position encoding techniques, which are largely borrowed from natural language processing, fail to effectively capture semantic-aware positional relationships between image patches.<n>Traditional approaches like absolute position encoding and relative position encoding primarily focus on 1D linear position relationship, often the semantic similarity between distant yet contextually related patches.
arXiv Detail & Related papers (2025-05-14T15:17:34Z) - Toward Relative Positional Encoding in Spiking Transformers [52.62008099390541]
Spiking neural networks (SNNs) are bio-inspired networks that model how neurons in the brain communicate through discrete spikes.
In this paper, we introduce an approximate method for relative positional encoding (RPE) in Spiking Transformers.
arXiv Detail & Related papers (2025-01-28T06:42:37Z) - Positional Prompt Tuning for Efficient 3D Representation Learning [16.25423192020736]
Point cloud analysis has achieved significant development and is well-performed in multiple downstream tasks like point cloud classification and segmentation, etc.
Being conscious of the simplicity of the position encoding structure in Transformer-based architectures, we attach importance to the position encoding as a high-dimensional part and the patch encoder to offer multi-scale information.
Our Proposed method of PEFT tasks, namely, with only 1.05% of parameters for training, gets state-of-the-art results in several mainstream datasets, such as 95.01% accuracy in the ScanObjectNN OBJ_BG dataset.
arXiv Detail & Related papers (2024-08-21T12:18:34Z) - Real-Time Motion Prediction via Heterogeneous Polyline Transformer with
Relative Pose Encoding [121.08841110022607]
Existing agent-centric methods have demonstrated outstanding performance on public benchmarks.
We introduce the K-nearest neighbor attention with relative pose encoding (KNARPE), a novel attention mechanism allowing the pairwise-relative representation to be used by Transformers.
By sharing contexts among agents and reusing the unchanged contexts, our approach is as efficient as scene-centric methods, while performing on par with state-of-the-art agent-centric methods.
arXiv Detail & Related papers (2023-10-19T17:59:01Z) - Bridging Vision and Language Encoders: Parameter-Efficient Tuning for
Referring Image Segmentation [72.27914940012423]
We do an investigation of efficient tuning problems on referring image segmentation.
We propose a novel adapter called Bridger to facilitate cross-modal information exchange.
We also design a lightweight decoder for image segmentation.
arXiv Detail & Related papers (2023-07-21T12:46:15Z) - Pure Transformer with Integrated Experts for Scene Text Recognition [11.089203218000854]
Scene text recognition (STR) involves the task of reading text in cropped images of natural scenes.
Recent times, the transformer architecture is being widely adopted in STR as it shows strong capability in capturing long-term dependency.
This work proposes the use of a transformer-only model as a simple baseline which outperforms hybrid CNN-transformer models.
arXiv Detail & Related papers (2022-11-09T15:26:59Z) - Camera Pose Auto-Encoders for Improving Pose Regression [6.700873164609009]
We introduce Camera Pose Auto-Encoders (PAEs) to encode camera poses using APRs as their teachers.
We show that the resulting latent pose representations can closely reproduce APR performance and demonstrate their effectiveness for related tasks.
We also show that train images can be reconstructed from the learned pose encoding, paving the way for integrating visual information from the train set at a low memory cost.
arXiv Detail & Related papers (2022-07-12T13:47:36Z) - Attribute Surrogates Learning and Spectral Tokens Pooling in
Transformers for Few-shot Learning [50.95116994162883]
Vision transformers have been thought of as a promising alternative to convolutional neural networks for visual recognition.
This paper presents hierarchically cascaded transformers that exploit intrinsic image structures through spectral tokens pooling.
HCTransformers surpass the DINO baseline by a large margin of 9.7% 5-way 1-shot accuracy and 9.17% 5-way 5-shot accuracy on miniImageNet.
arXiv Detail & Related papers (2022-03-17T03:49:58Z) - Conformer-based End-to-end Speech Recognition With Rotary Position
Embedding [11.428057887454008]
We introduce rotary position embedding (RoPE) in the convolution-augmented transformer (conformer)
RoPE encodes absolute positional information into the input sequence by a rotation matrix, and then naturally incorporates explicit relative position information into a self-attention module.
Our model achieves a relative word error rate reduction of 8.70% and 7.27% over the conformer on test-clean and test-other sets of the LibriSpeech corpus respectively.
arXiv Detail & Related papers (2021-07-13T08:07:22Z) - Learnable Fourier Features for Multi-DimensionalSpatial Positional
Encoding [96.9752763607738]
We propose a novel positional encoding method based on learnable Fourier features.
Our experiments show that our learnable feature representation for multi-dimensional positional encoding outperforms existing methods.
arXiv Detail & Related papers (2021-06-05T04:40:18Z) - Transformer-Based Deep Image Matching for Generalizable Person
Re-identification [114.56752624945142]
We investigate the possibility of applying Transformers for image matching and metric learning given pairs of images.
We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention.
We propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity.
arXiv Detail & Related papers (2021-05-30T05:38:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.