Rethinking and Improving Relative Position Encoding for Vision
Transformer
- URL: http://arxiv.org/abs/2107.14222v1
- Date: Thu, 29 Jul 2021 17:55:10 GMT
- Title: Rethinking and Improving Relative Position Encoding for Vision
Transformer
- Authors: Kan Wu and Houwen Peng and Minghao Chen and Jianlong Fu and Hongyang
Chao
- Abstract summary: Relative position encoding (RPE) is important for transformer to capture sequence ordering of input tokens.
We propose new relative position encoding methods dedicated to 2D images, called image RPE (iRPE)
- Score: 61.559777439200744
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Relative position encoding (RPE) is important for transformer to capture
sequence ordering of input tokens. General efficacy has been proven in natural
language processing. However, in computer vision, its efficacy is not well
studied and even remains controversial, e.g., whether relative position
encoding can work equally well as absolute position? In order to clarify this,
we first review existing relative position encoding methods and analyze their
pros and cons when applied in vision transformers. We then propose new relative
position encoding methods dedicated to 2D images, called image RPE (iRPE). Our
methods consider directional relative distance modeling as well as the
interactions between queries and relative position embeddings in self-attention
mechanism. The proposed iRPE methods are simple and lightweight. They can be
easily plugged into transformer blocks. Experiments demonstrate that solely due
to the proposed encoding methods, DeiT and DETR obtain up to 1.5% (top-1 Acc)
and 1.3% (mAP) stable improvements over their original versions on ImageNet and
COCO respectively, without tuning any extra hyperparameters such as learning
rate and weight decay. Our ablation and analysis also yield interesting
findings, some of which run counter to previous understanding. Code and models
are open-sourced at https://github.com/microsoft/Cream/tree/main/iRPE.
Related papers
- Positional Prompt Tuning for Efficient 3D Representation Learning [16.25423192020736]
Point cloud analysis has achieved significant development and is well-performed in multiple downstream tasks like point cloud classification and segmentation, etc.
Being conscious of the simplicity of the position encoding structure in Transformer-based architectures, we attach importance to the position encoding as a high-dimensional part and the patch encoder to offer multi-scale information.
Our Proposed method of PEFT tasks, namely, with only 1.05% of parameters for training, gets state-of-the-art results in several mainstream datasets, such as 95.01% accuracy in the ScanObjectNN OBJ_BG dataset.
arXiv Detail & Related papers (2024-08-21T12:18:34Z) - Real-Time Motion Prediction via Heterogeneous Polyline Transformer with
Relative Pose Encoding [121.08841110022607]
Existing agent-centric methods have demonstrated outstanding performance on public benchmarks.
We introduce the K-nearest neighbor attention with relative pose encoding (KNARPE), a novel attention mechanism allowing the pairwise-relative representation to be used by Transformers.
By sharing contexts among agents and reusing the unchanged contexts, our approach is as efficient as scene-centric methods, while performing on par with state-of-the-art agent-centric methods.
arXiv Detail & Related papers (2023-10-19T17:59:01Z) - Bridging Vision and Language Encoders: Parameter-Efficient Tuning for
Referring Image Segmentation [72.27914940012423]
We do an investigation of efficient tuning problems on referring image segmentation.
We propose a novel adapter called Bridger to facilitate cross-modal information exchange.
We also design a lightweight decoder for image segmentation.
arXiv Detail & Related papers (2023-07-21T12:46:15Z) - Pure Transformer with Integrated Experts for Scene Text Recognition [11.089203218000854]
Scene text recognition (STR) involves the task of reading text in cropped images of natural scenes.
Recent times, the transformer architecture is being widely adopted in STR as it shows strong capability in capturing long-term dependency.
This work proposes the use of a transformer-only model as a simple baseline which outperforms hybrid CNN-transformer models.
arXiv Detail & Related papers (2022-11-09T15:26:59Z) - Camera Pose Auto-Encoders for Improving Pose Regression [6.700873164609009]
We introduce Camera Pose Auto-Encoders (PAEs) to encode camera poses using APRs as their teachers.
We show that the resulting latent pose representations can closely reproduce APR performance and demonstrate their effectiveness for related tasks.
We also show that train images can be reconstructed from the learned pose encoding, paving the way for integrating visual information from the train set at a low memory cost.
arXiv Detail & Related papers (2022-07-12T13:47:36Z) - Attribute Surrogates Learning and Spectral Tokens Pooling in
Transformers for Few-shot Learning [50.95116994162883]
Vision transformers have been thought of as a promising alternative to convolutional neural networks for visual recognition.
This paper presents hierarchically cascaded transformers that exploit intrinsic image structures through spectral tokens pooling.
HCTransformers surpass the DINO baseline by a large margin of 9.7% 5-way 1-shot accuracy and 9.17% 5-way 5-shot accuracy on miniImageNet.
arXiv Detail & Related papers (2022-03-17T03:49:58Z) - Conformer-based End-to-end Speech Recognition With Rotary Position
Embedding [11.428057887454008]
We introduce rotary position embedding (RoPE) in the convolution-augmented transformer (conformer)
RoPE encodes absolute positional information into the input sequence by a rotation matrix, and then naturally incorporates explicit relative position information into a self-attention module.
Our model achieves a relative word error rate reduction of 8.70% and 7.27% over the conformer on test-clean and test-other sets of the LibriSpeech corpus respectively.
arXiv Detail & Related papers (2021-07-13T08:07:22Z) - Learnable Fourier Features for Multi-DimensionalSpatial Positional
Encoding [96.9752763607738]
We propose a novel positional encoding method based on learnable Fourier features.
Our experiments show that our learnable feature representation for multi-dimensional positional encoding outperforms existing methods.
arXiv Detail & Related papers (2021-06-05T04:40:18Z) - Transformer-Based Deep Image Matching for Generalizable Person
Re-identification [114.56752624945142]
We investigate the possibility of applying Transformers for image matching and metric learning given pairs of images.
We find that the Vision Transformer (ViT) and the vanilla Transformer with decoders are not adequate for image matching due to their lack of image-to-image attention.
We propose a new simplified decoder, which drops the full attention implementation with the softmax weighting, keeping only the query-key similarity.
arXiv Detail & Related papers (2021-05-30T05:38:33Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.