A 2D Semantic-Aware Position Encoding for Vision Transformers
- URL: http://arxiv.org/abs/2505.09466v1
- Date: Wed, 14 May 2025 15:17:34 GMT
- Title: A 2D Semantic-Aware Position Encoding for Vision Transformers
- Authors: Xi Chen, Shiyang Zhou, Muqi Huang, Jiaxu Feng, Yun Xiong, Kun Zhou, Biao Yang, Yuhui Zhang, Huishuai Bao, Sijia Peng, Chuan Li, Feng Shi,
- Abstract summary: Vision transformers have demonstrated significant advantages in computer vision tasks due to their ability to capture long-range dependencies and contextual relationships through self-attention.<n>Existing position encoding techniques, which are largely borrowed from natural language processing, fail to effectively capture semantic-aware positional relationships between image patches.<n>Traditional approaches like absolute position encoding and relative position encoding primarily focus on 1D linear position relationship, often the semantic similarity between distant yet contextually related patches.
- Score: 32.86183384267028
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Vision transformers have demonstrated significant advantages in computer vision tasks due to their ability to capture long-range dependencies and contextual relationships through self-attention. However, existing position encoding techniques, which are largely borrowed from natural language processing, fail to effectively capture semantic-aware positional relationships between image patches. Traditional approaches like absolute position encoding and relative position encoding primarily focus on 1D linear position relationship, often neglecting the semantic similarity between distant yet contextually related patches. These limitations hinder model generalization, translation equivariance, and the ability to effectively handle repetitive or structured patterns in images. In this paper, we propose 2-Dimensional Semantic-Aware Position Encoding ($\text{SaPE}^2$), a novel position encoding method with semantic awareness that dynamically adapts position representations by leveraging local content instead of fixed linear position relationship or spatial coordinates. Our method enhances the model's ability to generalize across varying image resolutions and scales, improves translation equivariance, and better aggregates features for visually similar but spatially distant patches. By integrating $\text{SaPE}^2$ into vision transformers, we bridge the gap between position encoding and perceptual similarity, thereby improving performance on computer vision tasks.
Related papers
- Cameras as Relative Positional Encoding [37.675563572777136]
Multi-view transformers must use camera geometry to ground visual tokens in 3D space.<n>We show how relative camera conditioning improves performance in feedforward novel view synthesis.<n>We then verify that these benefits persist for different tasks, stereo depth estimation and discriminative cognition, as well as larger model sizes.
arXiv Detail & Related papers (2025-07-14T17:22:45Z) - Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models [35.471513870514585]
Rotary Position Embedding (RoPE) is a widely adopted technique for encoding relative positional information in large language models.<n>RoPE variants enforce relative positional dependencies between text token indices and image tokens, causing spurious alignments.<n>We introduce Circle-RoPE, a novel encoding scheme that maps image token indices onto a circular trajectory to the linear path of text token indices, forming a cone-like structure.
arXiv Detail & Related papers (2025-05-22T09:05:01Z) - Distractors-Immune Representation Learning with Cross-modal Contrastive Regularization for Change Captioning [71.14084801851381]
Change captioning aims to succinctly describe the semantic change between a pair of similar images.
Most existing methods directly capture the difference between them, which risk obtaining error-prone difference features.
We propose a distractors-immune representation learning network that correlates the corresponding channels of two image representations.
arXiv Detail & Related papers (2024-07-16T13:00:33Z) - Smooth image-to-image translations with latent space interpolations [64.8170758294427]
Multi-domain image-to-image (I2I) translations can transform a source image according to the style of a target domain.
We show that our regularization techniques can improve the state-of-the-art I2I translations by a large margin.
arXiv Detail & Related papers (2022-10-03T11:57:30Z) - A Multi-level Alignment Training Scheme for Video-and-Language Grounding [9.866172676211905]
A good multi-modality encoder should be able to well capture both inputs' semantics and encode them in the shared feature space.
We developed a multi-level alignment training scheme to directly shape the encoding process.
Our framework achieved comparable performance to previous state-of-the-arts on multiple video QA and retrieval datasets.
arXiv Detail & Related papers (2022-04-22T21:46:52Z) - SAC-GAN: Structure-Aware Image-to-Image Composition for Self-Driving [18.842432515507035]
We present a compositional approach to image augmentation for self-driving applications.
It is an end-to-end neural network that is trained to seamlessly compose an object represented as a cropped patch from an object image, into a background scene image.
We evaluate our network, coined SAC-GAN for structure-aware composition, on prominent self-driving datasets in terms of quality, composability, and generalizability of the composite images.
arXiv Detail & Related papers (2021-12-13T12:24:50Z) - Global and Local Alignment Networks for Unpaired Image-to-Image
Translation [170.08142745705575]
The goal of unpaired image-to-image translation is to produce an output image reflecting the target domain's style.
Due to the lack of attention to the content change in existing methods, semantic information from source images suffers from degradation during translation.
We introduce a novel approach, Global and Local Alignment Networks (GLA-Net)
Our method effectively generates sharper and more realistic images than existing approaches.
arXiv Detail & Related papers (2021-11-19T18:01:54Z) - Rethinking and Improving Relative Position Encoding for Vision
Transformer [61.559777439200744]
Relative position encoding (RPE) is important for transformer to capture sequence ordering of input tokens.
We propose new relative position encoding methods dedicated to 2D images, called image RPE (iRPE)
arXiv Detail & Related papers (2021-07-29T17:55:10Z) - Learnable Fourier Features for Multi-DimensionalSpatial Positional
Encoding [96.9752763607738]
We propose a novel positional encoding method based on learnable Fourier features.
Our experiments show that our learnable feature representation for multi-dimensional positional encoding outperforms existing methods.
arXiv Detail & Related papers (2021-06-05T04:40:18Z) - Unpaired Image-to-Image Translation via Latent Energy Transport [61.62293304236371]
Image-to-image translation aims to preserve source contents while translating to discriminative target styles between two visual domains.
In this paper, we propose to deploy an energy-based model (EBM) in the latent space of a pretrained autoencoder for this task.
Our model is the first to be applicable to 1024$times$1024-resolution unpaired image translation.
arXiv Detail & Related papers (2020-12-01T17:18:58Z) - Code-Aligned Autoencoders for Unsupervised Change Detection in
Multimodal Remote Sensing Images [18.133760118780128]
Image translation with convolutional autoencoders has recently been used as an approach to multimodal change detection in bitemporal satellite images.
A main challenge is the alignment of the code spaces by reducing the contribution of change pixels to the learning of the translation function.
We propose to extract relational pixel information captured by domain-specific affinity matrices at the input and use this to enforce alignment of the code spaces.
arXiv Detail & Related papers (2020-04-15T11:24:51Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.