Related papers: Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models

Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models

URL: http://arxiv.org/abs/2505.16416v2
Date: Sat, 04 Oct 2025 09:54:36 GMT
Title: Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models
Authors: Chengcheng Wang, Jianyuan Guo, Hongguang Li, Yuchuan Tian, Ying Nie, Chang Xu, Kai Han,
Abstract summary: Rotary Position Embedding (RoPE) is a widely adopted technique for encoding relative positional information in large language models.<n>When extended to vision-language models (VLMs), RoPE and its variants enforce relative positional dependencies separately within text and image tokens.<n>We introduce Circle-RoPE, a novel encoding scheme designed to eliminate spurious cross-modal biases.
Score: 49.122200327049676
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Rotary Position Embedding (RoPE) is a widely adopted technique for encoding relative positional information in large language models (LLMs). However, when extended to vision-language models (VLMs), RoPE and its variants enforce relative positional dependencies separately within text and image tokens, introducing unintended cross-modal positional biases. For example, image tokens depicting semantically consistent content are assigned distinct positional encodings solely due to spatial location variations. As a result, such tokens exhibit entirely different relative positional relationships with their corresponding text tokens, ultimately leading to misaligned cross-modal representations. To address this, we propose Per-Token Distance, a simple yet effective metric for quantifying the independence of positional encodings across modalities. Informed by this analysis, we introduce Circle-RoPE, a novel encoding scheme designed to eliminate spurious cross-modal biases. Our key idea is to project image token indices onto a \emph{ring} that is orthogonal to the linear axis of text token indices, thereby forming a cone-like structure in the positional encoding space. In this configuration, each text token (point on the linear text axis) becomes the apex of a cone and maintains an equal distance to all image tokens (points on the circular image \emph{ring}), reducing artificial cross-modal biases while preserving intra-image spatial information. To further enhance performance, we propose a staggered strategy that applies different RoPE variants across layers. Extensive experiments demonstrate that our method effectively preserves spatial information from images while reducing relative positional bias, offering a more robust and flexible positional encoding framework for VLMs. The code is available at https://github.com/lose4578/CircleRoPE.

Related papers

TokenCLIP: Token-wise Prompt Learning for Zero-shot Anomaly Detection [62.95726973851089]
TokenCLIP is a token-wise adaptation framework for anomaly learning.<n>It enables dynamic alignment between visual and learnable textual spaces for fine-grained anomaly learning.
arXiv Detail & Related papers (2025-10-24T05:51:31Z)
CoPatch: Zero-Shot Referring Image Segmentation by Leveraging Untapped Spatial Knowledge in CLIP [26.827036116024914]
textscCoPatch is a zero-shot RIS framework that enhances spatial representations in both text and image modalities.<n>We show that textscCoPatch significantly improves spatial grounding in zero-shot RIS across RefCOCO, RefCOCO+, RefCOCOg, and PhraseCut (+ 2--7 mIoU) without requiring any additional training.
arXiv Detail & Related papers (2025-09-27T04:12:10Z)
MCA-LLaVA: Manhattan Causal Attention for Reducing Hallucination in Large Vision-Language Models [25.406556604989607]
Hallucinations pose a significant challenge in Large Vision Language Models (LVLMs)<n> misalignment between multimodal features identified as a key contributing factor.<n>MCA-LLaVA integrates the one-dimensional sequence order and two-dimensional spatial position of image tokens for positional modeling.
arXiv Detail & Related papers (2025-07-12T08:09:35Z)
SeqPE: Transformer with Sequential Position Encoding [76.22159277300891]
SeqPE represents each $n$-dimensional position index as a symbolic sequence and employs a lightweight sequential position encoder to learn their embeddings.<n> Experiments across language modeling, long-context question answering, and 2D image classification demonstrate that SeqPE not only surpasses strong baselines in perplexity, exact match (EM) and accuracy--but also enables seamless generalization to multi-dimensional inputs without requiring manual architectural redesign.
arXiv Detail & Related papers (2025-06-16T09:16:40Z)
HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition [16.46501527058266]
We introduce HypeVPR, a novel hierarchical embedding framework in hyperbolic space.<n>HypeVPR is designed to address the unique challenges of perspective-to-equirectangular (P2E) VPR.
arXiv Detail & Related papers (2025-06-05T08:47:15Z)
ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models [24.087014423545067]
A prevalent approach for enhancing Vision-Language Models (VLMs) performance is to encode both the high-resolution version and the thumbnail of an image simultaneously.<n>We propose ID-Align, which alleviates these problems by reordering position IDs.<n>Our experiments conducted within the LLaVA-Next framework demonstrate that ID-Align achieves significant improvements.
arXiv Detail & Related papers (2025-05-27T17:36:23Z)
PaTH Attention: Position Encoding via Accumulating Householder Transformations [56.32365080761523]
PaTH is a flexible data-dependent position encoding scheme based on accumulated products of Householder transformations.<n>We derive an efficient parallel algorithm for training through exploiting a compact representation of products of Householder matrices.
arXiv Detail & Related papers (2025-05-22T08:36:09Z)
A 2D Semantic-Aware Position Encoding for Vision Transformers [32.86183384267028]
Vision transformers have demonstrated significant advantages in computer vision tasks due to their ability to capture long-range dependencies and contextual relationships through self-attention.<n>Existing position encoding techniques, which are largely borrowed from natural language processing, fail to effectively capture semantic-aware positional relationships between image patches.<n>Traditional approaches like absolute position encoding and relative position encoding primarily focus on 1D linear position relationship, often the semantic similarity between distant yet contextually related patches.
arXiv Detail & Related papers (2025-05-14T15:17:34Z)
VRoPE: Rotary Position Embedding for Video Large Language Models [13.495442349395287]
Position Embedding (RoPE) has shown strong performance in text-based Large Language Models (LLMs)<n>Video adaptations, such as RoPE-3D, attempt to encode spatial and temporal dimensions separately but suffer from two major limitations.<n>We propose Position Rotary Embedding (VRoPE), a novel positional encoding method tailored for Video-LLMs.
arXiv Detail & Related papers (2025-02-17T10:53:57Z)
Advancing General Multimodal Capability of Vision-language Models with Pyramid-descent Visual Position Encoding [64.29499221878746]
Vision-language Models (VLMs) have shown remarkable capabilities in advancing general artificial intelligence.<n>PyPE is a novel approach designed to enhance the perception of visual tokens withinVLMs.<n>Our method reduces the relative distance between interrelated visual elements and instruction tokens.
arXiv Detail & Related papers (2025-01-19T07:00:46Z)
Scale-wise Bidirectional Alignment Network for Referring Remote Sensing Image Segmentation [12.893224628061516]
The goal of referring remote sensing image segmentation (RRSIS) is to extract specific pixel-level regions within an aerial image via a natural language expression.<n>We propose an innovative framework called Scale-wise Bidirectional Alignment Network (SBANet) to address these challenges.<n>Our proposed method achieves superior performance in comparison to previous state-of-the-art methods on the RRSIS-D and RefSegRS datasets.
arXiv Detail & Related papers (2025-01-01T14:24:04Z)
VXP: Voxel-Cross-Pixel Large-scale Image-LiDAR Place Recognition [40.603362112697255]
Cross-modal place recognition methods are flexible GPS-alternatives under varying environment conditions and sensor setups.<n>We propose Voxel-Cross-Pixel (VXP), a novel camera-to-LiDAR place recognition framework.<n>VXP enforces local similarities in a self-supervised manner and effectively brings global context from images and LiDAR scans into a shared feature space.
arXiv Detail & Related papers (2024-03-21T17:49:26Z)
CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition [73.51329037954866]
We propose a robust global representation method with cross-image correlation awareness for visual place recognition. Our method uses the attention mechanism to correlate multiple images within a batch. Our method outperforms state-of-the-art methods by a large margin with significantly less training time.
arXiv Detail & Related papers (2024-02-29T15:05:11Z)
Towards Few-shot Entity Recognition in Document Images: A Graph Neural Network Approach Robust to Image Manipulation [38.09501948846373]
We introduce the topological adjacency relationship among the tokens, emphasizing their relative position information. We incorporate these graphs into the pre-trained language model by adding graph neural network layers on top of the language model embeddings. Experiments on two benchmark datasets show that LAGER significantly outperforms strong baselines under different few-shot settings.
arXiv Detail & Related papers (2023-05-24T07:34:33Z)
Global and Local Alignment Networks for Unpaired Image-to-Image Translation [170.08142745705575]
The goal of unpaired image-to-image translation is to produce an output image reflecting the target domain's style. Due to the lack of attention to the content change in existing methods, semantic information from source images suffers from degradation during translation. We introduce a novel approach, Global and Local Alignment Networks (GLA-Net) Our method effectively generates sharper and more realistic images than existing approaches.
arXiv Detail & Related papers (2021-11-19T18:01:54Z)
Rethinking and Improving Relative Position Encoding for Vision Transformer [61.559777439200744]
Relative position encoding (RPE) is important for transformer to capture sequence ordering of input tokens. We propose new relative position encoding methods dedicated to 2D images, called image RPE (iRPE)
arXiv Detail & Related papers (2021-07-29T17:55:10Z)
Learnable Fourier Features for Multi-DimensionalSpatial Positional Encoding [96.9752763607738]
We propose a novel positional encoding method based on learnable Fourier features. Our experiments show that our learnable feature representation for multi-dimensional positional encoding outperforms existing methods.
arXiv Detail & Related papers (2021-06-05T04:40:18Z)

This list is automatically generated from the titles and abstracts of the papers in this site.