Related papers: Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models

Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models

URL: http://arxiv.org/abs/2505.16416v1
Date: Thu, 22 May 2025 09:05:01 GMT
Title: Circle-RoPE: Cone-like Decoupled Rotary Positional Embedding for Large Vision-Language Models
Authors: Chengcheng Wang, Jianyuan Guo, Hongguang Li, Yuchuan Tian, Ying Nie, Chang Xu, Kai Han,
Abstract summary: Rotary Position Embedding (RoPE) is a widely adopted technique for encoding relative positional information in large language models.<n>RoPE variants enforce relative positional dependencies between text token indices and image tokens, causing spurious alignments.<n>We introduce Circle-RoPE, a novel encoding scheme that maps image token indices onto a circular trajectory to the linear path of text token indices, forming a cone-like structure.
Score: 35.471513870514585
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Rotary Position Embedding (RoPE) is a widely adopted technique for encoding relative positional information in large language models (LLMs). However, when extended to large vision-language models (LVLMs), its variants introduce unintended cross-modal positional biases. Specifically, they enforce relative positional dependencies between text token indices and image tokens, causing spurious alignments. This issue arises because image tokens representing the same content but located at different spatial positions are assigned distinct positional biases, leading to inconsistent cross-modal associations. To address this, we propose Per-Token Distance (PTD) - a simple yet effective metric for quantifying the independence of positional encodings across modalities. Informed by this analysis, we introduce Circle-RoPE, a novel encoding scheme that maps image token indices onto a circular trajectory orthogonal to the linear path of text token indices, forming a cone-like structure. This configuration ensures that each text token maintains an equal distance to all image tokens, reducing artificial cross-modal biases while preserving intra-image spatial information. To further enhance performance, we propose a staggered layer strategy that applies different RoPE variants across layers. This design leverages the complementary strengths of each RoPE variant, thereby enhancing the model's overall performance. Our experimental results demonstrate that our method effectively preserves spatial information from images while reducing relative positional bias, offering a more robust and flexible positional encoding framework for LVLMs. The code is available at [https://github.com/lose4578/CircleRoPE](https://github.com/lose4578/CircleRoPE).

Related papers

MCA-LLaVA: Manhattan Causal Attention for Reducing Hallucination in Large Vision-Language Models [25.406556604989607]
Hallucinations pose a significant challenge in Large Vision Language Models (LVLMs)<n> misalignment between multimodal features identified as a key contributing factor.<n>MCA-LLaVA integrates the one-dimensional sequence order and two-dimensional spatial position of image tokens for positional modeling.
arXiv Detail & Related papers (2025-07-12T08:09:35Z)
SeqPE: Transformer with Sequential Position Encoding [76.22159277300891]
SeqPE represents each $n$-dimensional position index as a symbolic sequence and employs a lightweight sequential position encoder to learn their embeddings.<n> Experiments across language modeling, long-context question answering, and 2D image classification demonstrate that SeqPE not only surpasses strong baselines in perplexity, exact match (EM) and accuracy--but also enables seamless generalization to multi-dimensional inputs without requiring manual architectural redesign.
arXiv Detail & Related papers (2025-06-16T09:16:40Z)
HypeVPR: Exploring Hyperbolic Space for Perspective to Equirectangular Visual Place Recognition [16.46501527058266]
We introduce HypeVPR, a novel hierarchical embedding framework in hyperbolic space.<n>HypeVPR is designed to address the unique challenges of perspective-to-equirectangular (P2E) VPR.
arXiv Detail & Related papers (2025-06-05T08:47:15Z)
ID-Align: RoPE-Conscious Position Remapping for Dynamic High-Resolution Adaptation in Vision-Language Models [24.087014423545067]
A prevalent approach for enhancing Vision-Language Models (VLMs) performance is to encode both the high-resolution version and the thumbnail of an image simultaneously.<n>We propose ID-Align, which alleviates these problems by reordering position IDs.<n>Our experiments conducted within the LLaVA-Next framework demonstrate that ID-Align achieves significant improvements.
arXiv Detail & Related papers (2025-05-27T17:36:23Z)
PaTH Attention: Position Encoding via Accumulating Householder Transformations [56.32365080761523]
PaTH is a flexible data-dependent position encoding scheme based on accumulated products of Householder transformations.<n>We derive an efficient parallel algorithm for training through exploiting a compact representation of products of Householder matrices.
arXiv Detail & Related papers (2025-05-22T08:36:09Z)
A 2D Semantic-Aware Position Encoding for Vision Transformers [32.86183384267028]
Vision transformers have demonstrated significant advantages in computer vision tasks due to their ability to capture long-range dependencies and contextual relationships through self-attention.<n>Existing position encoding techniques, which are largely borrowed from natural language processing, fail to effectively capture semantic-aware positional relationships between image patches.<n>Traditional approaches like absolute position encoding and relative position encoding primarily focus on 1D linear position relationship, often the semantic similarity between distant yet contextually related patches.
arXiv Detail & Related papers (2025-05-14T15:17:34Z)
VRoPE: Rotary Position Embedding for Video Large Language Models [13.495442349395287]
Position Embedding (RoPE) has shown strong performance in text-based Large Language Models (LLMs)<n>Video adaptations, such as RoPE-3D, attempt to encode spatial and temporal dimensions separately but suffer from two major limitations.<n>We propose Position Rotary Embedding (VRoPE), a novel positional encoding method tailored for Video-LLMs.
arXiv Detail & Related papers (2025-02-17T10:53:57Z)
Advancing General Multimodal Capability of Vision-language Models with Pyramid-descent Visual Position Encoding [64.29499221878746]
Vision-language Models (VLMs) have shown remarkable capabilities in advancing general artificial intelligence.<n>PyPE is a novel approach designed to enhance the perception of visual tokens withinVLMs.<n>Our method reduces the relative distance between interrelated visual elements and instruction tokens.
arXiv Detail & Related papers (2025-01-19T07:00:46Z)
CricaVPR: Cross-image Correlation-aware Representation Learning for Visual Place Recognition [73.51329037954866]
We propose a robust global representation method with cross-image correlation awareness for visual place recognition. Our method uses the attention mechanism to correlate multiple images within a batch. Our method outperforms state-of-the-art methods by a large margin with significantly less training time.
arXiv Detail & Related papers (2024-02-29T15:05:11Z)
Towards Few-shot Entity Recognition in Document Images: A Graph Neural Network Approach Robust to Image Manipulation [38.09501948846373]
We introduce the topological adjacency relationship among the tokens, emphasizing their relative position information. We incorporate these graphs into the pre-trained language model by adding graph neural network layers on top of the language model embeddings. Experiments on two benchmark datasets show that LAGER significantly outperforms strong baselines under different few-shot settings.
arXiv Detail & Related papers (2023-05-24T07:34:33Z)
Rethinking and Improving Relative Position Encoding for Vision Transformer [61.559777439200744]
Relative position encoding (RPE) is important for transformer to capture sequence ordering of input tokens. We propose new relative position encoding methods dedicated to 2D images, called image RPE (iRPE)
arXiv Detail & Related papers (2021-07-29T17:55:10Z)

This list is automatically generated from the titles and abstracts of the papers in this site.