Spiral RoPE: Rotate Your Rotary Positional Embeddings in the 2D Plane
- URL: http://arxiv.org/abs/2602.03227v1
- Date: Tue, 03 Feb 2026 07:56:58 GMT
- Title: Spiral RoPE: Rotate Your Rotary Positional Embeddings in the 2D Plane
- Authors: Haoyu Liu, Sucheng Ren, Tingyu Zhu, Peng Wang, Cihang Xie, Alan Yuille, Zeyu Zheng, Feng Wang,
- Abstract summary: Spiral RoPE is a simple yet effective extension that enables multi-directional positional encoding.<n>Across a wide range of vision tasks including classification, segmentation, and generation, Spiral RoPE consistently improves performance.
- Score: 49.14270539697387
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Rotary Position Embedding (RoPE) is the de facto positional encoding in large language models due to its ability to encode relative positions and support length extrapolation. When adapted to vision transformers, the standard axial formulation decomposes two-dimensional spatial positions into horizontal and vertical components, implicitly restricting positional encoding to axis-aligned directions. We identify this directional constraint as a fundamental limitation of the standard axial 2D RoPE, which hinders the modeling of oblique spatial relationships that naturally exist in natural images. To overcome this limitation, we propose Spiral RoPE, a simple yet effective extension that enables multi-directional positional encoding by partitioning embedding channels into multiple groups associated with uniformly distributed directions. Each group is rotated according to the projection of the patch position onto its corresponding direction, allowing spatial relationships to be encoded beyond the horizontal and vertical axes. Across a wide range of vision tasks including classification, segmentation, and generation, Spiral RoPE consistently improves performance. Qualitative analysis of attention maps further show that Spiral RoPE exhibits more concentrated activations on semantically relevant objects and better respects local object boundaries, highlighting the importance of multi-directional positional encoding in vision transformers.
Related papers
- Rotary Positional Embeddings as Phase Modulation: Theoretical Bounds on the RoPE Base for Long-Context Transformers [0.5414847001704249]
Rotary positional embeddings (RoPE) are widely used in large language models to encode token positions.<n>We derive principled lower bounds on the RoPE base parameter that are necessary to preserve positional coherence over a target context length.<n>We extend this analysis to deep transformers, showing that repeated rotary modulation across layers compounds angular misalignment.<n>Together, the lower and upper bounds define a precision- and depth-dependent feasibility region a Goldilocks zone for long-context transformers.
arXiv Detail & Related papers (2026-02-11T15:50:07Z) - Untwisting RoPE: Frequency Control for Shared Attention in DiTs [84.14005261938284]
Positional encodings are essential to transformer-based generative models.<n>We show that Rotary Positional Embeddings (RoPE) naturally decomposes into frequency components with distinct positional sensitivities.<n>We introduce a method for selectively modulating RoPE frequency bands so that attention reflects semantic similarity rather than strict positional alignment.
arXiv Detail & Related papers (2026-02-04T20:01:59Z) - RSGround-R1: Rethinking Remote Sensing Visual Grounding through Spatial Reasoning [61.84363374647606]
Remote Sensing Visual Grounding (RSVG) aims to localize target objects in large-scale aerial imagery based on natural language descriptions.<n>These descriptions often rely heavily on positional cues, posing unique challenges for Multimodal Large Language Models (MLLMs) in spatial reasoning.<n>We propose a reasoning-guided, position-aware post-training framework, dubbed textbfRSGround-R1, to progressively enhance spatial understanding.
arXiv Detail & Related papers (2026-01-29T12:35:57Z) - Selective Rotary Position Embedding [84.22998043041198]
We introduce textitSelective RoPE, an textitinput-dependent rotary embedding mechanism.<n>We show that softmax attention already performs a hidden form of these rotations on query-key pairs.<n>We validate our method by equipping gated transformers with textitSelective RoPE, demonstrating that its input-dependent rotations improve performance in language modeling.
arXiv Detail & Related papers (2025-11-21T16:50:00Z) - Head-wise Adaptive Rotary Positional Encoding for Fine-Grained Image Generation [35.66580960895196]
Rotary Position Embedding (RoPE) excels in 1D domains, but its application to image generation reveals significant limitations.<n>HaroPE is a head-wise adaptive extension that inserts a learnable linear transformation parameterized via singular value decomposition.<n>HaroPE consistently improves performance over strong RoPE baselines and other extensions.
arXiv Detail & Related papers (2025-10-12T07:46:28Z) - Bidirectional Feature-aligned Motion Transformation for Efficient Dynamic Point Cloud Compression [97.66080040613726]
We propose a Bidirectional Feature-aligned Motion Transformation (Bi-FMT) framework that implicitly models motion in the feature space.<n>Bi-FMT aligns features across both past and future frames to produce temporally consistent latent representations.<n>We show Bi-FMT surpasses D-DPCC and AdaDPCC in both compression efficiency and runtime.
arXiv Detail & Related papers (2025-09-18T03:51:06Z) - HoPE: Hyperbolic Rotary Positional Encoding for Stable Long-Range Dependency Modeling in Large Language Models [19.3827288035483]
We propose Hyperbolic Rotary Positional.<n>(HoPE) which leverages hyperbolic functions to implement Lorentz rotations on token representations.<n>Tests show HoPE consistently exceeds existing positional encoding methods.
arXiv Detail & Related papers (2025-09-05T16:20:48Z) - SeqPE: Transformer with Sequential Position Encoding [76.22159277300891]
SeqPE represents each $n$-dimensional position index as a symbolic sequence and employs a lightweight sequential position encoder to learn their embeddings.<n> Experiments across language modeling, long-context question answering, and 2D image classification demonstrate that SeqPE not only surpasses strong baselines in perplexity, exact match (EM) and accuracy--but also enables seamless generalization to multi-dimensional inputs without requiring manual architectural redesign.
arXiv Detail & Related papers (2025-06-16T09:16:40Z) - ComRoPE: Scalable and Robust Rotary Position Embedding Parameterized by Trainable Commuting Angle Matrices [25.99231204405503]
We propose ComRoPE, which generalizes Rotary Positional PE (RoPE) by defining it in terms of trainable commuting angle matrices.<n>We present two types of trainable commuting angle matrices as sufficient solutions to the RoPE equation.<n>Our framework shows versatility in generalizing to existing RoPE formulations and offering new insights for future positional encoding research.
arXiv Detail & Related papers (2025-06-04T09:10:02Z) - Rethinking RoPE: A Mathematical Blueprint for N-dimensional Positional Embedding [1.8142288667655782]
We propose a systematic mathematical framework for Rotary Position Embedding (RoPE)<n>We derive the necessary and sufficient conditions for any valid $N$-dimensional RoPE based on two core properties of RoPE - relativity and reversibility.<n>Our framework unifies and explains existing RoPE designs while enabling principled extensions to higher-dimensional modalities and tasks.
arXiv Detail & Related papers (2025-04-07T21:58:22Z) - Rotation-Invariant Transformer for Point Cloud Matching [42.5714375149213]
We introduce RoITr, a Rotation-Invariant Transformer to cope with the pose variations in the point cloud matching task.
We propose a global transformer with rotation-invariant cross-frame spatial awareness learned by the self-attention mechanism.
RoITr surpasses the existing methods by at least 13 and 5 percentage points in terms of Inlier Ratio and Registration Recall.
arXiv Detail & Related papers (2023-03-14T20:55:27Z) - RoFormer: Enhanced Transformer with Rotary Position Embedding [9.01819510933327]
We propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information.
RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation.
We evaluate the enhanced transformer with rotary position embedding, also called RoFormer, on various long text classification benchmark datasets.
arXiv Detail & Related papers (2021-04-20T09:54:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.