A Circular Argument : Does RoPE need to be Equivariant for Vision?
- URL: http://arxiv.org/abs/2511.08368v1
- Date: Wed, 12 Nov 2025 01:55:54 GMT
- Title: A Circular Argument : Does RoPE need to be Equivariant for Vision?
- Authors: Chase van de Geijn, Timo Lüddecke, Polina Turishcheva, Alexander S. Ecker,
- Abstract summary: We mathematically show RoPE to be one of the most general solutions for equivariant positional embedding in one-dimensional data.<n>We propose Spherical RoPE, a method analogous to Mixed RoPE, but assumes non-commutative generators.
- Score: 45.33536249657655
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Rotary Positional Encodings (RoPE) have emerged as a highly effective technique for one-dimensional sequences in Natural Language Processing spurring recent progress towards generalizing RoPE to higher-dimensional data such as images and videos. The success of RoPE has been thought to be due to its positional equivariance, i.e. its status as a relative positional encoding. In this paper, we mathematically show RoPE to be one of the most general solutions for equivariant positional embedding in one-dimensional data. Moreover, we show Mixed RoPE to be the analogously general solution for M-dimensional data, if we require commutative generators -- a property necessary for RoPE's equivariance. However, we question whether strict equivariance plays a large role in RoPE's performance. We propose Spherical RoPE, a method analogous to Mixed RoPE, but assumes non-commutative generators. Empirically, we find Spherical RoPE to have the equivalent or better learning behavior compared to its equivariant analogues. This suggests that relative positional embeddings are not as important as is commonly believed, at least within computer vision. We expect this discovery to facilitate future work in positional encodings for vision that can be faster and generalize better by removing the preconception that they must be relative.
Related papers
- Selective Rotary Position Embedding [84.22998043041198]
We introduce textitSelective RoPE, an textitinput-dependent rotary embedding mechanism.<n>We show that softmax attention already performs a hidden form of these rotations on query-key pairs.<n>We validate our method by equipping gated transformers with textitSelective RoPE, demonstrating that its input-dependent rotations improve performance in language modeling.
arXiv Detail & Related papers (2025-11-21T16:50:00Z) - Do traveling waves make good positional encodings? [44.55744608160896]
We propose RollPE, a novel positional encoding mechanism based on traveling waves.<n>We show it significantly outperforms traditional absolute positional embeddings.<n>We derive a mathematical equivalence of RollPE to a particular configuration of RoPE.
arXiv Detail & Related papers (2025-11-11T14:32:45Z) - Decoupling the "What" and "Where" With Polar Coordinate Positional Embeddings [29.421443764865003]
We present an analysis indicating that what and where are entangled in the popular RoPE rotary position embedding.<n>We propose an improvement to RoPE, which we call Polar Coordinate Position Embeddings or PoPE, that eliminates the what-where confound.
arXiv Detail & Related papers (2025-09-05T14:22:27Z) - ComRoPE: Scalable and Robust Rotary Position Embedding Parameterized by Trainable Commuting Angle Matrices [25.99231204405503]
We propose ComRoPE, which generalizes Rotary Positional PE (RoPE) by defining it in terms of trainable commuting angle matrices.<n>We present two types of trainable commuting angle matrices as sufficient solutions to the RoPE equation.<n>Our framework shows versatility in generalizing to existing RoPE formulations and offering new insights for future positional encoding research.
arXiv Detail & Related papers (2025-06-04T09:10:02Z) - PaTH Attention: Position Encoding via Accumulating Householder Transformations [56.32365080761523]
PaTH is a flexible data-dependent position encoding scheme based on accumulated products of Householder transformations.<n>We derive an efficient parallel algorithm for training through exploiting a compact representation of products of Householder matrices.
arXiv Detail & Related papers (2025-05-22T08:36:09Z) - Rotary Position Embedding for Vision Transformer [44.27871591624888]
This study provides a comprehensive analysis of Rotary Position Embedding (RoPE) when applied to Vision Transformer (ViT)
RoPE demonstrates impressive extrapolation performance, i.e., maintaining precision while increasing image resolution at inference.
It eventually leads to performance improvement for ImageNet-1k, COCO detection, and ADE-20k segmentation.
arXiv Detail & Related papers (2024-03-20T04:47:13Z) - Scaling Laws of RoPE-based Extrapolation [103.33995311915864]
We propose textbftextitScaling Laws of RoPE-based Extrapolation to describe the relationship between the extrapolation performance and base value.
We achieve extrapolation up to 1 million context length within only 16K training length on LLaMA2 7B and 13B.
arXiv Detail & Related papers (2023-10-08T15:50:36Z) - RoFormer: Enhanced Transformer with Rotary Position Embedding [9.01819510933327]
We propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information.
RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation.
We evaluate the enhanced transformer with rotary position embedding, also called RoFormer, on various long text classification benchmark datasets.
arXiv Detail & Related papers (2021-04-20T09:54:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.