Related papers: Selective Rotary Position Embedding

Selective Rotary Position Embedding

URL: http://arxiv.org/abs/2511.17388v1
Date: Fri, 21 Nov 2025 16:50:00 GMT
Title: Selective Rotary Position Embedding
Authors: Sajad Movahedi, Timur Carstensen, Arshia Afzal, Frank Hutter, Antonio Orvieto, Volkan Cevher,
Abstract summary: We introduce textitSelective RoPE, an textitinput-dependent rotary embedding mechanism.<n>We show that softmax attention already performs a hidden form of these rotations on query-key pairs.<n>We validate our method by equipping gated transformers with textitSelective RoPE, demonstrating that its input-dependent rotations improve performance in language modeling.
Score: 84.22998043041198
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Position information is essential for language modeling. In softmax transformers, Rotary Position Embeddings (\textit{RoPE}) encode positions through \textit{fixed-angle} rotations, while in linear transformers, order is handled via input-dependent (selective) gating that decays past key-value associations. Selectivity has generally been shown to improve language-related tasks. Inspired by this, we introduce \textit{Selective RoPE}, an \textit{input-dependent} rotary embedding mechanism, that generalizes \textit{RoPE}, and enables rotation in \textit{arbitrary angles} for both linear and softmax transformers. We show that softmax attention already performs a hidden form of these rotations on query-key pairs, uncovering an implicit positional structure. We further show that in state-space models and gated linear transformers, the real part manages forgetting while the imaginary part encodes positions through rotations. We validate our method by equipping gated transformers with \textit{Selective RoPE}, demonstrating that its input-dependent rotations improve performance in language modeling and on difficult sequence tasks like copying, state tracking, and retrieval.

Related papers

Do traveling waves make good positional encodings? [44.55744608160896]
We propose RollPE, a novel positional encoding mechanism based on traveling waves.<n>We show it significantly outperforms traditional absolute positional embeddings.<n>We derive a mathematical equivalence of RollPE to a particular configuration of RoPE.
arXiv Detail & Related papers (2025-11-11T14:32:45Z)
PaTH Attention: Position Encoding via Accumulating Householder Transformations [56.32365080761523]
PaTH is a flexible data-dependent position encoding scheme based on accumulated products of Householder transformations.<n>We derive an efficient parallel algorithm for training through exploiting a compact representation of products of Householder matrices.
arXiv Detail & Related papers (2025-05-22T08:36:09Z)
Rotary Offset Features in Large Language Models [0.9137554315375919]
We study the features and patterns that emerge in queries and keys when using rotary embeddings.<n>We derive bounds predicting which rotary frequencies give rise to rotary offset features.<n>We verify our predictions empirically across models of different sizes and architectures.
arXiv Detail & Related papers (2025-03-03T18:55:09Z)
LieRE: Lie Rotational Positional Encodings [5.32707456872718]
Transformer architectures rely on position encodings to model the structure of input data.<n>We introduce Lie Relative algebras (LieRE) to increase the representational capacity of positional encodings in transformers.<n>We demonstrate the effectiveness of LieRE on 2D and 3D vision tasks, showing that it generalizes well to higher input resolutions.
arXiv Detail & Related papers (2024-06-14T17:41:55Z)
Functional Interpolation for Relative Positions Improves Long Context Transformers [86.12843093589]
We propose a novel functional relative position encoding with progressive, FIRE, to improve Transformer generalization to longer contexts. We theoretically prove that this can represent some of the popular relative position encodings, such as T5's RPE, Alibi, and Kerple. We show that FIRE models have better generalization to longer contexts on both zero-shot language modeling and long text benchmarks.
arXiv Detail & Related papers (2023-10-06T17:59:11Z)
Latent Positional Information is in the Self-Attention Variance of Transformer Language Models Without Positional Embeddings [68.61185138897312]
We show that a frozen transformer language model encodes strong positional information through the shrinkage of self-attention variance. Our findings serve to justify the decision to discard positional embeddings and thus facilitate more efficient pretraining of transformer language models.
arXiv Detail & Related papers (2023-05-23T01:03:40Z)
Rotation-Invariant Transformer for Point Cloud Matching [42.5714375149213]
We introduce RoITr, a Rotation-Invariant Transformer to cope with the pose variations in the point cloud matching task. We propose a global transformer with rotation-invariant cross-frame spatial awareness learned by the self-attention mechanism. RoITr surpasses the existing methods by at least 13 and 5 percentage points in terms of Inlier Ratio and Registration Recall.
arXiv Detail & Related papers (2023-03-14T20:55:27Z)
A Length-Extrapolatable Transformer [98.54835576985664]
We focus on length extrapolation, i.e., training on short texts while evaluating longer sequences. We introduce a relative position embedding to explicitly maximize attention resolution. We evaluate different Transformer variants with language modeling.
arXiv Detail & Related papers (2022-12-20T18:56:20Z)
Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding [63.539333383965726]
We propose a novel way to accelerate attention calculation for Transformers with relative positional encoding (RPE) Based upon the observation that relative positional encoding forms a Toeplitz matrix, we mathematically show that kernelized attention with RPE can be calculated efficiently using Fast Fourier Transform (FFT)
arXiv Detail & Related papers (2021-06-23T17:51:26Z)
RoFormer: Enhanced Transformer with Rotary Position Embedding [9.01819510933327]
We propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information. RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation. We evaluate the enhanced transformer with rotary position embedding, also called RoFormer, on various long text classification benchmark datasets.
arXiv Detail & Related papers (2021-04-20T09:54:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.