Related papers: Decoupling the "What" and "Where" With Polar Coordinate Positional Embeddings

Decoupling the "What" and "Where" With Polar Coordinate Positional Embeddings

URL: http://arxiv.org/abs/2509.10534v1
Date: Fri, 05 Sep 2025 14:22:27 GMT
Title: Decoupling the "What" and "Where" With Polar Coordinate Positional Embeddings
Authors: Anand Gopalakrishnan, Robert Csordás, Jürgen Schmidhuber, Michael C. Mozer,
Abstract summary: We present an analysis indicating that what and where are entangled in the popular RoPE rotary position embedding.<n>We propose an improvement to RoPE, which we call Polar Coordinate Position Embeddings or PoPE, that eliminates the what-where confound.
Score: 29.421443764865003
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The attention mechanism in a Transformer architecture matches key to query based on both content -- the what -- and position in a sequence -- the where. We present an analysis indicating that what and where are entangled in the popular RoPE rotary position embedding. This entanglement can impair performance particularly when decisions require independent matches on these two factors. We propose an improvement to RoPE, which we call Polar Coordinate Position Embeddings or PoPE, that eliminates the what-where confound. PoPE is far superior on a diagnostic task requiring indexing solely by position or by content. On autoregressive sequence modeling in music, genomic, and natural language domains, Transformers using PoPE as the positional encoding scheme outperform baselines using RoPE with respect to evaluation loss (perplexity) and downstream task performance. On language modeling, these gains persist across model scale, from 124M to 774M parameters. Crucially, PoPE shows strong zero-shot length extrapolation capabilities, whereas RoPE's performance degrades significantly on longer sequences at test time without fine tuning or the use of position-interpolation methods.

Related papers

DoPE: Denoising Rotary Position Embedding [60.779039511252584]
Rotary Position Embedding (RoPE) in Transformer models has inherent limits that weaken length.<n>We reinterpret the attention map with positional encoding as a noisy feature map, and propose Denoising Positional extrapolation page (DoPE)<n>DoPE is a training-free method based on truncated matrix entropy to detect outlier frequency bands in the feature map.
arXiv Detail & Related papers (2025-11-12T09:32:35Z)
A Circular Argument : Does RoPE need to be Equivariant for Vision? [45.33536249657655]
We mathematically show RoPE to be one of the most general solutions for equivariant positional embedding in one-dimensional data.<n>We propose Spherical RoPE, a method analogous to Mixed RoPE, but assumes non-commutative generators.
arXiv Detail & Related papers (2025-11-11T15:47:54Z)
Do traveling waves make good positional encodings? [44.55744608160896]
We propose RollPE, a novel positional encoding mechanism based on traveling waves.<n>We show it significantly outperforms traditional absolute positional embeddings.<n>We derive a mathematical equivalence of RollPE to a particular configuration of RoPE.
arXiv Detail & Related papers (2025-11-11T14:32:45Z)
Q-ROAR: Outlier-Aware Rescaling for RoPE Position Interpolation in Quantized Long-Context LLMs [0.9510848451801044]
We show that combining PI with PTQ degrades accuracy due to coupled effects long context aliasing, dynamic range dilation, axis grid anisotropy, and shifting that induce position-dependent logit noise.<n>We propose Q-ROAR, a RoPE-aware, weight-only stabilization that groups RoPE dimensions into a few frequency bands and performs a small search over per-band scales for W_Q,W_K, with an optional symmetric variant to preserve logit scale.
arXiv Detail & Related papers (2025-09-17T19:50:16Z)
Positional Encoding via Token-Aware Phase Attention [62.1265709014944]
We show that Rotary Positional Embedding (RoPE) introduces an intrinsic distance-dependent bias in attention scores that limits RoPE's ability to model long-context.<n>This paper introduces Token-Aware Phase Attention (TAPA), a new positional encoding method that incorporates a learnable phase function into the attention mechanism.
arXiv Detail & Related papers (2025-09-16T03:53:32Z)
HoPE: Hyperbolic Rotary Positional Encoding for Stable Long-Range Dependency Modeling in Large Language Models [19.3827288035483]
We propose Hyperbolic Rotary Positional.<n>(HoPE) which leverages hyperbolic functions to implement Lorentz rotations on token representations.<n>Tests show HoPE consistently exceeds existing positional encoding methods.
arXiv Detail & Related papers (2025-09-05T16:20:48Z)
Context-aware Rotary Position Embedding [0.0]
Rotary Positional Embeddings (RoPE) have become a widely adopted solution due to their compatibility with relative position encoding and computational efficiency.<n>We propose CARoPE (Context-Aware Rotary Positional Embedding), a novel generalization of RoPE that dynamically generates head-specific frequency patterns conditioned on token embeddings.<n>CaroPE consistently outperforms RoPE and other common positional encoding baselines, achieving significantly lower perplexity, even at longer context lengths.
arXiv Detail & Related papers (2025-07-30T20:32:19Z)
SeqPE: Transformer with Sequential Position Encoding [76.22159277300891]
SeqPE represents each $n$-dimensional position index as a symbolic sequence and employs a lightweight sequential position encoder to learn their embeddings.<n> Experiments across language modeling, long-context question answering, and 2D image classification demonstrate that SeqPE not only surpasses strong baselines in perplexity, exact match (EM) and accuracy--but also enables seamless generalization to multi-dimensional inputs without requiring manual architectural redesign.
arXiv Detail & Related papers (2025-06-16T09:16:40Z)
PaTH Attention: Position Encoding via Accumulating Householder Transformations [56.32365080761523]
PaTH is a flexible data-dependent position encoding scheme based on accumulated products of Householder transformations.<n>We derive an efficient parallel algorithm for training through exploiting a compact representation of products of Householder matrices.
arXiv Detail & Related papers (2025-05-22T08:36:09Z)
Eliminating Position Bias of Language Models: A Mechanistic Approach [119.34143323054143]
Position bias has proven to be a prevalent issue of modern language models (LMs)<n>Our mechanistic analysis attributes the position bias to two components employed in nearly all state-of-the-art LMs: causal attention and relative positional encodings.<n>By eliminating position bias, models achieve better performance and reliability in downstream tasks, including LM-as-a-judge, retrieval-augmented QA, molecule generation, and math reasoning.
arXiv Detail & Related papers (2024-07-01T09:06:57Z)
Scaling Laws of RoPE-based Extrapolation [103.33995311915864]
We propose textbftextitScaling Laws of RoPE-based Extrapolation to describe the relationship between the extrapolation performance and base value. We achieve extrapolation up to 1 million context length within only 16K training length on LLaMA2 7B and 13B.
arXiv Detail & Related papers (2023-10-08T15:50:36Z)
Conformer-based End-to-end Speech Recognition With Rotary Position Embedding [11.428057887454008]
We introduce rotary position embedding (RoPE) in the convolution-augmented transformer (conformer) RoPE encodes absolute positional information into the input sequence by a rotation matrix, and then naturally incorporates explicit relative position information into a self-attention module. Our model achieves a relative word error rate reduction of 8.70% and 7.27% over the conformer on test-clean and test-other sets of the LibriSpeech corpus respectively.
arXiv Detail & Related papers (2021-07-13T08:07:22Z)
RoFormer: Enhanced Transformer with Rotary Position Embedding [9.01819510933327]
We propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information. RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation. We evaluate the enhanced transformer with rotary position embedding, also called RoFormer, on various long text classification benchmark datasets.
arXiv Detail & Related papers (2021-04-20T09:54:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.