Rotary Offset Features in Large Language Models
- URL: http://arxiv.org/abs/2503.01832v2
- Date: Fri, 22 Aug 2025 13:41:57 GMT
- Title: Rotary Offset Features in Large Language Models
- Authors: André Jonasson,
- Abstract summary: We study the features and patterns that emerge in queries and keys when using rotary embeddings.<n>We derive bounds predicting which rotary frequencies give rise to rotary offset features.<n>We verify our predictions empirically across models of different sizes and architectures.
- Score: 0.9137554315375919
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Transformer-based Large Language Models (LLMs) rely on positional encodings to provide sequence position information to their attention mechanism. Rotary Positional Encodings (RoPE), which encode relative position by rotating queries and keys, have become widely used in modern LLMs. We study the features and patterns that emerge in queries and keys when using rotary embeddings and introduce the concept of rotary offset features. Our analysis reveals that these features, which frequently exhibit large activations and are often interpreted as outliers, arise consistently across layers, attention heads, and model architectures. We derive bounds predicting which rotary frequencies give rise to rotary offset features and the minimum angle between the query-key pairs for these features. We verify our predictions empirically across models of different sizes and architectures.
Related papers
- Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs [72.8830548005884]
Rotary Position Embeddings (RoPE) have become a standard for encoding sequence order in Large Language Models.<n>Standard implementations utilize only the real component of the complex-valued dot product for attention score calculation.<n>We propose an extension that re-incorporates this imaginary component.
arXiv Detail & Related papers (2025-12-08T12:59:54Z) - Selective Rotary Position Embedding [84.22998043041198]
We introduce textitSelective RoPE, an textitinput-dependent rotary embedding mechanism.<n>We show that softmax attention already performs a hidden form of these rotations on query-key pairs.<n>We validate our method by equipping gated transformers with textitSelective RoPE, demonstrating that its input-dependent rotations improve performance in language modeling.
arXiv Detail & Related papers (2025-11-21T16:50:00Z) - HoPE: Hyperbolic Rotary Positional Encoding for Stable Long-Range Dependency Modeling in Large Language Models [19.3827288035483]
We propose Hyperbolic Rotary Positional.<n>(HoPE) which leverages hyperbolic functions to implement Lorentz rotations on token representations.<n>Tests show HoPE consistently exceeds existing positional encoding methods.
arXiv Detail & Related papers (2025-09-05T16:20:48Z) - Attention Basin: Why Contextual Position Matters in Large Language Models [16.11590856103274]
We show that models systematically assign higher attention to items at the beginning and end of a sequence, while neglecting those in the middle.<n>We introduce Attention-Driven Reranking (AttnRank), a framework that estimates a model's intrinsic positional attention preferences.<n>AttnRank is a model-agnostic, training-free, and plug-and-play method with minimal computational overhead.
arXiv Detail & Related papers (2025-08-07T08:08:08Z) - ComRoPE: Scalable and Robust Rotary Position Embedding Parameterized by Trainable Commuting Angle Matrices [25.99231204405503]
We propose ComRoPE, which generalizes Rotary Positional PE (RoPE) by defining it in terms of trainable commuting angle matrices.<n>We present two types of trainable commuting angle matrices as sufficient solutions to the RoPE equation.<n>Our framework shows versatility in generalizing to existing RoPE formulations and offering new insights for future positional encoding research.
arXiv Detail & Related papers (2025-06-04T09:10:02Z) - Rotary Masked Autoencoders are Versatile Learners [0.0]
We present the Rotary Masked Autoencoder (RoMAE)<n>RoMAE is an extension to the Masked Autoencoder (MAE) that enables representation learning with multidimensional continuous positional information.<n>We showcase RoMAE's performance on a variety of modalities including irregular and multivariate time-series, images, and audio.
arXiv Detail & Related papers (2025-05-26T21:45:18Z) - PaTH Attention: Position Encoding via Accumulating Householder Transformations [56.32365080761523]
PaTH is a flexible data-dependent position encoding scheme based on accumulated products of Householder transformations.<n>We derive an efficient parallel algorithm for training through exploiting a compact representation of products of Householder matrices.
arXiv Detail & Related papers (2025-05-22T08:36:09Z) - MI-DETR: An Object Detection Model with Multi-time Inquiries Mechanism [67.56918651825056]
We propose a new decoder architecture with the parallel Multi-time Inquiries (MI) mechanism.<n>Our MI based model, MI-DETR, outperforms all existing DETR-like models on COCO benchmark.<n>A series of diagnostic and visualization experiments demonstrate the effectiveness, rationality, and interpretability of MI.
arXiv Detail & Related papers (2025-03-03T12:19:06Z) - Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding [58.364933651703524]
We show that concentrated massive values consistently emerge in specific regions of attention queries.<n>These massive values play a critical role in interpreting contextual knowledge.<n>We trace the emergence of massive values and find that such concentration is caused by Rotary Positional.
arXiv Detail & Related papers (2025-02-03T17:47:03Z) - Transformers Use Causal World Models in Maze-Solving Tasks [49.67445252528868]
We investigate the inner workings of transformer models trained on tasks across various domains.<n>We find that transformers are able to reason with respect to a greater number of active features than they see during training.<n>We observe that varying positional encodings can alter how WMs are encoded in a model's residual stream.
arXiv Detail & Related papers (2024-12-16T15:21:04Z) - WaveRoRA: Wavelet Rotary Route Attention for Multivariate Time Series Forecasting [4.680374146155483]
We propose a wavelet learning framework to model complex temporal dependencies of the time series data.
The wavelet domain integrates both time and frequency information, allowing for the analysis of local characteristics of signals at different scales.
We propose a novel attention mechanism: Rotary Route Attention (RoRA)
arXiv Detail & Related papers (2024-10-30T02:36:55Z) - PARE-Net: Position-Aware Rotation-Equivariant Networks for Robust Point Cloud Registration [8.668461141536383]
Learning rotation-invariant distinctive features is a fundamental requirement for point cloud registration.
Existing methods often use rotation-sensitive networks to extract features, while employing rotation augmentation to learn an approximate invariant mapping rudely.
We propose a novel position-aware rotation-equivariant network, for efficient, light-weighted, and robust registration.
arXiv Detail & Related papers (2024-07-14T10:26:38Z) - LieRE: Lie Rotational Positional Encodings [5.32707456872718]
Transformer architectures rely on position encodings to model the structure of input data.<n>We introduce Lie Relative algebras (LieRE) to increase the representational capacity of positional encodings in transformers.<n>We demonstrate the effectiveness of LieRE on 2D and 3D vision tasks, showing that it generalizes well to higher input resolutions.
arXiv Detail & Related papers (2024-06-14T17:41:55Z) - AttentionViz: A Global View of Transformer Attention [60.82904477362676]
We present a new visualization technique designed to help researchers understand the self-attention mechanism in transformers.
The main idea behind our method is to visualize a joint embedding of the query and key vectors used by transformer models to compute attention.
We create an interactive visualization tool, AttentionViz, based on these joint query-key embeddings.
arXiv Detail & Related papers (2023-05-04T23:46:49Z) - SPE-Net: Boosting Point Cloud Analysis via Rotation Robustness
Enhancement [118.20816888815658]
We propose a novel deep architecture tailored for 3D point cloud applications, named as SPE-Net.
The embedded Selective Position variant' procedure relies on an attention mechanism that can effectively attend to the underlying rotation condition of the input.
We demonstrate the merits of the SPE-Net and the associated hypothesis on four benchmarks, showing evident improvements on both rotated and unrotated test data over SOTA methods.
arXiv Detail & Related papers (2022-11-15T15:59:09Z) - Compositional Attention: Disentangling Search and Retrieval [66.7108739597771]
Multi-head, key-value attention is the backbone of the Transformer model and its variants.
Standard attention heads learn a rigid mapping between search and retrieval.
We propose a novel attention mechanism, called Compositional Attention, that replaces the standard head structure.
arXiv Detail & Related papers (2021-10-18T15:47:38Z) - RoFormer: Enhanced Transformer with Rotary Position Embedding [9.01819510933327]
We propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information.
RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation.
We evaluate the enhanced transformer with rotary position embedding, also called RoFormer, on various long text classification benchmark datasets.
arXiv Detail & Related papers (2021-04-20T09:54:06Z) - Fixed Encoder Self-Attention Patterns in Transformer-Based Machine
Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns.
Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.