Related papers: Rotary Outliers and Rotary Offset Features in Large Language Models

Rotary Outliers and Rotary Offset Features in Large Language Models

URL: http://arxiv.org/abs/2503.01832v1
Date: Mon, 03 Mar 2025 18:55:09 GMT
Title: Rotary Outliers and Rotary Offset Features in Large Language Models
Authors: André Jonasson,
Abstract summary: We study the features and patterns that emerge in queries and keys when using rotary embeddings.<n>We find and analyze outliers across models in queries and keys and find that they are likely to be found in rotary features with partial cycles.
Score: 1.9580473532948401
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Transformer-based Large Language Models (LLMs) rely on positional encodings to provide sequence position information to their attention mechanism. Rotary Positional Encodings (RoPE), which encode relative position by rotating queries and keys, have become widely used in modern LLMs. We study the features and patterns that emerge in queries and keys when using rotary embeddings. Our analysis reveals consistent patterns within the same model across layers and attention heads and across different models and architectures. We present and apply analysis techniques and show how the queries and keys use RoPE to construct various attention patterns, including attention sinks. We find and analyze outliers across models in queries and keys and find that they are likely to be found in rotary features with partial cycles. We derive bounds that tell us what rotary frequencies are likely to be selected as outlier features and at what minimum angle the query-key rotary pairs in these features tend to be above and verify the bounds empirically with models of significant architectural differences.

Related papers

Attention Basin: Why Contextual Position Matters in Large Language Models [16.11590856103274]
We show that models systematically assign higher attention to items at the beginning and end of a sequence, while neglecting those in the middle.<n>We introduce Attention-Driven Reranking (AttnRank), a framework that estimates a model's intrinsic positional attention preferences.<n>AttnRank is a model-agnostic, training-free, and plug-and-play method with minimal computational overhead.
arXiv Detail & Related papers (2025-08-07T08:08:08Z)
Rotary Masked Autoencoders are Versatile Learners [0.0]
We present the Rotary Masked Autoencoder (RoMAE)<n>RoMAE is an extension to the Masked Autoencoder (MAE) that enables representation learning with multidimensional continuous positional information.<n>We showcase RoMAE's performance on a variety of modalities including irregular and multivariate time-series, images, and audio.
arXiv Detail & Related papers (2025-05-26T21:45:18Z)
PaTH Attention: Position Encoding via Accumulating Householder Transformations [56.32365080761523]
PaTH is a flexible data-dependent position encoding scheme based on accumulated products of Householder transformations.<n>We derive an efficient parallel algorithm for training through exploiting a compact representation of products of Householder matrices.
arXiv Detail & Related papers (2025-05-22T08:36:09Z)
MI-DETR: An Object Detection Model with Multi-time Inquiries Mechanism [67.56918651825056]
We propose a new decoder architecture with the parallel Multi-time Inquiries (MI) mechanism.<n>Our MI based model, MI-DETR, outperforms all existing DETR-like models on COCO benchmark.<n>A series of diagnostic and visualization experiments demonstrate the effectiveness, rationality, and interpretability of MI.
arXiv Detail & Related papers (2025-03-03T12:19:06Z)
Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding [58.364933651703524]
We show that concentrated massive values consistently emerge in specific regions of attention queries.<n>These massive values play a critical role in interpreting contextual knowledge.<n>We trace the emergence of massive values and find that such concentration is caused by Rotary Positional.
arXiv Detail & Related papers (2025-02-03T17:47:03Z)
Transformers Use Causal World Models in Maze-Solving Tasks [49.67445252528868]
We investigate the inner workings of transformer models trained on tasks across various domains.<n>We find that transformers are able to reason with respect to a greater number of active features than they see during training.<n>We observe that varying positional encodings can alter how WMs are encoded in a model's residual stream.
arXiv Detail & Related papers (2024-12-16T15:21:04Z)
WaveRoRA: Wavelet Rotary Route Attention for Multivariate Time Series Forecasting [4.680374146155483]
We propose a wavelet learning framework to model complex temporal dependencies of the time series data. The wavelet domain integrates both time and frequency information, allowing for the analysis of local characteristics of signals at different scales. We propose a novel attention mechanism: Rotary Route Attention (RoRA)
arXiv Detail & Related papers (2024-10-30T02:36:55Z)
AttentionViz: A Global View of Transformer Attention [60.82904477362676]
We present a new visualization technique designed to help researchers understand the self-attention mechanism in transformers. The main idea behind our method is to visualize a joint embedding of the query and key vectors used by transformer models to compute attention. We create an interactive visualization tool, AttentionViz, based on these joint query-key embeddings.
arXiv Detail & Related papers (2023-05-04T23:46:49Z)
Compositional Attention: Disentangling Search and Retrieval [66.7108739597771]
Multi-head, key-value attention is the backbone of the Transformer model and its variants. Standard attention heads learn a rigid mapping between search and retrieval. We propose a novel attention mechanism, called Compositional Attention, that replaces the standard head structure.
arXiv Detail & Related papers (2021-10-18T15:47:38Z)
RoFormer: Enhanced Transformer with Rotary Position Embedding [9.01819510933327]
We propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information. RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation. We evaluate the enhanced transformer with rotary position embedding, also called RoFormer, on various long text classification benchmark datasets.
arXiv Detail & Related papers (2021-04-20T09:54:06Z)
Fixed Encoder Self-Attention Patterns in Transformer-Based Machine Translation [73.11214377092121]
We propose to replace all but one attention head of each encoder layer with simple fixed -- non-learnable -- attentive patterns. Our experiments with different data sizes and multiple language pairs show that fixing the attention heads on the encoder side of the Transformer at training time does not impact the translation quality.
arXiv Detail & Related papers (2020-02-24T13:53:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.