Related papers: HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation

HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation

URL: http://arxiv.org/abs/2410.21216v2
Date: Thu, 05 Dec 2024 07:09:27 GMT
Title: HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation
Authors: Yuhan Chen, Ang Lv, Jian Luan, Bin Wang, Wei Liu,
Abstract summary: positional encodings (PEs) are designed to exhibit long-term decay, based on an entrenched and long-standing inductive opinion.<n>We argue that long-term decay is outdated in the era of LLMs, as LLMs are now applied to tasks demanding precise retrieval of in-context information.
Score: 19.42279057349193
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Many positional encodings (PEs) are designed to exhibit long-term decay, based on an entrenched and long-standing inductive opinion: tokens farther away from the current position carry less relevant information. We argue that long-term decay is outdated in the era of LLMs, as LLMs are now applied to tasks demanding precise retrieval of in-context information from arbitrary positions. Firstly, we present empirical analyses on various PEs, demonstrating that models inherently learn attention with only a local-decay pattern while forming a U-shape pattern globally, contradicting the principle of long-term decay. Furthermore, we conduct a detailed analysis of rotary position encoding (RoPE, a prevalent relative positional encoding in LLMs), and found that the U-shape attention is caused by some learned components, which are also the key factor limiting RoPE's expressiveness and extrapolation.Inspired by these insights, we propose High-frequency rotary Position Encoding (HoPE). HoPE replaces the specific components in RoPE with position-independent ones, retaining only high-frequency signals, which also breaks the principle of long-term decay in theory. HoPE achieves two major advantages: (1) Without constraints imposed by long-term decay, contradictory factors that limit spontaneous attention optimization and model extrapolation performance are removed. (2) Components representing positions and semantics are are optimized. These enhances model's context awareness and extrapolation, as validated by extensive experiments.

Related papers

CoPE: Clipped RoPE as A Scalable Free Lunch for Long Context LLMs [18.897130541385646]
Rotary Positional Embedding (RoPE) is a key component of context scaling in Large Language Models (LLMs)<n>In this work, we unify these seemingly distinct objectives through a minimalist intervention, namely soft clipping lowfrequency components of RoPE.<n>CoPE not only eliminates OOD outliers and refines semantic signals, but also prevents spectral leakage caused by hard clipping.
arXiv Detail & Related papers (2026-02-05T03:31:14Z)
Beyond Real: Imaginary Extension of Rotary Position Embeddings for Long-Context LLMs [72.8830548005884]
Rotary Position Embeddings (RoPE) have become a standard for encoding sequence order in Large Language Models.<n>Standard implementations utilize only the real component of the complex-valued dot product for attention score calculation.<n>We propose an extension that re-incorporates this imaginary component.
arXiv Detail & Related papers (2025-12-08T12:59:54Z)
DoPE: Denoising Rotary Position Embedding [60.779039511252584]
Rotary Position Embedding (RoPE) in Transformer models has inherent limits that weaken length.<n>We reinterpret the attention map with positional encoding as a noisy feature map, and propose Denoising Positional extrapolation page (DoPE)<n>DoPE is a training-free method based on truncated matrix entropy to detect outlier frequency bands in the feature map.
arXiv Detail & Related papers (2025-11-12T09:32:35Z)
Positional Encoding via Token-Aware Phase Attention [45.855203550592734]
We show that Rotary Positional Embedding (RoPE) introduces an intrinsic distance-dependent bias in attention scores that limits RoPE's ability to model long-context.<n>This paper introduces Token-Aware Phase Attention (TAPA), a new positional encoding method that incorporates a learnable phase function into the attention mechanism.
arXiv Detail & Related papers (2025-09-16T03:53:32Z)
HoPE: Hyperbolic Rotary Positional Encoding for Stable Long-Range Dependency Modeling in Large Language Models [19.3827288035483]
We propose Hyperbolic Rotary Positional.<n>(HoPE) which leverages hyperbolic functions to implement Lorentz rotations on token representations.<n>Tests show HoPE consistently exceeds existing positional encoding methods.
arXiv Detail & Related papers (2025-09-05T16:20:48Z)
Context-aware Rotary Position Embedding [0.0]
Rotary Positional Embeddings (RoPE) have become a widely adopted solution due to their compatibility with relative position encoding and computational efficiency.<n>We propose CARoPE (Context-Aware Rotary Positional Embedding), a novel generalization of RoPE that dynamically generates head-specific frequency patterns conditioned on token embeddings.<n>CaroPE consistently outperforms RoPE and other common positional encoding baselines, achieving significantly lower perplexity, even at longer context lengths.
arXiv Detail & Related papers (2025-07-30T20:32:19Z)
HoPE: Hybrid of Position Embedding for Length Generalization in Vision-Language Models [4.105127179940934]
Vision-Language Models (VLMs) have made significant progress in multimodal tasks.<n>However, their performance often deteriorates in long-context scenarios.<n>We propose HoPE, a Hybrid of Position Embedding to improve the long-context capabilities ofVLMs.
arXiv Detail & Related papers (2025-05-26T18:37:40Z)
VideoRoPE: What Makes for Good Video Rotary Position Embedding? [109.88966080843608]
VideoRoPE consistently surpasses previous RoPE variants, across diverse downstream tasks such as long video retrieval, video understanding, and video hallucination.<n>VideoRoPE features textlow-frequency temporal allocation to mitigate periodic oscillations, a textitdiagonal layout to maintain spatial symmetry, and textadjustable temporal spacing to decouple temporal and spatial indexing.
arXiv Detail & Related papers (2025-02-07T18:56:04Z)
Advancing General Multimodal Capability of Vision-language Models with Pyramid-descent Visual Position Encoding [64.29499221878746]
Vision-language Models (VLMs) have shown remarkable capabilities in advancing general artificial intelligence. PyPE is a novel approach designed to enhance the perception of visual tokens withinVLMs. Our method reduces the relative distance between interrelated visual elements and instruction tokens.
arXiv Detail & Related papers (2025-01-19T07:00:46Z)
Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization [23.936687072300053]
We show that Rotary Position Embedding (RoPE) enables periodic attention by implicitly achieving Non-Uniform Discrete Fourier Transform. This periodicity is undermined by the spectral damage caused by: 1) linear layers and activation functions outside of attention; 2) insufficiently trained frequency components brought by time-domain truncation. We propose Fourier Position Embedding (FoPE), which enhances attention's frequency-domain properties to improve both its periodic extension and length generalization.
arXiv Detail & Related papers (2024-12-23T17:44:01Z)
When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training [51.23520027773028]
Extending context window sizes allows large language models to process longer sequences and handle more complex tasks. We observe that using RoPE with BFloat16 format results in numerical issues, causing it to deviate from its intended relative positional encoding. We develop AnchorAttention, a plug-and-play attention method that alleviates numerical issues caused by BFloat16.
arXiv Detail & Related papers (2024-11-20T17:22:31Z)
What is Wrong with Perplexity for Long-context Language Modeling? [71.34933096461124]
Long-context inputs are crucial for large language models (LLMs) in tasks such as extended conversations, document summarization, and many-shot in-context learning. Perplexity (PPL) has proven unreliable for assessing long-context capabilities. We propose bfLongPPL, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them.
arXiv Detail & Related papers (2024-10-31T09:39:28Z)
On the token distance modeling ability of higher RoPE attention dimension [76.55792402912027]
We investigate the correlation between a hidden dimension of an attention head and its contribution to capturing long-distance dependencies. We identify a particular type of attention heads, which we named Positional Heads, from various length-extrapolated models. These heads exhibit a strong focus on long-range information interaction and play a pivotal role in long input processing.
arXiv Detail & Related papers (2024-10-11T10:47:02Z)
Towards Inducing Long-Context Abilities in Multilingual Neural Machine Translation Models [4.625277907331917]
This work addresses the challenge of transitioning pre-trained NMT models from absolute Sinusoidal PEs to Relative PEs. We demonstrate that parameter-efficient fine-tuning, using only a small amount of high-quality data, can successfully facilitate this transition. We find that a small amount of long-context data in a few languages is sufficient for cross-lingual length generalization.
arXiv Detail & Related papers (2024-08-21T07:23:34Z)
Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective [35.947737679664016]
This paper offers a straightforward yet in-depth understanding of RoPE extensions from an attention perspective. Using longer continual pretraining lengths for RoPE extensions could reduce attention uncertainty and significantly enhance extrapolation.
arXiv Detail & Related papers (2024-06-19T07:23:33Z)
3D-RPE: Enhancing Long-Context Modeling Through 3D Rotary Position Encoding [12.335958945925437]
We propose a novel rotary position encoding on a three-dimensional sphere, named 3D Rotary Position (3D-RPE) 3D-RPE is an advanced version of the widely used 2D Rotary Position (RoPE) For controllable long-term decay, 3D-RPE allows for the regulation of long-term decay within the chunk size. For enhanced position resolution, 3D-RPE can mitigate the degradation of position resolution caused by position on RoPE.
arXiv Detail & Related papers (2024-06-14T10:13:37Z)
Base of RoPE Bounds Context Length [37.11078116104313]
Rotary position embedding (RoPE) is a technique that encodes the position information with a rotation matrix. In this paper, we find that LLMs may obtain a superficial long-context ability based on the OOD theory. Our work reveals the relationship between context length and RoPE base both theoretically and empirically, which may shed light on future long context training.
arXiv Detail & Related papers (2024-05-23T14:03:31Z)
Length Generalization of Causal Transformers without Position Encoding [59.802708262402824]
Generalizing to longer sentences is important for recent Transformer-based language models. We study the length generalization property of Transformers without position encodings. We find that although NoPE can extend to sequences longer than the commonly used explicit position encodings, it still has a limited context length.
arXiv Detail & Related papers (2024-04-18T14:38:32Z)
Resonance RoPE: Improving Context Length Generalization of Large Language Models [37.749813693281254]
This paper addresses the challenge of train-short-test-long (TSTL) scenarios in Large Language Models (LLMs) equipped with Rotary Position Embedding (RoPE) We introduce Resonance RoPE, a novel approach designed to narrow the generalization gap in TSTL scenarios. We present PosGen, a new synthetic benchmark specifically designed for fine-grained behavior analysis in TSTL scenarios.
arXiv Detail & Related papers (2024-02-29T19:02:03Z)
Scaling Laws of RoPE-based Extrapolation [103.33995311915864]
We propose textbftextitScaling Laws of RoPE-based Extrapolation to describe the relationship between the extrapolation performance and base value. We achieve extrapolation up to 1 million context length within only 16K training length on LLaMA2 7B and 13B.
arXiv Detail & Related papers (2023-10-08T15:50:36Z)
DEPTS: Deep Expansion Learning for Periodic Time Series Forecasting [83.60876685008225]
We introduce a deep expansion learning framework, DEPTS, for PTS forecasting. DEPTS starts with a decoupled formulation by introducing the periodic state as a hidden variable. Our two customized modules also have certain interpretable capabilities, such as attributing the forecasts to either local momenta or global periodicity.
arXiv Detail & Related papers (2022-03-15T06:51:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.