HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation
- URL: http://arxiv.org/abs/2410.21216v2
- Date: Thu, 05 Dec 2024 07:09:27 GMT
- Title: HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation
- Authors: Yuhan Chen, Ang Lv, Jian Luan, Bin Wang, Wei Liu,
- Abstract summary: positional encodings (PEs) are designed to exhibit long-term decay, based on an entrenched and long-standing inductive opinion.
We argue that long-term decay is outdated in the era of LLMs, as LLMs are now applied to tasks demanding precise retrieval of in-context information.
- Score: 19.42279057349193
- License:
- Abstract: Many positional encodings (PEs) are designed to exhibit long-term decay, based on an entrenched and long-standing inductive opinion: tokens farther away from the current position carry less relevant information. We argue that long-term decay is outdated in the era of LLMs, as LLMs are now applied to tasks demanding precise retrieval of in-context information from arbitrary positions. Firstly, we present empirical analyses on various PEs, demonstrating that models inherently learn attention with only a local-decay pattern while forming a U-shape pattern globally, contradicting the principle of long-term decay. Furthermore, we conduct a detailed analysis of rotary position encoding (RoPE, a prevalent relative positional encoding in LLMs), and found that the U-shape attention is caused by some learned components, which are also the key factor limiting RoPE's expressiveness and extrapolation.Inspired by these insights, we propose High-frequency rotary Position Encoding (HoPE). HoPE replaces the specific components in RoPE with position-independent ones, retaining only high-frequency signals, which also breaks the principle of long-term decay in theory. HoPE achieves two major advantages: (1) Without constraints imposed by long-term decay, contradictory factors that limit spontaneous attention optimization and model extrapolation performance are removed. (2) Components representing positions and semantics are are optimized. These enhances model's context awareness and extrapolation, as validated by extensive experiments.
Related papers
- Advancing General Multimodal Capability of Vision-language Models with Pyramid-descent Visual Position Encoding [64.29499221878746]
Vision-language Models (VLMs) have shown remarkable capabilities in advancing general artificial intelligence.
PyPE is a novel approach designed to enhance the perception of visual tokens withinVLMs.
Our method reduces the relative distance between interrelated visual elements and instruction tokens.
arXiv Detail & Related papers (2025-01-19T07:00:46Z) - Fourier Position Embedding: Enhancing Attention's Periodic Extension for Length Generalization [23.936687072300053]
We show that Rotary Position Embedding (RoPE) enables periodic attention by implicitly achieving Non-Uniform Discrete Fourier Transform.
This periodicity is undermined by the spectral damage caused by: 1) linear layers and activation functions outside of attention; 2) insufficiently trained frequency components brought by time-domain truncation.
We propose Fourier Position Embedding (FoPE), which enhances attention's frequency-domain properties to improve both its periodic extension and length generalization.
arXiv Detail & Related papers (2024-12-23T17:44:01Z) - When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training [51.23520027773028]
Extending context window sizes allows large language models to process longer sequences and handle more complex tasks.
We observe that using RoPE with BFloat16 format results in numerical issues, causing it to deviate from its intended relative positional encoding.
We develop AnchorAttention, a plug-and-play attention method that alleviates numerical issues caused by BFloat16.
arXiv Detail & Related papers (2024-11-20T17:22:31Z) - What is Wrong with Perplexity for Long-context Language Modeling? [71.34933096461124]
Long-context inputs are crucial for large language models (LLMs) in tasks such as extended conversations, document summarization, and many-shot in-context learning.
Perplexity (PPL) has proven unreliable for assessing long-context capabilities.
We propose bfLongPPL, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them.
arXiv Detail & Related papers (2024-10-31T09:39:28Z) - On the token distance modeling ability of higher RoPE attention dimension [76.55792402912027]
We investigate the correlation between a hidden dimension of an attention head and its contribution to capturing long-distance dependencies.
We identify a particular type of attention heads, which we named Positional Heads, from various length-extrapolated models.
These heads exhibit a strong focus on long-range information interaction and play a pivotal role in long input processing.
arXiv Detail & Related papers (2024-10-11T10:47:02Z) - Towards Inducing Long-Context Abilities in Multilingual Neural Machine Translation Models [4.625277907331917]
This work addresses the challenge of transitioning pre-trained NMT models from absolute Sinusoidal PEs to Relative PEs.
We demonstrate that parameter-efficient fine-tuning, using only a small amount of high-quality data, can successfully facilitate this transition.
We find that a small amount of long-context data in a few languages is sufficient for cross-lingual length generalization.
arXiv Detail & Related papers (2024-08-21T07:23:34Z) - Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective [35.947737679664016]
This paper offers a straightforward yet in-depth understanding of RoPE extensions from an attention perspective.
Using longer continual pretraining lengths for RoPE extensions could reduce attention uncertainty and significantly enhance extrapolation.
arXiv Detail & Related papers (2024-06-19T07:23:33Z) - Base of RoPE Bounds Context Length [37.11078116104313]
Rotary position embedding (RoPE) is a technique that encodes the position information with a rotation matrix.
In this paper, we find that LLMs may obtain a superficial long-context ability based on the OOD theory.
Our work reveals the relationship between context length and RoPE base both theoretically and empirically, which may shed light on future long context training.
arXiv Detail & Related papers (2024-05-23T14:03:31Z) - Length Generalization of Causal Transformers without Position Encoding [59.802708262402824]
Generalizing to longer sentences is important for recent Transformer-based language models.
We study the length generalization property of Transformers without position encodings.
We find that although NoPE can extend to sequences longer than the commonly used explicit position encodings, it still has a limited context length.
arXiv Detail & Related papers (2024-04-18T14:38:32Z) - Resonance RoPE: Improving Context Length Generalization of Large Language Models [37.749813693281254]
This paper addresses the challenge of train-short-test-long (TSTL) scenarios in Large Language Models (LLMs) equipped with Rotary Position Embedding (RoPE)
We introduce Resonance RoPE, a novel approach designed to narrow the generalization gap in TSTL scenarios.
We present PosGen, a new synthetic benchmark specifically designed for fine-grained behavior analysis in TSTL scenarios.
arXiv Detail & Related papers (2024-02-29T19:02:03Z) - Scaling Laws of RoPE-based Extrapolation [103.33995311915864]
We propose textbftextitScaling Laws of RoPE-based Extrapolation to describe the relationship between the extrapolation performance and base value.
We achieve extrapolation up to 1 million context length within only 16K training length on LLaMA2 7B and 13B.
arXiv Detail & Related papers (2023-10-08T15:50:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.