Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective
- URL: http://arxiv.org/abs/2406.13282v2
- Date: Tue, 29 Oct 2024 11:29:31 GMT
- Title: Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective
- Authors: Meizhi Zhong, Chen Zhang, Yikun Lei, Xikai Liu, Yan Gao, Yao Hu, Kehai Chen, Min Zhang,
- Abstract summary: This paper offers a straightforward yet in-depth understanding of RoPE extensions from an attention perspective.
Using longer continual pretraining lengths for RoPE extensions could reduce attention uncertainty and significantly enhance extrapolation.
- Score: 35.947737679664016
- License:
- Abstract: Enabling LLMs to handle lengthy context is currently a research hotspot. Most LLMs are built upon rotary position embedding (RoPE), a popular position encoding method. Therefore, a prominent path is to extrapolate the RoPE trained on comparably short texts to far longer texts. A heavy bunch of efforts have been dedicated to boosting the extrapolation via extending the formulations of the RoPE, however, few of them have attempted to showcase their inner workings comprehensively. In this paper, we are driven to offer a straightforward yet in-depth understanding of RoPE extensions from an attention perspective and on two benchmarking tasks. A broad array of experiments reveals several valuable findings: 1) Maintaining attention patterns to those at the pretrained length improves extrapolation; 2) Large attention uncertainty leads to retrieval errors; 3) Using longer continual pretraining lengths for RoPE extensions could reduce attention uncertainty and significantly enhance extrapolation.
Related papers
- HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation [19.42279057349193]
positional encodings (PEs) are designed to exhibit long-term decay, based on an entrenched and long-standing inductive opinion.
We argue that long-term decay is outdated in the era of LLMs, as LLMs are now applied to tasks demanding precise retrieval of in-context information.
arXiv Detail & Related papers (2024-10-28T17:01:52Z) - Why Does the Effective Context Length of LLMs Fall Short? [68.34573617977013]
In this work, we introduce ShifTed Rotray position embeddING (STRING)
STRING shifts well-trained positions to overwrite the original ineffective positions during inference, enhancing performance within their existing training lengths.
Experimental results show that STRING dramatically improves the performance of the latest large-scale models.
arXiv Detail & Related papers (2024-10-24T13:51:50Z) - On the token distance modeling ability of higher RoPE attention dimension [76.55792402912027]
We investigate the correlation between a hidden dimension of an attention head and its contribution to capturing long-distance dependencies.
We identify a particular type of attention heads, which we named Positional Heads, from various length-extrapolated models.
These heads exhibit a strong focus on long-range information interaction and play a pivotal role in long input processing.
arXiv Detail & Related papers (2024-10-11T10:47:02Z) - Round and Round We Go! What makes Rotary Positional Encodings useful? [15.543752938828831]
We study the internals of a trained Gemma 7B model to understand how RoPE is being used at a mechanical level.
We find that Gemma learns to use RoPE to construct robust "positional" attention patterns by exploiting the highest frequencies.
We propose a modification of RoPE that fixes some highlighted issues and improves performance.
arXiv Detail & Related papers (2024-10-08T17:07:01Z) - Mixture of In-Context Experts Enhance LLMs' Long Context Awareness [51.65245442281049]
Large language models (LLMs) exhibit uneven awareness of different contextual positions.
We introduce a novel method called "Mixture of In-Context Experts" (MoICE) to address this challenge.
MoICE comprises two key components: a router integrated into each attention head within LLMs and a lightweight router-only training optimization strategy.
arXiv Detail & Related papers (2024-06-28T01:46:41Z) - Base of RoPE Bounds Context Length [37.11078116104313]
Rotary position embedding (RoPE) is a technique that encodes the position information with a rotation matrix.
In this paper, we find that LLMs may obtain a superficial long-context ability based on the OOD theory.
Our work reveals the relationship between context length and RoPE base both theoretically and empirically, which may shed light on future long context training.
arXiv Detail & Related papers (2024-05-23T14:03:31Z) - Length Extrapolation of Transformers: A Survey from the Perspective of Positional Encoding [40.289596031245374]
All Transformer-based models including large language models (LLMs) suffer from a preset length limit.
Numerous methods have emerged to enhance the length extrapolation of Transformers.
This survey aims to enable the reader to gain a deep understanding of existing methods and provide stimuli for future research.
arXiv Detail & Related papers (2023-12-28T14:42:24Z) - Scaling Laws of RoPE-based Extrapolation [103.33995311915864]
We propose textbftextitScaling Laws of RoPE-based Extrapolation to describe the relationship between the extrapolation performance and base value.
We achieve extrapolation up to 1 million context length within only 16K training length on LLaMA2 7B and 13B.
arXiv Detail & Related papers (2023-10-08T15:50:36Z) - PEARL: Prompting Large Language Models to Plan and Execute Actions Over
Long Documents [78.27865456183397]
We propose PEARL, a prompting framework to improve reasoning over long documents.
Each stage of PEARL is implemented via zero-shot or few-shot prompting with minimal human input.
We evaluate PEARL on a challenging subset of the QuALITY dataset, which contains questions that require complex reasoning over long narrative texts.
arXiv Detail & Related papers (2023-05-23T23:06:04Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.