Related papers: Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective

Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective

URL: http://arxiv.org/abs/2406.13282v1
Date: Wed, 19 Jun 2024 07:23:33 GMT
Title: Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective
Authors: Meizhi Zhong, Chen Zhang, Yikun Lei, Xikai Liu, Yan Gao, Yao Hu, Kehai Chen, Min Zhang,
Abstract summary: This paper offers a straightforward yet in-depth understanding of RoPE extensions from an attention perspective. Using longer continual pretraining lengths for RoPE extensions could reduce attention uncertainty and significantly enhance extrapolation.
Score: 35.947737679664016
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Enabling LLMs to handle lengthy context is currently a research hotspot. Most LLMs are built upon rotary position embedding (RoPE), a popular position encoding method. Therefore, a prominent path is to extrapolate the RoPE trained on comparably short texts to far longer texts. A heavy bunch of efforts have been dedicated to boosting the extrapolation via extending the formulations of the RoPE, however, few of them have attempted to showcase their inner workings comprehensively. In this paper, we are driven to offer a straightforward yet in-depth understanding of RoPE extensions from an attention perspective and on two benchmarking tasks. A broad array of experiments reveals several valuable findings: 1) Maintaining attention patterns to those at the pretrained length improves extrapolation; 2) Large attention uncertainty leads to retrieval errors; 3) Using longer continual pretraining lengths for RoPE extensions could reduce attention uncertainty and significantly enhance extrapolation.

Related papers

Mixture of In-Context Experts Enhance LLMs' Long Context Awareness [51.65245442281049]
Large language models (LLMs) exhibit uneven awareness of different contextual positions. We introduce a novel method called Mixture of In-Context Experts'' (MoICE) to address this challenge. MoICE comprises two key components: a router integrated into each attention head within LLMs and a lightweight router-only training optimization strategy.
arXiv Detail & Related papers (2024-06-28T01:46:41Z)
Base of RoPE Bounds Context Length [37.11078116104313]
Rotary position embedding (RoPE) is a technique that encodes the position information with a rotation matrix. In this paper, we find that LLMs may obtain a superficial long-context ability based on the OOD theory. Our work reveals the relationship between context length and RoPE base both theoretically and empirically, which may shed light on future long context training.
arXiv Detail & Related papers (2024-05-23T14:03:31Z)
HiRoPE: Length Extrapolation for Code Models [31.844937849746312]
We introduce Hierarchical Rotary Position Embedding (HiRoPE) HiRoPE enhances the traditional rotary position embedding into a hierarchical format based on the hierarchical structure of source code. We introduce a new long code understanding task with real-world code projects, in hopes of promoting further development in this field.
arXiv Detail & Related papers (2024-03-28T03:11:38Z)
Found in the Middle: How Language Models Use Long Contexts Better via Plug-and-Play Positional Encoding [78.36702055076456]
This paper introduces Multi-scale Positional. (Ms-PoE) which is a simple yet effective plug-and-play approach to enhance the capacity of. LLMs to handle relevant information located in the middle of the context.
arXiv Detail & Related papers (2024-03-05T04:58:37Z)
Extending LLMs' Context Window with 100 Samples [42.52554295241792]
Large Language Models (LLMs) are known to have limited extrapolation ability beyond their pre-trained context window. Recent studies have sought to extend the context window by modifying rotary position embedding (RoPE) We introduce a novel extension to RoPE which combines adjusting RoPE's base frequency and scaling the attention logits to help LLMs efficiently adapt to a larger context window.
arXiv Detail & Related papers (2024-01-13T07:57:01Z)
Length Extrapolation of Transformers: A Survey from the Perspective of Positional Encoding [40.98734594005952]
Transformer has taken the field of natural language processing (NLP) by storm since its birth. Large language models (LLMs) built upon it have captured worldwide attention due to its superior abilities. All Transformer-based models including these powerful LLMs suffer from a preset length limit and can hardly generalize from short training sequences to longer inference ones.
arXiv Detail & Related papers (2023-12-28T14:42:24Z)
CLEX: Continuous Length Extrapolation for Large Language Models [68.43814043853347]
We propose Continuous Length EXtrapolation (CLEX) for Large Language Models (LLMs) CLEX extends the context window to over 4x or almost 8x training length, with no deterioration in performance. Our model trained on a 4k length exhibits competitive performance against state-of-the-art open-source models trained on context lengths up to 32k.
arXiv Detail & Related papers (2023-10-25T08:13:02Z)
Scaling Laws of RoPE-based Extrapolation [103.33995311915864]
We propose textbftextitScaling Laws of RoPE-based Extrapolation to describe the relationship between the extrapolation performance and base value. We achieve extrapolation up to 1 million context length within only 16K training length on LLaMA2 7B and 13B.
arXiv Detail & Related papers (2023-10-08T15:50:36Z)
PEARL: Prompting Large Language Models to Plan and Execute Actions Over Long Documents [78.27865456183397]
We propose PEARL, a prompting framework to improve reasoning over long documents. Each stage of PEARL is implemented via zero-shot or few-shot prompting with minimal human input. We evaluate PEARL on a challenging subset of the QuALITY dataset, which contains questions that require complex reasoning over long narrative texts.
arXiv Detail & Related papers (2023-05-23T23:06:04Z)

This list is automatically generated from the titles and abstracts of the papers in this site.