Related papers: Base of RoPE Bounds Context Length

Base of RoPE Bounds Context Length

URL: http://arxiv.org/abs/2405.14591v1
Date: Thu, 23 May 2024 14:03:31 GMT
Title: Base of RoPE Bounds Context Length
Authors: Xin Men, Mingyu Xu, Bingning Wang, Qingyu Zhang, Hongyu Lin, Xianpei Han, Weipeng Chen,
Abstract summary: Rotary position embedding (RoPE) is a technique that encodes the position information with a rotation matrix. In this paper, we find that LLMs may obtain a superficial long-context ability based on the OOD theory. Our work reveals the relationship between context length and RoPE base both theoretically and empirically, which may shed light on future long context training.
Score: 37.11078116104313
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Position embedding is a core component of current Large Language Models (LLMs). Rotary position embedding (RoPE), a technique that encodes the position information with a rotation matrix, has been the de facto choice for position embedding in many LLMs, such as the Llama series. RoPE has been further utilized to extend long context capability, which is roughly based on adjusting the \textit{base} parameter of RoPE to mitigate out-of-distribution (OOD) problems in position embedding. However, in this paper, we find that LLMs may obtain a superficial long-context ability based on the OOD theory. We revisit the role of RoPE in LLMs and propose a novel property of long-term decay, we derive that the \textit{base of RoPE bounds context length}: there is an absolute lower bound for the base value to obtain certain context length capability. Our work reveals the relationship between context length and RoPE base both theoretically and empirically, which may shed light on future long context training.

Related papers

LaMPE: Length-aware Multi-grained Positional Encoding for Adaptive Long-context Scaling Without Training [45.74983991122073]
Large language models (LLMs) experience significant performance degradation when the input exceeds the pretraining context window.<n>Recent studies mitigate this problem by remapping OOD positions into the in-distribution range with fixed mapping strategies.<n>We propose Length-aware Multi-grained Positional Scaling (LaMPE), a training-free method that fully utilizes the model's effective context window.
arXiv Detail & Related papers (2025-08-04T11:22:13Z)
Flora: Effortless Context Construction to Arbitrary Length and Scale [71.12886910497284]
We introduce Flora, an effortless (human/LLM-free) long-context construction strategy.<n>Experiments on Llama3-8B-Instruct and QwQ-32B show that Flora excel in three long-context benchmarks while maintaining strong performances in short-context tasks.
arXiv Detail & Related papers (2025-07-26T04:21:21Z)
ComRoPE: Scalable and Robust Rotary Position Embedding Parameterized by Trainable Commuting Angle Matrices [25.99231204405503]
We propose ComRoPE, which generalizes Rotary Positional PE (RoPE) by defining it in terms of trainable commuting angle matrices.<n>We present two types of trainable commuting angle matrices as sufficient solutions to the RoPE equation.<n>Our framework shows versatility in generalizing to existing RoPE formulations and offering new insights for future positional encoding research.
arXiv Detail & Related papers (2025-06-04T09:10:02Z)
LongRoPE2: Near-Lossless LLM Context Window Scaling [46.936900701411965]
LongRoPE2 is a novel approach that extends the effective context window of pre-trained large language models (LLMs) to the target length. This is achieved by three contributions: (1) a hypothesis that insufficient training in higher RoPE dimensions contributes to the persistent out-of-distribution issues observed in existing methods; (2) an effective RoPE rescaling algorithm that adopts evolutionary search guided by "needle-driven" perplexity to address the insufficient training problem; and (3) a mixed context window training approach that fine-tunes model weights to adopt rescaled RoPE for long-context sequences.
arXiv Detail & Related papers (2025-02-27T13:41:07Z)
When Precision Meets Position: BFloat16 Breaks Down RoPE in Long-Context Training [51.23520027773028]
Extending context window sizes allows large language models to process longer sequences and handle more complex tasks. We observe that using RoPE with BFloat16 format results in numerical issues, causing it to deviate from its intended relative positional encoding. We develop AnchorAttention, a plug-and-play attention method that alleviates numerical issues caused by BFloat16.
arXiv Detail & Related papers (2024-11-20T17:22:31Z)
What is Wrong with Perplexity for Long-context Language Modeling? [71.34933096461124]
Long-context inputs are crucial for large language models (LLMs) in tasks such as extended conversations, document summarization, and many-shot in-context learning. Perplexity (PPL) has proven unreliable for assessing long-context capabilities. We propose bfLongPPL, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them.
arXiv Detail & Related papers (2024-10-31T09:39:28Z)
HoPE: A Novel Positional Encoding Without Long-Term Decay for Enhanced Context Awareness and Extrapolation [19.42279057349193]
positional encodings (PEs) are designed to exhibit long-term decay, based on an entrenched and long-standing inductive opinion. We argue that long-term decay is outdated in the era of LLMs, as LLMs are now applied to tasks demanding precise retrieval of in-context information.
arXiv Detail & Related papers (2024-10-28T17:01:52Z)
Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective [35.947737679664016]
This paper offers a straightforward yet in-depth understanding of RoPE extensions from an attention perspective. Using longer continual pretraining lengths for RoPE extensions could reduce attention uncertainty and significantly enhance extrapolation.
arXiv Detail & Related papers (2024-06-19T07:23:33Z)
Long Context Alignment with Short Instructions and Synthesized Positions [56.1267385315404]
This paper introduces Step-Skipping Alignment (SkipAlign) It is a new technique designed to enhance the long-context capabilities of Large Language Models (LLMs) With a careful selection of the base model and alignment datasets, SkipAlign with only 6B parameters achieves it's best performance and comparable with strong baselines like GPT-3.5-Turbo-16K on LongBench.
arXiv Detail & Related papers (2024-05-07T01:56:22Z)
LongEmbed: Extending Embedding Models for Long Context Retrieval [87.60404151086715]
This paper explores context window extension of embedding models, pushing the limit to 32k without requiring additional training. First, we examine the performance of current embedding models for long context retrieval on our newly constructed LongEmbed benchmark. Experiments show that training-free context window extension strategies like positionRo can effectively extend the context window of existing embedding models by several folds.
arXiv Detail & Related papers (2024-04-18T11:29:23Z)
Resonance RoPE: Improving Context Length Generalization of Large Language Models [37.749813693281254]
This paper addresses the challenge of train-short-test-long (TSTL) scenarios in Large Language Models (LLMs) equipped with Rotary Position Embedding (RoPE) We introduce Resonance RoPE, a novel approach designed to narrow the generalization gap in TSTL scenarios. We present PosGen, a new synthetic benchmark specifically designed for fine-grained behavior analysis in TSTL scenarios.
arXiv Detail & Related papers (2024-02-29T19:02:03Z)
Scaling Laws of RoPE-based Extrapolation [103.33995311915864]
We propose textbftextitScaling Laws of RoPE-based Extrapolation to describe the relationship between the extrapolation performance and base value. We achieve extrapolation up to 1 million context length within only 16K training length on LLaMA2 7B and 13B.
arXiv Detail & Related papers (2023-10-08T15:50:36Z)
RoFormer: Enhanced Transformer with Rotary Position Embedding [9.01819510933327]
We propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information. RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation. We evaluate the enhanced transformer with rotary position embedding, also called RoFormer, on various long text classification benchmark datasets.
arXiv Detail & Related papers (2021-04-20T09:54:06Z)

This list is automatically generated from the titles and abstracts of the papers in this site.