Scaling Laws of RoPE-based Extrapolation
- URL: http://arxiv.org/abs/2310.05209v2
- Date: Wed, 13 Mar 2024 08:14:47 GMT
- Title: Scaling Laws of RoPE-based Extrapolation
- Authors: Xiaoran Liu, Hang Yan, Shuo Zhang, Chenxin An, Xipeng Qiu, Dahua Lin
- Abstract summary: We propose textbftextitScaling Laws of RoPE-based Extrapolation to describe the relationship between the extrapolation performance and base value.
We achieve extrapolation up to 1 million context length within only 16K training length on LLaMA2 7B and 13B.
- Score: 103.33995311915864
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The extrapolation capability of Large Language Models (LLMs) based on Rotary
Position Embedding is currently a topic of considerable interest. The
mainstream approach to addressing extrapolation with LLMs involves modifying
RoPE by replacing 10000, the rotary base of $\theta_n={10000}^{-2n/d}$ in the
original RoPE, with a larger value and providing longer fine-tuning text. In
this work, we first observe that fine-tuning a RoPE-based LLM with either a
smaller or larger base in pre-training context length could significantly
enhance its extrapolation performance. After that, we propose
\textbf{\textit{Scaling Laws of RoPE-based Extrapolation}}, a unified framework
from the periodic perspective, to describe the relationship between the
extrapolation performance and base value as well as tuning context length. In
this process, we also explain the origin of the RoPE-based extrapolation issue
by \textbf{\textit{critical dimension for extrapolation}}. Besides these
observations and analyses, we achieve extrapolation up to 1 million context
length within only 16K training length on LLaMA2 7B and 13B.
Related papers
- What is Wrong with Perplexity for Long-context Language Modeling? [71.34933096461124]
Long-context inputs are crucial for large language models (LLMs) in tasks such as extended conversations, document summarization, and many-shot in-context learning.
Perplexity (PPL) has proven unreliable for assessing long-context capabilities.
We propose bfLongPPL, a novel metric that focuses on key tokens by employing a long-short context contrastive method to identify them.
arXiv Detail & Related papers (2024-10-31T09:39:28Z) - Round and Round We Go! What makes Rotary Positional Encodings useful? [15.543752938828831]
We study the internals of a trained Gemma 7B model to understand how RoPE is being used at a mechanical level.
We find that Gemma learns to use RoPE to construct robust "positional" attention patterns by exploiting the highest frequencies.
We propose a modification of RoPE that fixes some highlighted issues and improves performance.
arXiv Detail & Related papers (2024-10-08T17:07:01Z) - Extending Context Window of Large Language Models from a Distributional Perspective [29.313701168816507]
We propose to optimize the context window extending task from the view of rotary angle distribution.
We present a novel extension strategy that minimizes the disturbance between rotary angle distributions.
Our method achieves an average improvement of up to 4.33% over existing state-of-the-art methods.
arXiv Detail & Related papers (2024-10-02T12:40:11Z) - Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective [35.947737679664016]
This paper offers a straightforward yet in-depth understanding of RoPE extensions from an attention perspective.
Using longer continual pretraining lengths for RoPE extensions could reduce attention uncertainty and significantly enhance extrapolation.
arXiv Detail & Related papers (2024-06-19T07:23:33Z) - Base of RoPE Bounds Context Length [37.11078116104313]
Rotary position embedding (RoPE) is a technique that encodes the position information with a rotation matrix.
In this paper, we find that LLMs may obtain a superficial long-context ability based on the OOD theory.
Our work reveals the relationship between context length and RoPE base both theoretically and empirically, which may shed light on future long context training.
arXiv Detail & Related papers (2024-05-23T14:03:31Z) - Long Context Alignment with Short Instructions and Synthesized Positions [56.1267385315404]
This paper introduces Step-Skipping Alignment (SkipAlign)
It is a new technique designed to enhance the long-context capabilities of Large Language Models (LLMs)
With a careful selection of the base model and alignment datasets, SkipAlign with only 6B parameters achieves it's best performance and comparable with strong baselines like GPT-3.5-Turbo-16K on LongBench.
arXiv Detail & Related papers (2024-05-07T01:56:22Z) - LongEmbed: Extending Embedding Models for Long Context Retrieval [87.60404151086715]
This paper explores context window extension of embedding models, pushing the limit to 32k without requiring additional training.
First, we examine the performance of current embedding models for long context retrieval on our newly constructed LongEmbed benchmark.
Experiments show that training-free context window extension strategies like positionRo can effectively extend the context window of existing embedding models by several folds.
arXiv Detail & Related papers (2024-04-18T11:29:23Z) - Resonance RoPE: Improving Context Length Generalization of Large Language Models [37.749813693281254]
This paper addresses the challenge of train-short-test-long (TSTL) scenarios in Large Language Models (LLMs) equipped with Rotary Position Embedding (RoPE)
We introduce Resonance RoPE, a novel approach designed to narrow the generalization gap in TSTL scenarios.
We present PosGen, a new synthetic benchmark specifically designed for fine-grained behavior analysis in TSTL scenarios.
arXiv Detail & Related papers (2024-02-29T19:02:03Z) - CLEX: Continuous Length Extrapolation for Large Language Models [68.43814043853347]
We propose Continuous Length EXtrapolation (CLEX) for Large Language Models (LLMs)
CLEX extends the context window to over 4x or almost 8x training length, with no deterioration in performance.
Our model trained on a 4k length exhibits competitive performance against state-of-the-art open-source models trained on context lengths up to 32k.
arXiv Detail & Related papers (2023-10-25T08:13:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.