Scaling Laws of RoPE-based Extrapolation
- URL: http://arxiv.org/abs/2310.05209v2
- Date: Wed, 13 Mar 2024 08:14:47 GMT
- Title: Scaling Laws of RoPE-based Extrapolation
- Authors: Xiaoran Liu, Hang Yan, Shuo Zhang, Chenxin An, Xipeng Qiu, Dahua Lin
- Abstract summary: We propose textbftextitScaling Laws of RoPE-based Extrapolation to describe the relationship between the extrapolation performance and base value.
We achieve extrapolation up to 1 million context length within only 16K training length on LLaMA2 7B and 13B.
- Score: 103.33995311915864
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The extrapolation capability of Large Language Models (LLMs) based on Rotary
Position Embedding is currently a topic of considerable interest. The
mainstream approach to addressing extrapolation with LLMs involves modifying
RoPE by replacing 10000, the rotary base of $\theta_n={10000}^{-2n/d}$ in the
original RoPE, with a larger value and providing longer fine-tuning text. In
this work, we first observe that fine-tuning a RoPE-based LLM with either a
smaller or larger base in pre-training context length could significantly
enhance its extrapolation performance. After that, we propose
\textbf{\textit{Scaling Laws of RoPE-based Extrapolation}}, a unified framework
from the periodic perspective, to describe the relationship between the
extrapolation performance and base value as well as tuning context length. In
this process, we also explain the origin of the RoPE-based extrapolation issue
by \textbf{\textit{critical dimension for extrapolation}}. Besides these
observations and analyses, we achieve extrapolation up to 1 million context
length within only 16K training length on LLaMA2 7B and 13B.
Related papers
- Understanding the RoPE Extensions of Long-Context LLMs: An Attention Perspective [35.947737679664016]
This paper offers a straightforward yet in-depth understanding of RoPE extensions from an attention perspective.
Using longer continual pretraining lengths for RoPE extensions could reduce attention uncertainty and significantly enhance extrapolation.
arXiv Detail & Related papers (2024-06-19T07:23:33Z) - Base of RoPE Bounds Context Length [37.11078116104313]
Rotary position embedding (RoPE) is a technique that encodes the position information with a rotation matrix.
In this paper, we find that LLMs may obtain a superficial long-context ability based on the OOD theory.
Our work reveals the relationship between context length and RoPE base both theoretically and empirically, which may shed light on future long context training.
arXiv Detail & Related papers (2024-05-23T14:03:31Z) - Long Context Alignment with Short Instructions and Synthesized Positions [56.1267385315404]
This paper introduces Step-Skipping Alignment (SkipAlign)
It is a new technique designed to enhance the long-context capabilities of Large Language Models (LLMs)
With a careful selection of the base model and alignment datasets, SkipAlign with only 6B parameters achieves it's best performance and comparable with strong baselines like GPT-3.5-Turbo-16K on LongBench.
arXiv Detail & Related papers (2024-05-07T01:56:22Z) - LongEmbed: Extending Embedding Models for Long Context Retrieval [87.60404151086715]
This paper explores context window extension of embedding models, pushing the limit to 32k without requiring additional training.
First, we examine the performance of current embedding models for long context retrieval on our newly constructed LongEmbed benchmark.
Experiments show that training-free context window extension strategies like positionRo can effectively extend the context window of existing embedding models by several folds.
arXiv Detail & Related papers (2024-04-18T11:29:23Z) - Resonance RoPE: Improving Context Length Generalization of Large Language Models [37.749813693281254]
This paper addresses the challenge of train-short-test-long (TSTL) scenarios in Large Language Models (LLMs) equipped with Rotary Position Embedding (RoPE)
We introduce Resonance RoPE, a novel approach designed to narrow the generalization gap in TSTL scenarios.
We present PosGen, a new synthetic benchmark specifically designed for fine-grained behavior analysis in TSTL scenarios.
arXiv Detail & Related papers (2024-02-29T19:02:03Z) - Extending LLMs' Context Window with 100 Samples [42.52554295241792]
Large Language Models (LLMs) are known to have limited extrapolation ability beyond their pre-trained context window.
Recent studies have sought to extend the context window by modifying rotary position embedding (RoPE)
We introduce a novel extension to RoPE which combines adjusting RoPE's base frequency and scaling the attention logits to help LLMs efficiently adapt to a larger context window.
arXiv Detail & Related papers (2024-01-13T07:57:01Z) - CLEX: Continuous Length Extrapolation for Large Language Models [68.43814043853347]
We propose Continuous Length EXtrapolation (CLEX) for Large Language Models (LLMs)
CLEX extends the context window to over 4x or almost 8x training length, with no deterioration in performance.
Our model trained on a 4k length exhibits competitive performance against state-of-the-art open-source models trained on context lengths up to 32k.
arXiv Detail & Related papers (2023-10-25T08:13:02Z) - KERPLE: Kernelized Relative Positional Embedding for Length
Extrapolation [72.71398034617607]
KERPLE is a framework that generalizes relative position embedding for extrapolation by kernelizing positional differences.
The diversity of CPD kernels allows us to derive various RPEs that enable length extrapolation in a principled way.
arXiv Detail & Related papers (2022-05-20T01:25:57Z) - RoFormer: Enhanced Transformer with Rotary Position Embedding [9.01819510933327]
We propose a novel method named Rotary Position Embedding(RoPE) to effectively leverage the positional information.
RoPE encodes the absolute position with a rotation matrix and meanwhile incorporates the explicit relative position dependency in self-attention formulation.
We evaluate the enhanced transformer with rotary position embedding, also called RoFormer, on various long text classification benchmark datasets.
arXiv Detail & Related papers (2021-04-20T09:54:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.