Related papers: LongRoPE2: Near-Lossless LLM Context Window Scaling

LongRoPE2: Near-Lossless LLM Context Window Scaling

URL: http://arxiv.org/abs/2502.20082v1
Date: Thu, 27 Feb 2025 13:41:07 GMT
Title: LongRoPE2: Near-Lossless LLM Context Window Scaling
Authors: Ning Shang, Li Lyna Zhang, Siyuan Wang, Gaokai Zhang, Gilsinia Lopez, Fan Yang, Weizhu Chen, Mao Yang,
Abstract summary: LongRoPE2 is a novel approach that extends the effective context window of pre-trained large language models (LLMs) to the target length.<n>This is achieved by three contributions: (1) a hypothesis that insufficient training in higher RoPE dimensions contributes to the persistent out-of-distribution issues observed in existing methods; (2) an effective RoPE rescaling algorithm that adopts evolutionary search guided by "needle-driven" perplexity to address the insufficient training problem; and (3) a mixed context window training approach that fine-tunes model weights to adopt rescaled RoPE for long-context sequences.
Score: 46.936900701411965
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: LongRoPE2 is a novel approach that extends the effective context window of pre-trained large language models (LLMs) to the target length, while preserving the performance on the original shorter context window. This is achieved by three contributions: (1) a hypothesis that insufficient training in higher RoPE dimensions contributes to the persistent out-of-distribution (OOD) issues observed in existing methods; (2) an effective RoPE rescaling algorithm that adopts evolutionary search guided by "needle-driven" perplexity to address the insufficient training problem; (3) a mixed context window training approach that fine-tunes model weights to adopt rescaled RoPE for long-context sequences while preserving the short-context performance with the original RoPE. Extensive experiments on LLaMA3-8B and Phi3-mini-3.8B across various benchmarks validate the hypothesis and demonstrate the effectiveness of LongRoPE2. Remarkably, LongRoPE2 extends LLaMA3-8B to achieve a 128K effective context length while retaining over 98.5% of short-context performance, using only 10B tokens -- 80x fewer than Meta's approach, which fails to reach the target effective context length. Code will be available at https://github.com/microsoft/LongRoPE.

Related papers

Effective Length Extrapolation via Dimension-Wise Positional Embeddings Manipulation [60.22622442950905]
Large Language Models (LLMs) often struggle to process and generate coherent context when the number of input tokens exceeds the pre-trained length. Recent advancements in long-context extension have significantly expanded the context window of LLMs. We propose Dimension-Wise Positional Embeddings Manipulation (DPE) to extrapolate the context window of LLMs.
arXiv Detail & Related papers (2025-04-26T08:46:10Z)
Why Does the Effective Context Length of LLMs Fall Short? [68.34573617977013]
In this work, we introduce ShifTed Rotray position embeddING (STRING) STRING shifts well-trained positions to overwrite the original ineffective positions during inference, enhancing performance within their existing training lengths. Experimental results show that STRING dramatically improves the performance of the latest large-scale models.
arXiv Detail & Related papers (2024-10-24T13:51:50Z)
Extending Context Window of Large Language Models from a Distributional Perspective [29.313701168816507]
We propose to optimize the context window extending task from the view of rotary angle distribution. We present a novel extension strategy that minimizes the disturbance between rotary angle distributions. Our method achieves an average improvement of up to 4.33% over existing state-of-the-art methods.
arXiv Detail & Related papers (2024-10-02T12:40:11Z)
LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models [72.71150585370147]
LongRecipe is an efficient training strategy for extending the context window of large language models. It simulates long-sequence inputs while maintaining training efficiency and significantly improves the model's understanding of long-range dependencies. LongRecipe can utilize long sequences while requiring only 30% of the target context window size, and reduces computational training resource over 85% compared to full sequence training.
arXiv Detail & Related papers (2024-08-31T17:19:30Z)
Base of RoPE Bounds Context Length [37.11078116104313]
Rotary position embedding (RoPE) is a technique that encodes the position information with a rotation matrix. In this paper, we find that LLMs may obtain a superficial long-context ability based on the OOD theory. Our work reveals the relationship between context length and RoPE base both theoretically and empirically, which may shed light on future long context training.
arXiv Detail & Related papers (2024-05-23T14:03:31Z)
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens [7.833740464264734]
Current extended context windows are limited to around 128k tokens. LongRoPE extends the context window of pre-trained LLMs to an impressive 2048k tokens.
arXiv Detail & Related papers (2024-02-21T12:30:33Z)
E^2-LLM: Efficient and Extreme Length Extension of Large Language Models [74.1254067728251]
We propose an Efficient and Extreme length extension method for Large Language Models, called E 2 -LLM, with only one training procedure and dramatically reduced cost. Comprehensive experimental results on multiple benchmark datasets demonstrate the effectiveness of our E 2 -LLM on challenging long-context tasks.
arXiv Detail & Related papers (2024-01-13T02:11:20Z)
PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training [91.99700930388998]
We propose Positional Skip-wisE training that simulates long inputs using a fixed context window. PoSE greatly reduces memory and time overhead compared with Full-length fine-tuning. We have successfully extended the LLaMA model to 128k tokens using a 2k training context window.
arXiv Detail & Related papers (2023-09-19T08:03:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.