Related papers: Extending Context Window of Large Language Models from a Distributional Perspective

Extending Context Window of Large Language Models from a Distributional Perspective

URL: http://arxiv.org/abs/2410.01490v2
Date: Thu, 3 Oct 2024 05:43:30 GMT
Title: Extending Context Window of Large Language Models from a Distributional Perspective
Authors: Yingsheng Wu, Yuxuan Gu, Xiaocheng Feng, Weihong Zhong, Dongliang Xu, Qing Yang, Hongtao Liu, Bing Qin,
Abstract summary: We propose to optimize the context window extending task from the view of rotary angle distribution. We present a novel extension strategy that minimizes the disturbance between rotary angle distributions. Our method achieves an average improvement of up to 4.33% over existing state-of-the-art methods.
Score: 29.313701168816507
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scaling the rotary position embedding (RoPE) has become a common method for extending the context window of RoPE-based large language models (LLMs). However, existing scaling methods often rely on empirical approaches and lack a profound understanding of the internal distribution within RoPE, resulting in suboptimal performance in extending the context window length. In this paper, we propose to optimize the context window extending task from the view of rotary angle distribution. Specifically, we first estimate the distribution of the rotary angles within the model and analyze the extent to which length extension perturbs this distribution. Then, we present a novel extension strategy that minimizes the disturbance between rotary angle distributions to maintain consistency with the pre-training phase, enhancing the model's capability to generalize to longer sequences. Experimental results compared to the strong baseline methods demonstrate that our approach reduces by up to 72% of the distributional disturbance when extending LLaMA2's context window to 8k, and reduces by up to 32% when extending to 16k. On the LongBench-E benchmark, our method achieves an average improvement of up to 4.33% over existing state-of-the-art methods. Furthermore, Our method maintains the model's performance on the Hugging Face Open LLM benchmark after context window extension, with only an average performance fluctuation ranging from -0.12 to +0.22.

Related papers

LongRoPE2: Near-Lossless LLM Context Window Scaling [46.936900701411965]
LongRoPE2 is a novel approach that extends the effective context window of pre-trained large language models (LLMs) to the target length. This is achieved by three contributions: (1) a hypothesis that insufficient training in higher RoPE dimensions contributes to the persistent out-of-distribution issues observed in existing methods; (2) an effective RoPE rescaling algorithm that adopts evolutionary search guided by "needle-driven" perplexity to address the insufficient training problem; and (3) a mixed context window training approach that fine-tunes model weights to adopt rescaled RoPE for long-context sequences.
arXiv Detail & Related papers (2025-02-27T13:41:07Z)
Why Does the Effective Context Length of LLMs Fall Short? [68.34573617977013]
In this work, we introduce ShifTed Rotray position embeddING (STRING) STRING shifts well-trained positions to overwrite the original ineffective positions during inference, enhancing performance within their existing training lengths. Experimental results show that STRING dramatically improves the performance of the latest large-scale models.
arXiv Detail & Related papers (2024-10-24T13:51:50Z)
Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning on Large-Language Models. We learn the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. Our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU.
arXiv Detail & Related papers (2024-06-15T09:31:03Z)
Exploring Context Window of Large Language Models via Decomposed Positional Vectors [107.19556541244654]
Transformer-based large language models (LLMs) typically have a limited context window. In this study, we explore the positional information within and beyond the context window.
arXiv Detail & Related papers (2024-05-28T09:50:46Z)
LongEmbed: Extending Embedding Models for Long Context Retrieval [87.60404151086715]
This paper explores context window extension of embedding models, pushing the limit to 32k without requiring additional training. First, we examine the performance of current embedding models for long context retrieval on our newly constructed LongEmbed benchmark. Experiments show that training-free context window extension strategies like positionRo can effectively extend the context window of existing embedding models by several folds.
arXiv Detail & Related papers (2024-04-18T11:29:23Z)
Extending LLMs' Context Window with 100 Samples [42.52554295241792]
Large Language Models (LLMs) are known to have limited extrapolation ability beyond their pre-trained context window. Recent studies have sought to extend the context window by modifying rotary position embedding (RoPE) We introduce a novel extension to RoPE which combines adjusting RoPE's base frequency and scaling the attention logits to help LLMs efficiently adapt to a larger context window.
arXiv Detail & Related papers (2024-01-13T07:57:01Z)
CLEX: Continuous Length Extrapolation for Large Language Models [68.43814043853347]
We propose Continuous Length EXtrapolation (CLEX) for Large Language Models (LLMs) CLEX extends the context window to over 4x or almost 8x training length, with no deterioration in performance. Our model trained on a 4k length exhibits competitive performance against state-of-the-art open-source models trained on context lengths up to 32k.
arXiv Detail & Related papers (2023-10-25T08:13:02Z)
Scaling Laws of RoPE-based Extrapolation [103.33995311915864]
We propose textbftextitScaling Laws of RoPE-based Extrapolation to describe the relationship between the extrapolation performance and base value. We achieve extrapolation up to 1 million context length within only 16K training length on LLaMA2 7B and 13B.
arXiv Detail & Related papers (2023-10-08T15:50:36Z)
Learning to Reach Goals via Diffusion [16.344212996721346]
We present a novel perspective on goal-conditioned reinforcement learning by framing it within the context of denoising diffusion models. We then learn a goal-conditioned policy to reverse these deviations, analogous to the score function. This approach, which we call Merlin, can reach specified goals from arbitrary initial states without learning a separate value function.
arXiv Detail & Related papers (2023-10-04T00:47:02Z)
Extending Context Window of Large Language Models via Positional Interpolation [26.076599895589098]
We present Position Interpolation that extends the context window sizes of RoPE-based pretrained LLMs to up to 32768 with minimal fine-tuning (within 1000 steps) We demonstrate strong empirical results on various tasks that require long context, including passkey retrieval, language modeling, and long document summarization from LLaMA 7B to 65B.
arXiv Detail & Related papers (2023-06-27T16:26:26Z)

This list is automatically generated from the titles and abstracts of the papers in this site.