Extending Context Window of Large Language Models via Positional
Interpolation
- URL: http://arxiv.org/abs/2306.15595v2
- Date: Wed, 28 Jun 2023 04:26:05 GMT
- Title: Extending Context Window of Large Language Models via Positional
Interpolation
- Authors: Shouyuan Chen, Sherman Wong, Liangjian Chen, Yuandong Tian
- Abstract summary: We present Position Interpolation that extends the context window sizes of RoPE-based pretrained LLMs to up to 32768 with minimal fine-tuning (within 1000 steps)
We demonstrate strong empirical results on various tasks that require long context, including passkey retrieval, language modeling, and long document summarization from LLaMA 7B to 65B.
- Score: 26.076599895589098
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: We present Position Interpolation (PI) that extends the context window sizes
of RoPE-based pretrained LLMs such as LLaMA models to up to 32768 with minimal
fine-tuning (within 1000 steps), while demonstrating strong empirical results
on various tasks that require long context, including passkey retrieval,
language modeling, and long document summarization from LLaMA 7B to 65B.
Meanwhile, the extended model by Position Interpolation preserve quality
relatively well on tasks within its original context window. To achieve this
goal, Position Interpolation linearly down-scales the input position indices to
match the original context window size, rather than extrapolating beyond the
trained context length which may lead to catastrophically high attention scores
that completely ruin the self-attention mechanism. Our theoretical study shows
that the upper bound of interpolation is at least $\sim 600 \times$ smaller
than that of extrapolation, further demonstrating its stability. Models
extended via Position Interpolation retain its original architecture and can
reuse most pre-existing optimization and infrastructure.
Related papers
- Two are better than one: Context window extension with multi-grained self-injection [111.1376461868317]
SharedLLM is a novel approach grounded in the design philosophy of multi-grained context compression and query-aware information retrieval.
We introduce a specialized tree-style data structure to efficiently encode, store and retrieve multi-grained contextual information for text chunks.
arXiv Detail & Related papers (2024-10-25T06:08:59Z) - Extending Context Window of Large Language Models from a Distributional Perspective [29.313701168816507]
We propose to optimize the context window extending task from the view of rotary angle distribution.
We present a novel extension strategy that minimizes the disturbance between rotary angle distributions.
Our method achieves an average improvement of up to 4.33% over existing state-of-the-art methods.
arXiv Detail & Related papers (2024-10-02T12:40:11Z) - Exploring Context Window of Large Language Models via Decomposed Positional Vectors [107.19556541244654]
Transformer-based large language models (LLMs) typically have a limited context window.
In this study, we explore the positional information within and beyond the context window.
arXiv Detail & Related papers (2024-05-28T09:50:46Z) - Long Context Alignment with Short Instructions and Synthesized Positions [56.1267385315404]
This paper introduces Step-Skipping Alignment (SkipAlign)
It is a new technique designed to enhance the long-context capabilities of Large Language Models (LLMs)
With a careful selection of the base model and alignment datasets, SkipAlign with only 6B parameters achieves it's best performance and comparable with strong baselines like GPT-3.5-Turbo-16K on LongBench.
arXiv Detail & Related papers (2024-05-07T01:56:22Z) - LongEmbed: Extending Embedding Models for Long Context Retrieval [87.60404151086715]
This paper explores context window extension of embedding models, pushing the limit to 32k without requiring additional training.
First, we examine the performance of current embedding models for long context retrieval on our newly constructed LongEmbed benchmark.
Experiments show that training-free context window extension strategies like positionRo can effectively extend the context window of existing embedding models by several folds.
arXiv Detail & Related papers (2024-04-18T11:29:23Z) - LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens [7.833740464264734]
Current extended context windows are limited to around 128k tokens.
LongRoPE extends the context window of pre-trained LLMs to an impressive 2048k tokens.
arXiv Detail & Related papers (2024-02-21T12:30:33Z) - LOCOST: State-Space Models for Long Document Abstractive Summarization [76.31514220737272]
We propose LOCOST: an encoder-decoder architecture based on state-space models for conditional text generation with long context inputs.
With a computational complexity of $O(L log L)$, this architecture can handle significantly longer sequences than state-of-the-art models that are based on sparse attention patterns.
arXiv Detail & Related papers (2024-01-31T15:33:37Z) - Extending LLMs' Context Window with 100 Samples [42.52554295241792]
Large Language Models (LLMs) are known to have limited extrapolation ability beyond their pre-trained context window.
Recent studies have sought to extend the context window by modifying rotary position embedding (RoPE)
We introduce a novel extension to RoPE which combines adjusting RoPE's base frequency and scaling the attention logits to help LLMs efficiently adapt to a larger context window.
arXiv Detail & Related papers (2024-01-13T07:57:01Z) - FoundationPose: Unified 6D Pose Estimation and Tracking of Novel Objects [55.77542145604758]
FoundationPose is a unified foundation model for 6D object pose estimation and tracking.
Our approach can be instantly applied at test-time to a novel object without fine-tuning.
arXiv Detail & Related papers (2023-12-13T18:28:09Z) - Local Context Attention for Salient Object Segmentation [5.542044768017415]
We propose a novel Local Context Attention Network (LCANet) to generate locally reinforcement feature maps in a uniform representational architecture.
The proposed network introduces an Attentional Correlation Filter (ACF) module to generate explicit local attention by calculating the correlation feature map between coarse prediction and global context.
Comprehensive experiments are conducted on several salient object segmentation datasets, demonstrating the superior performance of the proposed LCANet against the state-of-the-art methods.
arXiv Detail & Related papers (2020-09-24T09:20:06Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.