PoSE: Efficient Context Window Extension of LLMs via Positional
Skip-wise Training
- URL: http://arxiv.org/abs/2309.10400v3
- Date: Wed, 21 Feb 2024 13:37:07 GMT
- Title: PoSE: Efficient Context Window Extension of LLMs via Positional
Skip-wise Training
- Authors: Dawei Zhu and Nan Yang and Liang Wang and Yifan Song and Wenhao Wu and
Furu Wei and Sujian Li
- Abstract summary: We propose Positional Skip-wisE training that simulates long inputs using a fixed context window.
PoSE greatly reduces memory and time overhead compared with Full-length fine-tuning.
We have successfully extended the LLaMA model to 128k tokens using a 2k training context window.
- Score: 91.99700930388998
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Large Language Models (LLMs) are trained with a pre-defined context length,
restricting their use in scenarios requiring long inputs. Previous efforts for
adapting LLMs to a longer length usually requires fine-tuning with this target
length (Full-length fine-tuning), suffering intensive training cost. To
decouple train length from target length for efficient context window
extension, we propose Positional Skip-wisE (PoSE) training that smartly
simulates long inputs using a fixed context window. This is achieved by first
dividing the original context window into several chunks, then designing
distinct skipping bias terms to manipulate the position indices of each chunk.
These bias terms and the lengths of each chunk are altered for every training
example, allowing the model to adapt to all positions within target length.
Experimental results show that PoSE greatly reduces memory and time overhead
compared with Full-length fine-tuning, with minimal impact on performance.
Leveraging this advantage, we have successfully extended the LLaMA model to
128k tokens using a 2k training context window. Furthermore, we empirically
confirm that PoSE is compatible with all RoPE-based LLMs and position
interpolation strategies. Notably, our method can potentially support infinite
length, limited only by memory usage in inference. With ongoing progress for
efficient inference, we believe PoSE can further scale the context window
beyond 128k.
Related papers
- Why Does the Effective Context Length of LLMs Fall Short? [68.34573617977013]
In this work, we introduce ShifTed Rotray position embeddING (STRING)
STRING shifts well-trained positions to overwrite the original ineffective positions during inference, enhancing performance within their existing training lengths.
Experimental results show that STRING dramatically improves the performance of the latest large-scale models.
arXiv Detail & Related papers (2024-10-24T13:51:50Z) - LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models [72.71150585370147]
LongRecipe is an efficient training strategy for extending the context window of large language models.
It simulates long-sequence inputs while maintaining training efficiency and significantly improves the model's understanding of long-range dependencies.
LongRecipe can utilize long sequences while requiring only 30% of the target context window size, and reduces computational training resource over 85% compared to full sequence training.
arXiv Detail & Related papers (2024-08-31T17:19:30Z) - LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens [7.833740464264734]
Current extended context windows are limited to around 128k tokens.
LongRoPE extends the context window of pre-trained LLMs to an impressive 2048k tokens.
arXiv Detail & Related papers (2024-02-21T12:30:33Z) - InfLLM: Training-Free Long-Context Extrapolation for LLMs with an Efficient Context Memory [93.20588235940453]
In this paper, we introduce a training-free memory-based method, InfLLM.
InfLLM stores distant contexts into additional memory units and employs an efficient mechanism to lookup token-relevant units for attention.
Even when the sequence length is scaled to $1,024$K, InfLLM still effectively captures long-distance dependencies.
arXiv Detail & Related papers (2024-02-07T06:50:42Z) - Extending LLMs' Context Window with 100 Samples [42.52554295241792]
Large Language Models (LLMs) are known to have limited extrapolation ability beyond their pre-trained context window.
Recent studies have sought to extend the context window by modifying rotary position embedding (RoPE)
We introduce a novel extension to RoPE which combines adjusting RoPE's base frequency and scaling the attention logits to help LLMs efficiently adapt to a larger context window.
arXiv Detail & Related papers (2024-01-13T07:57:01Z) - E^2-LLM: Efficient and Extreme Length Extension of Large Language Models [74.1254067728251]
We propose an Efficient and Extreme length extension method for Large Language Models, called E 2 -LLM, with only one training procedure and dramatically reduced cost.
Comprehensive experimental results on multiple benchmark datasets demonstrate the effectiveness of our E 2 -LLM on challenging long-context tasks.
arXiv Detail & Related papers (2024-01-13T02:11:20Z) - CLEX: Continuous Length Extrapolation for Large Language Models [68.43814043853347]
We propose Continuous Length EXtrapolation (CLEX) for Large Language Models (LLMs)
CLEX extends the context window to over 4x or almost 8x training length, with no deterioration in performance.
Our model trained on a 4k length exhibits competitive performance against state-of-the-art open-source models trained on context lengths up to 32k.
arXiv Detail & Related papers (2023-10-25T08:13:02Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.