Related papers: CLEX: Continuous Length Extrapolation for Large Language Models

CLEX: Continuous Length Extrapolation for Large Language Models

URL: http://arxiv.org/abs/2310.16450v3
Date: Sun, 24 Mar 2024 17:14:11 GMT
Title: CLEX: Continuous Length Extrapolation for Large Language Models
Authors: Guanzheng Chen, Xin Li, Zaiqiao Meng, Shangsong Liang, Lidong Bing,
Abstract summary: We propose Continuous Length EXtrapolation (CLEX) for Large Language Models (LLMs) CLEX extends the context window to over 4x or almost 8x training length, with no deterioration in performance. Our model trained on a 4k length exhibits competitive performance against state-of-the-art open-source models trained on context lengths up to 32k.
Score: 68.43814043853347
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer-based Large Language Models (LLMs) are pioneering advances in many natural language processing tasks, however, their exceptional capabilities are restricted within the preset context window of Transformer. Position Embedding (PE) scaling methods, while effective in extending the context window to a specific length, demonstrate either notable limitations in their extrapolation abilities or sacrificing partial performance within the context window. Length extrapolation methods, although theoretically capable of extending the context window beyond the training sequence length, often underperform in practical long-context applications. To address these challenges, we propose Continuous Length EXtrapolation (CLEX) for LLMs. We generalise the PE scaling approaches to model the continuous dynamics by ordinary differential equations over the length scaling factor, thereby overcoming the constraints of current PE scaling methods designed for specific lengths. Moreover, by extending the dynamics to desired context lengths beyond the training sequence length, CLEX facilitates the length extrapolation with impressive performance in practical tasks. We demonstrate that CLEX can be seamlessly incorporated into LLMs equipped with Rotary Position Embedding, such as LLaMA and GPT-NeoX, with negligible impact on training and inference latency. Experimental results reveal that CLEX can effectively extend the context window to over 4x or almost 8x training length, with no deterioration in performance. Furthermore, when evaluated on the practical LongBench benchmark, our model trained on a 4k length exhibits competitive performance against state-of-the-art open-source models trained on context lengths up to 32k. Our code is available at https://github.com/DAMO-NLP-SG/CLEX.

Related papers

Effective Length Extrapolation via Dimension-Wise Positional Embeddings Manipulation [60.22622442950905]
Large Language Models (LLMs) often struggle to process and generate coherent context when the number of input tokens exceeds the pre-trained length. Recent advancements in long-context extension have significantly expanded the context window of LLMs. We propose Dimension-Wise Positional Embeddings Manipulation (DPE) to extrapolate the context window of LLMs.
arXiv Detail & Related papers (2025-04-26T08:46:10Z)
LIFT: Improving Long Context Understanding Through Long Input Fine-Tuning [35.31849814789343]
This paper introduces Long Input Fine-Tuning (LIFT) for long context modeling. LIFT enables efficient processing of lengthy inputs without the computational burden of offline long-context adaptation. The framework is further enhanced by integrating in-context learning and pre-LIFT supervised fine-tuning.
arXiv Detail & Related papers (2024-12-18T09:04:55Z)
Breaking the Stage Barrier: A Novel Single-Stage Approach to Long Context Extension for Large Language Models [28.253786579346432]
Large language models (LLMs) have revolutionized Natural Language Processing (NLP) Currently solutions toward long context modeling often employ multi-stage continual pertaining. In this paper, we introduce a novel single-stage continual pretraining method, Head-Adaptive Rotary Position.
arXiv Detail & Related papers (2024-12-10T04:09:29Z)
Why Does the Effective Context Length of LLMs Fall Short? [68.34573617977013]
In this work, we introduce ShifTed Rotray position embeddING (STRING) STRING shifts well-trained positions to overwrite the original ineffective positions during inference, enhancing performance within their existing training lengths. Experimental results show that STRING dramatically improves the performance of the latest large-scale models.
arXiv Detail & Related papers (2024-10-24T13:51:50Z)
LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models [72.71150585370147]
LongRecipe is an efficient training strategy for extending the context window of large language models. It simulates long-sequence inputs while maintaining training efficiency and significantly improves the model's understanding of long-range dependencies. LongRecipe can utilize long sequences while requiring only 30% of the target context window size, and reduces computational training resource over 85% compared to full sequence training.
arXiv Detail & Related papers (2024-08-31T17:19:30Z)
LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models [61.12177317970258]
LongSkywork is a long-context Large Language Model capable of processing up to 200,000 tokens. We develop two novel methods for creating synthetic data. LongSkywork achieves outstanding performance on a variety of long-context benchmarks.
arXiv Detail & Related papers (2024-06-02T03:34:41Z)
E^2-LLM: Efficient and Extreme Length Extension of Large Language Models [74.1254067728251]
We propose an Efficient and Extreme length extension method for Large Language Models, called E 2 -LLM, with only one training procedure and dramatically reduced cost. Comprehensive experimental results on multiple benchmark datasets demonstrate the effectiveness of our E 2 -LLM on challenging long-context tasks.
arXiv Detail & Related papers (2024-01-13T02:11:20Z)
PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training [91.99700930388998]
We propose Positional Skip-wisE training that simulates long inputs using a fixed context window. PoSE greatly reduces memory and time overhead compared with Full-length fine-tuning. We have successfully extended the LLaMA model to 128k tokens using a 2k training context window.
arXiv Detail & Related papers (2023-09-19T08:03:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.