E^2-LLM: Efficient and Extreme Length Extension of Large Language Models
- URL: http://arxiv.org/abs/2401.06951v3
- Date: Thu, 22 Feb 2024 12:49:10 GMT
- Title: E^2-LLM: Efficient and Extreme Length Extension of Large Language Models
- Authors: Jiaheng Liu, Zhiqi Bai, Yuanxing Zhang, Chenchen Zhang, Yu Zhang, Ge
Zhang, Jiakai Wang, Haoran Que, Yukang Chen, Wenbo Su, Tiezheng Ge, Jie Fu,
Wenhu Chen, Bo Zheng
- Abstract summary: We propose an Efficient and Extreme length extension method for Large Language Models, called E 2 -LLM, with only one training procedure and dramatically reduced cost.
Comprehensive experimental results on multiple benchmark datasets demonstrate the effectiveness of our E 2 -LLM on challenging long-context tasks.
- Score: 74.1254067728251
- License: http://creativecommons.org/publicdomain/zero/1.0/
- Abstract: Typically, training LLMs with long context sizes is computationally
expensive, requiring extensive training hours and GPU resources. Existing
long-context extension methods usually need additional training procedures to
support corresponding long-context windows, where the long-context training
data (e.g., 32k) is needed, and high GPU training costs are assumed. To address
the aforementioned issues, we propose an Efficient and Extreme length extension
method for Large Language Models, called E 2 -LLM, with only one training
procedure and dramatically reduced computation cost, which also removes the
need to collect long-context data. Concretely, first, the training data of our
E 2 -LLM only requires a short length (e.g., 4k), which reduces the tuning cost
greatly. Second, the training procedure on the short training context window is
performed only once time, and we can support different evaluation context
windows at inference. Third, in E 2 - LLM, based on RoPE position embeddings,
we introduce two different augmentation methods on the scale and position index
parameters for different samples in training. It aims to make the model more
robust to the different relative differences when directly interpolating the
arbitrary context length at inference. Comprehensive experimental results on
multiple benchmark datasets demonstrate the effectiveness of our E 2 -LLM on
challenging long-context tasks.
Related papers
- How to Train Long-Context Language Models (Effectively) [75.5418485597276]
We study continued training and supervised fine-tuning (SFT) of a language model (LM) to make effective use of long-context information.
ProLong-8B, which is from Llama-3 and trained on 40B tokens, demonstrates state-of-the-art long-context performance among similarly sized models at a length of 128K.
arXiv Detail & Related papers (2024-10-03T16:46:52Z) - LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models [61.12177317970258]
LongSkywork is a long-context Large Language Model capable of processing up to 200,000 tokens.
We develop two novel methods for creating synthetic data.
LongSkywork achieves outstanding performance on a variety of long-context benchmarks.
arXiv Detail & Related papers (2024-06-02T03:34:41Z) - Long Context is Not Long at All: A Prospector of Long-Dependency Data for Large Language Models [13.091271774417867]
Long-context modeling capabilities are important for large language models (LLMs) in various applications.
We propose a data mining framework textbfProLong that can assign each training sample with a long dependency score.
Comprehensive experiments on multiple benchmarks indicate that ProLong effectively identifies documents that carry long dependencies.
arXiv Detail & Related papers (2024-05-28T07:36:56Z) - Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum [30.46329559544246]
We introduce dataset decomposition, a novel variable sequence length training technique.
We train an 8k context-length 1B model at the same cost as a 2k context-length model trained with the baseline approach.
Experiments on a web-scale corpus demonstrate that our approach significantly enhances performance on standard language evaluations and long-context benchmarks.
arXiv Detail & Related papers (2024-05-21T22:26:01Z) - CLEX: Continuous Length Extrapolation for Large Language Models [68.43814043853347]
We propose Continuous Length EXtrapolation (CLEX) for Large Language Models (LLMs)
CLEX extends the context window to over 4x or almost 8x training length, with no deterioration in performance.
Our model trained on a 4k length exhibits competitive performance against state-of-the-art open-source models trained on context lengths up to 32k.
arXiv Detail & Related papers (2023-10-25T08:13:02Z) - Effective Long-Context Scaling of Foundation Models [90.57254298730923]
We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens.
Our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2.
arXiv Detail & Related papers (2023-09-27T21:41:49Z) - PoSE: Efficient Context Window Extension of LLMs via Positional
Skip-wise Training [91.99700930388998]
We propose Positional Skip-wisE training that simulates long inputs using a fixed context window.
PoSE greatly reduces memory and time overhead compared with Full-length fine-tuning.
We have successfully extended the LLaMA model to 128k tokens using a 2k training context window.
arXiv Detail & Related papers (2023-09-19T08:03:38Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.