Related papers: E^2-LLM: Efficient and Extreme Length Extension of Large Language Models

E^2-LLM: Efficient and Extreme Length Extension of Large Language Models

URL: http://arxiv.org/abs/2401.06951v3
Date: Thu, 22 Feb 2024 12:49:10 GMT
Title: E^2-LLM: Efficient and Extreme Length Extension of Large Language Models
Authors: Jiaheng Liu, Zhiqi Bai, Yuanxing Zhang, Chenchen Zhang, Yu Zhang, Ge Zhang, Jiakai Wang, Haoran Que, Yukang Chen, Wenbo Su, Tiezheng Ge, Jie Fu, Wenhu Chen, Bo Zheng
Abstract summary: We propose an Efficient and Extreme length extension method for Large Language Models, called E 2 -LLM, with only one training procedure and dramatically reduced cost. Comprehensive experimental results on multiple benchmark datasets demonstrate the effectiveness of our E 2 -LLM on challenging long-context tasks.
Score: 74.1254067728251
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: Typically, training LLMs with long context sizes is computationally expensive, requiring extensive training hours and GPU resources. Existing long-context extension methods usually need additional training procedures to support corresponding long-context windows, where the long-context training data (e.g., 32k) is needed, and high GPU training costs are assumed. To address the aforementioned issues, we propose an Efficient and Extreme length extension method for Large Language Models, called E 2 -LLM, with only one training procedure and dramatically reduced computation cost, which also removes the need to collect long-context data. Concretely, first, the training data of our E 2 -LLM only requires a short length (e.g., 4k), which reduces the tuning cost greatly. Second, the training procedure on the short training context window is performed only once time, and we can support different evaluation context windows at inference. Third, in E 2 - LLM, based on RoPE position embeddings, we introduce two different augmentation methods on the scale and position index parameters for different samples in training. It aims to make the model more robust to the different relative differences when directly interpolating the arbitrary context length at inference. Comprehensive experimental results on multiple benchmark datasets demonstrate the effectiveness of our E 2 -LLM on challenging long-context tasks.

Related papers

From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models [54.44375226381814]
Long-context capabilities are essential for a wide range of applications, including document and video understanding, in-context learning, and inference-time scaling. We introduce a efficient training recipe for building ultra-long context LLMs from aligned instruct model, pushing the boundaries of context lengths from 128K to 1M, 2M, and 4M tokens. Our approach achieves state-of-the-art performance across a diverse set of long-context benchmarks.
arXiv Detail & Related papers (2025-04-08T16:58:58Z)
Cost-Optimal Grouped-Query Attention for Long-Context LLMs [64.90662568387683]
Building effective Transformer-based large language models (LLMs) has recently become a research focus. We compare models with different parameter sizes, context lengths, and attention head configurations in terms of model performance, computational cost, and memory cost. Our studies show that, when processing sufficiently long sequences, a larger model with fewer attention heads can achieve a lower loss while incurring lower computational and memory costs.
arXiv Detail & Related papers (2025-03-12T17:50:42Z)
How to Train Long-Context Language Models (Effectively) [75.5418485597276]
We study continued training and supervised fine-tuning (SFT) of a language model (LM) to make effective use of long-context information. ProLong-8B, which is from Llama-3 and trained on 40B tokens, demonstrates state-of-the-art long-context performance among similarly sized models at a length of 128K.
arXiv Detail & Related papers (2024-10-03T16:46:52Z)
LongSkywork: A Training Recipe for Efficiently Extending Context Length in Large Language Models [61.12177317970258]
LongSkywork is a long-context Large Language Model capable of processing up to 200,000 tokens. We develop two novel methods for creating synthetic data. LongSkywork achieves outstanding performance on a variety of long-context benchmarks.
arXiv Detail & Related papers (2024-06-02T03:34:41Z)
Long Context is Not Long at All: A Prospector of Long-Dependency Data for Large Language Models [13.091271774417867]
Long-context modeling capabilities are important for large language models (LLMs) in various applications. We propose a data mining framework textbfProLong that can assign each training sample with a long dependency score. Comprehensive experiments on multiple benchmarks indicate that ProLong effectively identifies documents that carry long dependencies.
arXiv Detail & Related papers (2024-05-28T07:36:56Z)
Dataset Decomposition: Faster LLM Training with Variable Sequence Length Curriculum [30.46329559544246]
We introduce dataset decomposition, a novel variable sequence length training technique. We train an 8k context-length 1B model at the same cost as a 2k context-length model trained with the baseline approach. Experiments on a web-scale corpus demonstrate that our approach significantly enhances performance on standard language evaluations and long-context benchmarks.
arXiv Detail & Related papers (2024-05-21T22:26:01Z)
CLEX: Continuous Length Extrapolation for Large Language Models [68.43814043853347]
We propose Continuous Length EXtrapolation (CLEX) for Large Language Models (LLMs) CLEX extends the context window to over 4x or almost 8x training length, with no deterioration in performance. Our model trained on a 4k length exhibits competitive performance against state-of-the-art open-source models trained on context lengths up to 32k.
arXiv Detail & Related papers (2023-10-25T08:13:02Z)
Effective Long-Context Scaling of Foundation Models [90.57254298730923]
We present a series of long-context LLMs that support effective context windows of up to 32,768 tokens. Our models achieve consistent improvements on most regular tasks and significant improvements on long-context tasks over Llama 2.
arXiv Detail & Related papers (2023-09-27T21:41:49Z)
PoSE: Efficient Context Window Extension of LLMs via Positional Skip-wise Training [91.99700930388998]
We propose Positional Skip-wisE training that simulates long inputs using a fixed context window. PoSE greatly reduces memory and time overhead compared with Full-length fine-tuning. We have successfully extended the LLaMA model to 128k tokens using a 2k training context window.
arXiv Detail & Related papers (2023-09-19T08:03:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.