Training Long-Context LLMs Efficiently via Chunk-wise Optimization
- URL: http://arxiv.org/abs/2505.16710v1
- Date: Thu, 22 May 2025 14:11:34 GMT
- Title: Training Long-Context LLMs Efficiently via Chunk-wise Optimization
- Authors: Wenhao Li, Yuxin Zhang, Gen Luo, Daohai Yu, Rongrong Ji,
- Abstract summary: We present textitSequential Chunk-wise Optimization (SeCO), a memory-efficient training paradigm that partitions lengthy inputs into manageable chunks.<n>We also introduce textitSparse Chunk-wise Optimization (SpaCO), which reduces computational overhead by selectively propagating gradients to specific chunks.<n>SpaCO decouples the computational cost of backpropagation from the context length, enabling training time to gradually converge to inference time as sequences become longer.
- Score: 60.05884946552877
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: While long-context large language models (LLMs) exhibit remarkable document processing capabilities, their prohibitively high training costs often hinder customized applications. To mitigate this issue, we propose \textit{Sequential Chunk-wise Optimization} (SeCO), a memory-efficient training paradigm that partitions lengthy inputs into manageable chunks. Each chunk independently constructs its computational graph and performs localized backpropagation, ensuring that only one chunk's forward activations are stored in memory. Building on SeCO, we further introduce \textit{Sparse Chunk-wise Optimization} (SpaCO), which reduces computational overhead by selectively propagating gradients to specific chunks and incorporates a carefully designed compensation factor to ensure unbiased gradient estimation. SpaCO decouples the computational cost of backpropagation from the context length, enabling training time to gradually converge to inference time as sequences become longer. Implemented as lightweight training wrappers, both SeCO and SpaCO offer substantial practical benefits. For example, when fine-tuning an 8B model with LoRA on a single RTX 3090 GPU, SeCO expands maximum sequence length from 1K to 16K tokens, while SpaCO demonstrates accelerated training speed -- achieving up to 3x faster than SeCO under the same experimental setup. These innovations provide new insights into optimizing long-context models, making them more accessible for practical applications. We have open-sourced the code at \href{https://github.com/wenhaoli-xmu/seco}{here}.
Related papers
- InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU [48.105361428245736]
We introduce InfiniteHiP, an inference framework for large language models (LLMs)<n>We dynamically eliminate irrelevant context tokens through a modular hierarchical token pruning algorithm.<n>Our framework achieves an 18.95x speedup in attention decoding for a 1 million token context without requiring additional training.
arXiv Detail & Related papers (2025-02-13T02:52:01Z) - Adjoint sharding for very long context training of state space models [7.723642550918118]
Adjoint sharding is a technique that comprises sharding gradient calculation during training to reduce memory requirements by orders of magnitude.<n>We show the proposed adjoint sharding algorithm reduces memory usage by up to 3X with a 1.27B parameter large language model on 1M context length training.<n>This allows to increase the maximum context length during training or fine-tuning of a 1.27B parameter model from 35K tokens to above 100K tokens on a training infrastructure composed of five AWS P4 instances.
arXiv Detail & Related papers (2025-01-01T01:10:59Z) - LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models [72.71150585370147]
LongRecipe is an efficient training strategy for extending the context window of large language models.
It simulates long-sequence inputs while maintaining training efficiency and significantly improves the model's understanding of long-range dependencies.
LongRecipe can utilize long sequences while requiring only 30% of the target context window size, and reduces computational training resource over 85% compared to full sequence training.
arXiv Detail & Related papers (2024-08-31T17:19:30Z) - UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs [111.12010207132204]
UIO-LLMs is an incremental optimization approach for memory-enhanced transformers under long-context settings.
We refine the training process using the Truncated Backpropagation Through Time (TBPTT) algorithm.
UIO-LLMs successfully handle long context, such as extending the context window of Llama2-7b-chat from 4K to 100K tokens with minimal 2% additional parameters.
arXiv Detail & Related papers (2024-06-26T08:44:36Z) - Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens.<n>Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z) - Optimizing Large Model Training through Overlapped Activation Recomputation [24.28543166026873]
We present Lynx, a new recomputation framework to reduce overhead by overlapping recomputation with communication in training pipelines.<n>Our comprehensive evaluation using GPT models with 1.3B-23B parameters shows that Lynx outperforms existing recomputation approaches by up to 1.37x.
arXiv Detail & Related papers (2024-06-13T02:31:36Z) - LoCoCo: Dropping In Convolutions for Long Context Compression [77.26610232994508]
This paper presents a novel approach, Dropping In Convolutions for Long Context Compression (LoCoCo)
LoCoCo employs only a fixed-size Key-Value ( KV) cache, and can enhance efficiency in both inference and fine-tuning stages.
arXiv Detail & Related papers (2024-06-08T01:35:11Z) - Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models.
HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks.
A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.