Related papers: A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts

A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts

URL: http://arxiv.org/abs/2410.01485v1
Date: Wed, 2 Oct 2024 12:35:53 GMT
Title: A Little Goes a Long Way: Efficient Long Context Training and Inference with Partial Contexts
Authors: Suyu Ge, Xihui Lin, Yunan Zhang, Jiawei Han, Hao Peng,
Abstract summary: LongGen finetunes a pretrained LLM into an efficient architecture during length extension. LongGen achieves 1.55x training speedup and reduces wall-clock time by 36%, compared to a full-attention baseline. During inference, LongGen reduces KV cache memory by 62%, achieving 1.67x prefilling speedup and 1.41x decoding speedup.
Score: 38.867323730365406
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Training and serving long-context large language models (LLMs) incurs substantial overhead. To address this, two critical steps are often required: a pretrained LLM typically undergoes a separate stage for context length extension by training on long-context data, followed by architectural modifications to reduce the overhead of KV cache during serving. This paper argues that integrating length extension with a GPU-friendly KV cache reduction architecture not only reduces training overhead during length extension, but also achieves better long-context performance. This leads to our proposed LongGen, which finetunes a pretrained LLM into an efficient architecture during length extension. LongGen builds on three key insights: (1) Sparse attention patterns, such as window attention (attending to recent tokens), attention sink (initial ones), and blockwise sparse attention (strided token blocks) are well-suited for building efficient long-context models, primarily due to their GPU-friendly memory access patterns, enabling efficiency gains not just theoretically but in practice as well. (2) It is essential for the model to have direct access to all tokens. A hybrid architecture with 1/3 full attention layers and 2/3 efficient ones achieves a balanced trade-off between efficiency and long-context performance. (3) Lightweight training on 5B long-context data is sufficient to extend the hybrid model's context length from 4K to 128K. We evaluate LongGen on both Llama-2 7B and Llama-2 70B, demonstrating its effectiveness across different scales. During training with 128K-long contexts, LongGen achieves 1.55x training speedup and reduces wall-clock time by 36%, compared to a full-attention baseline. During inference, LongGen reduces KV cache memory by 62%, achieving 1.67x prefilling speedup and 1.41x decoding speedup.

Related papers

From 128K to 4M: Efficient Training of Ultra-Long Context Large Language Models [54.44375226381814]
Long-context capabilities are essential for a wide range of applications, including document and video understanding, in-context learning, and inference-time scaling. We introduce a efficient training recipe for building ultra-long context LLMs from aligned instruct model, pushing the boundaries of context lengths from 128K to 1M, 2M, and 4M tokens. Our approach achieves state-of-the-art performance across a diverse set of long-context benchmarks.
arXiv Detail & Related papers (2025-04-08T16:58:58Z)
LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention [26.54297116028556]
Large language models (LLMs) have shown remarkable potential in processing long sequences and complex reasoning tasks. We introduce LServe, an efficient system that accelerates long-sequence LLM serving via hybrid sparse attention. On average, LServe accelerates LLM prefilling by up to 2.9x and decoding by 1.3-2.1x over vLLM.
arXiv Detail & Related papers (2025-02-20T18:59:52Z)
ParallelComp: Parallel Long-Context Compressor for Length Extrapolation [51.68913021512016]
ParallelComp is a training-free method for long-context extrapolation. It extends context length from 4K to 128K while maintaining high throughput and preserving perplexity. Our analysis offers new insights into attention biases in parallel attention mechanisms.
arXiv Detail & Related papers (2025-02-20T07:10:43Z)
InfiniteHiP: Extending Language Model Context Up to 3 Million Tokens on a Single GPU [48.105361428245736]
We introduce InfiniteHiP, an inference framework for large language models (LLMs) We dynamically eliminate irrelevant context tokens through a modular hierarchical token pruning algorithm. Our framework achieves an 18.95x speedup in attention decoding for a 1 million token context without requiring additional training.
arXiv Detail & Related papers (2025-02-13T02:52:01Z)
Adjoint sharding for very long context training of state space models [7.723642550918118]
Adjoint sharding is a technique that comprises sharding gradient calculation during training to reduce memory requirements by orders of magnitude. We show the proposed adjoint sharding algorithm reduces memory usage by up to 3X with a 1.27B parameter large language model on 1M context length training. This allows to increase the maximum context length during training or fine-tuning of a 1.27B parameter model from 35K tokens to above 100K tokens on a training infrastructure composed of five AWS P4 instances.
arXiv Detail & Related papers (2025-01-01T01:10:59Z)
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads [22.462489968597]
Caching all Key and Value states across all attention heads consumes substantial memory. We introduce DuoAttention, a framework that only applies a full KV cache to retrieval heads while using a light-weight, constant-length KV cache for streaming heads. Our method significantly reduces long-context inference memory by up to 2.55x for MHA and 1.67x for GQA models.
arXiv Detail & Related papers (2024-10-14T17:59:58Z)
How to Train Long-Context Language Models (Effectively) [75.5418485597276]
We study continued training and supervised fine-tuning (SFT) of a language model (LM) to make effective use of long-context information. ProLong-8B, which is from Llama-3 and trained on 40B tokens, demonstrates state-of-the-art long-context performance among similarly sized models at a length of 128K.
arXiv Detail & Related papers (2024-10-03T16:46:52Z)
LongRecipe: Recipe for Efficient Long Context Generalization in Large Language Models [72.71150585370147]
LongRecipe is an efficient training strategy for extending the context window of large language models. It simulates long-sequence inputs while maintaining training efficiency and significantly improves the model's understanding of long-range dependencies. LongRecipe can utilize long sequences while requiring only 30% of the target context window size, and reduces computational training resource over 85% compared to full sequence training.
arXiv Detail & Related papers (2024-08-31T17:19:30Z)
Training-Free Long-Context Scaling of Large Language Models [114.53296002607993]
We propose Dual Chunk Attention, which enables Llama2 70B to support context windows of more than 100k tokens without continual training. By decomposing the attention for long sequences into chunk-based modules, DCA manages to effectively capture the relative positional information of tokens.
arXiv Detail & Related papers (2024-02-27T12:39:23Z)
E^2-LLM: Efficient and Extreme Length Extension of Large Language Models [74.1254067728251]
We propose an Efficient and Extreme length extension method for Large Language Models, called E 2 -LLM, with only one training procedure and dramatically reduced cost. Comprehensive experimental results on multiple benchmark datasets demonstrate the effectiveness of our E 2 -LLM on challenging long-context tasks.
arXiv Detail & Related papers (2024-01-13T02:11:20Z)
LongLoRA: Efficient Fine-tuning of Long-Context Large Language Models [67.58275666573496]
LongLoRA is an efficient fine-tuning approach that extends the context sizes of pre-trained large language models. We demonstrate strong empirical results on various tasks on Llama2 models from 7B/13B to 70B.
arXiv Detail & Related papers (2023-09-21T17:59:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.