Recurrent Context Compression: Efficiently Expanding the Context Window of LLM
- URL: http://arxiv.org/abs/2406.06110v1
- Date: Mon, 10 Jun 2024 08:50:59 GMT
- Title: Recurrent Context Compression: Efficiently Expanding the Context Window of LLM
- Authors: Chensen Huang, Guibo Zhu, Xuepeng Wang, Yifei Luo, Guojing Ge, Haoran Chen, Dong Yi, Jinqiao Wang,
- Abstract summary: This work introduces a method called Recurrent Context Compression (RCC), designed to efficiently expand the context window length of Transformer-based large language models (LLMs)
We validated our approach on multiple tasks, achieving a compression rate of up to 32x on text reconstruction tasks with a BLEU4 score close to 0.95, and nearly 100% accuracy on a passkey retrieval task with a sequence length of 1M.
- Score: 22.595457889113668
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: To extend the context length of Transformer-based large language models (LLMs) and improve comprehension capabilities, we often face limitations due to computational resources and bounded memory storage capacity. This work introduces a method called Recurrent Context Compression (RCC), designed to efficiently expand the context window length of LLMs within constrained storage space. We also investigate the issue of poor model responses when both instructions and context are compressed in downstream tasks, and propose an instruction reconstruction method to mitigate this problem. We validated the effectiveness of our approach on multiple tasks, achieving a compression rate of up to 32x on text reconstruction tasks with a BLEU4 score close to 0.95, and nearly 100\% accuracy on a passkey retrieval task with a sequence length of 1M. Finally, our method demonstrated competitive performance in long-text question-answering tasks compared to non-compressed methods, while significantly saving storage resources in long-text inference tasks. Our code, models, and demo are available at https://github.com/WUHU-G/RCC_Transformer
Related papers
- Context Embeddings for Efficient Answer Generation in RAG [10.702520553261756]
We present COCOM, an effective context compression method, reducing long contexts to only a handful of Context Embeddings.
Our method demonstrates a speed-up of up to 5.69 $times$ while achieving higher performance compared to existing efficient context compression methods.
arXiv Detail & Related papers (2024-07-12T13:30:44Z) - KV Cache Compression, But What Must We Give in Return? A Comprehensive Benchmark of Long Context Capable Approaches [52.02764371205856]
Long context capability is a crucial competency for large language models (LLMs)
This work provides a taxonomy of current methods and evaluating 10+ state-of-the-art approaches across seven categories of long context tasks.
arXiv Detail & Related papers (2024-07-01T17:59:47Z) - Training-Free Exponential Context Extension via Cascading KV Cache [49.608367376911694]
We introduce a novel mechanism that leverages cascading sub-cache buffers to selectively retain the most relevant tokens.
Our method reduces prefill stage latency by a factor of 6.8 when compared to flash attention on 1M tokens.
arXiv Detail & Related papers (2024-06-24T03:59:17Z) - In-Context Former: Lightning-fast Compressing Context for Large Language Model [48.831304302467004]
In this paper, we propose a new approach to compress the long input contexts of Transformer-based large language models (LLMs)
We use the cross-attention mechanism and a small number of learnable digest tokens to condense information from the contextual word embeddings.
Experimental results indicate that our method requires only 1/32 of the floating-point operations of the baseline during compression and improves processing speed by 68 to 112 times.
arXiv Detail & Related papers (2024-06-19T15:14:55Z) - Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models.
HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks.
A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z) - LLoCO: Learning Long Contexts Offline [63.3458260335454]
We propose LLoCO, a novel approach to processing long contexts.
LLoCO learns contexts offline through context compression and in-domain parameter-efficient finetuning with LoRA.
Our approach extends the effective context window of a 4k token LLaMA2-7B model to handle up to 128k tokens.
arXiv Detail & Related papers (2024-04-11T17:57:22Z) - StreamingDialogue: Prolonged Dialogue Learning via Long Context Compression with Minimal Losses [67.92595110412094]
StreamingDialogue compresses long dialogue history into conv-attn sinks with minimal losses.
Our method outperforms strong baselines in dialogue tasks.
arXiv Detail & Related papers (2024-03-13T07:44:14Z) - Long Context Compression with Activation Beacon [22.054232261437186]
Activation Beacon is a plug-in module for transformer-based LLMs.
It targets effective, efficient, and flexible compression of long contexts.
It achieves a 2x acceleration in inference time and an 8x reduction of memory costs for KV cache.
arXiv Detail & Related papers (2024-01-07T11:57:40Z) - Extending Context Window of Large Language Models via Semantic
Compression [21.35020344956721]
Large Language Models (LLMs) often impose limitations on the length of the text input to ensure the generation of fluent and relevant responses.
We propose a novel semantic compression method that enables generalization to texts 6-8 times longer, without incurring significant computational costs or requiring fine-tuning.
arXiv Detail & Related papers (2023-12-15T07:04:33Z) - Compressed Context Memory For Online Language Model Interaction [39.72054168889216]
This paper presents a context key/value compression method for Transformer language models in online scenarios.
As the context lengthens, the attention process demands increasing memory and computations, which in turn reduces the throughput of the language model.
We propose a compressed context memory system that continually compresses the accumulating attention key/value pairs into a compact memory space.
arXiv Detail & Related papers (2023-12-06T10:50:43Z) - Compressing Context to Enhance Inference Efficiency of Large Language
Models [26.75216730927996]
This paper proposes a method called Selective Context to enhance the inference efficiency of large language models (LLMs)
We test our approach using common data sources requiring long context processing: arXiv papers, news articles, and long conversations.
Experimental results show that Selective Context significantly reduces memory cost and decreases generation latency.
arXiv Detail & Related papers (2023-10-09T23:03:24Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.