Compressed Context Memory For Online Language Model Interaction
- URL: http://arxiv.org/abs/2312.03414v2
- Date: Tue, 6 Feb 2024 05:53:02 GMT
- Title: Compressed Context Memory For Online Language Model Interaction
- Authors: Jang-Hyun Kim, Junyoung Yeom, Sangdoo Yun, Hyun Oh Song
- Abstract summary: This paper presents a context key/value compression method for Transformer language models in online scenarios.
As the context lengthens, the attention process demands increasing memory and computations, which in turn reduces the throughput of the language model.
We propose a compressed context memory system that continually compresses the accumulating attention key/value pairs into a compact memory space.
- Score: 39.72054168889216
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper presents a context key/value compression method for Transformer
language models in online scenarios, where the context continually expands. As
the context lengthens, the attention process demands increasing memory and
computations, which in turn reduces the throughput of the language model. To
address this challenge, we propose a compressed context memory system that
continually compresses the accumulating attention key/value pairs into a
compact memory space, facilitating language model inference in a limited memory
space of computing environments. Our compression process involves integrating a
lightweight conditional LoRA into the language model's forward pass during
inference, without the need for fine-tuning the model's entire set of weights.
We achieve efficient training by modeling the recursive compression process as
a single parallelized forward computation. Through evaluations on conversation,
personalization, and multi-task learning, we demonstrate that our approach
achieves the performance level of a full context model with $5\times$ smaller
context memory size. We further demonstrate the applicability of our approach
in a streaming setting with an unlimited context length, outperforming the
sliding window approach. Codes are available at
https://github.com/snu-mllab/context-memory.
Related papers
- The Compressor-Retriever Architecture for Language Model OS [20.56093501980724]
This paper explores the concept of using a language model as the core component of an operating system (OS)
A key challenge in realizing such an LM OS is managing the life-long context and ensuring statefulness across sessions.
We introduce compressor-retriever, a model-agnostic architecture designed for life-long context management.
arXiv Detail & Related papers (2024-09-02T23:28:15Z) - Recurrent Context Compression: Efficiently Expanding the Context Window of LLM [22.595457889113668]
This work introduces a method called Recurrent Context Compression (RCC), designed to efficiently expand the context window length of Transformer-based large language models (LLMs)
We validated our approach on multiple tasks, achieving a compression rate of up to 32x on text reconstruction tasks with a BLEU4 score close to 0.95, and nearly 100% accuracy on a passkey retrieval task with a sequence length of 1M.
arXiv Detail & Related papers (2024-06-10T08:50:59Z) - Layer-Condensed KV Cache for Efficient Inference of Large Language Models [44.24593677113768]
We propose a novel method that only computes and caches the KVs of a small number of layers.
Our method achieves up to 26$times$ higher throughput than standard transformers.
arXiv Detail & Related papers (2024-05-17T08:59:46Z) - Hierarchical Context Merging: Better Long Context Understanding for Pre-trained LLMs [61.40047491337793]
We present Hierarchical cOntext MERging (HOMER), a new training-free scheme designed to overcome the limitations of large language models.
HomeR uses a divide-and-conquer algorithm, dividing long inputs into manageable chunks.
A token reduction technique precedes each merging, ensuring memory usage efficiency.
arXiv Detail & Related papers (2024-04-16T06:34:08Z) - Improving Retrieval Augmented Open-Domain Question-Answering with Vectorized Contexts [83.57864140378035]
This paper proposes a method to cover longer contexts in Open-Domain Question-Answering tasks.
It leverages a small encoder language model that effectively encodes contexts, and the encoding applies cross-attention with origin inputs.
After fine-tuning, there is improved performance across two held-in datasets, four held-out datasets, and also in two In Context Learning settings.
arXiv Detail & Related papers (2024-04-02T15:10:11Z) - Context Compression for Auto-regressive Transformers with Sentinel
Tokens [37.07722536907739]
We propose a plug-and-play approach that is able to incrementally compress the intermediate activation of a specified span of tokens into compact ones.
Experiments on both in-domain language modeling and zero-shot open-ended document generation demonstrate the advantage of our approach.
arXiv Detail & Related papers (2023-10-12T09:18:19Z) - In-context Autoencoder for Context Compression in a Large Language Model [70.7621953091318]
We propose the In-context Autoencoder (ICAE) to compress a long context into short compact memory slots.
ICAE is first pretrained using both autoencoding and language modeling objectives on massive text data.
arXiv Detail & Related papers (2023-07-13T17:59:21Z) - Does compressing activations help model parallel training? [64.59298055364336]
We present the first empirical study on the effectiveness of compression methods for model parallelism.
We implement and evaluate three common classes of compression algorithms.
We evaluate these methods across more than 160 settings and 8 popular datasets.
arXiv Detail & Related papers (2023-01-06T18:58:09Z) - Training Language Models with Memory Augmentation [28.4608705738799]
We present a novel training approach designed for training language models with memory augmentation.
Our approach uses a training objective that directly takes in-batch examples as accessible memory.
We demonstrate significant gains over previous memory-augmented approaches.
arXiv Detail & Related papers (2022-05-25T11:37:29Z) - LaMemo: Language Modeling with Look-Ahead Memory [50.6248714811912]
We propose Look-Ahead Memory (LaMemo) that enhances the recurrence memory by incrementally attending to the right-side tokens.
LaMemo embraces bi-directional attention and segment recurrence with an additional overhead only linearly proportional to the memory length.
Experiments on widely used language modeling benchmarks demonstrate its superiority over the baselines equipped with different types of memory.
arXiv Detail & Related papers (2022-04-15T06:11:25Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.