Context Compression for Auto-regressive Transformers with Sentinel
Tokens
- URL: http://arxiv.org/abs/2310.08152v2
- Date: Sun, 15 Oct 2023 09:15:02 GMT
- Title: Context Compression for Auto-regressive Transformers with Sentinel
Tokens
- Authors: Siyu Ren, Qi Jia, Kenny Q. Zhu
- Abstract summary: We propose a plug-and-play approach that is able to incrementally compress the intermediate activation of a specified span of tokens into compact ones.
Experiments on both in-domain language modeling and zero-shot open-ended document generation demonstrate the advantage of our approach.
- Score: 37.07722536907739
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The quadratic complexity of the attention module makes it gradually become
the bulk of compute in Transformer-based LLMs during generation. Moreover, the
excessive key-value cache that arises when dealing with long inputs also brings
severe issues on memory footprint and inference latency. In this work, we
propose a plug-and-play approach that is able to incrementally compress the
intermediate activation of a specified span of tokens into compact ones,
thereby reducing both memory and computational cost when processing subsequent
context. Experiments on both in-domain language modeling and zero-shot
open-ended document generation demonstrate the advantage of our approach over
sparse attention baselines in terms of fluency, n-gram matching, and semantic
similarity. At last, we comprehensively profile the benefit of context
compression on improving the system throughout. Code is available at
https://github.com/DRSY/KV_Compression.
Related papers
- UIO-LLMs: Unbiased Incremental Optimization for Long-Context LLMs [111.12010207132204]
UIO-LLMs is an incremental optimization approach for memory-enhanced transformers under long-context settings.
We refine the training process using the Truncated Backpropagation Through Time (TBPTT) algorithm.
UIO-LLMs successfully handle long context, such as extending the context window of Llama2-7b-chat from 4K to 100K tokens with minimal 2% additional parameters.
arXiv Detail & Related papers (2024-06-26T08:44:36Z) - In-Context Former: Lightning-fast Compressing Context for Large Language Model [48.831304302467004]
In this paper, we propose a new approach to compress the long input contexts of Transformer-based large language models (LLMs)
We use the cross-attention mechanism and a small number of learnable digest tokens to condense information from the contextual word embeddings.
Experimental results indicate that our method requires only 1/32 of the floating-point operations of the baseline during compression and improves processing speed by 68 to 112 times.
arXiv Detail & Related papers (2024-06-19T15:14:55Z) - SubGen: Token Generation in Sublinear Time and Memory [48.35076900702408]
Large language models (LLMs) have extensive memory requirements for token generation.
In this work, we focus on developing an efficient compression technique for the KV cache.
We have devised a novel caching method with sublinear complexity, employing online clustering on key tokens and online $ell$ sampling on values.
Not only does this algorithm ensure a sublinear memory footprint and sublinear time complexity, but we also establish a tight error bound for our approach.
arXiv Detail & Related papers (2024-02-08T22:17:40Z) - Compressed Context Memory For Online Language Model Interaction [39.72054168889216]
This paper presents a context key/value compression method for Transformer language models in online scenarios.
As the context lengthens, the attention process demands increasing memory and computations, which in turn reduces the throughput of the language model.
We propose a compressed context memory system that continually compresses the accumulating attention key/value pairs into a compact memory space.
arXiv Detail & Related papers (2023-12-06T10:50:43Z) - Resource-Efficient Separation Transformer [14.666016177212837]
This paper explores Transformer-based speech separation with a reduced computational cost.
Our main contribution is the development of the Resource-Efficient Separation Transformer (RE-SepFormer), a self-attention-based architecture.
The RE-SepFormer reaches a competitive performance on the popular WSJ0-2Mix and WHAM! datasets in both causal and non-causal settings.
arXiv Detail & Related papers (2022-06-19T23:37:24Z) - CenterCLIP: Token Clustering for Efficient Text-Video Retrieval [67.21528544724546]
In CLIP, the essential visual tokenization process, which produces discrete visual token sequences, generates many homogeneous tokens due to the redundancy nature of consecutive frames in videos.
This significantly increases computation costs and hinders the deployment of video retrieval models in web applications.
In this paper, we design a multi-segment token clustering algorithm to find the most representative tokens and drop the non-essential ones.
arXiv Detail & Related papers (2022-05-02T12:02:09Z) - Cluster-Former: Clustering-based Sparse Transformer for Long-Range
Dependency Encoding [90.77031668988661]
Cluster-Former is a novel clustering-based sparse Transformer to perform attention across chunked sequences.
The proposed framework is pivoted on two unique types of Transformer layer: Sliding-Window Layer and Cluster-Former Layer.
Experiments show that Cluster-Former achieves state-of-the-art performance on several major QA benchmarks.
arXiv Detail & Related papers (2020-09-13T22:09:30Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.