Compressing Context to Enhance Inference Efficiency of Large Language
Models
- URL: http://arxiv.org/abs/2310.06201v1
- Date: Mon, 9 Oct 2023 23:03:24 GMT
- Title: Compressing Context to Enhance Inference Efficiency of Large Language
Models
- Authors: Yucheng Li, Bo Dong, Chenghua Lin, Frank Guerin
- Abstract summary: This paper proposes a method called Selective Context to enhance the inference efficiency of large language models (LLMs)
We test our approach using common data sources requiring long context processing: arXiv papers, news articles, and long conversations.
Experimental results show that Selective Context significantly reduces memory cost and decreases generation latency.
- Score: 26.75216730927996
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large language models (LLMs) achieved remarkable performance across various
tasks. However, they face challenges in managing long documents and extended
conversations, due to significantly increased computational requirements, both
in memory and inference time, and potential context truncation when the input
exceeds the LLM's fixed context length. This paper proposes a method called
Selective Context that enhances the inference efficiency of LLMs by identifying
and pruning redundancy in the input context to make the input more compact. We
test our approach using common data sources requiring long context processing:
arXiv papers, news articles, and long conversations, on tasks of summarisation,
question answering, and response generation. Experimental results show that
Selective Context significantly reduces memory cost and decreases generation
latency while maintaining comparable performance compared to that achieved when
full context is used. Specifically, we achieve a 50\% reduction in context
cost, resulting in a 36\% reduction in inference memory usage and a 32\%
reduction in inference time, while observing only a minor drop of .023 in
BERTscore and .038 in faithfulness on four downstream applications, indicating
that our method strikes a good balance between efficiency and performance.
Related papers
- Squeezed Attention: Accelerating Long Context Length LLM Inference [64.11145320159126]
We propose Squeezed Attention as a mechanism to accelerate LLM applications where a large portion of the input prompt is fixed.
We use K-means clustering offline to group the keys for the fixed context based on semantic similarity and represent each cluster with a single centroid value.
We then compute exact attention using only these important keys from the fixed context, thereby reducing bandwidth and computational costs.
arXiv Detail & Related papers (2024-11-14T18:54:19Z) - Reducing Distraction in Long-Context Language Models by Focused Learning [6.803882766744194]
We propose a novel training method that enhances Large Language Models' ability to discern relevant information.
During fine-tuning with long contexts, we employ a retriever to extract the most relevant segments.
We then introduce an auxiliary contrastive learning objective to explicitly ensure that outputs from the original context and the retrieved sub-context are closely aligned.
arXiv Detail & Related papers (2024-11-08T19:27:42Z) - Enhancing and Accelerating Large Language Models via Instruction-Aware Contextual Compression [7.673616185468932]
Supplying irrelevant context to large language models can result in poorer responses, increased inference latency, and higher costs.
This paper introduces a method called Instruction-Aware Contextual Compression, which filters out less informative content.
The experimental results demonstrate that Instruction-Aware Contextual Compression notably reduces memory consumption and minimizes generation latency.
arXiv Detail & Related papers (2024-08-28T02:31:15Z) - FactorLLM: Factorizing Knowledge via Mixture of Experts for Large Language Models [50.331708897857574]
We introduce FactorLLM, a novel approach that decomposes well-trained dense FFNs into sparse sub-networks without requiring any further modifications.
FactorLLM achieves comparable performance to the source model securing up to 85% model performance while obtaining over a 30% increase in inference speed.
arXiv Detail & Related papers (2024-08-15T16:45:16Z) - In-Context Former: Lightning-fast Compressing Context for Large Language Model [48.831304302467004]
In this paper, we propose a new approach to compress the long input contexts of Transformer-based large language models (LLMs)
We use the cross-attention mechanism and a small number of learnable digest tokens to condense information from the contextual word embeddings.
Experimental results indicate that our method requires only 1/32 of the floating-point operations of the baseline during compression and improves processing speed by 68 to 112 times.
arXiv Detail & Related papers (2024-06-19T15:14:55Z) - Recurrent Context Compression: Efficiently Expanding the Context Window of LLM [22.595457889113668]
This work introduces a method called Recurrent Context Compression (RCC), designed to efficiently expand the context window length of Transformer-based large language models (LLMs)
We validated our approach on multiple tasks, achieving a compression rate of up to 32x on text reconstruction tasks with a BLEU4 score close to 0.95, and nearly 100% accuracy on a passkey retrieval task with a sequence length of 1M.
arXiv Detail & Related papers (2024-06-10T08:50:59Z) - LLoCO: Learning Long Contexts Offline [63.3458260335454]
We propose LLoCO, a novel approach to processing long contexts.
LLoCO learns contexts offline through context compression and in-domain parameter-efficient finetuning with LoRA.
Our approach extends the effective context window of a 4k token LLaMA2-7B model to handle up to 128k tokens.
arXiv Detail & Related papers (2024-04-11T17:57:22Z) - Extending Context Window of Large Language Models via Semantic
Compression [21.35020344956721]
Large Language Models (LLMs) often impose limitations on the length of the text input to ensure the generation of fluent and relevant responses.
We propose a novel semantic compression method that enables generalization to texts 6-8 times longer, without incurring significant computational costs or requiring fine-tuning.
arXiv Detail & Related papers (2023-12-15T07:04:33Z) - Making Retrieval-Augmented Language Models Robust to Irrelevant Context [55.564789967211844]
An important desideratum of RALMs, is that retrieved information helps model performance when it is relevant.
Recent work has shown that retrieval augmentation can sometimes have a negative effect on performance.
arXiv Detail & Related papers (2023-10-02T18:52:35Z) - OverPrompt: Enhancing ChatGPT through Efficient In-Context Learning [49.38867353135258]
We propose OverPrompt, leveraging the in-context learning capability of LLMs to handle multiple task inputs.
Our experiments show that OverPrompt can achieve cost-efficient zero-shot classification without causing significant detriment to task performance.
arXiv Detail & Related papers (2023-05-24T10:08:04Z) - Unlocking Context Constraints of LLMs: Enhancing Context Efficiency of
LLMs with Self-Information-Based Content Filtering [4.1372815372396525]
This paper proposes a method called textitSelective Context that employs self-information to filter out less informative content.
We demonstrate the effectiveness of our approach on tasks of summarisation and question answering across different data sources.
arXiv Detail & Related papers (2023-04-24T13:55:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.