Related papers: Enhancing and Accelerating Large Language Models via Instruction-Aware Contextual Compression

Enhancing and Accelerating Large Language Models via Instruction-Aware Contextual Compression

URL: http://arxiv.org/abs/2408.15491v1
Date: Wed, 28 Aug 2024 02:31:15 GMT
Title: Enhancing and Accelerating Large Language Models via Instruction-Aware Contextual Compression
Authors: Haowen Hou, Fei Ma, Binwen Bai, Xinxin Zhu, Fei Yu,
Abstract summary: Supplying irrelevant context to large language models can result in poorer responses, increased inference latency, and higher costs. This paper introduces a method called Instruction-Aware Contextual Compression, which filters out less informative content. The experimental results demonstrate that Instruction-Aware Contextual Compression notably reduces memory consumption and minimizes generation latency.
Score: 7.673616185468932
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) have garnered widespread attention due to their remarkable performance across various tasks. However, to mitigate the issue of hallucinations, LLMs often incorporate retrieval-augmented pipeline to provide them with rich external knowledge and context. Nevertheless, challenges stem from inaccurate and coarse-grained context retrieved from the retriever. Supplying irrelevant context to the LLMs can result in poorer responses, increased inference latency, and higher costs. This paper introduces a method called Instruction-Aware Contextual Compression, which filters out less informative content, thereby accelerating and enhancing the use of LLMs. The experimental results demonstrate that Instruction-Aware Contextual Compression notably reduces memory consumption and minimizes generation latency while maintaining performance levels comparable to those achieved with the use of the full context. Specifically, we achieved a 50% reduction in context-related costs, resulting in a 5% reduction in inference memory usage and a 2.2-fold increase in inference speed, with only a minor drop of 0.047 in Rouge-1. These findings suggest that our method strikes an effective balance between efficiency and performance.

Related papers

END: Early Noise Dropping for Efficient and Effective Context Denoising [60.24648712022382]
Large Language Models (LLMs) have demonstrated remarkable performance across a wide range of natural language processing tasks. They are often distracted by irrelevant or noisy context in input sequences that degrades output quality. We introduce Early Noise Dropping (textscEND), a novel approach to mitigate this issue without requiring fine-tuning the LLMs.
arXiv Detail & Related papers (2025-02-26T08:07:17Z)
BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression [91.23933111083389]
BRIEF (Bridging Retrieval and Inference through Evidence Fusion) is a lightweight approach that performs query-aware multi-hop reasoning. Based on our synthetic data built entirely by open-source models, BRIEF generates more concise summaries.
arXiv Detail & Related papers (2024-10-20T04:24:16Z)
Efficient Diffusion as Low Light Enhancer [63.789138528062225]
Reflectance-Aware Trajectory Refinement (RATR) is a simple yet effective module to refine the teacher trajectory using the reflectance component of images. textbfReflectance-aware textbfDiffusion with textbfDistilled textbfTrajectory (textbfReDDiT) is an efficient and flexible distillation framework tailored for Low-Light Image Enhancement (LLIE)
arXiv Detail & Related papers (2024-10-16T08:07:18Z)
QUITO-X: A New Perspective on Context Compression from the Information Bottleneck Theory [66.01597794579568]
We introduce information bottleneck theory (IB) to model the problem. We propose a cross-attention-based approach to approximate mutual information in IB. Our method achieves a 25% increase in compression rate compared to the state-of-the-art.
arXiv Detail & Related papers (2024-08-20T02:44:45Z)
In-Context Former: Lightning-fast Compressing Context for Large Language Model [48.831304302467004]
In this paper, we propose a new approach to compress the long input contexts of Transformer-based large language models (LLMs) We use the cross-attention mechanism and a small number of learnable digest tokens to condense information from the contextual word embeddings. Experimental results indicate that our method requires only 1/32 of the floating-point operations of the baseline during compression and improves processing speed by 68 to 112 times.
arXiv Detail & Related papers (2024-06-19T15:14:55Z)
Compressing Context to Enhance Inference Efficiency of Large Language Models [26.75216730927996]
This paper proposes a method called Selective Context to enhance the inference efficiency of large language models (LLMs) We test our approach using common data sources requiring long context processing: arXiv papers, news articles, and long conversations. Experimental results show that Selective Context significantly reduces memory cost and decreases generation latency.
arXiv Detail & Related papers (2023-10-09T23:03:24Z)
Compressing LLMs: The Truth is Rarely Pure and Never Simple [90.05366363633568]
Knowledge-Intensive Compressed LLM BenchmarK aims to redefine the evaluation protocol for compressed Large Language Models. LLM-KICK unveils many favorable merits and unfortunate plights of current SoTA compression methods. LLM-KICK is designed to holistically access compressed LLMs' ability for language understanding, reasoning, generation, in-context retrieval, in-context summarization, etc.
arXiv Detail & Related papers (2023-10-02T17:42:37Z)
Do Compressed LLMs Forget Knowledge? An Experimental Study with Practical Implications [63.29358103217275]
Large Language Models (LLMs) often leads to reduced performance, especially for knowledge-intensive tasks. We propose two conjectures on the nature of the damage: one is certain knowledge being forgotten (or erased) after compression. We introduce a variant called Inference-time Dynamic Prompting (IDP) that can effectively increase prompt diversity without incurring any inference overhead.
arXiv Detail & Related papers (2023-10-02T03:12:06Z)
In-context Autoencoder for Context Compression in a Large Language Model [70.7621953091318]
We propose the In-context Autoencoder (ICAE) to compress a long context into short compact memory slots. ICAE is first pretrained using both autoencoding and language modeling objectives on massive text data.
arXiv Detail & Related papers (2023-07-13T17:59:21Z)
Improving Language Models via Plug-and-Play Retrieval Feedback [42.786225163763376]
Large language models (LLMs) exhibit remarkable performance across various NLP tasks. They often generate incorrect or hallucinated information, which hinders their practical applicability in real-world scenarios. We introduce ReFeed, a novel pipeline designed to enhance LLMs by providing automatic retrieval feedback in a plug-and-play framework.
arXiv Detail & Related papers (2023-05-23T12:29:44Z)

This list is automatically generated from the titles and abstracts of the papers in this site.