Related papers: Understanding and Improving Information Preservation in Prompt Compression for LLMs

Understanding and Improving Information Preservation in Prompt Compression for LLMs

URL: http://arxiv.org/abs/2503.19114v1
Date: Mon, 24 Mar 2025 20:06:11 GMT
Title: Understanding and Improving Information Preservation in Prompt Compression for LLMs
Authors: Weronika Łajewska, Momchil Hardalov, Laura Aina, Neha Anna John, Hang Su, Lluís Màrquez,
Abstract summary: In information-intensive tasks, the prompt length can grow fast, leading to increased computational requirements, performance degradation, and induced biases from irrelevant or redundant information.<n>We propose a holistic evaluation framework that allows for in-depth analysis of prompt compression methods.
Score: 10.912320980464571
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent advancements in large language models (LLMs) have enabled their successful application to a broad range of tasks. However, in information-intensive tasks, the prompt length can grow fast, leading to increased computational requirements, performance degradation, and induced biases from irrelevant or redundant information. Recently, various prompt compression techniques have been introduced to optimize the trade-off between reducing input length and retaining performance. We propose a holistic evaluation framework that allows for in-depth analysis of prompt compression methods. We focus on three key aspects, besides compression ratio: (i) downstream task performance, (ii) grounding in the input context, and (iii) information preservation. Through this framework, we investigate state-of-the-art soft and hard compression methods, showing that they struggle to preserve key details from the original prompt, limiting their performance on complex tasks. We demonstrate that modifying soft prompting methods to control better the granularity of the compressed information can significantly improve their effectiveness -- up to +23\% in downstream task performance, more than +8 BERTScore points in grounding, and 2.7x more entities preserved in compression.

Related papers

DAC: A Dynamic Attention-aware Approach for Task-Agnostic Prompt Compression [63.83422894663496]
We propose a dynamic attention-aware approach for task-agnostic prompt compression (DAC)<n>This approach effectively integrates entropy and attention information, dynamically sensing entropy shifts during compression to achieve fine-grained prompt compression.<n>Extensive experiments across various domains, including LongBench, GSM8K, and BBH, show that DAC consistently yields robust and substantial improvements.
arXiv Detail & Related papers (2025-07-16T06:16:06Z)
Dynamic Compressing Prompts for Efficient Inference of Large Language Models [38.604760935983364]
Large Language Models (LLMs) have shown outstanding performance across a variety of tasks, partly due to advanced prompting techniques. While prompt compression is a straightforward solution, existing methods confront the challenges of retaining essential information, adapting to context changes, and remaining effective across different tasks. Our method reduces the number of prompt tokens while aiming to preserve the performance as much as possible.
arXiv Detail & Related papers (2025-04-15T09:20:45Z)
Efficient Token Compression for Vision Transformer with Spatial Information Preserved [59.79302182800274]
Token compression is essential for reducing the computational and memory requirements of transformer models. We propose an efficient and hardware-compatible token compression method called Prune and Merge.
arXiv Detail & Related papers (2025-03-30T14:23:18Z)
Selection-p: Self-Supervised Task-Agnostic Prompt Compression for Faithfulness and Transferability [67.77534983324229]
In this paper, we investigate the ability of Large Language Models to develop a unified compression method that discretizes uninformative tokens. Experiments show Selection-p achieves state-of-the-art performance across numerous classification tasks. It exhibits superior transferability to different models compared to prior work.
arXiv Detail & Related papers (2024-10-15T17:05:25Z)
TACO-RL: Task Aware Prompt Compression Optimization with Reinforcement Learning [11.167198972934736]
Large language models (LLMs) such as GPT-4 have led to a surge in the size of prompts required for optimal performance.<n>We propose a novel and efficient reinforcement learning (RL) based task-aware prompt compression method.<n>We demonstrate that our RL-guided compression method improves the task performance by 8% - 189% over state-of-the-art compression techniques.
arXiv Detail & Related papers (2024-09-19T18:11:59Z)
LanguaShrink: Reducing Token Overhead with Psycholinguistics [8.123272461141815]
LanguaShrink is a prompt compression framework for large language models. It reduces prompt length while preserving essential information. Compared to existing prompt compression methods, LanguaShrink improves end-to-end latency by 1.43 times.
arXiv Detail & Related papers (2024-09-01T22:09:20Z)
Concise and Precise Context Compression for Tool-Using Language Models [60.606281074373136]
We propose two strategies for compressing tool documentation into concise and precise summary sequences for tool-using language models. Results on API-Bank and APIBench show that our approach reaches a performance comparable to the upper-bound baseline under up to 16x compression ratio.
arXiv Detail & Related papers (2024-07-02T08:17:00Z)
LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression [43.048684907893104]
This paper focuses on task-agnostic prompt compression for better generalizability and efficiency. We formulate prompt compression as a token classification problem to guarantee the faithfulness of the compressed prompt to the original one. Our approach leads to lower latency by explicitly learning the compression objective with smaller models such as XLM-RoBERTa-large and mBERT.
arXiv Detail & Related papers (2024-03-19T17:59:56Z)
Long Context Compression with Activation Beacon [22.054232261437186]
Activation Beacon is a plug-in module for transformer-based LLMs. It targets effective, efficient, and flexible compression of long contexts. It achieves a 2x acceleration in inference time and an 8x reduction of memory costs for KV cache.
arXiv Detail & Related papers (2024-01-07T11:57:40Z)
RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective Augmentation [61.53695868960846]
We propose compressing retrieved documents into textual summaries prior to in-context integration. This not only reduces the computational costs but also relieves the burden of LMs to identify relevant information in long retrieved documents. We show that our compressors trained for one LM can transfer to other LMs on the language modeling task and provide summaries largely faithful to the retrieved documents.
arXiv Detail & Related papers (2023-10-06T17:55:36Z)
Compressing LLMs: The Truth is Rarely Pure and Never Simple [90.05366363633568]
Knowledge-Intensive Compressed LLM BenchmarK aims to redefine the evaluation protocol for compressed Large Language Models. LLM-KICK unveils many favorable merits and unfortunate plights of current SoTA compression methods. LLM-KICK is designed to holistically access compressed LLMs' ability for language understanding, reasoning, generation, in-context retrieval, in-context summarization, etc.
arXiv Detail & Related papers (2023-10-02T17:42:37Z)
Do Compressed LLMs Forget Knowledge? An Experimental Study with Practical Implications [63.29358103217275]
Large Language Models (LLMs) often leads to reduced performance, especially for knowledge-intensive tasks. We propose two conjectures on the nature of the damage: one is certain knowledge being forgotten (or erased) after compression. We introduce a variant called Inference-time Dynamic Prompting (IDP) that can effectively increase prompt diversity without incurring any inference overhead.
arXiv Detail & Related papers (2023-10-02T03:12:06Z)
Compress, Then Prompt: Improving Accuracy-Efficiency Trade-off of LLM Inference with Transferable Prompt [96.24800696597707]
We introduce a new perspective to optimize this trade-off by prompting compressed models. We propose a soft prompt learning method where we expose the compressed model to the prompt learning process. Our experimental analysis suggests our soft prompt strategy greatly improves the performance of the 8x compressed LLaMA-7B model.
arXiv Detail & Related papers (2023-05-17T20:45:13Z)

This list is automatically generated from the titles and abstracts of the papers in this site.