Related papers: DAC: A Dynamic Attention-aware Approach for Task-Agnostic Prompt Compression

DAC: A Dynamic Attention-aware Approach for Task-Agnostic Prompt Compression

URL: http://arxiv.org/abs/2507.11942v1
Date: Wed, 16 Jul 2025 06:16:06 GMT
Title: DAC: A Dynamic Attention-aware Approach for Task-Agnostic Prompt Compression
Authors: Yi Zhao, Zuchao Li, Hai Zhao, Baoyuan Qi, Guoming Liu,
Abstract summary: We propose a dynamic attention-aware approach for task-agnostic prompt compression (DAC)<n>This approach effectively integrates entropy and attention information, dynamically sensing entropy shifts during compression to achieve fine-grained prompt compression.<n>Extensive experiments across various domains, including LongBench, GSM8K, and BBH, show that DAC consistently yields robust and substantial improvements.
Score: 63.83422894663496
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Task-agnostic prompt compression leverages the redundancy in natural language to reduce computational overhead and enhance information density within prompts, especially in long-context scenarios. Existing methods predominantly rely on information entropy as the metric to compress lexical units, aiming to achieve minimal information loss. However, these approaches overlook two critical aspects: (i) the importance of attention-critical tokens at the algorithmic level, and (ii) shifts in information entropy during the compression process. Motivated by these challenges, we propose a dynamic attention-aware approach for task-agnostic prompt compression (DAC). This approach effectively integrates entropy and attention information, dynamically sensing entropy shifts during compression to achieve fine-grained prompt compression. Extensive experiments across various domains, including LongBench, GSM8K, and BBH, show that DAC consistently yields robust and substantial improvements across a diverse range of tasks and LLMs, offering compelling evidence of its efficacy.

Related papers

When Tokens Talk Too Much: A Survey of Multimodal Long-Context Token Compression across Images, Videos, and Audios [27.220318661244242]
Multimodal large language models (MLLMs) process increasingly long and complex contexts.<n> token compression has emerged as an auspicious and critical approach, efficiently reducing the number of tokens during both training and inference.<n>We present the first systematic survey and synthesis of the burgeoning field of multimodal long context token compression.
arXiv Detail & Related papers (2025-07-27T09:33:56Z)
Adaptive Inference-Time Scaling via Cyclic Diffusion Search [68.58892778987936]
We introduce the challenge of adaptive inference-time scaling-dynamically adjusting computational effort during inference.<n>We propose Adaptive Bi-directional Cyclic Diffusion (ABCD), a flexible, search-based inference framework.<n>ABCD refines outputs through bi-directional diffusion cycles while adaptively controlling exploration depth and termination.
arXiv Detail & Related papers (2025-05-20T07:31:38Z)
Dynamic Compressing Prompts for Efficient Inference of Large Language Models [38.604760935983364]
Large Language Models (LLMs) have shown outstanding performance across a variety of tasks, partly due to advanced prompting techniques.<n>While prompt compression is a straightforward solution, existing methods confront the challenges of retaining essential information, adapting to context changes, and remaining effective across different tasks.<n>Our method reduces the number of prompt tokens while aiming to preserve the performance as much as possible.
arXiv Detail & Related papers (2025-04-15T09:20:45Z)
Understanding and Improving Information Preservation in Prompt Compression for LLMs [10.912320980464571]
In information-intensive tasks, the prompt length can grow fast, leading to increased computational requirements, performance degradation, and induced biases from irrelevant or redundant information.<n>We propose a holistic evaluation framework that allows for in-depth analysis of prompt compression methods.
arXiv Detail & Related papers (2025-03-24T20:06:11Z)
PowerAttention: Exponentially Scaling of Receptive Fields for Effective Sparse Attention [73.26995918610669]
Large Language Models (LLMs) face efficiency bottlenecks due to the quadratic complexity of the attention mechanism when processing long contexts.<n>We introduce PowerAttention, a novel sparse attention design that facilitates effective and complete context extension.<n>Experiments demonstrate that PowerAttention outperforms existing static sparse attention methods by $5sim 40%$.
arXiv Detail & Related papers (2025-03-05T15:24:11Z)
Prompt Compression for Large Language Models: A Survey [31.578484271031908]
This survey provides an overview of prompt compression techniques, categorized into hard prompt methods and soft prompt methods. We also examine the downstream adaptations of various prompt compression techniques.
arXiv Detail & Related papers (2024-10-16T09:13:23Z)
Perception Compressor: A Training-Free Prompt Compression Framework in Long Context Scenarios [17.720102137585503]
Perception is a training-free prompt compression framework for large language models.<n>It includes a perception retriever that leverages guiding questions and instruction to retrieve the most relevant demonstrations.<n>We conduct extensive experiments on long context, benchmarks, iSie, LongBench, and MuSiQue.
arXiv Detail & Related papers (2024-09-28T07:13:33Z)
TACO-RL: Task Aware Prompt Compression Optimization with Reinforcement Learning [11.167198972934736]
Large language models (LLMs) such as GPT-4 have led to a surge in the size of prompts required for optimal performance.<n>We propose a novel and efficient reinforcement learning (RL) based task-aware prompt compression method.<n>We demonstrate that our RL-guided compression method improves the task performance by 8% - 189% over state-of-the-art compression techniques.
arXiv Detail & Related papers (2024-09-19T18:11:59Z)
QUITO-X: A New Perspective on Context Compression from the Information Bottleneck Theory [66.01597794579568]
We introduce information bottleneck theory (IB) to model the problem.<n>We propose a cross-attention-based approach to approximate mutual information in IB.<n>Our method achieves a 25% increase in compression rate compared to the state-of-the-art.
arXiv Detail & Related papers (2024-08-20T02:44:45Z)
Towards Efficient Vision-Language Tuning: More Information Density, More Generalizability [73.34532767873785]
We propose the concept of Information Density'' (ID) to indicate whether a matrix strongly belongs to certain feature spaces. We introduce the Dense Information Prompt (DIP) to enhance information density to improve generalization. DIP significantly reduces the number of tunable parameters and the requisite storage space, making it particularly advantageous in resource-constrained settings.
arXiv Detail & Related papers (2023-12-17T20:42:43Z)
Do Compressed LLMs Forget Knowledge? An Experimental Study with Practical Implications [63.29358103217275]
Large Language Models (LLMs) often leads to reduced performance, especially for knowledge-intensive tasks. We propose two conjectures on the nature of the damage: one is certain knowledge being forgotten (or erased) after compression. We introduce a variant called Inference-time Dynamic Prompting (IDP) that can effectively increase prompt diversity without incurring any inference overhead.
arXiv Detail & Related papers (2023-10-02T03:12:06Z)
Revisit Visual Representation in Analytics Taxonomy: A Compression Perspective [69.99087941471882]
We study the problem of supporting multiple machine vision analytics tasks with the compressed visual representation. By utilizing the intrinsic transferability among different tasks, our framework successfully constructs compact and expressive representations at low bit-rates. In order to impose compactness in the representations, we propose a codebook-based hyperprior.
arXiv Detail & Related papers (2021-06-16T01:44:32Z)

This list is automatically generated from the titles and abstracts of the papers in this site.