Prompt Compression for Large Language Models: A Survey
- URL: http://arxiv.org/abs/2410.12388v2
- Date: Thu, 17 Oct 2024 04:09:09 GMT
- Title: Prompt Compression for Large Language Models: A Survey
- Authors: Zongqian Li, Yinhong Liu, Yixuan Su, Nigel Collier,
- Abstract summary: This survey provides an overview of prompt compression techniques, categorized into hard prompt methods and soft prompt methods.
We also examine the downstream adaptations of various prompt compression techniques.
- Score: 31.578484271031908
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Leveraging large language models (LLMs) for complex natural language tasks typically requires long-form prompts to convey detailed requirements and information, which results in increased memory usage and inference costs. To mitigate these challenges, multiple efficient methods have been proposed, with prompt compression gaining significant research interest. This survey provides an overview of prompt compression techniques, categorized into hard prompt methods and soft prompt methods. First, the technical approaches of these methods are compared, followed by an exploration of various ways to understand their mechanisms, including the perspectives of attention optimization, Parameter-Efficient Fine-Tuning (PEFT), modality integration, and new synthetic language. We also examine the downstream adaptations of various prompt compression techniques. Finally, the limitations of current prompt compression methods are analyzed, and several future directions are outlined, such as optimizing the compression encoder, combining hard and soft prompts methods, and leveraging insights from multimodality.
Related papers
- DAC: A Dynamic Attention-aware Approach for Task-Agnostic Prompt Compression [63.83422894663496]
We propose a dynamic attention-aware approach for task-agnostic prompt compression (DAC)<n>This approach effectively integrates entropy and attention information, dynamically sensing entropy shifts during compression to achieve fine-grained prompt compression.<n>Extensive experiments across various domains, including LongBench, GSM8K, and BBH, show that DAC consistently yields robust and substantial improvements.
arXiv Detail & Related papers (2025-07-16T06:16:06Z) - ONLY: One-Layer Intervention Sufficiently Mitigates Hallucinations in Large Vision-Language Models [67.75439511654078]
Large Vision-Language Models (LVLMs) have introduced a new paradigm for understanding and reasoning about image input through textual responses.<n>They face the persistent challenge of hallucination, which introduces practical weaknesses and raises concerns about their reliable deployment in real-world applications.<n>We propose ONLY, a training-free decoding approach that requires only a single query and a one-layer intervention during decoding, enabling efficient real-time deployment.
arXiv Detail & Related papers (2025-07-01T16:01:08Z) - Efficient Reasoning Models: A Survey [52.96232442322824]
This survey aims to provide a comprehensive overview of recent advances in efficient reasoning.
It categorizes existing works into three key directions: (1) shorter - compressing lengthy CoTs into concise yet effective reasoning chains; (2) smaller - developing compact language models with strong reasoning capabilities; and (3) faster.
arXiv Detail & Related papers (2025-04-15T06:28:00Z) - Understanding and Improving Information Preservation in Prompt Compression for LLMs [10.912320980464571]
In information-intensive tasks, the prompt length can grow fast, leading to increased computational requirements, performance degradation, and induced biases from irrelevant or redundant information.
We propose a holistic evaluation framework that allows for in-depth analysis of prompt compression methods.
arXiv Detail & Related papers (2025-03-24T20:06:11Z) - Task-agnostic Prompt Compression with Context-aware Sentence Embedding and Reward-guided Task Descriptor [16.830389144259584]
Task-agnostic Prompt Compression (TPC) is a novel framework that generalizes compression across tasks and domains without requiring input questions or templates.
TPC generates a context-relevant task description using a task descriptor trained on a curated dataset of context and query pairs.
We introduce 3 model sizes (Base, Large, and Huge), where the largest model outperforms the existing state-of-the-art methods on LongBench and ZeroSCROLLS benchmarks.
arXiv Detail & Related papers (2025-02-19T02:16:29Z) - BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression [91.23933111083389]
BRIEF (Bridging Retrieval and Inference through Evidence Fusion) is a lightweight approach that performs query-aware multi-hop reasoning.
Based on our synthetic data built entirely by open-source models, BRIEF generates more concise summaries.
arXiv Detail & Related papers (2024-10-20T04:24:16Z) - Selection-p: Self-Supervised Task-Agnostic Prompt Compression for Faithfulness and Transferability [67.77534983324229]
In this paper, we investigate the ability of Large Language Models to develop a unified compression method that discretizes uninformative tokens.
Experiments show Selection-p achieves state-of-the-art performance across numerous classification tasks.
It exhibits superior transferability to different models compared to prior work.
arXiv Detail & Related papers (2024-10-15T17:05:25Z) - Perception Compressor:A training-free prompt compression method in long context scenarios [17.720102137585503]
Perception is a training-free prompt compression method for large language models.
It outperforms existing methods by a large margin, achieving state-of-the-art performance.
arXiv Detail & Related papers (2024-09-28T07:13:33Z) - Fundamental Limits of Prompt Compression: A Rate-Distortion Framework for Black-Box Language Models [21.025001473355996]
We formalize the problem of prompt compression for large language models (LLMs)
We present a framework to unify token-level prompt compression methods which create hard prompts for black-box models.
We show that there is a large gap between the performance of current prompt compression methods and the optimal strategy.
arXiv Detail & Related papers (2024-07-22T09:40:13Z) - Adapting LLMs for Efficient Context Processing through Soft Prompt Compression [1.1550486371582305]
This article presents an innovative framework that strategically tailors Large Language Models for streamlined context processing.
Our methodology, dubbed SoftPromptComp, amalgamates natural language prompts with dynamically generated soft prompts to forge a concise yet semantically robust depiction of protracted contexts.
We substantiate that our framework markedly diminishes computational overhead and enhances LLMs' efficacy across various benchmarks.
arXiv Detail & Related papers (2024-04-07T15:44:20Z) - Efficient Prompting Methods for Large Language Models: A Survey [50.171011917404485]
Prompting has become a mainstream paradigm for adapting large language models (LLMs) to specific natural language processing tasks.
This approach brings the additional computational burden of model inference and human effort to guide and control the behavior of LLMs.
We present the basic concepts of prompting, review the advances for efficient prompting, and highlight future research directions.
arXiv Detail & Related papers (2024-04-01T12:19:08Z) - A Survey on Transformer Compression [84.18094368700379]
Transformer plays a vital role in the realms of natural language processing (NLP) and computer vision (CV)
Model compression methods reduce the memory and computational cost of Transformer.
This survey provides a comprehensive review of recent compression methods, with a specific focus on their application to Transformer-based models.
arXiv Detail & Related papers (2024-02-05T12:16:28Z) - LLMLingua: Compressing Prompts for Accelerated Inference of Large
Language Models [22.06402870816756]
Large language models (LLMs) have been applied in various applications due to their astonishing capabilities.
This paper presents LLMLingua, a coarse-to-fine prompt compression method that involves a budget controller to maintain semantic integrity.
We show that the proposed approach yields state-of-the-art performance and allows for up to 20x compression with little performance loss.
arXiv Detail & Related papers (2023-10-09T14:10:21Z) - Learning Accurate Performance Predictors for Ultrafast Automated Model
Compression [86.22294249097203]
We propose an ultrafast automated model compression framework called SeerNet for flexible network deployment.
Our method achieves competitive accuracy-complexity trade-offs with significant reduction of the search cost.
arXiv Detail & Related papers (2023-04-13T10:52:49Z) - Revisiting Offline Compression: Going Beyond Factorization-based Methods
for Transformer Language Models [7.542276054279341]
transformer language models achieve outstanding results in many natural language processing (NLP) tasks.
Their enormous size often makes them impractical on memory-constrained devices, requiring practitioners to compress them to smaller networks.
In this paper, we explore offline compression methods, meaning computationally-cheap approaches that do not require further fine-tuning of the compressed model.
arXiv Detail & Related papers (2023-02-08T13:36:06Z) - Learning End-to-End Lossy Image Compression: A Benchmark [90.35363142246806]
We first conduct a comprehensive literature survey of learned image compression methods.
We describe milestones in cutting-edge learned image-compression methods, review a broad range of existing works, and provide insights into their historical development routes.
By introducing a coarse-to-fine hyperprior model for entropy estimation and signal reconstruction, we achieve improved rate-distortion performance.
arXiv Detail & Related papers (2020-02-10T13:13:43Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.