CompactPrompt: A Unified Pipeline for Prompt Data Compression in LLM Workflows
- URL: http://arxiv.org/abs/2510.18043v1
- Date: Mon, 20 Oct 2025 19:31:11 GMT
- Title: CompactPrompt: A Unified Pipeline for Prompt Data Compression in LLM Workflows
- Authors: Joong Ho Choi, Jiayang Zhao, Jeel Shah, Ritvika Sonawane, Vedant Singh, Avani Appalla, Will Flanagan, Filipe Condessa,
- Abstract summary: Large Language Models (LLMs) deliver powerful reasoning and generation capabilities but incur substantial run-time costs.<n>We introduce CompactPrompt, an end-to-end pipeline that merges hard prompt compression with lightweight file-level data compression.
- Score: 0.9275065651255189
- License: http://creativecommons.org/licenses/by-sa/4.0/
- Abstract: Large Language Models (LLMs) deliver powerful reasoning and generation capabilities but incur substantial run-time costs when operating in agentic workflows that chain together lengthy prompts and process rich data streams. We introduce CompactPrompt, an end-to-end pipeline that merges hard prompt compression with lightweight file-level data compression. CompactPrompt first prunes low-information tokens from prompts using self-information scoring and dependency-based phrase grouping. In parallel, it applies n-gram abbreviation to recurrent textual patterns in attached documents and uniform quantization to numerical columns, yielding compact yet semantically faithful representations. Integrated into standard LLM agents, CompactPrompt reduces total token usage and inference cost by up to 60% on benchmark dataset like TAT-QA and FinQA, while preserving output quality (Results in less than 5% accuracy drop for Claude-3.5-Sonnet, and GPT-4.1-Mini) CompactPrompt helps visualize real-time compression decisions and quantify cost-performance trade-offs, laying the groundwork for leaner generative AI pipelines.
Related papers
- Detecting Overflow in Compressed Token Representations for Retrieval-Augmented Generation [49.48204107529758]
We define token overflow as a regime in which compressed representations no longer contain sufficient information to answer a given query.<n>In this paper, we find that query-agnostic saturation statistics reliably separate compressed from uncompressed token representations.<n>Lightweight probing classifiers over both query and context xRAG representations detect overflow with 0.72 AUC-ROC on average.<n>These results advance from query-independent diagnostics to query-aware detectors, enabling low-cost pre-LLM gating to mitigate compression-induced errors.
arXiv Detail & Related papers (2026-02-12T18:15:08Z) - Context Compression via Explicit Information Transmission [25.078241611630585]
Long-context inference with Large Language Models (LLMs) is costly due to quadratic attention and growing key-value caches.<n>We propose ComprExIT, a lightweight framework that formulates soft compression into a new paradigm.
arXiv Detail & Related papers (2026-02-03T17:44:12Z) - Hierarchical Token Prepending: Enhancing Information Flow in Decoder-based LLM Embeddings [52.49524240846879]
We propose Hierarchical Token Prepending to mitigate attention-level compression and readout-level over-squashing.<n>HTP partitions the input into blocks and prepends block-level summary tokens to subsequent blocks, creating pathways for backward information flow.<n>As a simple, architecture-agnostic method, HTP enhances both zero-shot and finetuned models, offering a scalable route to superior long-document embeddings.
arXiv Detail & Related papers (2025-11-18T19:37:40Z) - Cmprsr: Abstractive Token-Level Question-Agnostic Prompt Compressor [36.57824786347272]
We present the first comprehensive LLM-as-a-compressor benchmark spanning 25 open- and closed-source models.<n>We improve the performance of the best overall vanilla compressor, with Textgrad-based compression meta-prompt optimization.<n>We call the resulting model Cmprsr and demonstrate its superiority over both extractive and vanilla abstractive compression.
arXiv Detail & Related papers (2025-11-15T16:28:03Z) - CompLLM: Compression for Long Context Q&A [47.90063873976842]
We introduce CompLLM, a soft compression technique designed for practical deployment.<n>Instead of processing the context holistically, CompLLM divides it into segments and compresses each one independently.<n>Our experiments show that with a 2x compression rate, at high context lengths CompLLM speeds up Time To First Token (TTFT) by up to 4x and reduces the KV cache size by 50%.
arXiv Detail & Related papers (2025-09-23T16:49:43Z) - Task-agnostic Prompt Compression with Context-aware Sentence Embedding and Reward-guided Task Descriptor [16.830389144259584]
Task-agnostic Prompt Compression (TPC) is a novel framework that generalizes compression across tasks and domains without requiring input questions or templates.<n>TPC generates a context-relevant task description using a task descriptor trained on a curated dataset of context and query pairs.<n>We introduce 3 model sizes (Base, Large, and Huge), where the largest model outperforms the existing state-of-the-art methods on LongBench and ZeroSCROLLS benchmarks.
arXiv Detail & Related papers (2025-02-19T02:16:29Z) - ICPC: In-context Prompt Compression with Faster Inference [0.0]
We propose I CPC (In-context Prompt Compression), a novel and scalable prompt compression method that adaptively reduces the prompt length.<n>The key idea of I CPC is to calculate the probability of each word appearing in the prompt using encoders and calculate information carried by each word through the information function.<n> Empirically, we demonstrate that I CPC can effectively compress long texts of different categories and thus achieve better performance and speed on different types of NLP tasks.
arXiv Detail & Related papers (2025-01-03T03:46:51Z) - BRIEF: Bridging Retrieval and Inference for Multi-hop Reasoning via Compression [91.23933111083389]
Retrieval-augmented generation (RAG) can supplement large language models (LLMs) by integrating external knowledge.<n>This paper presents BRIEF, a lightweight approach that performs query-aware multi-hop reasoning.<n>Based on our synthetic data built entirely by open-source models, BRIEF generates more concise summaries.
arXiv Detail & Related papers (2024-10-20T04:24:16Z) - Lightweight Correlation-Aware Table Compression [58.50312417249682]
$texttVirtual$ is a framework that integrates seamlessly with existing open formats.
Experiments on data-gov datasets show that $texttVirtual$ reduces file sizes by up to 40% compared to Apache Parquet.
arXiv Detail & Related papers (2024-10-17T22:28:07Z) - LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression [43.048684907893104]
This paper focuses on task-agnostic prompt compression for better generalizability and efficiency.
We formulate prompt compression as a token classification problem to guarantee the faithfulness of the compressed prompt to the original one.
Our approach leads to lower latency by explicitly learning the compression objective with smaller models such as XLM-RoBERTa-large and mBERT.
arXiv Detail & Related papers (2024-03-19T17:59:56Z) - RECOMP: Improving Retrieval-Augmented LMs with Compression and Selective
Augmentation [61.53695868960846]
We propose compressing retrieved documents into textual summaries prior to in-context integration.
This not only reduces the computational costs but also relieves the burden of LMs to identify relevant information in long retrieved documents.
We show that our compressors trained for one LM can transfer to other LMs on the language modeling task and provide summaries largely faithful to the retrieved documents.
arXiv Detail & Related papers (2023-10-06T17:55:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.