CompLLM: Compression for Long Context Q&A
- URL: http://arxiv.org/abs/2509.19228v1
- Date: Tue, 23 Sep 2025 16:49:43 GMT
- Title: CompLLM: Compression for Long Context Q&A
- Authors: Gabriele Berton, Jayakrishnan Unnikrishnan, Son Tran, Mubarak Shah,
- Abstract summary: We introduce CompLLM, a soft compression technique designed for practical deployment.<n>Instead of processing the context holistically, CompLLM divides it into segments and compresses each one independently.<n>Our experiments show that with a 2x compression rate, at high context lengths CompLLM speeds up Time To First Token (TTFT) by up to 4x and reduces the KV cache size by 50%.
- Score: 47.90063873976842
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Large Language Models (LLMs) face significant computational challenges when processing long contexts due to the quadratic complexity of self-attention. While soft context compression methods, which map input text to smaller latent representations, have shown promise, their real-world adoption is limited. Existing techniques typically compress the context as a single unit, which leads to quadratic compression complexity and an inability to reuse computations across queries with overlapping contexts. In this work, we introduce CompLLM, a soft compression technique designed for practical deployment. Instead of processing the context holistically, CompLLM divides it into segments and compresses each one independently. This simple design choice yields three critical properties: efficiency, as the compression step scales linearly with the context length; scalability, enabling models trained on short sequences (e.g., 1k tokens) to generalize to contexts of 100k tokens; and reusability, allowing compressed segments to be cached and reused across different queries. Our experiments show that with a 2x compression rate, at high context lengths CompLLM speeds up Time To First Token (TTFT) by up to 4x and reduces the KV cache size by 50%. Furthermore, CompLLM achieves performance comparable to that obtained with the uncompressed context, and even surpasses it on very long sequences, demonstrating its effectiveness and practical utility.
Related papers
- Cognitive Chunking for Soft Prompts: Accelerating Compressor Learning via Block-wise Causal Masking [28.492055407384495]
Long contexts increase inference latency, as the computational cost of self-attention grows quadratically with sequence length.<n>Existing methods typically compress the entire context indiscriminately into a set of memory tokens.<n>We propose Parallelized Iterative Compression (PIC), which restricts the receptive field of memory tokens to sequential local chunks.
arXiv Detail & Related papers (2026-02-15T03:58:13Z) - Simple Context Compression: Mean-Pooling and Multi-Ratio Training [12.049015994907629]
We develop a lightweight and simple mean-pooling approach that consistently outperforms the widely used compression-tokens architecture.<n>We conduct extensive experiments across in-domain and out-of-domain QA datasets, as well as across model families, scales, and compression ratios.<n>Overall, our simple mean-pooling approach achieves the strongest performance, with a relatively small drop when training for multiple compression ratios.
arXiv Detail & Related papers (2025-10-23T17:57:23Z) - Compressing Many-Shots in In-Context Learning [61.231471139896506]
We study an approach to improve the memory and computational efficiency of ICL inference by compressing the many-shot prompts.<n>We first show that existing prompt compression methods are ineffective for many-shot compression.<n>We propose MemCom, a layer-wise compression method.
arXiv Detail & Related papers (2025-10-17T16:57:42Z) - Autoencoding-Free Context Compression for LLMs via Contextual Semantic Anchors [43.02557489472655]
Current context compression methods rely on autoencoding tasks to train context-agnostic compression tokens to compress contextual semantics.<n>We propose Semantic-Anchor Compression (SAC), a novel method that shifts from autoencoding task based compression to an architecture that is equipped with this compression capability.<n>SAC consistently outperforms existing context compression methods across various compression ratios.
arXiv Detail & Related papers (2025-10-10T01:42:14Z) - UniGist: Towards General and Hardware-aligned Sequence-level Long Context Compression [86.33995240043936]
UniGist is a sequence-level long-context compression framework for large language models.<n>It efficiently preserves context information by replacing raw tokens with special compression tokens (gists) in a fine-grained manner.<n>Our scheme also supports flexible inference by allowing the actual removal of compressed tokens, resulting in real-time memory savings.
arXiv Detail & Related papers (2025-09-19T08:47:37Z) - Compactor: Calibrated Query-Agnostic KV Cache Compression with Approximate Leverage Scores [37.41699761967978]
KV cache is often the dominant resource bottleneck in real-world deployments.<n>We present Compactor, a parameter-free, query-agnostic KV compression strategy.<n>We show that Compactor achieves full KV performance on Longbench while reducing the KV memory burden by 63%, on average.
arXiv Detail & Related papers (2025-07-10T20:03:35Z) - R1-Compress: Long Chain-of-Thought Compression via Chunk Compression and Search [61.4807238517108]
Chain-of-Thought (CoT) reasoning enhances large language models (LLMs) by enabling step-by-step problem-solving.<n>CoT's extension to Long-CoT introduces substantial computational overhead due to increased token length.<n>We propose R1-Compress, a two-stage chunk-level compression framework that preserves both local information and coherence.
arXiv Detail & Related papers (2025-05-22T16:06:59Z) - KV-Distill: Nearly Lossless Learnable Context Compression for LLMs [37.0803484148612]
We introduce KV-Distill, a Transformer compression framework that distills long context KV caches into significantly shorter representations.<n> KV-Distill can be trained as a parameter-efficient adaptor for pretrained models.<n>It can be fine-tuned on domain-specific contexts to reduce lengths by up to 99% while preserving downstream performance.
arXiv Detail & Related papers (2025-03-13T13:15:28Z) - ChunkKV: Semantic-Preserving KV Cache Compression for Efficient Long-Context LLM Inference [61.412894960600205]
Large Language Models (LLMs) require significant GPU memory when processing long texts.<n>ChunkKV reimagines KV cache compression by treating semantic chunks as basic compression units.<n>Result: ChunkKV outperforms state-of-the-art methods by up to 8.7% in precision.
arXiv Detail & Related papers (2025-02-01T03:49:47Z) - Style-Compress: An LLM-Based Prompt Compression Framework Considering Task-Specific Styles [49.65811277223873]
Style-Compress is a lightweight framework that adapts a smaller language model to compress prompts for a larger model on a new task without additional training.
Our approach iteratively generates and selects effective compressed prompts as task-specific demonstrations through style variation and in-context learning.
Style-Compress outperforms two baseline compression models in four tasks: original prompt reconstruction, text summarization, multi-hop QA, and CoT reasoning.
arXiv Detail & Related papers (2024-10-17T21:35:49Z) - In-Context Former: Lightning-fast Compressing Context for Large Language Model [48.831304302467004]
In this paper, we propose a new approach to compress the long input contexts of Transformer-based large language models (LLMs)
We use the cross-attention mechanism and a small number of learnable digest tokens to condense information from the contextual word embeddings.
Experimental results indicate that our method requires only 1/32 of the floating-point operations of the baseline during compression and improves processing speed by 68 to 112 times.
arXiv Detail & Related papers (2024-06-19T15:14:55Z) - Long Context Compression with Activation Beacon [22.054232261437186]
Activation Beacon is a plug-in module for transformer-based LLMs.
It targets effective, efficient, and flexible compression of long contexts.
It achieves a 2x acceleration in inference time and an 8x reduction of memory costs for KV cache.
arXiv Detail & Related papers (2024-01-07T11:57:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.