Dynamic Context Compression for Efficient RAG
- URL: http://arxiv.org/abs/2507.22931v2
- Date: Thu, 28 Aug 2025 16:42:39 GMT
- Title: Dynamic Context Compression for Efficient RAG
- Authors: Shuyu Guo, Zhaochun Ren,
- Abstract summary: Retrieval-augmented generation incurs significant inference costs due to lengthy retrieved contexts.<n>Existing methods apply fixed compression rates, over-compressing simple queries or under-compressing complex ones.<n>We propose Adaptive Context Compression for RAG (ACC-RAG), a framework that dynamically adjusts compression rates based on input complexity.
- Score: 23.75730930953087
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge but incurs significant inference costs due to lengthy retrieved contexts. While context compression mitigates this issue, existing methods apply fixed compression rates, over-compressing simple queries or under-compressing complex ones. We propose Adaptive Context Compression for RAG (ACC-RAG), a framework that dynamically adjusts compression rates based on input complexity, optimizing inference efficiency without sacrificing accuracy. ACC-RAG combines a hierarchical compressor (for multi-granular embeddings) with a context selector to retain minimal sufficient information, akin to human skimming. Evaluated on Wikipedia and five QA datasets, ACC-RAG outperforms fixed-rate methods and matches/unlocks over 4 times faster inference versus standard RAG while maintaining or improving accuracy.
Related papers
- Arbitrary Ratio Feature Compression via Next Token Prediction [52.10426317889982]
Arbitrary Ratio Feature Compression (ARFC) framework supports any compression ratio with a single model.<n>ARC is an auto-regressive model that performs compression via next-gressive prediction.<n>MoS module refines the compressed tokens by utilizing multiple compression results.<n>ERGC is integrated into the training process to preserve semantic and structural relationships during compression.
arXiv Detail & Related papers (2026-02-12T02:38:57Z) - ATACompressor: Adaptive Task-Aware Compression for Efficient Long-Context Processing in LLMs [28.55805086141996]
We propose Adaptive Task-Aware (ATACompressor), which adjusts compression based on the specific requirements of a task.<n>ATACompressor employs a selective encoder that compresses only the task-relevant portions of long contexts, ensuring that essential information is preserved while reducing unnecessary content.<n>We evaluate ATACompressor on three QA datasets: HotpotQA, MSMARCO, and SQUAD-showing that it outperforms existing methods in terms of both compression efficiency and task performance.
arXiv Detail & Related papers (2026-02-03T07:53:29Z) - Rethinking Soft Compression in Retrieval-Augmented Generation: A Query-Conditioned Selector Perspective [21.41673002861847]
Retrieval-Augmented Generation (RAG) effectively grounds Large Language Models (LLMs) with external knowledge.<n>Recent research on soft context compression aims to address this by encoding long documents into compact embeddings.<n>We introduce SeleCom, a selector-based soft compression framework for RAG that redefines the encoder's role as query-conditioned information selector.
arXiv Detail & Related papers (2026-01-25T09:06:24Z) - Rethinking Autoregressive Models for Lossless Image Compression via Hierarchical Parallelism and Progressive Adaptation [75.58269386927076]
Autoregressive (AR) models are often dismissed as impractical due to prohibitive computational cost.<n>This work re-thinks this paradigm, introducing a framework built on hierarchical parallelism and progressive adaptation.<n> Experiments on diverse datasets (natural, satellite, medical) validate that our method achieves new state-of-the-art compression.
arXiv Detail & Related papers (2025-11-14T06:27:58Z) - AttnComp: Attention-Guided Adaptive Context Compression for Retrieval-Augmented Generation [27.480791258325066]
We introduce AttnComp, an adaptive, efficient and context-aware compression framework.<n>AttnComp employs a Top-P compression algorithm to retain the minimal set of documents.<n>In addition to compression, AttnComp estimates response confidence by assessing the overall relevance of the retrieved content.
arXiv Detail & Related papers (2025-09-22T08:18:50Z) - REFRAG: Rethinking RAG based Decoding [67.4862300145604]
REFRAG is an efficient decoding framework that compresses, senses, and expands to improve latency in RAG applications.<n>We provide rigorous validation of REFRAG across diverse long-context tasks, including RAG, multi-turn conversations, and long document summarization.
arXiv Detail & Related papers (2025-09-01T03:31:44Z) - CORE-RAG: Lossless Compression for Retrieval-Augmented LLMs via Reinforcement Learning [22.93037884068796]
Retrieval-Augmented Generation (RAG) has emerged as a promising approach to enhance the timeliness of knowledge updates and the factual accuracy of responses in large language models.<n>Existing approaches to document compression tailored for RAG often degrade task performance.<n>We propose CORE, a novel method for lossless context compression in RAG.
arXiv Detail & Related papers (2025-08-24T12:21:50Z) - SARA: Selective and Adaptive Retrieval-augmented Generation with Context Compression [28.043964124611026]
We propose SARA, a unified RAG framework that balances local precision and global knowledge coverage under tight context budgets.<n> SARA combines natural-language text snippets with semantic compression vectors to jointly enhance context efficiency and answer correctness.
arXiv Detail & Related papers (2025-07-08T03:29:09Z) - KG-Infused RAG: Augmenting Corpus-Based RAG with External Knowledge Graphs [66.35046942874737]
KG-Infused RAG is a framework that integrates KGs into RAG systems to implement spreading activation.<n> KG-Infused RAG retrieves KG facts, expands the query accordingly, and enhances generation by combining corpus passages with structured facts.
arXiv Detail & Related papers (2025-06-11T09:20:02Z) - ECoRAG: Evidentiality-guided Compression for Long Context RAG [22.842546956145064]
We propose Evidentiality-guided RAG, or ECoRAG framework.<n>ECoRAG improves performance by compressing retrieved documents based on evidentiality.<n>ECoRAG is highly cost-efficient, as it not only reduces latency but also minimizes token usage.
arXiv Detail & Related papers (2025-06-05T15:43:49Z) - Sentinel: Attention Probing of Proxy Models for LLM Context Compression with an Understanding Perspective [29.50363211934763]
Retrieval-augmented generation (RAG) enhances large language models with external context, but retrieved passages are often lengthy, noisy, or exceed input limits.<n>We propose Sentinel, a lightweight sentence-level compression framework that reframes context filtering as an attention-based understanding task.
arXiv Detail & Related papers (2025-05-29T09:24:12Z) - MOOSComp: Improving Lightweight Long-Context Compressor via Mitigating Over-Smoothing and Incorporating Outlier Scores [5.893964327109089]
MOOSComp is a token-classification-based long-context compression method.<n>We introduce outlier scores to preserve rare but critical tokens that are prone to be discarded in task-agnostic compression.<n>Our method obtains a speedup of 3.3x at a 4x compression ratio on a resource-constrained mobile device.
arXiv Detail & Related papers (2025-04-23T15:02:53Z) - Long Context In-Context Compression by Getting to the Gist of Gisting [50.24627831994713]
GistPool is an in-context compression method with no architectural modification to the decoder transformer.<n>We demonstrate that gisting struggles with longer contexts, with significant performance drops even at minimal compression rates.<n>GistPool preserves the simplicity of gisting, while significantly boosting its performance on long context compression tasks.
arXiv Detail & Related papers (2025-04-11T19:23:31Z) - Fast or Better? Balancing Accuracy and Cost in Retrieval-Augmented Generation with Flexible User Control [52.405085773954596]
Retrieval-Augmented Generation has emerged as a powerful approach to mitigate large language model hallucinations.<n>Existing RAG frameworks often apply retrieval indiscriminately,leading to inefficiencies-over-retrieving.<n>We introduce a novel user-controllable RAG framework that enables dynamic adjustment of the accuracy-cost trade-off.
arXiv Detail & Related papers (2025-02-17T18:56:20Z) - EXIT: Context-Aware Extractive Compression for Enhancing Retrieval-Augmented Generation [8.757777529568383]
Current RAG systems often struggle when retrieval models fail to rank the most relevant documents.<n>We introduce EXIT, an extractive context compression framework.<n>Our evaluations show that EXIT consistently surpasses existing compression methods.
arXiv Detail & Related papers (2024-12-17T05:38:27Z) - SFR-RAG: Towards Contextually Faithful LLMs [57.666165819196486]
Retrieval Augmented Generation (RAG) is a paradigm that integrates external contextual information with large language models (LLMs) to enhance factual accuracy and relevance.
We introduce SFR-RAG, a small LLM that is instruction-textual with an emphasis on context-grounded generation and hallucination.
We also present ConBench, a new evaluation framework compiling multiple popular and diverse RAG benchmarks.
arXiv Detail & Related papers (2024-09-16T01:08:18Z) - In-Context Former: Lightning-fast Compressing Context for Large Language Model [48.831304302467004]
In this paper, we propose a new approach to compress the long input contexts of Transformer-based large language models (LLMs)
We use the cross-attention mechanism and a small number of learnable digest tokens to condense information from the contextual word embeddings.
Experimental results indicate that our method requires only 1/32 of the floating-point operations of the baseline during compression and improves processing speed by 68 to 112 times.
arXiv Detail & Related papers (2024-06-19T15:14:55Z) - Learning Accurate Performance Predictors for Ultrafast Automated Model
Compression [86.22294249097203]
We propose an ultrafast automated model compression framework called SeerNet for flexible network deployment.
Our method achieves competitive accuracy-complexity trade-offs with significant reduction of the search cost.
arXiv Detail & Related papers (2023-04-13T10:52:49Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.