AttentionRAG: Attention-Guided Context Pruning in Retrieval-Augmented Generation
- URL: http://arxiv.org/abs/2503.10720v1
- Date: Thu, 13 Mar 2025 08:22:28 GMT
- Title: AttentionRAG: Attention-Guided Context Pruning in Retrieval-Augmented Generation
- Authors: Yixiong Fang, Tianran Sun, Yuling Shi, Xiaodong Gu,
- Abstract summary: We propose AttentionRAG, an attention-guided context pruning method for RAG systems.<n>The core idea of AttentionRAG lies in its attention focus mechanism, which reformulates RAG queries into a next-token prediction paradigm.<n>Experiments on LongBench and Babilong benchmarks show that AttentionRAG achieves up to 6.3$times$ context compression while outperforming LLMLingua methods by around 10% in key metrics.
- Score: 3.735352744532723
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: While RAG demonstrates remarkable capabilities in LLM applications, its effectiveness is hindered by the ever-increasing length of retrieved contexts, which introduces information redundancy and substantial computational overhead. Existing context pruning methods, such as LLMLingua, lack contextual awareness and offer limited flexibility in controlling compression rates, often resulting in either insufficient pruning or excessive information loss. In this paper, we propose AttentionRAG, an attention-guided context pruning method for RAG systems. The core idea of AttentionRAG lies in its attention focus mechanism, which reformulates RAG queries into a next-token prediction paradigm. This mechanism isolates the query's semantic focus to a single token, enabling precise and efficient attention calculation between queries and retrieved contexts. Extensive experiments on LongBench and Babilong benchmarks show that AttentionRAG achieves up to 6.3$\times$ context compression while outperforming LLMLingua methods by around 10\% in key metrics.
Related papers
- Hallucination Detection in LLMs via Topological Divergence on Attention Graphs [64.74977204942199]
Hallucination, i.e., generating factually incorrect content, remains a critical challenge for large language models.
We introduce TOHA, a TOpology-based HAllucination detector in the RAG setting.
arXiv Detail & Related papers (2025-04-14T10:06:27Z) - CausalRAG: Integrating Causal Graphs into Retrieval-Augmented Generation [11.265999775635823]
CausalRAG is a novel framework that incorporates causal graphs into the retrieval process.
By constructing and tracing causal relationships, CausalRAG preserves contextual continuity and improves retrieval precision.
Our findings suggest that grounding retrieval in causal reasoning provides a promising approach to knowledge-intensive tasks.
arXiv Detail & Related papers (2025-03-25T17:43:08Z) - DeepRAG: Thinking to Retrieval Step by Step for Large Language Models [92.87532210660456]
We propose DeepRAG, a framework that models retrieval-augmented reasoning as a Markov Decision Process (MDP)<n>By iteratively decomposing queries, DeepRAG dynamically determines whether to retrieve external knowledge or rely on parametric reasoning at each step.<n> Experiments show that DeepRAG improves retrieval efficiency while improving answer accuracy by 21.99%, demonstrating its effectiveness in optimizing retrieval-augmented reasoning.
arXiv Detail & Related papers (2025-02-03T08:22:45Z) - Chain-of-Retrieval Augmented Generation [72.06205327186069]
This paper introduces an approach for training o1-like RAG models that retrieve and reason over relevant information step by step before generating the final answer.<n>Our proposed method, CoRAG, allows the model to dynamically reformulate the query based on the evolving state.
arXiv Detail & Related papers (2025-01-24T09:12:52Z) - Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks [11.053340674721005]
Retrieval-augmented generation (RAG) has gained traction as a powerful approach for enhancing language models by integrating external knowledge sources.
This paper proposes an alternative paradigm, cache-augmented generation (CAG) that bypasses real-time retrieval.
arXiv Detail & Related papers (2024-12-20T06:58:32Z) - Context Awareness Gate For Retrieval Augmented Generation [2.749898166276854]
Retrieval Augmented Generation (RAG) has emerged as a widely adopted approach to mitigate the limitations of large language models (LLMs) in answering domain-specific questions.<n>Previous research has predominantly focused on improving the accuracy and quality of retrieved data chunks to enhance the overall performance of the generation pipeline.<n>We investigate the impact of retrieving irrelevant information in open-domain question answering, highlighting its significant detrimental effect on the quality of LLM outputs.
arXiv Detail & Related papers (2024-11-25T06:48:38Z) - PEAR: Position-Embedding-Agnostic Attention Re-weighting Enhances Retrieval-Augmented Generation with Zero Inference Overhead [24.611413814466978]
Large language models (LLMs) enhanced with retrieval-augmented generation (RAG) have introduced a new paradigm for web search.
Existing methods to enhance context awareness are often inefficient, incurring time or memory overhead during inference.
We propose Position-Embedding-Agnostic attention Re-weighting (PEAR) which enhances the context awareness of LLMs with zero inference overhead.
arXiv Detail & Related papers (2024-09-29T15:40:54Z) - Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting [68.90949377014742]
Speculative RAG is a framework that leverages a larger generalist LM to efficiently verify multiple RAG drafts produced in parallel by a smaller, distilled specialist LM.<n>Our method accelerates RAG by delegating drafting to the smaller specialist LM, with the larger generalist LM performing a single verification pass over the drafts.<n>It notably enhances accuracy by up to 12.97% while reducing latency by 50.83% compared to conventional RAG systems on PubHealth.
arXiv Detail & Related papers (2024-07-11T06:50:19Z) - Compressing Long Context for Enhancing RAG with AMR-based Concept Distillation [17.156915103545728]
Large Language Models (LLMs) have made significant strides in information acquisition.
Retrieval Augmented Generation (RAG) addresses this limitation by incorporating external, non-parametric knowledge.
We propose a novel concept-based RAG framework with the Abstract Representation (AMR)-based concept distillation algorithm.
arXiv Detail & Related papers (2024-05-06T00:18:43Z) - Fortify the Shortest Stave in Attention: Enhancing Context Awareness of Large Language Models for Effective Tool Use [74.72150542395487]
An inherent waveform pattern in the attention allocation of large language models (LLMs) significantly affects their performance in tasks demanding a high degree of context awareness.
To address this issue, we propose a novel inference method named Attention Buckets.
arXiv Detail & Related papers (2023-12-07T17:24:51Z) - Self-RAG: Learning to Retrieve, Generate, and Critique through
Self-Reflection [74.51523859064802]
We introduce a new framework called Self-Reflective Retrieval-Augmented Generation (Self-RAG)
Self-RAG enhances an LM's quality and factuality through retrieval and self-reflection.
It significantly outperforms state-of-the-art LLMs and retrieval-augmented models on a diverse set of tasks.
arXiv Detail & Related papers (2023-10-17T18:18:32Z) - Generation-Augmented Retrieval for Open-domain Question Answering [134.27768711201202]
Generation-Augmented Retrieval (GAR) for answering open-domain questions.
We show that generating diverse contexts for a query is beneficial as fusing their results consistently yields better retrieval accuracy.
GAR achieves state-of-the-art performance on Natural Questions and TriviaQA datasets under the extractive QA setup when equipped with an extractive reader.
arXiv Detail & Related papers (2020-09-17T23:08:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.