Related papers: RAGBoost: Efficient Retrieval-Augmented Generation with Accuracy-Preserving Context Reuse

RAGBoost: Efficient Retrieval-Augmented Generation with Accuracy-Preserving Context Reuse

URL: http://arxiv.org/abs/2511.03475v1
Date: Wed, 05 Nov 2025 13:59:01 GMT
Title: RAGBoost: Efficient Retrieval-Augmented Generation with Accuracy-Preserving Context Reuse
Authors: Yinsicheng Jiang, Yeqi Huang, Liang Cheng, Cheng Deng, Xuan Sun, Luo Mai,
Abstract summary: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with retrieved context.<n>Existing caching techniques either preserve accuracy with low cache reuse or improve reuse at the cost of degraded reasoning quality.<n>We present RAGBoost, an efficient RAG system that achieves high cache reuse without sacrificing accuracy through accuracy-preserving context reuse.
Score: 39.76548092849437
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with retrieved context but often suffers from downgraded prefill performance as modern applications demand longer and more complex inputs. Existing caching techniques either preserve accuracy with low cache reuse or improve reuse at the cost of degraded reasoning quality. We present RAGBoost, an efficient RAG system that achieves high cache reuse without sacrificing accuracy through accuracy-preserving context reuse. RAGBoost detects overlapping retrieved items across concurrent sessions and multi-turn interactions, using efficient context indexing, ordering, and de-duplication to maximize reuse, while lightweight contextual hints maintain reasoning fidelity. It integrates seamlessly with existing LLM inference engines and improves their prefill performance by 1.5-3X over state-of-the-art methods, while preserving or even enhancing reasoning accuracy across diverse RAG and agentic AI workloads. Our code is released at: https://github.com/Edinburgh-AgenticAI/RAGBoost.

Related papers

AMA: Adaptive Memory via Multi-Agent Collaboration [54.490349689939166]
We propose Adaptive Memory via Multi-Agent Collaboration (AMA), a novel framework that leverages coordinated agents to manage memory across multiple granularities.<n>AMA significantly outperforms state-of-the-art baselines while reducing token consumption by approximately 80% compared to full-context methods.
arXiv Detail & Related papers (2026-01-28T08:09:49Z)
TeaRAG: A Token-Efficient Agentic Retrieval-Augmented Generation Framework [62.66056331998838]
TeaRAG is a token-efficient agentic RAG framework capable of compressing both retrieval content and reasoning steps.<n>Our reward function evaluates the knowledge sufficiency by a knowledge matching mechanism, while penalizing excessive reasoning steps.
arXiv Detail & Related papers (2025-11-07T16:08:34Z)
REFRAG: Rethinking RAG based Decoding [67.4862300145604]
REFRAG is an efficient decoding framework that compresses, senses, and expands to improve latency in RAG applications.<n>We provide rigorous validation of REFRAG across diverse long-context tasks, including RAG, multi-turn conversations, and long document summarization.
arXiv Detail & Related papers (2025-09-01T03:31:44Z)
Accelerating Adaptive Retrieval Augmented Generation via Instruction-Driven Representation Reduction of Retrieval Overlaps [16.84310001807895]
This paper introduces a model-agnostic approach that can be applied to A-RAG methods.<n>Specifically, we use cache access and parallel generation to speed up the prefilling and decoding stages respectively.
arXiv Detail & Related papers (2025-05-19T05:39:38Z)
Relevance Isn't All You Need: Scaling RAG Systems With Inference-Time Compute Via Multi-Criteria Reranking [0.0]
We show that in standard RAG pipelines, maximizing for context relevance alone can degrade downstream response quality.<n>We introduce "RErankyond reLevance (REBEL)", which enables RAG systems to scale with inference-time compute.
arXiv Detail & Related papers (2025-03-14T00:19:39Z)
Leveraging Approximate Caching for Faster Retrieval-Augmented Generation [6.674782158041247]
We introduce Proximity, an approximate key-value cache that optimize the RAG workflow by leveraging similarities in user queries.<n>Instead of treating each query independently, Proximity reuses previously retrieved documents when similar queries appear.<n>Our experiments demonstrate that Proximity with our LSH scheme and a realistically-skewed MedRAG workload reduces database calls by 77.2% while maintaining database recall and test accuracy.
arXiv Detail & Related papers (2025-03-07T15:54:04Z)
Fast or Better? Balancing Accuracy and Cost in Retrieval-Augmented Generation with Flexible User Control [52.405085773954596]
Retrieval-Augmented Generation has emerged as a powerful approach to mitigate large language model hallucinations.<n>Existing RAG frameworks often apply retrieval indiscriminately,leading to inefficiencies-over-retrieving.<n>We introduce a novel user-controllable RAG framework that enables dynamic adjustment of the accuracy-cost trade-off.
arXiv Detail & Related papers (2025-02-17T18:56:20Z)
Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks [11.053340674721005]
Retrieval-augmented generation (RAG) has gained traction as a powerful approach for enhancing language models by integrating external knowledge sources.<n>This paper proposes an alternative paradigm, cache-augmented generation (CAG) that bypasses real-time retrieval.
arXiv Detail & Related papers (2024-12-20T06:58:32Z)
MemoRAG: Boosting Long Context Processing with Global Memory-Enhanced Retrieval Augmentation [60.04380907045708]
Retrieval-Augmented Generation (RAG) is considered a promising strategy to address this problem.<n>We propose MemoRAG, a novel RAG framework empowered by global memory-augmented retrieval.<n>MemoRAG achieves superior performances across a variety of long-context evaluation tasks.
arXiv Detail & Related papers (2024-09-09T13:20:31Z)
RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation [11.321659218769598]
Retrieval-Augmented Generation (RAG) has shown significant improvements in various natural language processing tasks. RAGCache organizes the intermediate states of retrieved knowledge in a knowledge tree and caches them in the GPU and host memory hierarchy. RAGCache reduces the time to first token (TTTF) by up to 4x and improves the throughput by up to 2.1x compared to vLLM integrated with Faiss.
arXiv Detail & Related papers (2024-04-18T18:32:30Z)

This list is automatically generated from the titles and abstracts of the papers in this site.