RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving
- URL: http://arxiv.org/abs/2503.14649v2
- Date: Fri, 21 Mar 2025 17:51:53 GMT
- Title: RAGO: Systematic Performance Optimization for Retrieval-Augmented Generation Serving
- Authors: Wenqi Jiang, Suvinay Subramanian, Cat Graves, Gustavo Alonso, Amir Yazdanbakhsh, Vidushi Dadu,
- Abstract summary: Retrieval-augmented generation (RAG) is emerging as a popular approach for reliable LLM serving.<n>RAG is a structured abstraction that captures the wide range of RAG algorithms.<n> RAGO is a system optimization framework for efficient RAG serving.
- Score: 9.962031642362813
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Retrieval-augmented generation (RAG), which combines large language models (LLMs) with retrievals from external knowledge databases, is emerging as a popular approach for reliable LLM serving. However, efficient RAG serving remains an open challenge due to the rapid emergence of many RAG variants and the substantial differences in workload characteristics across them. In this paper, we make three fundamental contributions to advancing RAG serving. First, we introduce RAGSchema, a structured abstraction that captures the wide range of RAG algorithms, serving as a foundation for performance optimization. Second, we analyze several representative RAG workloads with distinct RAGSchema, revealing significant performance variability across these workloads. Third, to address this variability and meet diverse performance requirements, we propose RAGO (Retrieval-Augmented Generation Optimizer), a system optimization framework for efficient RAG serving. Our evaluation shows that RAGO achieves up to a 2x increase in QPS per chip and a 55% reduction in time-to-first-token latency compared to RAG systems built on LLM-system extensions.
Related papers
- RAGRouter-Bench: A Dataset and Benchmark for Adaptive RAG Routing [37.7721677767453]
We introduce RAG-Bench, the first dataset and benchmark designed for adaptive RAG routing.<n>RAG-Bench revisits retrieval from a query-corpus compatibility perspective and standardizes five representative RAG paradigms for systematic evaluation.<n> Experiments with DeepSeek-V3 and LLaMA-3.1-8B demonstrate that no single RAG paradigm is universally optimal.
arXiv Detail & Related papers (2026-01-30T20:38:11Z) - Predict the Retrieval! Test time adaptation for Retrieval Augmented Generation [66.36556189794526]
TTARAG is a test-time adaptation method that dynamically updates the language model's parameters during inference to improve RAG system performance in specialized domains.<n>Our method introduces a simple yet effective approach where the model learns to predict retrieved content, enabling automatic parameter adjustment to the target domain.
arXiv Detail & Related papers (2026-01-16T17:07:01Z) - RAGPulse: An Open-Source RAG Workload Trace to Optimize RAG Serving Systems [10.189392948536446]
This paper introduces RAGPulse, an open-source RAG workload trace dataset.<n>This dataset was collected from a university-wide Q&A system serving more than 40,000 students and faculties since April 2024.<n>Our analysis reveals that real-world RAG workloads exhibit significant temporal and highly skewed hot document access pattern.
arXiv Detail & Related papers (2025-11-17T05:06:47Z) - RAG-Stack: Co-Optimizing RAG Quality and Performance From the Vector Database Perspective [3.385836913732549]
Retrieval-augmented generation (RAG) has emerged as one of the most prominent applications of vector databases.<n>We present RAG-Stack, a three-pillar blueprint for quality-performance co-optimization in RAG systems.
arXiv Detail & Related papers (2025-10-23T07:35:19Z) - DecEx-RAG: Boosting Agentic Retrieval-Augmented Generation with Decision and Execution Optimization via Process Supervision [50.89715397781075]
Agentic Retrieval-Augmented Generation (Agentic RAG) enhances the processing capability for complex tasks.<n>We propose DecEx-RAG, which models RAG as a Markov Decision Process (MDP) incorporating decision-making and execution.<n>We show that DecEx-RAG achieves an average absolute performance improvement of $6.2%$ across six datasets.
arXiv Detail & Related papers (2025-10-07T08:49:22Z) - REFRAG: Rethinking RAG based Decoding [67.4862300145604]
REFRAG is an efficient decoding framework that compresses, senses, and expands to improve latency in RAG applications.<n>We provide rigorous validation of REFRAG across diverse long-context tasks, including RAG, multi-turn conversations, and long document summarization.
arXiv Detail & Related papers (2025-09-01T03:31:44Z) - Improving End-to-End Training of Retrieval-Augmented Generation Models via Joint Stochastic Approximation [9.493788719707835]
Retrieval-augmented generation (RAG) has become a widely recognized paradigm to combine parametric memory with non-parametric memories.<n>A major challenge in end-to-end optimization of the RAG model is that marginalization over relevant passages is required.<n>In this paper, we propose and develop joint approximation (JSA) based end-to-end training of RAG.<n>The JSA algorithm is an extension of the EM (expectation-maximization) algorithm and is particularly powerful in estimating latent variable models.
arXiv Detail & Related papers (2025-08-25T16:17:16Z) - PRGB Benchmark: A Robust Placeholder-Assisted Algorithm for Benchmarking Retrieval-Augmented Generation [15.230902967865925]
Retrieval-Augmented Generation (RAG) enhances large language models (LLMs) by integrating external knowledge.<n>Current benchmarks emphasize broad aspects such as noise robustness, but lack a systematic and granular evaluation framework on document utilization.<n>Our benchmark provides a reproducible framework for developing more reliable and efficient RAG systems.
arXiv Detail & Related papers (2025-07-23T16:14:08Z) - LTRR: Learning To Rank Retrievers for LLMs [53.285436927963865]
We show that routing-based RAG systems can outperform the best single-retriever-based systems.<n>Performance gains are especially pronounced in models trained with the Answer Correctness (AC) metric.<n>As part of the SIGIR 2025 LiveRAG challenge, our submitted system demonstrated the practical viability of our approach.
arXiv Detail & Related papers (2025-06-16T17:53:18Z) - Single LLM, Multiple Roles: A Unified Retrieval-Augmented Generation Framework Using Role-Specific Token Optimization [64.33914369424494]
RoleRAG is a unified RAG framework that achieves efficient multi-task processing through role-specific token optimization.<n>RoleRAG comprises six modules, each handling a specific sub-task within the RAG process.<n>We introduce a query graph to represent the decomposition of the query, which can be dynamically resolved according to the decomposing state.
arXiv Detail & Related papers (2025-05-21T12:25:12Z) - An Analysis of Hyper-Parameter Optimization Methods for Retrieval Augmented Generation [6.98773220458697]
We present a comprehensive study involving 5 HPO algorithms over 5 datasets from diverse domains.<n>Our study explores the largest HPO search space considered to date, with three evaluation metrics as optimization targets.<n>Analysis of the results shows that RAG HPO can be done efficiently, either greedily or with random search, and that it significantly boosts RAG performance for all datasets.
arXiv Detail & Related papers (2025-05-06T11:47:52Z) - Direct Retrieval-augmented Optimization: Synergizing Knowledge Selection and Language Models [83.8639566087953]
We propose a direct retrieval-augmented optimization framework, named DRO, that enables end-to-end training of two key components.<n>DRO alternates between two phases: (i) document permutation estimation and (ii) re-weighted, progressively improving RAG components.<n>Our theoretical analysis reveals that DRO is analogous to policy-gradient methods in reinforcement learning.
arXiv Detail & Related papers (2025-05-05T23:54:53Z) - OpenRAG: Optimizing RAG End-to-End via In-Context Retrieval Learning [13.181087031343619]
We introduce OpenRAG, a RAG framework that is optimized end-to-end by tuning the retriever to capture in-context relevance.<n>Experiments across a wide range of tasks demonstrate that OpenRAG, by tuning a retriever end-to-end, leads to a consistent improvement of 4.0% over the original retriever.
arXiv Detail & Related papers (2025-03-11T13:04:05Z) - RAG-Gym: Optimizing Reasoning and Search Agents with Process Supervision [43.50113345998687]
We introduce RAG-Gym, a unified optimization framework that enhances information-seeking agents through fine-grained process supervision at each search step.<n>We also propose ReSearch, a novel agent architecture that synergizes answer reasoning and search query generation within the RAG-Gym framework.
arXiv Detail & Related papers (2025-02-19T18:56:03Z) - Improving Retrieval-Augmented Generation through Multi-Agent Reinforcement Learning [51.54046200512198]
Retrieval-augmented generation (RAG) is extensively utilized to incorporate external, current knowledge into large language models.<n>A standard RAG pipeline may comprise several components, such as query rewriting, document retrieval, document filtering, and answer generation.<n>To overcome these challenges, we propose treating the RAG pipeline as a multi-agent cooperative task, with each component regarded as an RL agent.
arXiv Detail & Related papers (2025-01-25T14:24:50Z) - Chain-of-Retrieval Augmented Generation [72.06205327186069]
This paper introduces an approach for training o1-like RAG models that retrieve and reason over relevant information step by step before generating the final answer.
Our proposed method, CoRAG, allows the model to dynamically reformulate the query based on the evolving state.
arXiv Detail & Related papers (2025-01-24T09:12:52Z) - RAGServe: Fast Quality-Aware RAG Systems with Configuration Adaptation [9.50826652108988]
RAG (Retrieval Augmented Generation) allows large language models to generate better responses with external knowledge.<n>This paper presents RAGServe, the first RAG system that jointly schedules queries and adapts the key RAG configurations of each query.
arXiv Detail & Related papers (2024-12-13T20:39:30Z) - Toward Optimal Search and Retrieval for RAG [39.69494982983534]
Retrieval-augmented generation (RAG) is a promising method for addressing some of the memory-related challenges associated with Large Language Models (LLMs)
Here, we work towards the goal of understanding how retrievers can be optimized for RAG pipelines for common tasks such as Question Answering (QA)
arXiv Detail & Related papers (2024-11-11T22:06:51Z) - RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rewards [78.74923079748521]
Retrieval-Augmented Generation (RAG) has proven its effectiveness in mitigating hallucinations in Large Language Models (LLMs) by retrieving knowledge from external resources.<n>Current approaches use instruction tuning to optimize LLMs, improving their ability to utilize retrieved knowledge.<n>We propose a Differentiable Data Rewards ( DDR) method, which trains RAG systems by aligning data preferences between different RAG modules.
arXiv Detail & Related papers (2024-10-17T12:53:29Z) - EasyRAG: Efficient Retrieval-Augmented Generation Framework for Automated Network Operations [24.142649256624082]
This paper presents EasyRAG, a simple, lightweight, and efficient retrieval-augmented generation framework for automated network operations.
Our framework has three advantages. The first is accurate question answering.
The second is simple deployment. Our method primarily consists of BM25 retrieval and BGE-reranker reranking, requiring no fine-tuning of any models, occupying minimal VRAM, easy to deploy, and highly scalable.
The last one is efficient inference. We designed an efficient inference acceleration scheme for the entire coarse ranking, reranking, and generation process that significantly reduces the inference latency of RAG while maintaining a
arXiv Detail & Related papers (2024-10-14T09:17:43Z) - RAG Foundry: A Framework for Enhancing LLMs for Retrieval Augmented Generation [8.377398103067508]
We introduce RAG Foundry, an open-source framework for augmenting large language models for RAG use cases.
RAG Foundry integrates data creation, training, inference and evaluation into a single workflow.
We demonstrate the framework effectiveness by augmenting and fine-tuning Llama-3 and Phi-3 models with diverse RAG configurations.
arXiv Detail & Related papers (2024-08-05T15:16:24Z) - Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting [68.90949377014742]
Speculative RAG is a framework that leverages a larger generalist LM to efficiently verify multiple RAG drafts produced in parallel by a smaller, distilled specialist LM.<n>Our method accelerates RAG by delegating drafting to the smaller specialist LM, with the larger generalist LM performing a single verification pass over the drafts.<n>It notably enhances accuracy by up to 12.97% while reducing latency by 50.83% compared to conventional RAG systems on PubHealth.
arXiv Detail & Related papers (2024-07-11T06:50:19Z) - FlashRAG: A Modular Toolkit for Efficient Retrieval-Augmented Generation Research [70.6584488911715]
retrieval-augmented generation (RAG) has attracted considerable research attention.<n>Existing RAG toolkits are often heavy and inflexibly, failing to meet the customization needs of researchers.<n>Our toolkit has implemented 16 advanced RAG methods and gathered and organized 38 benchmark datasets.
arXiv Detail & Related papers (2024-05-22T12:12:40Z) - RAGGED: Towards Informed Design of Retrieval Augmented Generation Systems [51.171355532527365]
Retrieval-augmented generation (RAG) can significantly improve the performance of language models (LMs)
RAGGED is a framework for analyzing RAG configurations across various document-based question answering tasks.
arXiv Detail & Related papers (2024-03-14T02:26:31Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.