Related papers: TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval

TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval

URL: http://arxiv.org/abs/2502.20969v1
Date: Fri, 28 Feb 2025 11:32:22 GMT
Title: TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval
Authors: Chien-Yu Lin, Keisuke Kamahori, Yiyu Liu, Xiaoxiang Shi, Madhav Kashyap, Yile Gu, Rulin Shao, Zihao Ye, Kan Zhu, Stephanie Wang, Arvind Krishnamurthy, Rohan Kadekodi, Luis Ceze, Baris Kasikci,
Abstract summary: Retrieval-augmented generation (RAG) extends large language models (LLMs) with external data sources to enhance factual correctness and domain coverage.<n>Modern RAG pipelines rely on large datastores, leading to system challenges in latency-sensitive deployments.<n>We propose TeleRAG, an efficient inference system that reduces RAG latency with minimal GPU memory requirements.
Score: 10.268774281394261
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Retrieval-augmented generation (RAG) extends large language models (LLMs) with external data sources to enhance factual correctness and domain coverage. Modern RAG pipelines rely on large datastores, leading to system challenges in latency-sensitive deployments, especially when limited GPU memory is available. To address these challenges, we propose TeleRAG, an efficient inference system that reduces RAG latency with minimal GPU memory requirements. The core innovation of TeleRAG is lookahead retrieval, a prefetching mechanism that anticipates required data and transfers it from CPU to GPU in parallel with LLM generation. By leveraging the modularity of RAG pipelines, the inverted file index (IVF) search algorithm and similarities between queries, TeleRAG optimally overlaps data movement and computation. Experimental results show that TeleRAG reduces end-to-end RAG inference latency by up to 1.72x on average compared to state-of-the-art systems, enabling faster, more memory-efficient deployments of advanced RAG applications.

Related papers

HedraRAG: Coordinating LLM Generation and Database Retrieval in Heterogeneous RAG Serving [10.130938079844121]
HedraRAG is a runtime system built on a graph-based abstraction that exposes optimization opportunities across stage-level parallelism, intra-request similarity, and inter-request skewness.<n>The resulting execution plans are mapped onto hybrid CPU-GPU pipelines to improve resource utilization and reduce latency.
arXiv Detail & Related papers (2025-07-12T04:42:43Z)
AIRES: Accelerating Out-of-Core GCNs via Algorithm-System Co-Design [6.554916179445241]
Graph convolutional networks (GCNs) are fundamental in various scientific applications, ranging from biomedical protein-protein interactions (PPI) to large-scale recommendation systems.<n>An essential component for modeling graph structures in GCNs is sparse general matrix-matrix multiplication (SpGEMM)<n>SpGEMMs are often conducted in an out-of-core fashion due to limited GPU memory space in resource-constrained systems.<n>We propose AIRES, a novel algorithm-system co-design solution to accelerate out-of-core SpGEMM computation for GCNs.
arXiv Detail & Related papers (2025-07-02T00:35:43Z)
REIS: A High-Performance and Energy-Efficient Retrieval System with In-Storage Processing [8.574396262432522]
Large Language Models (LLMs) face an inherent challenge: their knowledge is confined to the data that they have been trained on.<n>Retrieval-Augmented Generation (RAG) complements the static training-derived knowledge of LLMs with an external knowledge repository.<n>We propose REIS, the first ISP system tailored for RAG that addresses these limitations with three key mechanisms.
arXiv Detail & Related papers (2025-06-19T16:26:51Z)
ImpRAG: Retrieval-Augmented Generation with Implicit Queries [49.510101132093396]
ImpRAG is a query-free RAG system that integrates retrieval and generation into a unified model.<n>We show that ImpRAG achieves 3.6-11.5 improvements in exact match scores on unseen tasks with diverse formats.
arXiv Detail & Related papers (2025-06-02T21:38:21Z)
Efficient Distributed Retrieval-Augmented Generation for Enhancing Language Model Performance [34.695803671702606]
Small language models (SLMs) support efficient deployments on resource-constrained edge devices, but their limited capacity compromises inference performance. Retrieval-augmented generation (RAG) is a promising solution to enhance model performance by integrating external databases, without requiring intensive on-device model retraining. We propose DRAGON, a distributed RAG framework to enhance on-device SLMs through both general and personal knowledge without the risk of leaking document privacy.
arXiv Detail & Related papers (2025-04-15T13:53:08Z)
An Adaptive Vector Index Partitioning Scheme for Low-Latency RAG Pipeline [0.6445605125467574]
Retrieval Augmented Generation (RAG) systems enhance response quality by integrating Large Language Models (LLMs) with vector databases. Existing optimizations for vector search and LLM serving have largely been developed in isolation. This paper introduces VectorLiteRAG, an optimized vector index partitioning mechanism designed for RAG systems.
arXiv Detail & Related papers (2025-04-11T19:18:41Z)
RGL: A Graph-Centric, Modular Framework for Efficient Retrieval-Augmented Generation on Graphs [58.10503898336799]
We introduce the RAG-on-Graphs Library (RGL), a modular framework that seamlessly integrates the complete RAG pipeline. RGL addresses key challenges by supporting a variety of graph formats and integrating optimized implementations for essential components. Our evaluations demonstrate that RGL not only accelerates the prototyping process but also enhances the performance and applicability of graph-based RAG systems.
arXiv Detail & Related papers (2025-03-25T03:21:48Z)
Chain-of-Retrieval Augmented Generation [72.06205327186069]
This paper introduces an approach for training o1-like RAG models that retrieve and reason over relevant information step by step before generating the final answer.<n>Our proposed method, CoRAG, allows the model to dynamically reformulate the query based on the evolving state.
arXiv Detail & Related papers (2025-01-24T09:12:52Z)
EdgeRAG: Online-Indexed RAG for Edge Devices [1.740992908651449]
We propose EdgeRAG which addresses the memory constraint by pruning embeddings within clusters and generating embeddings on-demand during retrieval.<n>The result from BEIR suite shows that EdgeRAG offers significant latency reduction over the baseline IVF index.
arXiv Detail & Related papers (2024-12-30T15:46:53Z)
Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks [11.053340674721005]
Retrieval-augmented generation (RAG) has gained traction as a powerful approach for enhancing language models by integrating external knowledge sources.<n>This paper proposes an alternative paradigm, cache-augmented generation (CAG) that bypasses real-time retrieval.
arXiv Detail & Related papers (2024-12-20T06:58:32Z)
Accelerating Retrieval-Augmented Generation [15.179354005559338]
Retrieval-Augmented Generation (RAG) involves augmenting large language models with information retrieved from an external knowledge source, such as the web.<n>IKS is a type-2 CXL device that implements a scale-out near-memory acceleration architecture with a novel cache-coherent interface between the host CPU and near-memory accelerators.
arXiv Detail & Related papers (2024-12-14T06:47:56Z)
RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rewards [78.74923079748521]
Retrieval-Augmented Generation (RAG) has proven its effectiveness in mitigating hallucinations in Large Language Models (LLMs) Current approaches use instruction tuning to optimize LLMs, improving their ability to utilize retrieved knowledge. We propose a Differentiable Data Rewards ( DDR) method, which trains RAG systems by aligning data preferences between different RAG modules.
arXiv Detail & Related papers (2024-10-17T12:53:29Z)
MemoRAG: Boosting Long Context Processing with Global Memory-Enhanced Retrieval Augmentation [60.04380907045708]
Retrieval-Augmented Generation (RAG) is considered a promising strategy to address this problem. We propose MemoRAG, a novel RAG framework empowered by global memory-augmented retrieval. MemoRAG achieves superior performances across a variety of long-context evaluation tasks.
arXiv Detail & Related papers (2024-09-09T13:20:31Z)
RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation [54.707460684650584]
Large Language Models (LLMs) demonstrate human-level capabilities in dialogue, reasoning, and knowledge retention. Current research addresses this bottleneck by equipping LLMs with external knowledge, a technique known as Retrieval Augmented Generation (RAG) RAGLAB is a modular and research-oriented open-source library that reproduces 6 existing algorithms and provides a comprehensive ecosystem for investigating RAG algorithms.
arXiv Detail & Related papers (2024-08-21T07:20:48Z)
PipeRAG: Fast Retrieval-Augmented Generation via Algorithm-System Co-design [16.76965926088238]
PipeRAG is a novel algorithm-system co-design approach to reduce generation latency and enhance generation quality. Our evaluation shows that PipeRAG achieves up to 2.6$times$ speedup in end-to-end generation latency while improving generation quality.
arXiv Detail & Related papers (2024-03-08T21:09:20Z)
Communication-Efficient Graph Neural Networks with Probabilistic Neighborhood Expansion Analysis and Caching [59.8522166385372]
Training and inference with graph neural networks (GNNs) on massive graphs has been actively studied since the inception of GNNs. This paper is concerned with minibatch training and inference with GNNs that employ node-wise sampling in distributed settings. We present SALIENT++, which extends the prior state-of-the-art SALIENT system to work with partitioned feature data.
arXiv Detail & Related papers (2023-05-04T21:04:01Z)
Real-Time GPU-Accelerated Machine Learning Based Multiuser Detection for 5G and Beyond [70.81551587109833]
nonlinear beamforming filters can significantly outperform linear approaches in stationary scenarios with massive connectivity. One of the main challenges comes from the real-time implementation of these algorithms. This paper explores the acceleration of APSM-based algorithms through massive parallelization.
arXiv Detail & Related papers (2022-01-13T15:20:45Z)

This list is automatically generated from the titles and abstracts of the papers in this site.