HedraRAG: Coordinating LLM Generation and Database Retrieval in Heterogeneous RAG Serving
- URL: http://arxiv.org/abs/2507.09138v1
- Date: Sat, 12 Jul 2025 04:42:43 GMT
- Title: HedraRAG: Coordinating LLM Generation and Database Retrieval in Heterogeneous RAG Serving
- Authors: Zhengding Hu, Vibha Murthy, Zaifeng Pan, Wanlu Li, Xiaoyi Fang, Yufei Ding, Yuke Wang,
- Abstract summary: HedraRAG is a runtime system built on a graph-based abstraction that exposes optimization opportunities across stage-level parallelism, intra-request similarity, and inter-request skewness.<n>The resulting execution plans are mapped onto hybrid CPU-GPU pipelines to improve resource utilization and reduce latency.
- Score: 10.130938079844121
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: This paper addresses emerging system-level challenges in heterogeneous retrieval-augmented generation (RAG) serving, where complex multi-stage workflows and diverse request patterns complicate efficient execution. We present HedraRAG, a runtime system built on a graph-based abstraction that exposes optimization opportunities across stage-level parallelism, intra-request similarity, and inter-request skewness. These opportunities are realized through dynamic graph transformations, such as node splitting, reordering, edge addition, and dependency rewiring, applied to wavefronts of subgraphs spanning concurrent requests. The resulting execution plans are mapped onto hybrid CPU-GPU pipelines to improve resource utilization and reduce latency. Evaluations across a wide range of RAG workflows demonstrate speedups exceeding 1.5x and reaching up to 5x over existing frameworks, showcasing the effectiveness of coordinated generation and retrieval in serving environments.
Related papers
- HAWK: A Hierarchical Workflow Framework for Multi-Agent Collaboration [3.2588674134593942]
Multi-agent systems face persistent challenges in cross-platform interoperability, dynamic task scheduling, and efficient resource sharing.<n>We propose Hierarchical Agent (Hawk), a modular framework comprising five layers-User, Operator, Agent, Resource-and supported by sixteen standardized interfaces.<n>Hawk delivers an end-to-end pipeline covering task parsing, workflow orchestration, intelligent scheduling, resource invocation, and data synchronization.
arXiv Detail & Related papers (2025-07-05T15:03:53Z) - EraRAG: Efficient and Incremental Retrieval Augmented Generation for Growing Corpora [20.890240791042302]
Graph-based Retrieval-Augmented Generation (Graph-RAG) enhances large language models (LLMs) by structuring retrieval over an external corpus.<n>We introduce EraRAG, a novel multi-layered Graph-RAG framework that supports efficient and scalable dynamic updates.<n>Our method leverages hyperplane-based Locality-Sensitive Hashing (LSH) to partition and organize the original corpus into hierarchical graph structures.
arXiv Detail & Related papers (2025-06-26T03:01:33Z) - Single LLM, Multiple Roles: A Unified Retrieval-Augmented Generation Framework Using Role-Specific Token Optimization [64.33914369424494]
RoleRAG is a unified RAG framework that achieves efficient multi-task processing through role-specific token optimization.<n>RoleRAG comprises six modules, each handling a specific sub-task within the RAG process.<n>We introduce a query graph to represent the decomposition of the query, which can be dynamically resolved according to the decomposing state.
arXiv Detail & Related papers (2025-05-21T12:25:12Z) - RGL: A Graph-Centric, Modular Framework for Efficient Retrieval-Augmented Generation on Graphs [58.10503898336799]
We introduce the RAG-on-Graphs Library (RGL), a modular framework that seamlessly integrates the complete RAG pipeline.<n>RGL addresses key challenges by supporting a variety of graph formats and integrating optimized implementations for essential components.<n>Our evaluations demonstrate that RGL not only accelerates the prototyping process but also enhances the performance and applicability of graph-based RAG systems.
arXiv Detail & Related papers (2025-03-25T03:21:48Z) - TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval [10.268774281394261]
Retrieval-augmented generation (RAG) extends large language models (LLMs) with external data sources to enhance factual correctness and domain coverage.<n>Modern RAG pipelines rely on large datastores, leading to system challenges in latency-sensitive deployments.<n>We propose TeleRAG, an efficient inference system that reduces RAG latency with minimal GPU memory requirements.
arXiv Detail & Related papers (2025-02-28T11:32:22Z) - FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Intelligent Hybrid Resource Allocation in MEC-assisted RAN Slicing Network [72.2456220035229]
We aim to maximize the SSR for heterogeneous service demands in the cooperative MEC-assisted RAN slicing system.
We propose a recurrent graph reinforcement learning (RGRL) algorithm to intelligently learn the optimal hybrid RA policy.
arXiv Detail & Related papers (2024-05-02T01:36:13Z) - PipeRAG: Fast Retrieval-Augmented Generation via Algorithm-System
Co-design [16.76965926088238]
PipeRAG is a novel algorithm-system co-design approach to reduce generation latency and enhance generation quality.
Our evaluation shows that PipeRAG achieves up to 2.6$times$ speedup in end-to-end generation latency while improving generation quality.
arXiv Detail & Related papers (2024-03-08T21:09:20Z) - T-GAE: Transferable Graph Autoencoder for Network Alignment [79.89704126746204]
T-GAE is a graph autoencoder framework that leverages transferability and stability of GNNs to achieve efficient network alignment without retraining.
Our experiments demonstrate that T-GAE outperforms the state-of-the-art optimization method and the best GNN approach by up to 38.7% and 50.8%, respectively.
arXiv Detail & Related papers (2023-10-05T02:58:29Z) - Gradient Coding with Dynamic Clustering for Straggler-Tolerant
Distributed Learning [55.052517095437]
gradient descent (GD) is widely employed to parallelize the learning task by distributing the dataset across multiple workers.
A significant performance bottleneck for the per-iteration completion time in distributed synchronous GD is $straggling$ workers.
Coded distributed techniques have been introduced recently to mitigate stragglers and to speed up GD iterations by assigning redundant computations to workers.
We propose a novel dynamic GC scheme, which assigns redundant data to workers to acquire the flexibility to choose from among a set of possible codes depending on the past straggling behavior.
arXiv Detail & Related papers (2021-03-01T18:51:29Z) - Phase Retrieval using Expectation Consistent Signal Recovery Algorithm
based on Hypernetwork [73.94896986868146]
Phase retrieval is an important component in modern computational imaging systems.
Recent advances in deep learning have opened up a new possibility for robust and fast PR.
We develop a novel framework for deep unfolding to overcome the existing limitations.
arXiv Detail & Related papers (2021-01-12T08:36:23Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.