Related papers: PipeRAG: Fast Retrieval-Augmented Generation via Algorithm-System Co-design

PipeRAG: Fast Retrieval-Augmented Generation via Algorithm-System Co-design

URL: http://arxiv.org/abs/2403.05676v1
Date: Fri, 8 Mar 2024 21:09:20 GMT
Title: PipeRAG: Fast Retrieval-Augmented Generation via Algorithm-System Co-design
Authors: Wenqi Jiang, Shuai Zhang, Boran Han, Jie Wang, Bernie Wang, Tim Kraska
Abstract summary: PipeRAG is a novel algorithm-system co-design approach to reduce generation latency and enhance generation quality. Our evaluation shows that PipeRAG achieves up to 2.6$times$ speedup in end-to-end generation latency while improving generation quality.
Score: 16.76965926088238
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Retrieval-augmented generation (RAG) can enhance the generation quality of large language models (LLMs) by incorporating external token databases. However, retrievals from large databases can constitute a substantial portion of the overall generation time, particularly when retrievals are periodically performed to align the retrieved content with the latest states of generation. In this paper, we introduce PipeRAG, a novel algorithm-system co-design approach to reduce generation latency and enhance generation quality. PipeRAG integrates (1) pipeline parallelism to enable concurrent retrieval and generation processes, (2) flexible retrieval intervals to maximize the efficiency of pipeline parallelism, and (3) a performance model to automatically balance retrieval quality and latency based on the generation states and underlying hardware. Our evaluation shows that, by combining the three aforementioned methods, PipeRAG achieves up to 2.6$\times$ speedup in end-to-end generation latency while improving generation quality. These promising results showcase the effectiveness of co-designing algorithms with underlying systems, paving the way for the adoption of PipeRAG in future RAG systems.

Related papers

StreamRL: Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation [55.75008325187133]
Reinforcement learning (RL) has become the core post-training technique for large language models (LLMs) StreamRL is designed with disaggregation from first principles to address two types of performance bottlenecks. Experiments show that StreamRL improves throughput by up to 2.66x compared to existing state-of-the-art systems.
arXiv Detail & Related papers (2025-04-22T14:19:06Z)
HyperRAG: Enhancing Quality-Efficiency Tradeoffs in Retrieval-Augmented Generation with Reranker KV-Cache Reuse [7.521340060861743]
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing the performance of large language models (LLMs) We propose HyperRAG, a system that optimize the trade-off between quality and efficiency in RAG pipelines by leveraging KV-cache reuse for efficient reranker inference. We show that HyperRAG achieves a 2 - 3 throughput improvement with decoder-only rerankers while also delivering higher downstream performance compared with traditional RAG service.
arXiv Detail & Related papers (2025-04-03T17:08:42Z)
RGL: A Graph-Centric, Modular Framework for Efficient Retrieval-Augmented Generation on Graphs [58.10503898336799]
We introduce the RAG-on-Graphs Library (RGL), a modular framework that seamlessly integrates the complete RAG pipeline. RGL addresses key challenges by supporting a variety of graph formats and integrating optimized implementations for essential components. Our evaluations demonstrate that RGL not only accelerates the prototyping process but also enhances the performance and applicability of graph-based RAG systems.
arXiv Detail & Related papers (2025-03-25T03:21:48Z)
TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval [10.268774281394261]
Retrieval-augmented generation (RAG) extends large language models (LLMs) with external data sources to enhance factual correctness and domain coverage. Modern RAG pipelines rely on large datastores, leading to system challenges in latency-sensitive deployments. We propose TeleRAG, an efficient inference system that reduces RAG latency with minimal GPU memory requirements.
arXiv Detail & Related papers (2025-02-28T11:32:22Z)
FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency. We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs) We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z)
ChronoGAN: Supervised and Embedded Generative Adversarial Networks for Time Series Generation [0.9374652839580181]
We introduce a robust framework aimed at addressing and mitigating these issues effectively. This framework integrates the benefits of an Autoencoder-generated embedding space with the adversarial training dynamics of GANs. We introduce an early generation algorithm and an improved neural network architecture to enhance stability and ensure effective generalization across both short and long time series.
arXiv Detail & Related papers (2024-09-21T04:51:35Z)
Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection [28.15184715270483]
Large language models (LLMs) augmented with retrieval exhibit robust performance and extensive versatility. We propose a novel paradigm named Sparse RAG, which seeks to cut costs through sparsity. Sparse RAG encodes retrieved documents in parallel, which eliminates latency introduced by long-range attention of retrieved documents.
arXiv Detail & Related papers (2024-05-25T11:10:04Z)
ALTO: An Efficient Network Orchestrator for Compound AI Systems [20.880866765513066]
ALTO is a network orchestrator for efficiently serving compound AI systems such as pipelines of language models. As language models produce outputs token by token, ALTO exposes opportunities to stream intermediate outputs between stages when possible. We highlight two new challenges of correctness and load balancing which emerge when streaming intermediate data across distributed pipeline stage instances.
arXiv Detail & Related papers (2024-03-07T08:30:26Z)
End-to-End Latency Optimization of Multi-view 3D Reconstruction for Disaster Response [3.471012855429593]
Multi-view Stereo (MVS) based 3D reconstruction applications are exceedingly time consuming, especially when run on such computationally constrained mobile edge devices. In this paper, we aim to design a latency optimized MVS algorithm pipeline, with the objective to best balance the end-to-end latency and reconstruction quality.
arXiv Detail & Related papers (2023-04-04T03:04:44Z)
FormerTime: Hierarchical Multi-Scale Representations for Multivariate Time Series Classification [53.55504611255664]
FormerTime is a hierarchical representation model for improving the classification capacity for the multivariate time series classification task. It exhibits three aspects of merits: (1) learning hierarchical multi-scale representations from time series data, (2) inheriting the strength of both transformers and convolutional networks, and (3) tacking the efficiency challenges incurred by the self-attention mechanism.
arXiv Detail & Related papers (2023-02-20T07:46:14Z)
Towards Generating Real-World Time Series Data [52.51620668470388]
We propose a novel generative framework for time series data generation - RTSGAN. RTSGAN learns an encoder-decoder module which provides a mapping between a time series instance and a fixed-dimension latent vector. To generate time series with missing values, we further equip RTSGAN with an observation embedding layer and a decide-and-generate decoder.
arXiv Detail & Related papers (2021-11-16T11:31:37Z)
Deep Cellular Recurrent Network for Efficient Analysis of Time-Series Data with Spatial Information [52.635997570873194]
This work proposes a novel deep cellular recurrent neural network (DCRNN) architecture to process complex multi-dimensional time series data with spatial information. The proposed architecture achieves state-of-the-art performance while utilizing substantially less trainable parameters when compared to comparable methods in the literature.
arXiv Detail & Related papers (2021-01-12T20:08:18Z)
Phase Retrieval using Expectation Consistent Signal Recovery Algorithm based on Hypernetwork [73.94896986868146]
Phase retrieval is an important component in modern computational imaging systems. Recent advances in deep learning have opened up a new possibility for robust and fast PR. We develop a novel framework for deep unfolding to overcome the existing limitations.
arXiv Detail & Related papers (2021-01-12T08:36:23Z)
Hybrid Backpropagation Parallel Reservoir Networks [8.944918753413827]
We propose a novel hybrid network, which combines the effectiveness of learning random temporal features of reservoirs with the readout power of a deep neural network with batch normalization. We demonstrate that our new network outperforms LSTMs and GRUs, including multi-layer "deep" versions of these networks. We show also that the inclusion of a novel meta-ring structure, which we call HBP-ESN M-Ring, achieves similar performance to one large reservoir while decreasing the memory required by an order of magnitude.
arXiv Detail & Related papers (2020-10-27T21:03:35Z)
Recent Developments Combining Ensemble Smoother and Deep Generative Networks for Facies History Matching [58.720142291102135]
This research project focuses on the use of autoencoders networks to construct a continuous parameterization for facies models. We benchmark seven different formulations, including VAE, generative adversarial network (GAN), Wasserstein GAN, variational auto-encoding GAN, principal component analysis (PCA) with cycle GAN, PCA with transfer style network and VAE with style loss.
arXiv Detail & Related papers (2020-05-08T21:32:42Z)

This list is automatically generated from the titles and abstracts of the papers in this site.