An Adaptive Vector Index Partitioning Scheme for Low-Latency RAG Pipeline
- URL: http://arxiv.org/abs/2504.08930v1
- Date: Fri, 11 Apr 2025 19:18:41 GMT
- Title: An Adaptive Vector Index Partitioning Scheme for Low-Latency RAG Pipeline
- Authors: Junkyum Kim, Divya Mahajan,
- Abstract summary: Retrieval Augmented Generation (RAG) systems enhance response quality by integrating Large Language Models (LLMs) with vector databases.<n>Existing optimizations for vector search and LLM serving have largely been developed in isolation.<n>This paper introduces VectorLiteRAG, an optimized vector index partitioning mechanism designed for RAG systems.
- Score: 0.6445605125467574
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Retrieval Augmented Generation (RAG) systems enhance response quality by integrating Large Language Models (LLMs) with vector databases, enabling external knowledge retrieval to support language model reasoning. While RAG enables efficient question answering with smaller LLMs, existing optimizations for vector search and LLM serving have largely been developed in isolation. As a result, their integration often leads to suboptimal end-to-end performance. ... This paper introduces VectorLiteRAG, an optimized vector index partitioning mechanism designed for RAG systems that enhances the responsiveness of the system by jointly optimizing vector search and LLM serving across CPU and GPU system. A key challenge is to determine which indices and how much of the vector index should reside on the GPU and adjusting LLM batch sizes to balance the pipeline for lower Time-To-First-Token (TTFT) and meeting user-defined Service-Level Objectives (SLOs). To address this, we leverage the insight that cluster access in vector databases exhibits access skew, where a subset of clusters are queried significantly more frequently than others. VectorLiteRAG exploits this property through an optimized memory distribution strategy, dynamically allocating the minimum number of vector indices corresponding to frequently accessed clusters onto the GPU HBM to ensure a balanced pipeline with the LLM for high responsiveness. This adaptive partitioning scheme is guided by a statistical model that informs memory allocation and workload distribution. Our evaluation demonstrates that VectorLiteRAG improves vector search responsiveness by 2x, significantly reduces end-to-end TTFT in RAG systems by intelligently balancing memory resources between vector search and LLM execution.
Related papers
- TeleRAG: Efficient Retrieval-Augmented Generation Inference with Lookahead Retrieval [10.268774281394261]
Retrieval-augmented generation (RAG) extends large language models (LLMs) with external data sources to enhance factual correctness and domain coverage.<n>Modern RAG pipelines rely on large datastores, leading to system challenges in latency-sensitive deployments.<n>We propose TeleRAG, an efficient inference system that reduces RAG latency with minimal GPU memory requirements.
arXiv Detail & Related papers (2025-02-28T11:32:22Z) - LLM-based Optimization of Compound AI Systems: A Survey [64.39860384538338]
In a compound AI system, components such as an LLM call, a retriever, a code interpreter, or tools are interconnected.
Recent advancements enable end-to-end optimization of these parameters using an LLM.
This paper presents a survey of the principles and emerging trends in LLM-based optimization of compound AI systems.
arXiv Detail & Related papers (2024-10-21T18:06:25Z) - RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rewards [78.74923079748521]
Retrieval-Augmented Generation (RAG) has proven its effectiveness in mitigating hallucinations in Large Language Models (LLMs) by retrieving knowledge from external resources.<n>Current approaches use instruction tuning to optimize LLMs, improving their ability to utilize retrieved knowledge.<n>We propose a Differentiable Data Rewards ( DDR) method, which trains RAG systems by aligning data preferences between different RAG modules.
arXiv Detail & Related papers (2024-10-17T12:53:29Z) - RetrievalAttention: Accelerating Long-Context LLM Inference via Vector Retrieval [24.472784635757016]
RetrievalAttention is a training-free approach to both accelerate attention computation and reduce GPU memory consumption.<n>We show that RetrievalAttention achieves near full attention accuracy while only requiring access to 1--3% of the data.
arXiv Detail & Related papers (2024-09-16T17:59:52Z) - Speculative RAG: Enhancing Retrieval Augmented Generation through Drafting [68.90949377014742]
Speculative RAG is a framework that leverages a larger generalist LM to efficiently verify multiple RAG drafts produced in parallel by a smaller, distilled specialist LM.
Our method accelerates RAG by delegating drafting to the smaller specialist LM, with the larger generalist LM performing a single verification pass over the drafts.
It notably enhances accuracy by up to 12.97% while reducing latency by 50.83% compared to conventional RAG systems on PubHealth.
arXiv Detail & Related papers (2024-07-11T06:50:19Z) - LLM-Vectorizer: LLM-based Verified Loop Vectorizer [12.048697450464935]
Large-language models (LLMs) can generate vectorized code from scalar programs that process individual array elements.
LLMs are capable of producing high performance vectorized code with run-time speedup ranging from 1.1x to 9.4x.
Our approach is able to verify 38.2% of vectorizations as correct on the TSVC benchmark dataset.
arXiv Detail & Related papers (2024-06-07T07:04:26Z) - Accelerating Inference of Retrieval-Augmented Generation via Sparse Context Selection [28.15184715270483]
Large language models (LLMs) augmented with retrieval exhibit robust performance and extensive versatility.
We propose a novel paradigm named Sparse RAG, which seeks to cut costs through sparsity.
Sparse RAG encodes retrieved documents in parallel, which eliminates latency introduced by long-range attention of retrieved documents.
arXiv Detail & Related papers (2024-05-25T11:10:04Z) - Multimodal Learned Sparse Retrieval with Probabilistic Expansion Control [66.78146440275093]
Learned retrieval (LSR) is a family of neural methods that encode queries and documents into sparse lexical vectors.
We explore the application of LSR to the multi-modal domain, with a focus on text-image retrieval.
Current approaches like LexLIP and STAIR require complex multi-step training on massive datasets.
Our proposed approach efficiently transforms dense vectors from a frozen dense model into sparse lexical vectors.
arXiv Detail & Related papers (2024-02-27T14:21:56Z) - Chameleon: a Heterogeneous and Disaggregated Accelerator System for Retrieval-Augmented Language Models [20.286113681831814]
A Retrieval-Augmented Language Model (RALM) combines a large language model (LLM) with a vector database to retrieve context-specific knowledge.<n>We propose Chameleon, a heterogeneous accelerator system integrating both LLM and vector search accelerators.
arXiv Detail & Related papers (2023-10-15T20:57:25Z) - Inference with Reference: Lossless Acceleration of Large Language Models [97.04200102556551]
LLMA is an accelerator to speed up Large Language Model (LLM) inference with references.
It is motivated by the observation that there are abundant identical text spans between the decoding result by an LLM and the reference that is available in many real world scenarios.
arXiv Detail & Related papers (2023-04-10T09:55:14Z) - NumS: Scalable Array Programming for the Cloud [82.827921577004]
We present NumS, an array programming library which optimize NumPy-like expressions on task-based distributed systems.
This is achieved through a novel scheduler called Load Simulated Hierarchical Scheduling (LSHS)
We show that LSHS enhances performance on Ray by decreasing network load by a factor of 2x, requiring 4x less memory, and reducing execution time by 10x on the logistic regression problem.
arXiv Detail & Related papers (2022-06-28T20:13:40Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.