RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation
- URL: http://arxiv.org/abs/2404.12457v2
- Date: Thu, 25 Apr 2024 06:47:57 GMT
- Title: RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation
- Authors: Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, Xin Jin,
- Abstract summary: Retrieval-Augmented Generation (RAG) has shown significant improvements in various natural language processing tasks.
RAGCache organizes the intermediate states of retrieved knowledge in a knowledge tree and caches them in the GPU and host memory hierarchy.
RAGCache reduces the time to first token (TTTF) by up to 4x and improves the throughput by up to 2.1x compared to vLLM integrated with Faiss.
- Score: 11.321659218769598
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Retrieval-Augmented Generation (RAG) has shown significant improvements in various natural language processing tasks by integrating the strengths of large language models (LLMs) and external knowledge databases. However, RAG introduces long sequence generation and leads to high computation and memory costs. We propose RAGCache, a novel multilevel dynamic caching system tailored for RAG. Our analysis benchmarks current RAG systems, pinpointing the performance bottleneck (i.e., long sequence due to knowledge injection) and optimization opportunities (i.e., caching knowledge's intermediate states). Based on these insights, we design RAGCache, which organizes the intermediate states of retrieved knowledge in a knowledge tree and caches them in the GPU and host memory hierarchy. RAGCache proposes a replacement policy that is aware of LLM inference characteristics and RAG retrieval patterns. It also dynamically overlaps the retrieval and inference steps to minimize the end-to-end latency. We implement RAGCache and evaluate it on vLLM, a state-of-the-art LLM inference system and Faiss, a state-of-the-art vector database. The experimental results show that RAGCache reduces the time to first token (TTFT) by up to 4x and improves the throughput by up to 2.1x compared to vLLM integrated with Faiss.
Related papers
- Shared Disk KV Cache Management for Efficient Multi-Instance Inference in RAG-Powered LLMs [5.02504911036896]
Recent large language models (LLMs) face increasing inference latency as input context length and model size grow.
This paper proposes a method to reduce TTFT by leveraging a disk-based key-value (KV) cache to lessen the computational burden during the prefill stage.
We also introduce a disk-based shared KV cache management system, called Shared RAG-DCache, for multi-instance LLM RAG service environments.
arXiv Detail & Related papers (2025-04-16T04:59:18Z) - QuantCache: Adaptive Importance-Guided Quantization with Hierarchical Latent and Layer Caching for Video Generation [84.91431271257437]
Diffusion Transformers (DiTs) have emerged as a dominant architecture in video generation.
DiTs come with significant drawbacks, including increased computational and memory costs.
We propose QuantCache, a novel training-free inference acceleration framework.
arXiv Detail & Related papers (2025-03-09T10:31:51Z) - Leveraging Approximate Caching for Faster Retrieval-Augmented Generation [1.3450852784287828]
Retrieval-augmented generation (RAG) enhances the reliability of large language model (LLM) answers by integrating external knowledge.
RAG increases the end-to-end inference time since looking for relevant documents from large vector databases is computationally expensive.
We introduce Proximity, an approximate key-value cache that optimize the RAG workflow by leveraging similarities in user queries.
arXiv Detail & Related papers (2025-03-07T15:54:04Z) - Detecting Knowledge Boundary of Vision Large Language Models by Sampling-Based Inference [78.08901120841833]
We propose a method to detect the knowledge boundary of Visual Large Language Models (VLLMs)
We show that our method successfully depicts a VLLM's knowledge boundary based on which we are able to reduce indiscriminate retrieval while maintaining or improving the performance.
arXiv Detail & Related papers (2025-02-25T09:32:08Z) - Cache-Craft: Managing Chunk-Caches for Efficient Retrieval-Augmented Generation [14.842469293627271]
CacheCraft is a system for managing reusing precomputed KVs corresponding to the text chunks.
We show how to identify chunk-caches that are reusable, how to efficiently perform a small fraction of recomputation to fix the cache, and how to efficiently store and evict chunk-caches in the hardware.
arXiv Detail & Related papers (2025-02-05T14:12:33Z) - Don't Do RAG: When Cache-Augmented Generation is All You Need for Knowledge Tasks [11.053340674721005]
Retrieval-augmented generation (RAG) has gained traction as a powerful approach for enhancing language models by integrating external knowledge sources.
This paper proposes an alternative paradigm, cache-augmented generation (CAG) that bypasses real-time retrieval.
arXiv Detail & Related papers (2024-12-20T06:58:32Z) - Accelerating Retrieval-Augmented Generation [15.179354005559338]
Retrieval-Augmented Generation (RAG) involves augmenting large language models with information retrieved from an external knowledge source, such as the web.
IKS is a type-2 CXL device that implements a scale-out near-memory acceleration architecture with a novel cache-coherent interface between the host CPU and near-memory accelerators.
arXiv Detail & Related papers (2024-12-14T06:47:56Z) - Toward Optimal Search and Retrieval for RAG [39.69494982983534]
Retrieval-augmented generation (RAG) is a promising method for addressing some of the memory-related challenges associated with Large Language Models (LLMs)
Here, we work towards the goal of understanding how retrievers can be optimized for RAG pipelines for common tasks such as Question Answering (QA)
arXiv Detail & Related papers (2024-11-11T22:06:51Z) - RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rewards [78.74923079748521]
Retrieval-Augmented Generation (RAG) has proven its effectiveness in mitigating hallucinations in Large Language Models (LLMs)
Current approaches use instruction tuning to optimize LLMs, improving their ability to utilize retrieved knowledge.
We propose a Differentiable Data Rewards ( DDR) method, which trains RAG systems by aligning data preferences between different RAG modules.
arXiv Detail & Related papers (2024-10-17T12:53:29Z) - MemoRAG: Moving towards Next-Gen RAG Via Memory-Inspired Knowledge Discovery [24.38640001674072]
Retrieval-Augmented Generation (RAG) leverages retrieval tools to access external databases.
Existing RAG systems are primarily effective for straightforward question-answering tasks.
We propose MemoRAG, a novel retrieval-augmented generation paradigm empowered by long-term memory.
arXiv Detail & Related papers (2024-09-09T13:20:31Z) - RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation [54.707460684650584]
Large Language Models (LLMs) demonstrate human-level capabilities in dialogue, reasoning, and knowledge retention.
Current research addresses this bottleneck by equipping LLMs with external knowledge, a technique known as Retrieval Augmented Generation (RAG)
RAGLAB is a modular and research-oriented open-source library that reproduces 6 existing algorithms and provides a comprehensive ecosystem for investigating RAG algorithms.
arXiv Detail & Related papers (2024-08-21T07:20:48Z) - Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models.
We propose an importance-driven cache merging strategy to prune redundancy caches.
For instruction encoding, we utilize the frequency to evaluate the importance of caches.
Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z) - PQCache: Product Quantization-based KVCache for Long Context LLM Inference [27.523568511043273]
Key-Value Cache (KVCache) is a crucial component in Large Language Models (LLMs)
Current methods selectively determine suitable keys and values for self-attention in LLMs to address the issue.
We propose PQCache, which employs Product Quantization (PQ) to manage KVCache, maintaining model quality while ensuring low serving latency.
arXiv Detail & Related papers (2024-07-01T13:05:42Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z) - HistAlign: Improving Context Dependency in Language Generation by
Aligning with History [96.35214682008701]
Language models (LMs) can generate hallucinations and incoherent outputs, which highlights their weak context dependency.
Cache-LMs, which augment LMs with a memory of recent history, can increase context dependency.
We present HistAlign, a new training approach to ensure good cache alignment.
arXiv Detail & Related papers (2023-05-08T15:34:56Z) - Accelerating Deep Learning Classification with Error-controlled
Approximate-key Caching [72.50506500576746]
We propose a novel caching paradigm, that we named approximate-key caching.
While approximate cache hits alleviate DL inference workload and increase the system throughput, they however introduce an approximation error.
We analytically model our caching system performance for classic LRU and ideal caches, we perform a trace-driven evaluation of the expected performance, and we compare the benefits of our proposed approach with the state-of-the-art similarity caching.
arXiv Detail & Related papers (2021-12-13T13:49:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.