RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation
- URL: http://arxiv.org/abs/2404.12457v2
- Date: Thu, 25 Apr 2024 06:47:57 GMT
- Title: RAGCache: Efficient Knowledge Caching for Retrieval-Augmented Generation
- Authors: Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, Xin Jin,
- Abstract summary: Retrieval-Augmented Generation (RAG) has shown significant improvements in various natural language processing tasks.
RAGCache organizes the intermediate states of retrieved knowledge in a knowledge tree and caches them in the GPU and host memory hierarchy.
RAGCache reduces the time to first token (TTTF) by up to 4x and improves the throughput by up to 2.1x compared to vLLM integrated with Faiss.
- Score: 11.321659218769598
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Retrieval-Augmented Generation (RAG) has shown significant improvements in various natural language processing tasks by integrating the strengths of large language models (LLMs) and external knowledge databases. However, RAG introduces long sequence generation and leads to high computation and memory costs. We propose RAGCache, a novel multilevel dynamic caching system tailored for RAG. Our analysis benchmarks current RAG systems, pinpointing the performance bottleneck (i.e., long sequence due to knowledge injection) and optimization opportunities (i.e., caching knowledge's intermediate states). Based on these insights, we design RAGCache, which organizes the intermediate states of retrieved knowledge in a knowledge tree and caches them in the GPU and host memory hierarchy. RAGCache proposes a replacement policy that is aware of LLM inference characteristics and RAG retrieval patterns. It also dynamically overlaps the retrieval and inference steps to minimize the end-to-end latency. We implement RAGCache and evaluate it on vLLM, a state-of-the-art LLM inference system and Faiss, a state-of-the-art vector database. The experimental results show that RAGCache reduces the time to first token (TTFT) by up to 4x and improves the throughput by up to 2.1x compared to vLLM integrated with Faiss.
Related papers
- RAG-DDR: Optimizing Retrieval-Augmented Generation Using Differentiable Data Rewards [78.74923079748521]
Retrieval-Augmented Generation (RAG) has proven its effectiveness in mitigating hallucinations in Large Language Models (LLMs)
Current approaches use instruction tuning to optimize LLMs, improving their ability to utilize retrieved knowledge.
We propose a Differentiable Data Rewards ( DDR) method, which trains RAG systems by aligning data preferences between different RAG modules.
arXiv Detail & Related papers (2024-10-17T12:53:29Z) - MemoRAG: Moving towards Next-Gen RAG Via Memory-Inspired Knowledge Discovery [24.38640001674072]
Retrieval-Augmented Generation (RAG) leverages retrieval tools to access external databases.
Existing RAG systems are primarily effective for straightforward question-answering tasks.
We propose MemoRAG, a novel retrieval-augmented generation paradigm empowered by long-term memory.
arXiv Detail & Related papers (2024-09-09T13:20:31Z) - RAGLAB: A Modular and Research-Oriented Unified Framework for Retrieval-Augmented Generation [54.707460684650584]
Large Language Models (LLMs) demonstrate human-level capabilities in dialogue, reasoning, and knowledge retention.
Current research addresses this bottleneck by equipping LLMs with external knowledge, a technique known as Retrieval Augmented Generation (RAG)
RAGLAB is a modular and research-oriented open-source library that reproduces 6 existing algorithms and provides a comprehensive ecosystem for investigating RAG algorithms.
arXiv Detail & Related papers (2024-08-21T07:20:48Z) - Efficient Inference of Vision Instruction-Following Models with Elastic Cache [76.44955111634545]
We introduce Elastic Cache, a novel strategy for efficient deployment of instruction-following large vision-language models.
We propose an importance-driven cache merging strategy to prune redundancy caches.
For instruction encoding, we utilize the frequency to evaluate the importance of caches.
Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation.
arXiv Detail & Related papers (2024-07-25T15:29:05Z) - PQCache: Product Quantization-based KVCache for Long Context LLM Inference [27.523568511043273]
Key-Value Cache (KVCache) is a crucial component in Large Language Models (LLMs)
Current methods selectively determine suitable keys and values for self-attention in LLMs to address the issue.
We propose PQCache, which employs Product Quantization (PQ) to manage KVCache, maintaining model quality while ensuring low serving latency.
arXiv Detail & Related papers (2024-07-01T13:05:42Z) - CORM: Cache Optimization with Recent Message for Large Language Model Inference [57.109354287786154]
We introduce an innovative method for optimizing the KV cache, which considerably minimizes its memory footprint.
CORM, a KV cache eviction policy, dynamically retains essential key-value pairs for inference without the need for model fine-tuning.
Our validation shows that CORM reduces the inference memory usage of KV cache by up to 70% with negligible performance degradation across six tasks in LongBench.
arXiv Detail & Related papers (2024-04-24T16:11:54Z) - HistAlign: Improving Context Dependency in Language Generation by
Aligning with History [96.35214682008701]
Language models (LMs) can generate hallucinations and incoherent outputs, which highlights their weak context dependency.
Cache-LMs, which augment LMs with a memory of recent history, can increase context dependency.
We present HistAlign, a new training approach to ensure good cache alignment.
arXiv Detail & Related papers (2023-05-08T15:34:56Z) - Optimizing L1 cache for embedded systems through grammatical evolution [1.9371782627708491]
Grammatical Evolution (GE) is able to efficiently find the best cache configurations for a given set of benchmark applications.
Our proposal is able to find cache configurations that obtain an average improvement of $62%$ versus a real world baseline configuration.
arXiv Detail & Related papers (2023-03-06T18:10:00Z) - Accelerating Deep Learning Classification with Error-controlled
Approximate-key Caching [72.50506500576746]
We propose a novel caching paradigm, that we named approximate-key caching.
While approximate cache hits alleviate DL inference workload and increase the system throughput, they however introduce an approximation error.
We analytically model our caching system performance for classic LRU and ideal caches, we perform a trace-driven evaluation of the expected performance, and we compare the benefits of our proposed approach with the state-of-the-art similarity caching.
arXiv Detail & Related papers (2021-12-13T13:49:11Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.