Fugu-MT 論文翻訳(概要): From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation

論文の概要: From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation

arxiv url: http://arxiv.org/abs/2601.12904v1
Date: Mon, 19 Jan 2026 09:59:39 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-21 22:47:22.848279
Title: From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation
Title（参考訳）: プリフィックスキャッシュからFusion RAGキャッシュ: 検索拡張ジェネレーションにおけるLCM推論の高速化
Authors: Jiahao Wang, Weiyu Xie, Mingxing Zhang, Boxing Zhang, Jianwei Dong, Yuening Zhu, Chen Lin, Jinqi Tang, Yaochen Han, Zhiyuan Ai, Xianglin Chen, Yongwei Wu, Congfeng Jiang,
Abstract要約: Retrieval-Augmented Generationは、外部知識を統合することで、大規模言語モデルを強化する。既存のソリューションは、検索されたチャンクのプリプロセスされたKVキャッシュを再利用し、RAGを加速することを目的としている。 RAGの前処理と再処理の両方を最適化する新しい推論フレームワークFusionRAGを提案する。
参考スコア（独自算出の注目度）: 11.929816243824561
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Retrieval-Augmented Generation enhances Large Language Models by integrating external knowledge, which reduces hallucinations but increases prompt length. This increase leads to higher computational costs and longer Time to First Token (TTFT). To mitigate this issue, existing solutions aim to reuse the preprocessed KV cache of each retrieved chunk to accelerate RAG. However, the lack of cross-chunk contextual information leads to a significant drop in generation quality, leaving the potential benefits of KV cache reuse largely unfulfilled. The challenge lies in how to reuse the precomputed KV cache of chunks while preserving generation quality. We propose FusionRAG, a novel inference framework that optimizes both the preprocessing and reprocessing stages of RAG. In the offline preprocessing stage, we embed information from other related text chunks into each chunk, while in the online reprocessing stage, we recompute the KV cache for tokens that the model focuses on. As a result, we achieve a better trade-off between generation quality and efficiency. According to our experiments, FusionRAG significantly improves generation quality at the same recomputation ratio compared to previous state-of-the-art solutions. By recomputing fewer than 15% of the tokens, FusionRAG achieves up to 70% higher normalized F1 scores than baselines and reduces TTFT by 2.66x-9.39x compared to Full Attention.
Abstract（参考訳）: Retrieval-Augmented Generationは、外部知識を統合することで大規模言語モデルを強化する。これにより計算コストが高くなり、TTFT(Time to First Token)が長くなる。この問題を軽減するため、既存のソリューションでは、各チャンクのプリプロセスされたKVキャッシュを再利用し、RAGを高速化する。しかし、クロスチャンクのコンテキスト情報の欠如は、生成品質の大幅な低下をもたらし、KVキャッシュ再利用の潜在的な利点は、ほとんど満たされないままである。課題は、生成品質を維持しながら、事前に計算されたKVキャッシュを再利用する方法にある。 RAGの前処理と再処理の両方を最適化する新しい推論フレームワークFusionRAGを提案する。オフライン前処理の段階では、関連するテキストチャンクの情報を各チャンクに埋め込む一方、オンライン再処理の段階では、モデルがフォーカスするトークンに対してKVキャッシュを再計算する。その結果、生成品質と効率のトレードオフがより良くなりました。我々の実験によると、FusionRAGは従来の最先端のソリューションと比較して、同じ再計算比で生成品質を著しく改善する。トークンの15%未満を再計算することで、FusionRAGはベースラインよりも最大70%高い正規化F1スコアを獲得し、フルアテンションに比べてTTFTを2.66x-9.39倍削減する。

論文の概要: From Prefix Cache to Fusion RAG Cache: Accelerating LLM Inference in Retrieval-Augmented Generation

関連論文リスト