Fugu-MT 論文翻訳(概要): REFRAG: Rethinking RAG based Decoding

論文の概要: REFRAG: Rethinking RAG based Decoding

arxiv url: http://arxiv.org/abs/2509.01092v1
Date: Mon, 01 Sep 2025 03:31:44 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-04 15:17:03.537788
Title: REFRAG: Rethinking RAG based Decoding
Title（参考訳）: REFRAG: RAGベースのデコードを再考
Authors: Xiaoqiang Lin, Aritra Ghosh, Bryan Kian Hsiang Low, Anshumali Shrivastava, Vijai Mohan,
Abstract要約: REFRAGは効率的なデコードフレームワークで、RAGアプリケーションの遅延を圧縮し、感知し、拡張し、改善する。本稿では,RAG,マルチターン会話,長期文書要約など,多種多様な長文タスクを対象としたREFRAGの厳密な検証を行う。
参考スコア（独自算出の注目度）: 67.4862300145604
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Large Language Models (LLMs) have demonstrated remarkable capabilities in leveraging extensive external knowledge to enhance responses in multi-turn and agentic applications, such as retrieval-augmented generation (RAG). However, processing long-context inputs introduces significant system latency and demands substantial memory for the key-value cache, resulting in reduced throughput and a fundamental trade-off between knowledge enrichment and system efficiency. While minimizing latency for long-context inputs is a primary objective for LLMs, we contend that RAG require specialized consideration. In RAG, much of the LLM context consists of concatenated passages from retrieval, with only a small subset directly relevant to the query. These passages often exhibit low semantic similarity due to diversity or deduplication during re-ranking, leading to block-diagonal attention patterns that differ from those in standard LLM generation tasks. Based on this observation, we argue that most computations over the RAG context during decoding are unnecessary and can be eliminated with minimal impact on performance. To this end, we propose REFRAG, an efficient decoding framework that compresses, senses, and expands to improve latency in RAG applications. By exploiting the sparsity structure, we demonstrate a 30.85 the time-to-first-token acceleration (3.75 improvement to previous work) without loss in perplexity. In addition, our optimization framework for large context enables REFRAG to extend the context size of LLMs by 16. We provide rigorous validation of REFRAG across diverse long-context tasks, including RAG, multi-turn conversations, and long document summarization, spanning a wide range of datasets. Experimental results confirm that REFRAG delivers substantial speedup with no loss in accuracy compared to LLaMA models and other state-of-the-art baselines across various context sizes.
Abstract（参考訳）: 大規模言語モデル(LLM)は、検索強化生成(RAG)のようなマルチターンおよびエージェントアプリケーションにおける応答を高めるために、広範囲な外部知識を活用する際、顕著な能力を示した。しかし、長いコンテキストの入力を処理すると、システム遅延が大きくなり、キーバリューキャッシュにかなりのメモリを必要とするため、スループットが低下し、知識の豊かさとシステム効率の基本的なトレードオフが生じる。長文入力のレイテンシを最小化することがLLMの主要な目的であるが、RAGには特別な考慮が必要であると我々は主張する。 RAGでは、LLMコンテキストの大部分は検索からの連結されたパスで構成されており、クエリに直接関係するサブセットはごくわずかである。これらの節は、再ランクの際の多様性や重複による意味的類似度が低く、標準LLM生成タスクとは異なるブロック対角の注意パターンをもたらす。この観測から, 復号化時のRAGコンテキスト上の計算のほとんどは不要であり, 性能への影響を最小限に抑えることができると論じる。そこで本研究では,RAGアプリケーションの遅延を圧縮し,知覚し,拡張する効率的な復号化フレームワークREFRAGを提案する。空間構造を利用して, パープレキシティを損なうことなく, 30.85 のタイム・ツー・ファースト・トーケン・アクセラレーション (3.75 の改善) を実演する。さらに,大きなコンテキストに対する最適化フレームワークにより,REFRAGはLLMのコンテキストサイズを16に拡張できる。我々は、RAG、マルチターン会話、および広範囲のデータセットにまたがる長い文書要約を含む、様々な長文タスクにわたるREFRAGの厳密な検証を提供する。実験結果から,REFRAGはLLaMAモデルや各種コンテキストサイズにおける最先端のベースラインと比較して精度を低下させることなく,かなりのスピードアップを実現することを確認した。

論文の概要: REFRAG: Rethinking RAG based Decoding

関連論文リスト