Fugu-MT 論文翻訳(概要): CacheClip: Accelerating RAG with Effective KV Cache Reuse

論文の概要: CacheClip: Accelerating RAG with Effective KV Cache Reuse

arxiv url: http://arxiv.org/abs/2510.10129v1
Date: Sat, 11 Oct 2025 09:28:26 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-14 18:06:29.79969
Title: CacheClip: Accelerating RAG with Effective KV Cache Reuse
Title（参考訳）: CacheClip: 効果的なKVキャッシュ再利用によるRAGの高速化
Authors: Bin Yang, Qiuyu Leng, Jun Zeng, Zhenhua Wu,
Abstract要約: CacheClipは、高速TTFTとハイジェネレーション品質の両方を実現する新しいフレームワークである。本手法は,(1)選択的KVキャッシュ再計算のための補助モデル誘導トークン選択,(2)冗長な注意シンクを排除するための共有プレフィックス,(3)局所コヒーレンスを維持するためのグループ化戦略の3つの手法を統合する。
参考スコア（独自算出の注目度）: 8.016679032026824
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Retrieval-Augmented Generation (RAG) systems suffer from severe time-to-first-token (TTFT) bottlenecks due to long input sequences. Existing KV cache reuse methods face a fundamental trade-off: prefix caching requires identical prefixes that rarely occur in RAG scenarios, while direct precomputation sacrifices quality due to missing inter-chunk attention and repeated attention sinks. Recent methods like APE and CacheBlend partially address these issues but remain inadequate for robust RAG applications. This paper presents CacheClip, a novel framework that achieves both fast TTFT and high generation quality. Our key insight is that small auxiliary LLMs exhibit similar last-layer attention distributions to primary LLMs (the target model for generation), enabling efficient identification of tokens critical for restoring inter-chunk attention, thereby significantly improving response quality on cross-chunk reasoning tasks. CacheClip integrates three techniques: (1) auxiliary-model-guided token selection for selective KV cache recomputation, where the auxiliary model is finetuned to improve selection accuracy, (2) shared prefixes to eliminate redundant attention sinks, and (3) grouping strategy to maintain local coherence during partial KV cache updates. Experiments show CacheClip retains up to 94.8% and 85.0% of full-attention performance on NIAH and LongBench, outperforming APE and CacheBlend by 25.2% and 35.1% on NIAH (with reomp% = 20%). Meanwhile, CacheClip accelerates LLM inference by up to 1.92x in prefill time, providing a practical solution to the efficiency-quality trade-off in RAG systems.
Abstract（参考訳）: Retrieval-Augmented Generation (RAG) システムは、長い入力シーケンスによって、TTFT(Time-to-first-token)ボトルネックに悩まされる。プレフィックスキャッシュはRAGシナリオではほとんど発生しない同じプレフィックスを必要とするが、直接プリ計算はチャンク間の注意の欠如と繰り返しの注意シンクによって品質を犠牲にする。 APEやCacheBlendのような最近の手法は、これらの問題に部分的に対処するが、堅牢なRAGアプリケーションには不適切である。本稿では,高速TTFTと高世代品質を実現する新しいフレームワークであるCacheClipを提案する。我々の重要な洞察は、小型補助LDMは一次LSM(生成対象モデル)に類似した最終層アテンション分布を示し、チャンク間アテンションの復元に重要なトークンの効率的な識別を可能にし、チャンク間アレンディングタスクにおける応答品質を著しく向上させることである。 CacheClipは,(1)選択KVキャッシュ再計算のための補助モデル誘導トークン選択,(2)選択精度を向上させるための補助モデル微調整,(2)余分な注意シンクを排除するための共有プレフィックス,(3)部分KVキャッシュ更新時の局所コヒーレンスを維持するグループ化戦略の3つの手法を統合する。実験の結果、CacheClipはNIAHとLongBenchで94.8%と85.0%のフルアテンション性能を維持しており、APEとCacheBlendを25.2%、NIAHで35.1%(reomp%=20%)上回っている。一方、CacheClip は LLM の予測をプリフィル時間で最大 1.92 倍に高速化し、RAG システムにおける効率性のトレードオフに対する実用的な解決策を提供する。

論文の概要: CacheClip: Accelerating RAG with Effective KV Cache Reuse

関連論文リスト