Fugu-MT 論文翻訳(概要): EpiCache: Episodic KV Cache Management for Long Conversational Question Answering

論文の概要: EpiCache: Episodic KV Cache Management for Long Conversational Question Answering

arxiv url: http://arxiv.org/abs/2509.17396v1
Date: Mon, 22 Sep 2025 06:56:35 GMT
ステータス: 翻訳完了
システム内更新日: 2025-09-23 18:58:16.244336
Title: EpiCache: Episodic KV Cache Management for Long Conversational Question Answering
Title（参考訳）: EpiCache: 長期会話型質問応答のためのエピソードKVキャッシュ管理
Authors: Minsoo Kim, Arnav Kundu, Han-Byul Kim, Richa Dixit, Minsik Cho,
Abstract要約: 長時間会話型質問応答のためのトレーニング不要なKVキャッシュ管理フレームワークであるEpiCacheを紹介した。 EpiCacheはブロックワイズプリフィルを通じてキャッシュの成長を制限し、エピソードKV圧縮を通じてトピック関連コンテキストを保存する。 3つのLongConvQAベンチマークで、EpiCacheは最近のベースラインよりも40%の精度向上を実現し、4-6倍の圧縮でほぼフルなKVの精度を維持し、レイテンシとメモリを最大2.4倍と3.5倍に削減した。
参考スコア（独自算出の注目度）: 15.288494370436469
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Recent advances in large language models (LLMs) have extended context lengths, enabling assistants to sustain long histories for coherent, personalized responses. This ability, however, hinges on Key-Value (KV) caching, whose memory grows linearly with dialogue length and quickly dominates under strict resource constraints. An active line of research for reducing this overhead is KV cache compression, which seeks to limit cache size while preserving accuracy. Yet existing methods face two major limitations: (i) evicting entries after full-context prefill causes unbounded peak memory, and (ii) query-dependent eviction narrows the cache to a single query, leading to degraded accuracy in multi-turn conversations. We introduce EpiCache, a training-free KV cache management framework for long conversational question answering (LongConvQA) under fixed memory budgets. EpiCache bounds cache growth through block-wise prefill and preserves topic-relevant context via episodic KV compression, which clusters conversation history into coherent episodes and applies episode-specific KV cache eviction. We further design an adaptive layer-wise budget allocation strategy that measures each layer's sensitivity to eviction and distributes the memory budget across layers accordingly. Across three LongConvQA benchmarks, EpiCache improves accuracy by up to 40% over recent baselines, sustains near-full KV accuracy under 4-6x compression, and reduces latency and memory by up to 2.4x and 3.5x, thereby enabling efficient multi-turn interaction under strict resource constraints.
Abstract（参考訳）: 大規模言語モデル(LLM)の最近の進歩は、コンテキスト長が拡張され、アシスタントは一貫性のあるパーソナライズされた応答に対して長い履歴を維持することができる。しかし、この能力はキーバリュー(KV)キャッシングに依存しており、メモリは対話長とともに線形に成長し、厳しいリソース制約の下では急速に支配的になる。このオーバーヘッドを減らすための研究の活発な行はKVキャッシュ圧縮であり、精度を保ちながらキャッシュサイズを制限することを目指している。しかし、既存のメソッドには2つの大きな制限がある。一フルコンテクストプリフィル後のエントリの削除は、無制限のピークメモリを生じさせ、 (ii) クエリ依存の消去は、キャッシュを単一のクエリに絞り込み、マルチターン会話の精度が低下する。本稿では,長期会話型質問応答(LongConvQA)のためのトレーニング不要なKVキャッシュ管理フレームワークであるEpiCacheを紹介する。 EpiCacheはブロック単位のプリフィルを通じてキャッシュの成長を制限し、エピソディックなKV圧縮を通じてトピック関連コンテキストを保存する。さらに、各レイヤの退避に対する感度を計測し、それに応じてメモリ予算をレイヤ間で分散する、適応的なレイヤ単位の予算配分戦略を設計する。 3つのLongConvQAベンチマークで、EpiCacheは最近のベースラインよりも40%の精度向上を実現し、4-6倍の圧縮でほぼ完全なKV精度を維持し、レイテンシとメモリを2.4倍と3.5倍に削減し、厳しいリソース制約下で効率的なマルチターンインタラクションを可能にする。

論文の概要: EpiCache: Episodic KV Cache Management for Long Conversational Question Answering

関連論文リスト