Fugu-MT 論文翻訳(概要): Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs

論文の概要: Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs

arxiv url: http://arxiv.org/abs/2512.03324v1
Date: Wed, 03 Dec 2025 00:20:35 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-04 20:02:55.050239
Title: Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs
Title（参考訳）: キャッシュ: LLMにおけるメモリ境界KVキャッシュのトークン保持
Authors: Ngoc Bui, Shubham Sharma, Simran Lamba, Saumitra Mishra, Rex Ying,
Abstract要約: 本稿では,軽量保持ゲートを介して各トークンの創出時の本質的な重要性を学習する手法を提案する。我々は,特に低メモリ環境において,強い信念と学習可能な検索ベースラインを一貫して上回ることを示す。一部の設定ではフルキャッシュモデルを超えており、選択的な保持が正規化の一形態として機能することを示している。
参考スコア（独自算出の注目度）: 26.951325519894525
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Memory and computation remain core bottlenecks in long-horizon LLM inference due to the quadratic cost of self-attention and the ever-growing key-value (KV) cache. Existing strategies for memory-bounded inference, such as quantization, offloading, or heuristic KV eviction, either incur high orchestration costs or rely on unreliable attention-based proxies of importance. We propose TRIM-KV, a novel approach that learns each token's intrinsic importance at creation time via a lightweight retention gate. Each gate predicts a scalar retention score that decays over time, reflecting the long-term utility of the token for a specific layer and head. Tokens with low scores are evicted when the memory budget is exceeded, ensuring that the cache always contains the most critical tokens. TRIM-KV is trained efficiently through distillation from a frozen LLM combined with a capacity loss, requiring only gate fine-tuning and adding negligible inference overhead. Across mathematical reasoning (GSM8K, MATH-500, AIME24), procedural generation (LongProc), conversational long-memory benchmarks (LongMemEval), and long-context understanding (LongBench and SCBench), TRIM-KV consistently outperforms strong eviction and learnable retrieval baselines, especially in low-memory regimes. Remarkably, it even surpasses full-cache models in some settings, showing that selective retention can serve as a form of regularization, suppressing noise from uninformative tokens. Qualitative analyses further reveal that learned retention scores align with human intuition, naturally recovering heuristics such as sink tokens, sliding windows, and gist compression without explicit design. Beyond efficiency, retention scores provide insights into layer- and head-specific roles, suggesting a new path toward LLM interpretability.
Abstract（参考訳）: メモリと計算は、自己アテンションの2次コストと、成長を続けるキーバリュー(KV)キャッシュのために、長期LLM推論のコアボトルネックのままである。量子化、オフロード、ヒューリスティックなKV消去のような既存のメモリバウンド推論のための戦略は、高いオーケストレーションコストを発生させるか、信頼できない注意ベースの重要なプロキシに依存する。軽量保持ゲートを介して各トークンの本質的な重要性を学習する新しいアプローチであるTRIM-KVを提案する。各ゲートは、特定の層と頭に対するトークンの長期的な有用性を反映して、時間の経過とともに崩壊するスカラー保持スコアを予測する。メモリ予算を超えるとスコアの低いトークンが排除され、キャッシュが常に最も重要なトークンを含むことが保証される。 TRIM-KV は冷凍 LLM からの蒸留とキャパシティロスの併用により効率よく訓練され、ゲートの微調整と無視できない推論オーバーヘッドが加えられる。 Across mathematical reasoning (GSM8K, MATH-500, AIME24), procedural generation (LongProc), conversational long- benchmarks (LongMemEval), long-context understanding (LongBench, SCBench), TRIM-KV は、特に低メモリ状態において、強い消去と学習可能な検索ベースラインを一貫して上回る。注目すべきは、一部の設定ではフルキャッシュモデルを超え、選択的保持が正規化の形で機能し、非形式的トークンからのノイズを抑制することを示しています。定性的分析により、学習された保持スコアは人間の直感と一致し、シンクトークン、スライディングウインドウ、ギスト圧縮のような自然に回復することが明らかとなった。効率性以外にも、保持スコアは層と頭の役割に関する洞察を与え、LCMの解釈可能性への新たな道のりを示唆している。

論文の概要: Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs

関連論文リスト