Fugu-MT 論文翻訳(概要): ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs

論文の概要: ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs

arxiv url: http://arxiv.org/abs/2603.08727v1
Date: Thu, 19 Feb 2026 16:24:08 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-15 16:38:22.500556
Title: ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs
Title（参考訳）: ARKV:LLMにおける長期推論のための限定メモリ予算に基づく適応的で資源効率の良いKVキャッシュ管理
Authors: Jianlong Lei, Shashikant Ilager,
Abstract要約: 大規模言語モデル(LLM)は、超長期のコンテキスト推論を必要とするシナリオにますますデプロイされている。既存のメモリ削減技術、例えば消去や量子化は、しばしば静的キャッシュに依存している。キャッシュされたトークンに精度レベルを動的に割り当てる軽量で適応的なフレームワークARKVを提案する。
参考スコア（独自算出の注目度）: 1.1267872663780352
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Large Language Models (LLMs) are increasingly deployed in scenarios demanding ultra-long context reasoning, such as agentic workflows and deep research understanding. However, long-context inference is constrained by the KV cache, a transient memory structure that grows linearly with sequence length and batch size, quickly dominating GPU memory usage. Existing memory reduction techniques, including eviction and quantization, often rely on static heuristics and suffer from degraded quality under tight budgets. In this paper, we propose ARKV, a lightweight and adaptive framework that dynamically allocates precision levels to cached tokens based on per-layer attention dynamics and token-level importance. During a short prefill phase, ARKV estimates the original quantization (OQ) ratio of each layer by computing statistical scores such as attention entropy, variance and kurtosis. During decoding, tokens are assigned to one of three states, Original (full precision), Quantization (low precision), or Eviction, according to a fast heavy-hitter scoring strategy. Our experiments on LLaMA3 and Qwen3 models across diverse long- and short-context tasks demonstrate that ARKV preserves ~97% of baseline accuracy on long-context benchmarks while reducing KV memory usage by 4x, with minimal throughput loss. On short-context tasks, ARKV matches full-precision baselines; on GSM8K math reasoning, it significantly outperforms uniform quantization. These results highlight the practical viability of ARKV for scalable LLM deployment, offering fine-grained, data-driven memory control without retraining or architectural modifications. The source code and artifacts can be found in: https://github.com/Large-scale-Sustainable-Computing-LSC/ARKV
Abstract（参考訳）: 大規模言語モデル(LLM)は、エージェントワークフローや深い研究理解など、超長期のコンテキスト推論を必要とするシナリオにますます多くデプロイされている。しかし、長いコンテキスト推論は、シーケンス長とバッチサイズで線形に成長し、GPUメモリの使用を急速に支配する、過渡的なメモリ構造であるKVキャッシュによって制限される。既存のメモリ削減技術(エヴィジョンや量子化など)は、しばしば静的ヒューリスティックに依存し、厳格な予算の下で劣化した品質に悩まされる。本稿では,階層単位の注意力とトークンレベルの重要度に基づいて,キャッシュされたトークンに精度レベルを動的に割り当てる軽量で適応的なフレームワークARKVを提案する。短時間のプリフィルフェーズにおいて、ARKVは、注意エントロピー、分散、クルトシスなどの統計スコアを計算することにより、各層の元の量子化(OQ)比を推定する。デコード中、トークンは、高速なヘビーヒッタースコアリング戦略に従って、オリジナル(全精度)、量子化(低精度)またはエヴィクションの3つの状態のうちの1つに割り当てられる。 LLaMA3 と Qwen3 モデルを用いた各種長コンテキストおよび短コンテキストタスクに対する実験により,ARKV は長コンテキストベンチマークにおけるベースライン精度の約97% を保ち,KV メモリ使用率を 4 倍に削減し,スループット損失を最小限に抑えた。短時間のタスクでは、ARKVは完全精度のベースラインと一致し、GSM8Kの数学的推論では、均一な量子化よりも大幅に優れる。これらの結果は、拡張性のあるLLMデプロイメントのためのARKVの実用性を強調し、再トレーニングやアーキテクチャの変更なしに、きめ細かいデータ駆動型メモリ制御を提供する。ソースコードとアーティファクトは以下の通りである。 https://github.com/Large-scale-Sustainable-LSC/ARKV

論文の概要: ARKV: Adaptive and Resource-Efficient KV Cache Management under Limited Memory Budget for Long-Context Inference in LLMs

関連論文リスト