Fugu-MT 論文翻訳(概要): One Size Does Not Fit All: Token-Wise Adaptive Compression for KV Cache

論文の概要: One Size Does Not Fit All: Token-Wise Adaptive Compression for KV Cache

arxiv url: http://arxiv.org/abs/2603.04411v1
Date: Tue, 03 Feb 2026 13:20:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-09 01:20:08.188681
Title: One Size Does Not Fit All: Token-Wise Adaptive Compression for KV Cache
Title（参考訳）: One Sizeは、すべてに合わない:KVキャッシュのためのToken-Wise Adaptive Compression
Authors: Liming Lu, Kaixi Qiu, Jiayu Zhou, Jushi Kai, Haoyan Zhang, Huanyu Wang, Jingwen Leng, Ziwei He, Zhouhan Lin,
Abstract要約: 低ランクKVキャッシュ圧縮のための新しいポストトレーニングフレームワークDynaKVを提案する。我々の手法は既存の最先端圧縮技術より一貫して優れています。 SnapKVと統合した場合、DynaKVはKVキャッシュの6%しか保持せず、LongBenchベンチマークのベースラインパフォーマンスの94%を維持している。
参考スコア（独自算出の注目度）: 38.49582847975703
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Despite the remarkable progress of Large Language Models (LLMs), the escalating memory footprint of the Key-Value (KV) cache remains a critical bottleneck for efficient inference. While dimensionality reduction offers a promising compression avenue, existing approaches typically either necessitate prohibitively expensive pre-training from scratch or suffer from severe performance deterioration under high compression regimes. In this work, we propose DynaKV, a novel post-training framework for low-rank KV cache compression. To the best of our knowledge, DynaKV is the first method to dynamically allocate compression rates to individual tokens according to their semantic meaning, which allows it to achieve better fidelity at aggressive compression ratios. Extensive experiments demonstrate that our method consistently outperforms existing state-of-the-art compression techniques, achieving significant memory reduction while maintaining competitive generation quality. Furthermore, our approach is orthogonal to sequence-level pruning methods. When integrated with SnapKV, DynaKV retains only 6% of the KV cache while maintaining 94% of the baseline performance on the LongBench benchmark.
Abstract（参考訳）: 大きな言語モデル(LLM)の顕著な進歩にもかかわらず、キーバリュー(KV)キャッシュのメモリフットプリントの増大は、効率的な推論にとって重要なボトルネックである。次元的縮小は有望な圧縮手段を提供するが、既存のアプローチは一般的に、スクラッチから違法に高価な事前訓練を必要とするか、高い圧縮条件下での厳しい性能劣化に悩まされる。本研究では,低ランクKVキャッシュ圧縮のためのポストトレーニングフレームワークであるDynaKVを提案する。我々の知る限りでは、DynaKVは個々のトークンに対するセマンティックな意味に応じて動的に圧縮率を割り当てる最初の方法であり、攻撃的な圧縮比においてより忠実性を達成することができる。実験の結果,提案手法は既存の圧縮技術よりも優れており,競争力のある生成品質を維持しつつ,メモリの大幅な削減を実現していることがわかった。さらに,本手法はシーケンスレベルのプルーニング法と直交する。 SnapKVと統合した場合、DynaKVはKVキャッシュの6%しか保持せず、LongBenchベンチマークのベースラインパフォーマンスの94%を維持している。

論文の概要: One Size Does Not Fit All: Token-Wise Adaptive Compression for KV Cache

関連論文リスト