Fugu-MT 論文翻訳(概要): PatternKV: Flattening KV Representation Expands Quantization Headroom

論文の概要: PatternKV: Flattening KV Representation Expands Quantization Headroom

arxiv url: http://arxiv.org/abs/2510.05176v1
Date: Sun, 05 Oct 2025 12:09:14 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-08 17:57:07.888338
Title: PatternKV: Flattening KV Representation Expands Quantization Headroom
Title（参考訳）: PatternKV: フラット化KV表現が量子化ヘッドルームを拡大
Authors: Ji Zhang, Yiwei Li, Shaoxiong Feng, Peiwen Yuan, Xinglin Wang, Jiayi Shi, Yueqi Zhang, Chuyi Tan, Boyuan Pan, Yao Hu, Kan Li,
Abstract要約: 自己回帰 LLM における KV キャッシュは冗長な再計算を排除しているが、推論時に支配的なメモリと帯域幅のボトルネックとして出現している。 KV量子化はキャッシュコストを削減するキーレバーであるが、ネイティブなKV分布が平坦性に欠けるため、精度は急激に低下する。 Kキャッシュは、コンテキストとともに徐々に進化する安定した構造を維持し、Vキャッシュは潜在意味規則性を持つことを示す。
参考スコア（独自算出の注目度）: 37.83913102876393
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: KV cache in autoregressive LLMs eliminates redundant recomputation but has emerged as the dominant memory and bandwidth bottleneck during inference, notably with long contexts and test-time scaling. KV quantization is a key lever for reducing cache cost, but accuracy drops sharply as the native KV distribution lacks flatness and thus maintains a wide quantization range. Prior work focuses on isolating outliers, which caps their error but fails to flatten the overall distribution, leaving performance fragile under low-bit settings. In this work, we show that the K cache maintains a stable structure that evolves gradually with context, while the V cache carries latent semantic regularities. Building on these insights, we propose PatternKV, a pattern-aligned residual quantization scheme. It mines representative pattern vectors online, aligns each KV vector to its nearest pattern, and quantizes only the residual. This reshaping of the KV distribution flattens the quantization target and narrows its range, thereby improving the fidelity of low-bit KV quantization. Across long-context and test-time scaling settings on multiple backbones, PatternKV delivers consistent 2-bit gains, with a 0.08% average 4-bit drop relative to FP16, improves test-time scaling accuracy by 10% on average, and raises throughput by 1.4x while supporting 1.25x larger batches.
Abstract（参考訳）: 自動回帰LDMにおけるKVキャッシュは冗長な再計算をなくすが、特に長期のコンテキストとテストタイムのスケーリングにおいて、推論において主要なメモリと帯域幅のボトルネックとして現れる。 KV量子化はキャッシュコストを削減するためのキーレバーであるが、ネイティブなKV分布が平坦性に欠け、広い量子化範囲を維持するため、精度は急激に低下する。以前の作業では、エラーを克服するが、全体の分散をフラットにせず、パフォーマンスが低ビット設定で脆弱になるような、オフレイアの分離に重点を置いていた。本稿では,Kキャッシュがコンテキストとともに徐々に進化する安定な構造を維持し,Vキャッシュが潜在意味規則性を持つことを示す。これらの知見に基づいてパターン整合型残差量子化スキームであるPatternKVを提案する。代表パターンベクトルをオンラインでマイニングし、各KVベクトルを最も近いパターンに整列し、残基のみを定量化する。このKV分布の再構成は量子化ターゲットを平坦化し、その範囲を狭め、低ビットKV量子化の忠実度を向上させる。複数のバックボーン上での長期コンテキストとテストタイムのスケーリング設定全体で、PatternKVは一貫性のある2ビットゲインを提供し、FP16と比較して平均4ビットのダウンが0.08%、テストタイムのスケーリング精度が平均10%向上し、スループットが1.4倍向上し、1.25倍のバッチをサポートする。

論文の概要: PatternKV: Flattening KV Representation Expands Quantization Headroom

関連論文リスト