Fugu-MT 論文翻訳(概要): Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

論文の概要: Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

arxiv url: http://arxiv.org/abs/2606.09864v1
Date: Mon, 01 Jun 2026 02:02:20 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-10 15:40:57.969251
Title: Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation
Title（参考訳）: KVキャッシュ量子化によるアライメント崩壊:診断と緩和
Authors: Bruce Changlong Xu, Adarsh Kumarappan, Mu Zhou,
Abstract要約: キー値(KV)キャッシュの量子化は、Large Language Model(LLM)推論メモリの削減に広く利用されている。本研究では,KVキャッシュ量子化下でのアライメント保存について検討する。低ビット量子化は安全アライメントを静かに破壊することができる。
参考スコア（独自算出の注目度）: 6.129872931808218
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Key-value (KV) cache quantization is widely used to reduce Large Language Model (LLM) inference memory, yet existing evaluations solely focus on measuring perplexity and accuracy without assessing the safety impact. In this study, we explore alignment preservation under KV cache quantization. Across eleven instruction-tuned models (3.8B-72B) and five benchmarks (1,894 prompts), we find that low-bit quantization can silently destroy safety alignment: Mistral-7B loses 15.2% of its refusals at only 1.03x perplexity, and no universal safe bit-width exists, with sharp model-specific phase transitions invisible to standard metrics. We identify that the root cause is geometric: safety features occupy a low-dimensional activation subspace 10^2-10^3x more vulnerable to quantization noise than the full representation space perplexity averages over. Inspired by this observation, we propose Per-Channel Reduction (PCR), a diagnostic that classifies each model into one of three mechanistic failure modes: outlier-crushes-safety, where safety lives in non-outlier channels collaterally damaged by outlier-driven scale factors; outlier-as-safety, where safety overlaps outlier channels and finer granularity cannot rescue it; and multi-layer dilution, where safety is distributed across many layers and per-layer fixes fail. PCR predicts the correct mitigation direction on all nine primary models and one held-out model from an independent family using 20 calibration prompts. PCR generalizes across unseen prompts, models, and production quantizers, including KIVI with up to 97.2% recovery, succeeding where attention-based allocation methods fail. The resulting training-free protocol, requiring approximately 35 GPU-minutes, recovers up to 97% of lost alignment at minimal memory overhead, addressing vulnerabilities confirmed in production vLLM serving with FP8 KV cache on NVIDIA GPUs.
Abstract（参考訳）: キー値(KV)キャッシュの量子化は、Large Language Model(LLM)推論メモリの削減に広く用いられているが、既存の評価では、安全性への影響を評価することなく、パープレキシティと精度の計測にのみ焦点をあてている。本研究では,KVキャッシュ量子化下でのアライメント保存について検討する。 11種類の命令チューニングモデル (3.8B-72B) と5つのベンチマーク (1,894 のプロンプト) で、低ビット量子化は安全アライメントを静かに破壊できることがわかった。低次元のアクティベーション部分空間10^2-10^3xは、全表現空間のパープレキシティの平均よりも量子化ノイズに弱い。そこで本研究では,各モデルを3つの機械的故障モードの1つに分類する診断法であるPer-Channel Reduction (PCR)を提案する。 PCRは、20種類のキャリブレーションプロンプトを用いて、9つのプライマリモデルと1つのホールドアウトモデルの正しい緩和方向を予測する。 PCRは、97.2%の回復率を持つKIVIを含む、目に見えないプロンプト、モデル、生産量計を一般化し、注意に基づく割り当て手法が失敗するところを成功させる。トレーニング不要のプロトコルは、約35GPU分を必要とし、最大97%のアライメントを最小メモリオーバーヘッドで回復し、NVIDIA GPU上のFP8 KVキャッシュで機能する本番vLLMで確認された脆弱性に対処する。

論文の概要: Alignment Collapse Under KV Cache Quantization: Diagnosis and Mitigation

関連論文リスト