Fugu-MT 論文翻訳(概要): Don't be so Stief! Learning KV Cache low-rank approximation over the Stiefel manifold

論文の概要: Don't be so Stief! Learning KV Cache low-rank approximation over the Stiefel manifold

arxiv url: http://arxiv.org/abs/2601.21686v1
Date: Thu, 29 Jan 2026 13:19:24 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-30 16:22:49.842186
Title: Don't be so Stief! Learning KV Cache low-rank approximation over the Stiefel manifold
Title（参考訳）: Don't be so Stief! Learning KV Cache Low-rank approximation over the Stiefel manifold
Authors: Luca Benfenati, Matteo Risso, Andrea Vannozzi, Ahmet Caner Yüzügüler, Lukas Cavigelli, Enrico Macii, Daniele Jahier Pagliari, Alessio Burrello,
Abstract要約: StiefAttentionは、出力再構成誤差を直接最小化し、インフォノーマルプロジェクションベースを学習するKV-cache圧縮手法である。これは、C4の難易度が11.9ドル、0ショットMMLUの精度が5.4%でEigenAttentionを上回り、元のデコーダ層出力に対する相対誤差が低く、コサイン類似度も高い。
参考スコア（独自算出の注目度）: 7.162701793686856
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Key--value (KV) caching enables fast autoregressive decoding but at long contexts becomes a dominant bottleneck in High Bandwidth Memory (HBM) capacity and bandwidth. A common mitigation is to compress cached keys and values by projecting per-head matrixes to a lower rank, storing only the projections in the HBM. However, existing post-training approaches typically fit these projections using SVD-style proxy objectives, which may poorly reflect end-to-end reconstruction after softmax, value mixing, and subsequent decoder-layer transformations. For these reasons, we introduce StiefAttention, a post-training KV-cache compression method that learns \emph{orthonormal} projection bases by directly minimizing \emph{decoder-layer output reconstruction error}. StiefAttention additionally precomputes, for each layer, an error-rank profile over candidate ranks, enabling flexible layer-wise rank allocation under a user-specified error budget. Noteworthy, on Llama3-8B under the same conditions, StiefAttention outperforms EigenAttention by $11.9$ points on C4 perplexity and $5.4\%$ on 0-shot MMLU accuracy at iso-compression, yielding lower relative error and higher cosine similarity with respect to the original decoder-layer outputs.
Abstract（参考訳）: キー値キャッシュ(KV)は、高速な自己回帰デコードを可能にするが、長いコンテキストでは、ハイ帯域メモリ(HBM)の容量と帯域幅において主要なボトルネックとなる。一般的な緩和法は、HBM内の投射だけを格納し、ヘッド当たりの行列を低いランクに投影することでキャッシュされたキーと値を圧縮することである。しかし、既存のポストトレーニングアプローチは、一般的にSVDスタイルのプロキシ目的を用いてこれらのプロジェクションに適合し、ソフトマックス、値混合、およびその後のデコーダ層変換後のエンドツーエンド再構成を十分に反映しない可能性がある。これらの理由から,学習後のKV-cache圧縮手法であるStiefAttentionを導入する。さらに、StiefAttentionは、各レイヤに対して、候補ランク上のエラーランクプロファイルをプリ計算し、ユーザが指定したエラー予算の下で柔軟なレイヤレベルのランク割り当てを可能にする。注目すべきは、同じ条件下でのLlama3-8Bにおいて、StiefAttentionは、EigenAttentionをC4の難易度で11.9ドル、イソ圧縮で0ショットMMLUの精度で5.4\%で上回り、元のデコーダ層出力に関してより低い相対誤差と高いコサイン類似性をもたらす。

論文の概要: Don't be so Stief! Learning KV Cache low-rank approximation over the Stiefel manifold

関連論文リスト