Fugu-MT 論文翻訳(概要): CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention

論文の概要: CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention

arxiv url: http://arxiv.org/abs/2603.17946v1
Date: Wed, 18 Mar 2026 17:18:35 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-19 18:32:57.847275
Title: CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention
Title（参考訳）: CARE:マルチヘッド潜伏注意のための共分散認識とランク強化分解
Authors: Zhongzhu Zhou, Fengxiang Bie, Ziyan Chen, Zhenyu Zhang, Yibo Yang, Junxiong Wang, Ben Athiwaratkun, Xiaoxia Wu, Shuaiwen Leon Song,
Abstract要約: 固定KV幅の共分散対応・ランク強化型MLA変換パイプラインを提案する。 CAREは、3つの重要なステップを紹介している: (i) 活性化保存因子化 (i) 重量だけでなく実際の入力アクティベーションと近似を一致させる) 調整されたランク割り当て (ii) 固定されたKV予算を層に分散させ、最も必要な層により多くのキャパシティを与える) 変換されたKとVをパラメータ化してMLAフォーマットに適合させるKVパリティマッピング (iii) 。
参考スコア（独自算出の注目度）: 35.44699837487632
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Converting pretrained attention modules such as grouped-query attention (GQA) into multi-head latent attention (MLA) can improve expressivity without increasing KV-cache cost, making it attractive for efficient inference. However, many practical conversion baselines rely on weight-only low-rank approximations (e.g., SVD-style initializations) and uniform rank allocation. They focus on minimizing the difference between weight matrices rather than on how those weights affect input activations, ignore the covariance structure of activations, and enforce uniform rank across layers, causing activation drift and degraded attention fidelity. To address these issues, we propose CARE, a Covariance-Aware, Rank-Enhanced MLA conversion pipeline under a fixed KV width. CARE introduces three key steps: (i) activation-preserving factorization, which aligns the approximation with the actual input activations rather than just the weights; (ii) adjusted-rank allocation, which spreads a fixed KV budget across layers by giving more capacity to layers that need it most; and (iii) KV-parity mapping, which reparameterizes the converted K and V to fit the MLA format while keeping the KV-cache size unchanged. Our method outperforms a uniform-rank SVD baseline on Qwen3-4B/30B-A3B-Instruct-2507 and Llama-3.1-8B/70B-Instruct, reducing one-shot perplexity by up to 215x and improving mean accuracy by up to 1.70x at matched KV budgets. With a brief post-SVD healing fine-tune, we fully recover the original model's accuracy.
Abstract（参考訳）: グループクエリアテンション(GQA)などの事前学習されたアテンションモジュールをマルチヘッド潜在アテンション(MLA)に変換することは、KVキャッシュコストを増大させることなく、表現性を向上させることができるため、効率的な推論には魅力的である。しかし、多くの実用的な変換ベースラインは、重量のみの低ランク近似(例えば、SVDスタイルの初期化)と均一なランク割り当てに依存している。彼らは、これらの重量が入力活性化にどのように影響するかよりも、重量行列の違いを最小化することに注力し、活性化の共分散構造を無視し、層間の均一なランクを強制し、アクティベーションドリフトと注意力の低下を引き起こす。これらの問題に対処するために、我々は、固定KV幅でCovariance-Aware, Rank-Enhanced MLA変換パイプラインであるCAREを提案する。 CAREは3つの重要なステップを紹介します。 (i)活性化保存因子化は、重量だけでなく実際の入力活性化と近似を一致させる。 (ii)調整されたランク割り当ては、最も必要な層により多くのキャパシティを与えることで、固定KV予算を層に広げる。三変換KとVをパラメータ化してMLAフォーマットに適合させ、かつ、KVキャッシュサイズを一定に保ったKVパリティマッピング。提案手法は,Qwen3-4B/30B-A3B-Instruct-2507とLlama-3.1-8B/70B-Instructの均一ランクSVDベースラインを上回り,一発パープレキシティを最大215倍に低減し,一致KV予算で平均精度を最大1.70倍向上させる。簡単なSVD後治癒の微調整により、元のモデルの精度を完全に回復する。

論文の概要: CARE: Covariance-Aware and Rank-Enhanced Decomposition for Enabling Multi-Head Latent Attention

関連論文リスト