Fugu-MT 論文翻訳(概要): Improving Model Representation and Reducing KV Cache via Skip Connections with First Value Heads

論文の概要: Improving Model Representation and Reducing KV Cache via Skip Connections with First Value Heads

arxiv url: http://arxiv.org/abs/2510.16807v1
Date: Sun, 19 Oct 2025 12:17:42 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-25 00:56:39.157757
Title: Improving Model Representation and Reducing KV Cache via Skip Connections with First Value Heads
Title（参考訳）: 第一値頭を用いたスキップ接続によるモデル表現の改善とKVキャッシュ削減
Authors: Zhoutong Wu, Yuan Zhang, Yiming Dong, Chenheng Zhang, Cong Fang, Kun Yuan, Zhouchen Lin,
Abstract要約: SkipV1Formerは、第1層のバリューヘッドからのスキップ接続を使用して表現を強化し、KVキャッシュを削減するトランスフォーマーである。我々は、SkipV1FormerがKVキャッシュの約25%の一貫性のある削減を実現していることを示す。 YOCOと組み合わせると、KVキャッシュサイズが50%近く削減され、パフォーマンスが向上する。
参考スコア（独自算出の注目度）: 47.05385031325841
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer models have driven breakthroughs across various language tasks by their strong capability to learn rich contextual representations. Scaling them to improve representation, however, often demands substantial memory and compute costs, such as the Key-Value (KV) cache used during auto-regressive decoding. Skip connections offer a promising way to improve representation without bloating resource usage, yet most prior works either improve expressivity while leaving KV costs unchanged, or reduce memory at the cost of weaker representation. In this work, we propose SkipV1Former, a Transformer variant that uses skip connections from the first layer's Value heads to strengthen model representation and reduce KV cache. Specifically, from the second block onward, each layer reuses half of its Value heads from the very first layer, while computing the other half as usual-cutting Value projections and V cache by nearly 50 \%. Theoretically, we show that routing uncompressed first-layer Values into deeper layers restores information lost to compression and accelerates the model's implicit mesa-optimization-a key pattern of Transformer in auto-regressive tasks. Empirically, across different model scales, SkipV1Former delivers consistent reductions of approximately 25 \% in KV cache while improving perplexity relative to standard Multi-Head Attention (MHA) Transformers and some advanced variants. Moreover, we propose a recipe for uptraining existing MHA Transformer checkpoints to SkipV1Former with only 10-15\% additional compute. Finally, SkipV1Former can seamlessly combine advanced methods like Group-Query Attention and Multi-Latent Attention to achieve further KV cache savings and performance improvement. When combined with YOCO, it cuts KV cache size by nearly 50 \% while still improving performance.
Abstract（参考訳）: トランスフォーマーモデルは、リッチな文脈表現を学習する強力な能力によって、様々な言語タスクを突破した。しかし、表現を改善するためにスケールするには、しばしば、自動回帰復号時に使用されるキーバリュー(KV)キャッシュのような、かなりのメモリと計算コストを必要とする。スキップ接続は、リソース使用量を膨らませることなく表現を改善するための有望な方法を提供するが、以前のほとんどの作業は、KVのコストは変わらないままで表現性を改善するか、より弱い表現コストでメモリを減らすかのどちらかである。本研究では,第1層の値ヘッドからのスキップ接続を利用してモデル表現を強化し,KVキャッシュを削減するトランスフォーマーであるSkipV1Formerを提案する。具体的には、第2のブロック以降、各レイヤはバリューヘッドの半分を第1のレイヤから再利用します。理論的には、圧縮されていない第1層の値をより深い層にルーティングすることで、圧縮に失われる情報を復元し、自動回帰タスクにおけるトランスフォーマーの重要なパターンである暗黙のメザ最適化を加速する。経験的に、異なるモデルスケールにわたって、SkipV1Formerは、標準マルチヘッドアテンション(MHA)トランスフォーマーといくつかの高度な変種と比較して、KVキャッシュの約25%の一貫性のある縮小を実現している。さらに,既存のMHA変換器のチェックポイントをSkipV1Formerに10～15倍の演算量でアップトレーニングする手法を提案する。最後に、SkipV1Formerは、グループクエリアテンションやマルチレイテンシアテンションといった高度なメソッドをシームレスに組み合わせて、KVキャッシュの保存とパフォーマンスの改善を実現します。 YOCOと組み合わせることで、KVキャッシュのサイズを50%近く削減すると同時に、パフォーマンスも向上する。

論文の概要: Improving Model Representation and Reducing KV Cache via Skip Connections with First Value Heads

関連論文リスト