Fugu-MT 論文翻訳(概要): Mixture-of-Depths Attention

論文の概要: Mixture-of-Depths Attention

arxiv url: http://arxiv.org/abs/2603.15619v1
Date: Mon, 16 Mar 2026 17:59:55 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 18:28:58.730621
Title: Mixture-of-Depths Attention
Title（参考訳）: 深度混合注意
Authors: Lianghui Zhu, Yuxin Fang, Bencheng Liao, Shijie Wang, Tianheng Cheng, Zilong Huang, Chen Chen, Lai Wei, Yutao Zeng, Ya Wang, Yi Lin, Yu Li, Xinggang Wang,
Abstract要約: スケーリングディープは、大規模言語モデル(LLM)のキードライバーである。我々はMix of-Depths attention (MoDA)を紹介する。 MoDAにより、各アテンションヘッドは、現在の層におけるシーケンスKVペアと、前の層からの深さKVペアに出席することができる。
参考スコア（独自算出の注目度）: 65.80640499676542
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scaling depth is a key driver for large language models (LLMs). Yet, as LLMs become deeper, they often suffer from signal degradation: informative features formed in shallow layers are gradually diluted by repeated residual updates, making them harder to recover in deeper layers. We introduce mixture-of-depths attention (MoDA), a mechanism that allows each attention head to attend to sequence KV pairs at the current layer and depth KV pairs from preceding layers. We further describe a hardware-efficient algorithm for MoDA that resolves non-contiguous memory-access patterns, achieving 97.3% of FlashAttention-2's efficiency at a sequence length of 64K. Experiments on 1.5B-parameter models demonstrate that MoDA consistently outperforms strong baselines. Notably, it improves average perplexity by 0.2 across 10 validation benchmarks and increases average performance by 2.11% on 10 downstream tasks, with a negligible 3.7% FLOPs computational overhead. We also find that combining MoDA with post-norm yields better performance than using it with pre-norm. These results suggest that MoDA is a promising primitive for depth scaling. Code is released at https://github.com/hustvl/MoDA .
Abstract（参考訳）: スケールディープは、大規模言語モデル(LLM)のキードライバーである。しかし、LSMが深くなるにつれて、信号の劣化に悩まされることが多く、浅い層に形成された情報的特徴は、繰り返しの更新によって徐々に希薄化され、より深い層での回復が困難になる。我々は,各注目ヘッドが現在の層におけるシーケンスKV対と,先行層からの深さKV対に対応する機構である混合深度アテンション(MoDA)を導入する。さらに、連続しないメモリアクセスパターンを解決し、64Kのシーケンス長でFlashAttention-2の効率の97.3%を達成するMoDAのハードウェア効率アルゴリズムについて述べる。 1.5Bパラメータモデルの実験は、MoDAが強いベースラインを一貫して上回ることを示した。特に、10の検証ベンチマークで平均パープレキシティが0.2向上し、10の下流タスクでは平均性能が2.11%向上し、3.7%のFLOPが計算オーバーヘッドを無視できる。また、MoDAとポストノームを組み合わせることで、プレノームを使うよりも優れたパフォーマンスが得られることもわかりました。これらの結果から,MoDAは深度スケーリングのための有望なプリミティブであることが示唆された。コードはhttps://github.com/hustvl/MoDA で公開されている。

論文の概要: Mixture-of-Depths Attention

関連論文リスト