Fugu-MT 論文翻訳(概要): When Does Sparsity Mitigate the Curse of Depth in LLMs

論文の概要: When Does Sparsity Mitigate the Curse of Depth in LLMs

arxiv url: http://arxiv.org/abs/2603.15389v1
Date: Mon, 16 Mar 2026 15:04:16 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 18:28:58.522667
Title: When Does Sparsity Mitigate the Curse of Depth in LLMs
Title（参考訳）: LLMの深さ曲線はいつ緩和されるか
Authors: Dilxat Muhtar, Xinyuan Song, Sebastian Pokutta, Max Zimmer, Nico Pelleriti, Thomas Hofmann, Shiwei Liu,
Abstract要約: 本研究では,分散伝播の規制として空間空間が機能し,深度利用が向上することを示す。以上の結果から,大規模な言語モデルにおいて,より効率的な深度スケーリングを実現するための重要なメカニズムとして,スパーシリティが明らかとなった。
参考スコア（独自算出の注目度）: 53.137717161619484
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent work has demonstrated the curse of depth in large language models (LLMs), where later layers contribute less to learning and representation than earlier layers. Such under-utilization is linked to the accumulated growth of variance in Pre-Layer Normalization, which can push deep blocks toward near-identity behavior. In this paper, we demonstrate that, sparsity, beyond enabling efficiency, acts as a regulator of variance propagation and thereby improves depth utilization. Our investigation covers two sources of sparsity: (i) implicit sparsity, which emerges from training and data conditions, including weight sparsity induced by weight decay and attention sparsity induced by long context inputs; and (ii) explicit sparsity, which is enforced by architectural design, including key/value-sharing sparsity in Grouped-Query Attention and expert-activation sparsity in Mixtureof-Experts. Our claim is thoroughly supported by controlled depth-scaling experiments and targeted layer effectiveness interventions. Across settings, we observe a consistent relationship: sparsity improves layer utilization by reducing output variance and promoting functional differentiation. We eventually distill our findings into a practical rule-of-thumb recipe for training deptheffective LLMs, yielding a notable 4.6% accuracy improvement on downstream tasks. Our results reveal sparsity, arising naturally from standard design choices, as a key yet previously overlooked mechanism for effective depth scaling in LLMs. Code is available at https://github.com/pUmpKin-Co/SparsityAndCoD.
Abstract（参考訳）: 最近の研究は、大きな言語モデル(LLM)における深さの呪いを実証している。このような未利用化は、プレ層正規化におけるばらつきの蓄積と結びついており、深いブロックをほぼ同一の振る舞いへと押し上げることができる。本稿では, 分散伝搬の制御器として機能し, 深度利用の向上を図っている。我々の調査は2つの空白の源をカバーしている。一トレーニング及びデータ条件から生じる暗黙の空間性、例えば、体重減少による重量の空間性、長期の文脈入力による注意の空間性 (ii)明示的な空間性は、グループクエリのキー/バリュー共有の空間性やMixtureof-Expertsのエキスパート-アクティベーションの空間性など、アーキテクチャ設計によって強制される。我々の主張は、制御された深度スケーリング実験とターゲット層効果の介入によって完全に支持されている。空間性は、出力のばらつきを減らし、機能的分化を促進することによって、層利用を改善する。最終的に,本研究の成果を,深度効率のLLMの実践的ルール・オブ・サンプブ・レシピに抽出し,下流タスクの精度が4.6%向上した。以上の結果から,LLMにおける有効深度スケーリングのメカニズムとして,従来の設計選択から自然に生じる疎度が明らかとなった。コードはhttps://github.com/pUmpKin-Co/SparsityAndCoDで入手できる。

論文の概要: When Does Sparsity Mitigate the Curse of Depth in LLMs

関連論文リスト