Fugu-MT 論文翻訳(概要): Dynamic sparsity in tree-structured feed-forward layers at scale

論文の概要: Dynamic sparsity in tree-structured feed-forward layers at scale

arxiv url: http://arxiv.org/abs/2604.08565v1
Date: Wed, 18 Mar 2026 09:57:36 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-19 19:09:11.425804
Title: Dynamic sparsity in tree-structured feed-forward layers at scale
Title（参考訳）: 大規模木構造フィードフォワード層における動的疎水性
Authors: Reza Sedghi, Robin Schiewer, Anand Subramoney, David Kappel,
Abstract要約: 深部変圧器アーキテクチャにおけるブロックのドロップイン置換として, スパース, ツリー構造を有するフィードフォワード層について検討した。自動回帰言語モデリングやダウンストリーム質問応答において,この条件空間が適用可能であることを初めて実証した。
参考スコア（独自算出の注目度）: 0.869928033942254
License: http://creativecommons.org/licenses/by/4.0/
Abstract: At typical context lengths, the feed-forward MLP block accounts for a large share of a transformer's compute budget, motivating sparse alternatives to dense MLP blocks. We study sparse, tree-structured feed-forward layers as drop-in replacements for MLP blocks in deep transformer architectures, enabling conditional computation via hard hierarchical routing without a separate router network. We demonstrate for the first time that this form of tree-structured conditional sparsity can be applied for autoregressive language modeling and downstream question answering, including zero- and few-shot settings, and its scalability beyond 1B parameters. Despite activating fewer than 5% of the feed-forward block's units per token, our models match dense baselines under controlled training and fine-tuning protocols. We further analyze training dynamics and identify an emergent auto-pruning effect: the interaction of hard routing with asymmetric nonlinearities progressively deactivates unused paths, yielding partial conversion of dynamic routing into static structural sparsity. We show that simple architectural choices can modulate this behavior and recover balanced trees without auxiliary losses. Overall, our work demonstrates that tree-structured feed-forward layers provide a scalable and controllable mechanism for sparsifying large transformer models.
Abstract（参考訳）: 典型的な文脈長では、フィードフォワード MLP ブロックはトランスフォーマーの計算予算の大部分を占めており、密度の高い MLP ブロックに対してスパースな代替手段を動機付けている。本研究では, 深いトランスアーキテクチャにおいて, MLPブロックのドロップイン置換として, 分離したルータネットワークを使わずに, ハード階層的ルーティングによる条件計算を実現するために, スパース, ツリー構造化フィードフォワード層について検討する。本稿では,この木構造条件空間が,ゼロおよび少数ショット設定を含む自動回帰言語モデリングやダウンストリーム質問応答,および1Bパラメータを超える拡張性に応用できることを示す。トークン当たりのフィードフォワードブロックの単位の5%以下を活性化するにもかかわらず、我々のモデルは制御されたトレーニングおよび微調整プロトコルの下で密度の高いベースラインと一致する。さらに、トレーニングのダイナミクスを分析し、非対称な非線形性とハードルーティングの相互作用によって、未使用経路が徐々に非活性化され、動的ルーティングが静的な構造空間に部分的に変換されるという、創発的なオートプルーニング効果を同定する。簡単な設計上の選択は、この挙動を調節し、補助的な損失を伴わずにバランスのとれた木を復元できることが示される。全体として、我々は、木構造フィードフォワード層が大きなトランスフォーマーモデルを分散化するためのスケーラブルで制御可能なメカニズムを提供することを示した。

論文の概要: Dynamic sparsity in tree-structured feed-forward layers at scale

関連論文リスト