Fugu-MT 論文翻訳(概要): Sparse Growing Transformer: Training-Time Sparse Depth Allocation via Progressive Attention Looping

論文の概要: Sparse Growing Transformer: Training-Time Sparse Depth Allocation via Progressive Attention Looping

arxiv url: http://arxiv.org/abs/2603.23998v1
Date: Wed, 25 Mar 2026 06:55:33 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-26 21:06:11.171045
Title: Sparse Growing Transformer: Training-Time Sparse Depth Allocation via Progressive Attention Looping
Title（参考訳）: スパース成長変圧器:プログレッシブアテンションループによる訓練時スパース深さ割当
Authors: Yao Chen, Yilong Chen, Yinqi Yang, Junyuan Shang, Zhenyu Zhang, Zefeng Zhang, Shuaiyi Nie, Shuohuan Wang, Yu Sun, Hua Wu, HaiFeng Wang, Tingwen Liu,
Abstract要約: トランスフォーマーの効果的な深さを高めるための既存のアプローチは、パラメータの再利用に依存している。スパース成長変圧器(SGT)について紹介する。 SGTはトレーニング時のスパース深さ割り当てフレームワークで、より深い層からより浅い層まで徐々に繰り返しを拡張する。
参考スコア（独自算出の注目度）: 43.89065405956364
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Existing approaches to increasing the effective depth of Transformers predominantly rely on parameter reuse, extending computation through recursive execution. Under this paradigm, the network structure remains static along the training timeline, and additional computational depth is uniformly assigned to entire blocks at the parameter level. This rigidity across training time and parameter space leads to substantial computational redundancy during training. In contrast, we argue that depth allocation during training should not be a static preset, but rather a progressively growing structural process. Our systematic analysis reveals a deep-to-shallow maturation trajectory across layers, where high-entropy attention heads play a crucial role in semantic integration. Motivated by this observation, we introduce the Sparse Growing Transformer (SGT). SGT is a training-time sparse depth allocation framework that progressively extends recurrence from deeper to shallower layers via targeted attention looping on informative heads. This mechanism induces structural sparsity by selectively increasing depth only for a small subset of parameters as training evolves. Extensive experiments across multiple parameter scales demonstrate that SGT consistently outperforms training-time static block-level looping baselines under comparable settings, while reducing the additional training FLOPs overhead from approximately 16--20% to only 1--3% relative to a standard Transformer backbone.
Abstract（参考訳）: トランスフォーマーの効果的な深さを高めるための既存のアプローチは、主にパラメータの再利用に依存し、再帰的な実行を通じて計算を拡張する。このパラダイムの下では、ネットワーク構造はトレーニングスケジュールに沿って静的のままであり、パラメータレベルでのブロック全体に計算深度を均一に割り当てる。このトレーニング時間とパラメータ空間の剛性は、トレーニング中にかなりの計算冗長性をもたらす。対照的に、トレーニング中の深さの割り当ては静的なプリセットではなく、徐々に増加する構造的プロセスであるべきだ、と我々は主張する。系統的な解析により,階層間の深層から浅層への成熟軌道が明らかとなり,高いエントロピー・アテンション・ヘッドがセマンティック・インテグレーションにおいて重要な役割を担っている。本研究の目的は,Sparse Growing Transformer (SGT) の導入である。 SGTはトレーニング時のスパース深さ割り当てフレームワークで、情報的ヘッドに対するターゲットアテンションループを通じて、より深い層からより浅い層へのリカレンスを段階的に拡張する。このメカニズムは、トレーニングが進むにつれて、パラメータの小さなサブセットに対してのみ、深さを選択的に増加させることで、構造的疎結合を誘導する。複数のパラメータスケールにわたる大規模な実験により、SGTはトレーニング時の静的ブロックレベルのループベースラインを同等の設定で一貫して上回り、追加のトレーニングFLOPのオーバーヘッドを標準のTransformerバックボーンと比較して約16-20%から1-3%に削減した。

論文の概要: Sparse Growing Transformer: Training-Time Sparse Depth Allocation via Progressive Attention Looping

関連論文リスト