Fugu-MT 論文翻訳(概要): On the Residual Scaling of Looped Transformers: Stability and Transferability

論文の概要: On the Residual Scaling of Looped Transformers: Stability and Transferability

arxiv url: http://arxiv.org/abs/2606.18524v1
Date: Tue, 16 Jun 2026 22:39:13 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-18 17:16:50.919905
Title: On the Residual Scaling of Looped Transformers: Stability and Transferability
Title（参考訳）: ループ変換器の残留スケーリングについて:安定性と伝達性
Authors: Shaowen Wang, Bingrui Li, Ge Zhang, Wenhao Huang, Shen Yan, Jian Li,
Abstract要約: 1/N$のスケーリングは、トレーニング性を改善し、ループ数で1/sqrtN$のスケーリングよりも優れた損失をもたらすことを示す。ループ変換器の実験では、1/N$のスケーリングがトレーニング性を改善し、ループ数を越えたスケールで1/sqrtN$よりもよい損失をもたらすことが確認されている。
参考スコア（独自算出の注目度）: 31.27468588849646
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Looped (weight-tied) Transformers apply a shared residual block $N$ times ($h \leftarrow h + \varepsilon\,f(h)$, same $f$ at each step), increasing effective depth without adding parameters. Prior depth-scaling analyses prescribe $\varepsilon = 1/\!\sqrt{L}$ for depth-$L$ residual networks. We show that this is insufficient for looped architectures: weight sharing makes residual updates correlated across iterations, requiring the stronger scaling $\varepsilon = 1/N$. For multi-layer blocks ($L$ unique layers looped $N$ times), we derive a factored parameterization $\varepsilon = λ/(N\!\sqrt{L})$ that separates the two sources of growth: $1/N$ controls the within-layer loop correlation, and $1/\!\sqrt{L}$ controls the across-layer variance. A key consequence is that the optimal learning rate depends only on the number of unique layers $L$, not on the loop count $N$, enabling direct hyperparameter transfer from small to large $N$ without retuning. Experiments on looped Transformers confirm that $1/N$ scaling improves trainability and yields better loss than $1/\!\sqrt{N}$ scaling across loop counts.
Abstract（参考訳）: Looped (weight-tied) Transformerは、共有残余ブロック$N$ times$h \leftarrow h + \varepsilon\,f(h)$, same $f$を各ステップで適用し、パラメータを追加することなく効果的な深さを増大させる。事前の深さスケーリング分析では$\varepsilon = 1/\! 深さ-$L$残差ネットワークに対して \sqrt{L}$。重みの共有は繰り返しの間に残余の更新を相関させ、より強力なスケーリングの$\varepsilon = 1/N$を必要とします。多層ブロック (L$ unique layer looped $N$ times) の場合、因子化パラメータ化$\varepsilon = λ/(N\! 1/N$は層内ループ相関を制御し、1/\! \sqrt{L}$は層間分散を制御する。その結果、最適な学習レートは、ループ数$N$ではなく、ユニークなレイヤ数$L$にのみ依存し、調整することなく、小さなものから大きなものへの直接ハイパーパラメータ転送を可能にする。ループトランスフォーマーの実験では,1/N$のスケーリングによってトレーニング性が向上し,1/\! \sqrt{N}$ ループカウントのスケーリング。

論文の概要: On the Residual Scaling of Looped Transformers: Stability and Transferability

関連論文リスト