Fugu-MT 論文翻訳(概要): Simply Stabilizing the Loop via Fully Looped Transformer

論文の概要: Simply Stabilizing the Loop via Fully Looped Transformer

arxiv url: http://arxiv.org/abs/2605.18797v1
Date: Mon, 11 May 2026 07:21:53 GMT
ステータス: 翻訳完了
システム内更新日: 2026-05-20 21:37:32.335229
Title: Simply Stabilizing the Loop via Fully Looped Transformer
Title（参考訳）: 完全ループ変換器によるループの安定化
Authors: Rao Fu, Zixuan Yang, Jiankun Zhang, Jing Ma, Hechang Chen, Yu Li, Yi Chang,
Abstract要約: Looped Transformerは、ループイテレーションの数が増えると、トレーニングの不安定性に悩まされる。実験により、フルループ変換器はトレーニングの安定性を改善し、下流の性能を高め、異なるテスト時間計算予算の下で予備的適応性を提供することを示した。
参考スコア（独自算出の注目度）: 41.240805541680395
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Scaling model performance typically requires increasing model size. Looped Transformer offers a compelling alternative by iteratively reusing the same Transformer blocks, trading additional computation for improved performance without increasing parameter count or context length. Because the number of loop iterations can be adjusted at inference, it also provides a natural mechanism for balancing performance and test-time compute. However, Looped Transformer still suffers from training instability when the number of loop iterations increases. Our analysis reveals that this instability stems from two sources: gradient oscillation and residual explosion. To address these two problems, we propose the Fully Looped Transformer, which introduces two parameter-free modifications: (1) Fully Looped Architecture, which distributes inter-loop signals across all layers to mitigate residual explosion; (2) Attention Injection, which reuses the existing attention block to suppress gradient oscillation. These modifications stabilize training dynamics, enabling the Fully Looped Transformer to be trained stably up to 12 loop iterations, whereas other baseline looped models collapse in this regime. In milder settings where Looped Transformer does not collapse, Fully Looped Transformer still improves average downstream-task performance by up to 13.2\%. Overall, our experiments demonstrate that Fully Looped Transformer improves training stability, enhances downstream performance, and provides preliminary adaptability under different test-time compute budgets by varying loop iterations at inference.
Abstract（参考訳）: モデルパフォーマンスのスケーリングは通常、モデルのサイズを拡大する必要があります。 Looped Transformerは、同じTransformerブロックを反復的に再利用し、パラメータ数やコンテキスト長を増大させることなく、パフォーマンスを改善するための追加計算を交換することで、魅力的な代替手段を提供する。ループイテレーションの回数は推論で調整できるため、パフォーマンスとテスト時間計算のバランスをとるための自然なメカニズムも提供する。しかし、ループ反復数が増加すると、Looped Transformerはトレーニングの不安定さに悩まされる。解析の結果、この不安定性は勾配振動と残留爆発の2つの源に由来することが明らかとなった。これら2つの問題に対処するため,(1)全層にわたってループ間信号を分散して残差爆発を緩和するフルループ型アーキテクチャ,(2)既存のアテンションブロックを再利用して勾配発振を抑制するアテンションインジェクションの2つのパラメータフリーな修正を提案する。これらの変更はトレーニングのダイナミクスを安定化させ、フルループトランスフォーマーを安定して最大12ループの繰り返しでトレーニングできるようにする一方で、他のベースラインループモデルはこの体制で崩壊する。 Looped Transformerが崩壊しない軽度な設定では、Fully Looped Transformerは平均ダウンストリームタスクパフォーマンスを最大13.2\%改善している。実験の結果,フルループ変換器はトレーニングの安定性を向上し,ダウンストリーム性能を向上し,異なるテスト時間計算予算下での予備的適応性を提供する。

論文の概要: Simply Stabilizing the Loop via Fully Looped Transformer

関連論文リスト