Fugu-MT 論文翻訳(概要): On the Convergence of Gradient Descent on Learning Transformers with Residual Connections

論文の概要: On the Convergence of Gradient Descent on Learning Transformers with Residual Connections

arxiv url: http://arxiv.org/abs/2506.05249v1
Date: Thu, 05 Jun 2025 17:10:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-06-06 21:53:49.83959
Title: On the Convergence of Gradient Descent on Learning Transformers with Residual Connections
Title（参考訳）: 残差接続を有する学習用変圧器におけるグラディエントDescentの収束性について
Authors: Zhen Qin, Jinxin Zhou, Zhihui Zhu,
Abstract要約: 本研究では, 自己アテンション, フィードフォワードネットワーク, 残差接続を含む構造的に完全な単一層トランスの収束挙動を解析する。残余接続はこの出力行列の不調和を改善するのに役立ち、これはソフトマックス演算によって課される低ランク構造から生じる問題である。
参考スコア（独自算出の注目度）: 26.02176724426513
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer models have emerged as fundamental tools across various scientific and engineering disciplines, owing to their outstanding performance in diverse applications. Despite this empirical success, the theoretical foundations of Transformers remain relatively underdeveloped, particularly in understanding their training dynamics. Existing research predominantly examines isolated components--such as self-attention mechanisms and feedforward networks--without thoroughly investigating the interdependencies between these components, especially when residual connections are present. In this paper, we aim to bridge this gap by analyzing the convergence behavior of a structurally complete yet single-layer Transformer, comprising self-attention, a feedforward network, and residual connections. We demonstrate that, under appropriate initialization, gradient descent exhibits a linear convergence rate, where the convergence speed is determined by the minimum and maximum singular values of the output matrix from the attention layer. Moreover, our analysis reveals that residual connections serve to ameliorate the ill-conditioning of this output matrix, an issue stemming from the low-rank structure imposed by the softmax operation, thereby promoting enhanced optimization stability. We also extend our theoretical findings to a multi-layer Transformer architecture, confirming the linear convergence rate of gradient descent under suitable initialization. Empirical results corroborate our theoretical insights, illustrating the beneficial role of residual connections in promoting convergence stability.
Abstract（参考訳）: トランスフォーマーモデルは、様々な応用において卓越した性能のため、様々な科学・工学分野の基本的なツールとして登場した。この経験的成功にもかかわらず、トランスフォーマーの理論的基礎は、特に訓練力学の理解において、比較的未発達のままである。既存の研究では、孤立したコンポーネント(自己保持機構やフィードフォワードネットワークなど)について、特に残余接続が存在する場合に、これらのコンポーネント間の相互依存性を徹底的に調査している。本稿では, 自己アテンション, フィードフォワードネットワーク, 残差接続を含む構造的に完全な単一層トランスの収束挙動を解析し, このギャップを埋めることを目的とする。適切な初期化の下では、勾配降下は線形収束速度を示し、この収束速度は注目層からの出力行列の最小値と最大値によって決定される。さらに,我々は,残差接続が,ソフトマックス演算による低ランク構造に起因する問題である出力行列の不調和を改善するのに役立つことを明らかにし,最適化安定性の向上を図った。また, この理論結果を多層トランスフォーマーアーキテクチャに拡張し, 最適初期化条件下での勾配降下の線形収束率を確認した。実験結果は我々の理論的洞察を裏付け、収束安定性の促進における残留接続の有益な役割を解明する。

論文の概要: On the Convergence of Gradient Descent on Learning Transformers with Residual Connections

関連論文リスト