Fugu-MT 論文翻訳(概要): Finite-Time Analysis of Gradient Descent for Shallow Transformers

論文の概要: Finite-Time Analysis of Gradient Descent for Shallow Transformers

arxiv url: http://arxiv.org/abs/2601.16514v1
Date: Fri, 23 Jan 2026 07:28:17 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-26 14:27:27.590447
Title: Finite-Time Analysis of Gradient Descent for Shallow Transformers
Title（参考訳）: 浅変圧器用グラディエントディフレッシュの有限時間解析
Authors: Enes Arda, Semih Cayci, Atilla Eryilmaz,
Abstract要約: 本研究では,トランスフォーマーの最適化環境が不適当であることから,トランスフォーマーの性能が向上する理由を考察する。完全なコンテキストを維持するために、Transformerのメモリ要件は長さとともに増加する。
参考スコア（独自算出の注目度）: 16.566605776410068
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Understanding why Transformers perform so well remains challenging due to their non-convex optimization landscape. In this work, we analyze a shallow Transformer with $m$ independent heads trained by projected gradient descent in the kernel regime. Our analysis reveals two main findings: (i) the width required for nonasymptotic guarantees scales only logarithmically with the sample size $n$, and (ii) the optimization error is independent of the sequence length $T$. This contrasts sharply with recurrent architectures, where the optimization error can grow exponentially with $T$. The trade-off is memory: to keep the full context, the Transformer's memory requirement grows with the sequence length. We validate our theoretical results numerically in a teacher-student setting and confirm the predicted scaling laws for Transformers.
Abstract（参考訳）: トランスフォーマーがうまく機能する理由を理解することは、非凸最適化の状況のため、依然として困難である。本研究は,カーネルシステムにおいて,投射勾配降下法により訓練された独立ヘッド$m$の浅層トランスフォーマーを解析する。我々の分析では2つの主な発見が判明した。 (i)無症候性保証に必要な幅は、サンプルサイズ$n$と対数的にのみスケールし、 (ii)最適化誤差はシーケンス長$T$とは独立である。これは、最適化エラーが指数関数的に$T$で増大する反復アーキテクチャとは対照的である。完全なコンテキストを維持するために、Transformerのメモリ要求はシーケンスの長さとともに増加する。教師-学生設定で理論結果を数値的に検証し,変圧器のスケーリング法則を検証した。

論文の概要: Finite-Time Analysis of Gradient Descent for Shallow Transformers

関連論文リスト