Fugu-MT 論文翻訳(概要): Ghosts of Softmax: Complex Singularities That Limit Safe Step Sizes in Cross-Entropy

論文の概要: Ghosts of Softmax: Complex Singularities That Limit Safe Step Sizes in Cross-Entropy

arxiv url: http://arxiv.org/abs/2603.13552v1
Date: Fri, 13 Mar 2026 19:42:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-17 16:19:35.267756
Title: Ghosts of Softmax: Complex Singularities That Limit Safe Step Sizes in Cross-Entropy
Title（参考訳）: ソフトマックスのゴースト: クロスエントロピーにおける安全なステップサイズを制限する複雑な特異点
Authors: Piyush Sao,
Abstract要約: クロスエントロピートレーニング分析は、提案されたステップが目標を減少させるかどうかを予測するために、損失の局所的なテイラーモデルに依存する。提案した更新方向に沿って,ロジット線形化の下で閉形式式を導出する。 _a$の正規化は、標準偏差$0.992$から$0.164$へのオンセット閾値の広がりを縮小する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Optimization analyses for cross-entropy training rely on local Taylor models of the loss to predict whether a proposed step will decrease the objective. These surrogates are reliable only inside the Taylor convergence radius of the true loss along the update direction. That radius is set not by real-line curvature alone but by the nearest complex singularity. For cross-entropy, the softmax partition function $F=\sum_j \exp(z_j)$ has complex zeros -- ``ghosts of softmax'' -- that induce logarithmic singularities in the loss and cap this radius. To make this geometry usable, we derive closed-form expressions under logit linearization along the proposed update direction. In the binary case, the exact radius is $ρ^*=\sqrt{δ^2+ π^2}/Δ_a$. In the multiclass case, we obtain the lower bound $ρ_a=π/Δ_a$, where $Δ_a=\max_k a_k-\min_k a_k$ is the spread of directional logit derivatives $a_k=\nabla z_k\cdot v$. This bound costs one Jacobian-vector product and reveals what makes a step fragile: samples that are both near a decision flip and highly sensitive to the proposed direction tighten the radius. The normalized step size $r=τ/ρ_a$ separates safe from dangerous updates. Across six tested architectures and multiple step directions, no model fails for $r<1$, yet collapse appears once $r\ge 1$. Temperature scaling confirms the mechanism: normalizing by $ρ_a$ shrinks the onset-threshold spread from standard deviation $0.992$ to $0.164$. A controller that enforces $τ\leρ_a$ survives learning-rate spikes up to $10{,} 000\times$ in our tests, where gradient clipping still collapses. Together, these results identify a geometric constraint on cross-entropy optimization that operates through Taylor convergence rather than Hessian curvature.
Abstract（参考訳）: クロスエントロピートレーニングのための最適化解析は、提案されたステップが目的を減少させるかどうかを予測するために、損失の局所的なテイラーモデルに依存する。これらのサロゲートは、更新方向に沿った真の損失のテイラー収束半径内でのみ信頼できる。この半径は、実数直線曲率のみではなく、最も近い複素特異点によって設定される。クロスエントロピーの場合、ソフトマックス分割関数 $F=\sum_j \exp(z_j)$ は複素零点 -- ``ghosts of softmax'' を持ち、損失の対数特異点を誘導し、この半径をキャップする。この幾何を利用可能にするために、提案した更新方向に沿ってロジト線形化の下で閉形式式を導出する。二項の場合、正確な半径は$ρ^*=\sqrt{δ^2+ π^2}/Δ_a$である。多クラスの場合、下界の$ρ_a=π/Δ_a$ を得るが、$Δ_a=\max_k a_k-\min_k a_k$ は方向ロジット微分 $a_k=\nabla z_k\cdot v$ の拡散である。この境界は1つのジャコビアンベクトル積を犠牲にし、決定フリップに近く、提案された方向に非常に敏感なサンプルが半径を締め付けるという、ステップの脆弱さを明らかにしている。正規化されたステップサイズ$r=τ/ρ_a$は、危険な更新からセーフを分離する。 6つのテストされたアーキテクチャと複数のステップの方向性で、$r<1$でモデルが失敗することはないが、$r\ge 1$で崩壊する。 ρ_a$ の正規化は標準偏差 $0.992$ から $0.164$ へのオンセット閾値拡散を縮小する。 τ\leρ_a$を強制するコントローラは、我々のテストで最大10{,} 000\times$まで学習速度のスパイクを継続します。これらの結果は、ヘッセン曲率よりもテイラー収束を通したクロスエントロピー最適化の幾何学的制約を特定する。

論文の概要: Ghosts of Softmax: Complex Singularities That Limit Safe Step Sizes in Cross-Entropy

関連論文リスト