Fugu-MT 論文翻訳(概要): The Affine Divergence: Aligning Activation Updates Beyond Normalisation

論文の概要: The Affine Divergence: Aligning Activation Updates Beyond Normalisation

arxiv url: http://arxiv.org/abs/2512.22247v1
Date: Wed, 24 Dec 2025 00:31:22 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-30 22:37:29.92933
Title: The Affine Divergence: Aligning Activation Updates Beyond Normalisation
Title（参考訳）: Affine Divergence: 正常化を超えたアクティベーションアップデート
Authors: George Bird,
Abstract要約: 勾配降下時の数学的理想と効果的なアクティベーション更新の間には、体系的なミスマッチが存在する。正規化はパラメータ化スケーリングによるアクティベーション関数のような写像の方が優れており、最適化時の表現の優先順位付けを支援することが主張されている。これは、経験的に検証されたいくつかの新しい関数を導き、モデル生成に対するアフィン+非線形アプローチに関する疑問を提起する理論原理的なアプローチを構成する。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: A systematic mismatch exists between mathematically ideal and effective activation updates during gradient descent. As intended, parameters update in their direction of steepest descent. However, activations are argued to constitute a more directly impactful quantity to prioritise in optimisation, as they are closer to the loss in the computational graph and carry sample-dependent information through the network. Yet their propagated updates do not take the optimal steepest-descent step. These quantities exhibit non-ideal sample-wise scaling across affine, convolutional, and attention layers. Solutions to correct for this are trivial and, entirely incidentally, derive normalisation from first principles despite motivational independence. Consequently, such considerations offer a fresh and conceptual reframe of normalisation's action, with auxiliary experiments bolstering this mechanistically. Moreover, this analysis makes clear a second possibility: a solution that is functionally distinct from modern normalisations, without scale-invariance, yet remains empirically successful, outperforming conventional normalisers across several tests. This is presented as an alternative to the affine map. This generalises to convolution via a new functional form, "PatchNorm", a compositionally inseparable normaliser. Together, these provide an alternative mechanistic framework that adds to, and counters some of, the discussion of normalisation. Further, it is argued that normalisers are better decomposed into activation-function-like maps with parameterised scaling, thereby aiding the prioritisation of representations during optimisation. Overall, this constitutes a theoretical-principled approach that yields several new functions that are empirically validated and raises questions about the affine + nonlinear approach to model creation.
Abstract（参考訳）: 勾配降下時の数学的理想と効果的なアクティベーション更新の間には、体系的なミスマッチが存在する。意図されたように、パラメータは最も急降下する方向に更新される。しかしながら、アクティベーションは、計算グラフの損失に近づき、ネットワークを介してサンプル依存情報を運ぶため、最適化において優先されるよりも直接的に影響のある量である、と論じられている。しかし、彼らのプロパゲーションアップデートは、最適の急勾配のステップを踏まない。これらの量は、アフィン、畳み込み、および注意層を横断する非理想的なサンプルワイドスケーリングを示す。これを修正する解決策は自明であり、完全に偶然に、モチベーション的な独立性にもかかわらず第一原則から正規化を導出します。その結果、そのような考察は正規化の行動の新しく概念的な再編成をもたらし、補助的な実験によってこの機構が強化された。さらに、この分析は2つ目の可能性を明確にしている: スケール不変性のない現代的な正規化とは機能的に異なる解は、実験的に成功し、いくつかのテストで従来の正規化よりも優れている。これはアフィン写像の代替として提示される。これは、合成的に分離不能な正規化器である新しい関数形式である"PatchNorm"を通じて畳み込みを一般化する。これらが組み合わさって、正規化に関する議論を加味し、それに対抗する代替の力学フレームワークを提供する。さらに、正規化はパラメータ化スケーリングで活性化関数のような写像に分解され、最適化時の表現の優先順位付けを支援することが議論されている。全体として、これはいくつかの新しい関数を経験的に検証し、モデル生成に対するアフィン+非線形アプローチに関する疑問を提起する理論原理のアプローチを構成する。

論文の概要: The Affine Divergence: Aligning Activation Updates Beyond Normalisation

関連論文リスト