Fugu-MT 論文翻訳(概要): Scaling Bidirectional Spans and Span Violations in Attention Mechanism

論文の概要: Scaling Bidirectional Spans and Span Violations in Attention Mechanism

arxiv url: http://arxiv.org/abs/2512.13033v1
Date: Mon, 15 Dec 2025 07:03:24 GMT
ステータス: 翻訳完了
システム内更新日: 2025-12-16 17:54:56.560387
Title: Scaling Bidirectional Spans and Span Violations in Attention Mechanism
Title（参考訳）: 注意機構における双方向スパンとスパン振動のスケーリング
Authors: Jongwook Kim, Sangheon Yun, Sukjin Yoon,
Abstract要約: canonical $O(N2)$ Transformerは、シーケンスモデリングにおける経験的なパフォーマンスフロンティアのままである。本研究では,非対称なプロジェクションを利用して後方方向の勾配を並列スパンに分解する最適化フレームワークを提案する。我々はこれらのコンポーネントを選択的にスケーリングし、主に0分の1の双方向並列スパンにフォーカスすることで、最も効果的な学習信号が得られることを示した。
参考スコア（独自算出の注目度）: 5.755498052202004
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The canonical $O(N^2)$ Transformer remains the empirical performance frontier in sequence modeling, and its training can be further optimized by addressing geometric inefficiency. We propose an optimization framework that leverages an asymmetric projection to decompose the backward-pass gradients into parallel spans and orthogonal violations, while keeping the canonical forward-pass $QKV$ structure intact. Through consistent experimental validation across various decomposition and projection setups, we provide strong theoretical evidence: the standard attention gradient is suboptimal. We demonstrated that selectively scaling these components, focusing primarily on $0^{th}$ order bidirectional parallel spans, yields the most effective learning signal. On the limited WikiText-2 dataset, and using a crude configuration, this method achieved a $0.56\%$ reduction in validation loss, confirming the framework's fundamental validity and suggesting significant potential gains on larger datasets and deeper training regimes
Abstract（参考訳）: 標準的な$O(N^2)$ Transformerは、シーケンスモデリングにおける経験的なパフォーマンスフロンティアであり、そのトレーニングは幾何学的非効率に対処することによってさらに最適化することができる。非対称なプロジェクションを利用して、逆方向の勾配を平行なスパンと直交の違反に分解し、標準のフォワードパス$QKV$構造をそのまま維持する最適化フレームワークを提案する。様々な分解・投射装置における一貫した実験的検証を通じて、標準的な注意勾配は最適以下である、強い理論的証拠を提供する。我々はこれらの成分を選択的にスケーリングし、主に$0^{th}$の双方向並列スパンに焦点をあてることで、最も効果的な学習信号が得られることを示した。限られたWikiText-2データセットと粗い構成を用いて、この手法は検証損失の0.56\%の削減を達成し、フレームワークの基本的妥当性を確認し、より大きなデータセットとより深いトレーニング体制に有意義な可能性を示す。

論文の概要: Scaling Bidirectional Spans and Span Violations in Attention Mechanism

関連論文リスト