Fugu-MT 論文翻訳(概要): Why Do Accumulated Transformations Extrapolate?

論文の概要: Why Do Accumulated Transformations Extrapolate?

arxiv url: http://arxiv.org/abs/2606.24975v1
Date: Tue, 23 Jun 2026 12:08:30 GMT
ステータス: 翻訳完了
システム内更新日: 2026-06-25 17:05:30.077762
Title: Why Do Accumulated Transformations Extrapolate?
Title（参考訳）: 累積変換はなぜ外転するのか?
Authors: Mahesh Godavarti,
Abstract要約: PaTHアテンションは、RoPEの位置インデクシングされた回転を蓄積したデータ依存リフレクションに置き換えることで、強い長さの外挿が得られることを示した。我々は,RoPEのブロック対角SO(2)回転を保ちながら,位置付き角度を累積トークン依存の角度に置き換える簡易な変種について検討する。結果が一定の正則性条件を満たす累積変換にまで拡張されることを証明した。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: PaTH Attention showed that replacing RoPE's position-indexed rotations with accumulated data-dependent Householder reflections yields strong length extrapolation, though performance degrades at extreme context lengths. We ask whether this depends on Householder-specific structure or reflects a general property of accumulated transformations along source-to-query paths. We study a simpler variant keeping RoPE's block-diagonal SO(2) rotations but replacing position-indexed angles with accumulated token-dependent ones. It shows the same pattern: improved extrapolation then degradation at long contexts. We prove the result extends to accumulated orthogonal transformations satisfying certain regularity conditions: their products become incoherent after finitely many steps, suppressing attention to distant tokens. Accumulated rotations of queries and keys create a finite mixing window independent of context length; per-token suppression learned in training transfers unchanged to any evaluation length, and high-dimensional concentration produces a score gap suppressing far tokens while near-route transport preserves the target signal. Conversely, a lower bound shows accumulated rotations must eventually degrade: as the far set grows, no rotations preserve the near signal without explicit far-mass control. For SO(2) rotations, rotating values too makes residual far contributions combine incoherently, extending the range. Controlled experiments support these predictions: random accumulated rotations substantially improve extrapolation over RoPE, learned token-dependent rotations maintain near-training-length perplexity far beyond the training context, and rotating values helps over queries and keys alone. Rotation-only models still degrade at extreme lengths, while ALiBi stays length-stable, consistent with the need for far-mass control.
Abstract（参考訳）: PaTH Attentionは、RoPEの位置インデクシングされた回転を蓄積したデータ依存リフレクションに置き換えると、強い長さの外挿が得られることを示した。戸主固有の構造に依存しているのか,あるいはソース・ツー・クエリ・パスに沿って蓄積された変換の一般的な性質を反映しているのかを問う。我々は,RoPEのブロック対角SO(2)回転を保ちながら,位置付き角度を累積トークン依存の角度に置き換える簡易な変種について検討する。これは同じパターンを示している。外挿を改善し、長いコンテキストで分解する。この結果は、一定の規則性条件を満たす累積直交変換にまで拡張され、その積は有限ステップの後に不整合となり、遠くのトークンへの注意を抑える。クェリとキーの累積回転は、文脈長とは無関係に有限混合ウィンドウを生成し、トレーニングトランスファーで学んだ各トケン抑制は、任意の評価長に変化せず、高次元濃度は、ターゲット信号を保存する間、遠くのトークンを抑圧するスコアギャップを生成する。逆に、下界は蓄積された回転を最終的に劣化させなければならない: 遠集合が大きくなるにつれて、明確な遠質量制御なしに近信号を保存する回転は存在しない。 SO(2) 回転の場合、回転する値も残余の遠方寄与を不整合に組み合わせ、範囲を広げる。ランダム蓄積ローテーションはRoPEの補間を大幅に改善し、学習されたトークン依存ローテーションはトレーニングコンテキストを超えてほぼトレーニング長のパープレキシティを維持し、ローテーション値はクエリとキーのみを上回るのに役立つ。回転のみのモデルはまだ極端な長さで劣化するが、ALiBiは遠質量制御の必要性に応じて安定している。

論文の概要: Why Do Accumulated Transformations Extrapolate?

関連論文リスト