Fugu-MT 論文翻訳(概要): Residual Stream Duality in Modern Transformer Architectures

論文の概要: Residual Stream Duality in Modern Transformer Architectures

arxiv url: http://arxiv.org/abs/2603.16039v1
Date: Tue, 17 Mar 2026 00:56:29 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-18 17:42:07.055616
Title: Residual Stream Duality in Modern Transformer Architectures
Title（参考訳）: 現代変圧器アーキテクチャにおける残留ストリーム双対性
Authors: Yifan Zhang,
Abstract要約: 最近の研究により、残留経路は単なる最適化配管ではなく、モデルの表現機械の一部であることが明らかになった。このデザイン空間を整理する最もクリーンな方法は、Transformerの2軸ビューである、と私たちは主張する。
参考スコア（独自算出の注目度）: 9.910562011343009
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recent work has made clear that the residual pathway is not mere optimization plumbing; it is part of the model's representational machinery. We agree, but argue that the cleanest way to organize this design space is through a two-axis view of the Transformer. A decoder evolves information along two ordered dimensions: sequence position and layer depth. Self-attention already provides adaptive mixing along the sequence axis, whereas the residual stream usually performs fixed addition along the depth axis. If we fix a token position and treat layer index as the ordered variable, then a causal depth-wise residual attention read is exactly the same local operator as causal short sliding-window attention (ShortSWA), except written over depth rather than over sequence. This is the core residual stream duality behind Transformer$^2$. This perspective also clarifies the recent literature. ELC-BERT and DenseFormer already show that learned aggregation over depth can outperform uniform residual accumulation, while Vertical Attention, DeepCrossAttention (DCA), MUDDFormer, and Attention Residuals move further toward explicit attention-based routing over earlier layers. The key point, however, is that operator-level duality does not imply systems-level symmetry. For large-scale autoregressive models, sequence-axis ShortSWA is usually the more hardware-friendly placement because it reuses token-side sliding-window kernels, KV-cache layouts, and chunked execution. If the goal is instead to change the shortcut itself, Deep Delta Learning (DDL) is the cleaner intervention because it modifies the residual operator directly rather than adding a separate cross-layer retrieval path. Our recommendation is therefore simple: use DDL when the shortcut is the object of interest, and use sequence-axis ShortSWA when the goal is local adaptive mixing.
Abstract（参考訳）: 最近の研究により、残留経路は単なる最適化配管ではなく、モデルの表現機械の一部であることが明らかになった。私たちは同意するが、このデザイン空間を整理する最もクリーンな方法はトランスフォーマーの2軸ビューであると主張している。デコーダは、シーケンス位置と層深度という2つの順序付けられた次元に沿って情報を進化させる。自己アテンションは、配列軸に沿って適応的な混合を提供するのに対し、残留ストリームは通常、深さ軸に沿って固定的な加算を行う。トークンの位置を固定し、層インデックスを順序変数として扱う場合、因果深さ方向の残留注意読み出しは、シーケンスオーバーよりも奥行きオーバーで書き直された場合を除いて、まさに因果ショート・スライディング・ウインドウ・アテンション(ShortSWA)と同一の局所演算子である。これはTransformer$^2$の背後にあるコア残ストリームの双対性である。この視点は近年の文献も明らかにしている。 ELC-BERTとDenseFormerは、深度を超える学習されたアグリゲーションが均一な残留蓄積より優れていることをすでに示しているが、Vertical Attention、DeepCrossAttention (DCA)、MUDDFormer、Attention Residualsは、以前のレイヤへの明示的なアグリゲーションベースのルーティングに向かって前進している。しかし、鍵となる点は、作用素レベルの双対性はシステムレベルの対称性を含まないことである。大規模な自己回帰モデルでは、トークン側のスライディングウインドウカーネル、KVキャッシュレイアウト、チャンク実行を再利用するため、シーケンス軸のShortSWAがハードウェアフレンドリーな配置であることが多い。目的がショートカット自体を変更する場合、Deep Delta Learning(DDL)は、個別のクロス層検索パスを追加するのではなく、残留演算子を直接変更するため、よりクリーンな介入である。ショートカットが関心の対象である場合にはDDLを使用し、ゴールが局所適応混合である場合にはシーケンス軸ShortSWAを使用します。

論文の概要: Residual Stream Duality in Modern Transformer Architectures

関連論文リスト