Fugu-MT 論文翻訳(概要): Rank, Head-Channel Non-Identifiability, and Symmetry Breaking: A Precise Analysis of Representational Collapse in Transformers

論文の概要: Rank, Head-Channel Non-Identifiability, and Symmetry Breaking: A Precise Analysis of Representational Collapse in Transformers

arxiv url: http://arxiv.org/abs/2604.23681v1
Date: Sun, 26 Apr 2026 12:43:12 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-28 17:12:07.497366
Title: Rank, Head-Channel Non-Identifiability, and Symmetry Breaking: A Precise Analysis of Representational Collapse in Transformers
Title（参考訳）: ランク, 頭部の非識別性, 対称性の破断: 変圧器の表象崩壊の精密解析
Authors: Giansalvo Cirrincione,
Abstract要約: Dong et al. (2021) による広く引用された結果によると、トランスフォーマーは、接続をスキップしたり、フィードフォワード層を使わずに、自己注意だけで構築され、急速に階級が崩壊する。この図は、ドンが研究した体制では正しいが、建築的理解にとって重要な方法では不完全であることを示している。
参考スコア（独自算出の注目度）: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A widely cited result by Dong et al. (2021) showed that Transformers built from self-attention alone, without skip connections or feed-forward layers, suffer from rapid rank collapse: all token representations converge to a single direction. The proposed remedy was the MLP. We show that this picture, while correct in the regime studied by Dong, is incomplete in ways that matter for architectural understanding. Three results are established. First, layer normalisation is precisely affine-rank-neutral: it preserves the affine rank of the token representation set exactly. The widespread claim that LN "plays no role" is imprecise; the correct statement is sharper. Second, residual connections generically obstruct rank collapse in real Transformers such as BERT-base, in a measure-theoretic sense, without contribution from the MLP. The MLP's irreplaceable function is different: generating feature directions outside the linear span of the original token embeddings, which no stack of attention layers can produce. Third, a phenomenon distinct from rank collapse is identified: head-channel non-identifiability. After multi-head attention sums per-head outputs through the output projection, individual contributions cannot be canonically attributed to a specific head; n(H-1)d_k degrees of freedom per layer remain ambiguous when recovering a single head from the mixed signal. The MLP cannot remedy this because it acts on the post-summation signal. A constructive partial remedy is proposed: a position-gated output projection (PG-OP) at parameter overhead below 1.6% of the standard output projection. The four collapse phenomena identified in the literature -- rank collapse in depth, in width, head-channel non-identifiability, and entropy collapse -- are unified under a symmetry-breaking framework, each corresponding to a distinct symmetry of the Transformer's forward pass.
Abstract（参考訳）: Dong et al (2021) による広く引用された結果によると、トランスフォーマーは、接続をスキップしたりフィードフォワード層を使わずに、自己注意だけで構築され、急激なランク崩壊に悩まされている。提案された治療法はMLPだった。この図は、ドンが研究した体制では正しいが、建築的理解にとって重要な方法では不完全であることを示している。 3つの結果が得られた。まず、層正規化は正確にアフィンランクニュートラルであり、トークン表現セットのアフィンランクを正確に保持する。 LNが「役割を果たさない」という主張は不正確であり、正しい主張はよりシャープである。第二に、残留接続は、MBPからの貢献なしに、測度理論的な意味でBERTベースのような実変換器のランク崩壊を総じて阻止する。元々のトークン埋め込みの線形スパンの外の機能方向を生成するが、注意層が生成できない。第3に、ランク崩壊とは異なる現象が同定される:ヘッドチャネルの非識別性。出力プロジェクションを通した複数ヘッド毎の出力の合計の後、個々のコントリビューションは特定のヘッドにカノニカルに関連付けられず、混合信号から1つのヘッドを回収する際には、層ごとの自由度n(H-1)d_kが曖昧である。 MLPは、仮定後のシグナルに作用するため、これを治療できない。標準出力プロジェクションの1.6%未満のパラメータオーバヘッドにおける位置付き出力プロジェクション(PG-OP)が提案される。文献で特定される4つの崩壊現象(ランク崩壊、幅、ヘッドチャネル非識別性、エントロピー崩壊)は、それぞれトランスフォーマーの前方通過の異なる対称性に対応する対称性破壊の枠組みの下で統一される。

論文の概要: Rank, Head-Channel Non-Identifiability, and Symmetry Breaking: A Precise Analysis of Representational Collapse in Transformers

関連論文リスト