Fugu-MT 論文翻訳(概要): Multi-Head Attention Is a Multi-Player Game

論文の概要: Multi-Head Attention Is a Multi-Player Game

arxiv url: http://arxiv.org/abs/2602.00861v1
Date: Sat, 31 Jan 2026 18:49:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-03 19:28:33.438788
Title: Multi-Head Attention Is a Multi-Player Game
Title（参考訳）: マルチプレイヤーゲーム「Multi-Head Attention」
Authors: Kushal Chakrabarti, Nirmal Balachundar,
Abstract要約: クロスエントロピートレーニングは、頭の中で暗黙の潜在的なゲームを引き起こす。勾配降下は、潜在的に非有界な非効率でナッシュ平衡に収束する。我々はこれを GAME-LoRA として、Barlow Twins decorrelation と対数行列座標圧を組み合わせる。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern transformer attention is internally multi-agent -- heads compete and coordinate -- yet we train it as if it were a monolithic optimizer. We formalize this gap: cross-entropy training induces an implicit potential game among heads, and gradient descent converges to Nash equilibria with potentially unbounded inefficiency due to unpriced externalities (redundancy, correlated errors). Our main result bounds the Price of Anarchy by $Γ(G)$, the off-diagonal mass of a head interaction matrix capturing weight and gradient coupling. Under mild smoothness assumptions, we prove that both \emph{excess hallucination probability} and \emph{excess head redundancy} scale with PoA, unifying two distinct failure modes into a single mechanism. The bound is prescriptive: regularization that reduces $Γ(G)$ provably tightens PoA. We instantiate this as GAME-LoRA, combining Barlow Twins decorrelation with log-determinant coordination pressure. Experiments validate the theory: $Γ(G)$ predicts hallucination ($p{<}0.05$), emergent coalitions exhibit selective coordination, and GAME-LoRA achieves up to 18\% hallucination reduction (8\% average) with no knowledge degradation -- a Pareto improvement inaccessible to methods ignoring the game structure.
Abstract（参考訳）: 現代のトランスフォーマーの注目は、内部のマルチエージェント -- ヘッドの競合とコーディネート -- にありますが、モノリシックなオプティマイザのようにトレーニングします。クロスエントロピートレーニングは、頭の中で暗黙のポテンシャルゲームを引き起こし、勾配降下は、プライドな外部性(冗長性、相関誤差)によって潜在的に非有界な非効率でナッシュ平衡に収束する。我々の主な結果は、重みと勾配のカップリングを捉えた頭部相互作用行列の対角質量である$(G)$でアナーキーの価格を束縛する。軽度な滑らかさの仮定の下では,2つの異なる障害モードを単一のメカニズムに統一し,それぞれがPoAでスケールすることが証明される。有界性は規範的であり、正則化は$(G)$を減らし、PoAを確実に締め付ける。我々はこれを GAME-LoRA として、Barlow Twins decorrelation と対数行列座標圧を組み合わせる。実験は、この理論を検証している:$(G)$は幻覚(p{<}0.05$)を予測し、創発的連立は選択的な調整を示し、GAME-LoRAは知識劣化のない最大18\%の幻覚減少(平均8\%)を達成する。

論文の概要: Multi-Head Attention Is a Multi-Player Game

関連論文リスト