Fugu-MT 論文翻訳(概要): On the Computational Hardness of Transformers

論文の概要: On the Computational Hardness of Transformers

arxiv url: http://arxiv.org/abs/2603.11332v1
Date: Wed, 11 Mar 2026 21:48:43 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-13 14:46:25.665039
Title: On the Computational Hardness of Transformers
Title（参考訳）: 変圧器の計算硬度について
Authors: Barna Saha, Yinzhan Xu, Christopher Ye, Hantao Yu,
Abstract要約: この結果から, トランスフォーマーの効率は, 独立評価値のLH$よりも高いことがわかった。小さな埋め込み方式では、$LH$アテンションヘッドは別々に$LHN2 + o(1)$時間を必要とする。大規模な埋め込み方式では、$LHN+ o(1)$算術演算を使用して別々に$LH$アテンションヘッドを計算することができる。
参考スコア（独自算出の注目度）: 14.73362105392153
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The transformer has revolutionized modern AI across language, vision, and beyond. It consists of $L$ layers, each running $H$ attention heads in parallel and feeding the combined output to the subsequent layer. In attention, the input consists of $N$ tokens, each a vector of dimension $m$. The attention mechanism involves multiplying three $N \times m$ matrices, applying softmax to an intermediate product. Several recent works have advanced our understanding of the complexity of attention. Known algorithms for transformers compute each attention head independently. This raises a fundamental question that has recurred throughout TCS under the guise of ``direct sum'' problems: can multiple instances of the same problem be solved more efficiently than solving each instance separately? Many answers to this question, both positive and negative, have arisen in fields spanning communication complexity and algorithm design. Thus, we ask whether transformers can be computed more efficiently than $LH$ independent evaluations of attention. In this paper, we resolve this question in the negative, and give the first non-trivial computational lower bounds for multi-head multi-layer transformers. In the small embedding regime ($m = N^{o(1)}$), computing $LH$ attention heads separately takes $LHN^{2 + o(1)}$ time. We establish that this is essentially optimal under SETH. In the large embedding regime ($m = N$), one can compute $LH$ attention heads separately using $LHN^{ω+ o(1)}$ arithmetic operations (plus exponents), where $ω$ is the matrix multiplication exponent. We establish that this is optimal, by showing that $LHN^{ω- o(1)}$ arithmetic operations are necessary when $ω> 2$. Our lower bound in the large embedding regime relies on a novel application of the Baur-Strassen theorem, a powerful algorithmic tool underpinning the famous backpropagation algorithm.
Abstract（参考訳）: トランスフォーマーは、言語、ビジョン、そしてそれ以上にわたって、現代のAIに革命をもたらした。これは$L$レイヤで構成され、それぞれが$H$アテンションヘッドを並列に実行し、組み合わせた出力を後続のレイヤに供給する。注意点として、入力は$N$トークンで構成され、それぞれが次元$m$のベクトルである。注意機構は、3つの$N \times m$行列を乗算し、中間積にソフトマックスを適用する。いくつかの最近の研究により、注意の複雑さに対する理解が深まった。トランスの既知のアルゴリズムは、それぞれのアテンションヘッドを独立に計算する。これは 'direct sum'' 問題の下でTCS全体で繰り返された根本的な疑問を提起する。同じ問題の複数のインスタンスは、各インスタンスを別々に解決するよりも、より効率的に解決できるのか? この問題に対する多くの答えは、正と負の両方で、コミュニケーションの複雑さとアルゴリズム設計にまたがる分野に現れている。したがって、変換器を$LH$の独立評価よりも効率的に計算できるかどうかを問う。本稿では、この問題を負の形で解決し、マルチヘッド多層トランスにおける最初の非自明な計算下界を与える。小さな埋め込みレジーム(m = N^{o(1)}$)では、$LH$アテンションヘッドは別々に$LHN^{2 + o(1)}$時間を必要とする。これはSETHの下では本質的に最適である。大規模な埋め込みレジーム(m = N$)では、$LHN^{ω+ o(1)}$算術演算(+指数)で別々に$LH$アテンションヘッドを計算できる。 LHN^{ω- o(1)}$算術演算が$ω> 2$のときに必要であることを示すことにより、これが最適であることを示す。我々の大規模な埋め込み体制における下位境界は、有名なバックプロパゲーションアルゴリズムを支える強力なアルゴリズムツールであるバウル・ストラッセンの定理の新たな応用に依存している。

論文の概要: On the Computational Hardness of Transformers

関連論文リスト