Fugu-MT 論文翻訳(概要): Fast Attention Requires Bounded Entries

論文の概要: Fast Attention Requires Bounded Entries

arxiv url: http://arxiv.org/abs/2302.13214v2
Date: Tue, 9 May 2023 20:03:04 GMT
ステータス: 翻訳完了
システム内更新日: 2023-05-11 16:39:50.153982
Title: Fast Attention Requires Bounded Entries
Title（参考訳）: 注意を急ぐには境界のあるエントリが必要です
Authors: Josh Alman, Zhao Song
Abstract要約: 内部製品注意計算はTransformer, GPT-1, BERT, GPT-2, GPT-3, ChatGPTなどの大規模言語モデルを訓練するための基本的なタスクである。行列を暗黙的に$A$とすることで、より高速なアルゴリズムが可能かどうかを検討する。このことは、入力行列がより小さいエントリを持つ場合、注意計算の方がはるかに効率的である、実際に観察された現象の理論的な説明を与える。
参考スコア（独自算出の注目度）: 19.17278873525312
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: In modern machine learning, inner product attention computation is a fundamental task for training large language models such as Transformer, GPT-1, BERT, GPT-2, GPT-3 and ChatGPT. Formally, in this problem, one is given as input three matrices $Q, K, V \in [-B,B]^{n \times d}$, and the goal is to construct the matrix $\mathrm{Att}(Q,K,V) := \mathrm{diag}(A {\bf 1}_n)^{-1} A V \in \mathbb{R}^{n \times d}$, where $A = \exp(QK^\top/d)$ is the `attention matrix', and $\exp$ is applied entry-wise. Straightforward methods for this problem explicitly compute the $n \times n$ attention matrix $A$, and hence require time $\Omega(n^2)$ even when $d = n^{o(1)}$ is small. In this paper, we investigate whether faster algorithms are possible by implicitly making use of the matrix $A$. We present two results, showing that there is a sharp transition at $B = \Theta(\sqrt{\log n})$. $\bullet$ If $d = O(\log n)$ and $B = o(\sqrt{\log n})$, there is an $n^{1+o(1)}$ time algorithm to approximate $\mathrm{Att}(Q,K,V)$ up to $1/\mathrm{poly}(n)$ additive error. $\bullet$ If $d = O(\log n)$ and $B = \Theta (\sqrt{\log n})$, assuming the Strong Exponential Time Hypothesis from fine-grained complexity theory, it is impossible to approximate $\mathrm{Att}(Q,K,V)$ up to $1/\mathrm{poly}(n)$ additive error in truly subquadratic time $n^{2 - \Omega(1)}$. This gives a theoretical explanation for the phenomenon observed in practice that attention computation is much more efficient when the input matrices have smaller entries.
Abstract（参考訳）: 現代の機械学習では、内部製品注意計算はTransformer, GPT-1, BERT, GPT-2, GPT-3, ChatGPTなどの大規模言語モデルを訓練するための基本的なタスクである。形式的には、この問題では、入力 3 つの行列 $q, k, v \in [-b,b]^{n \times d}$ として与えられるが、目的は行列 $\mathrm{att}(q,k,v) := \mathrm{diag}(a {\bf 1}_n)^{-1} a v \in \mathbb{r}^{n \times d}$ を構成することである。この問題のストラテフォワード法は、明示的に$n \times n$ attention matrix $A$を計算し、$d = n^{o(1)}$が小さい場合でも、時間$\Omega(n^2)$を必要とする。本稿では,行列の$A$を暗黙的に利用することで,より高速なアルゴリズムが可能かどうかを検討する。 2つの結果を示し、$b = \theta(\sqrt{\log n})$ の鋭い遷移が存在することを示した。 $\bullet$ if $d = o(\log n)$ and $b = o(\sqrt{\log n})$, $n^{1+o(1)}$ timeアルゴリズムは$\mathrm{att}(q,k,v)$を1/\mathrm{poly}(n)$に近似する。 $\bullet$ if $d = o(\log n)$ and $b = \theta (\sqrt{\log n})$ きめ細かな複雑性理論から強い指数時間仮説を仮定すると、$\mathrm{att}(q,k,v)$を1/\mathrm{poly}(n)$ 真の準次時間$n^{2 - \omega(1)}$ で近似することは不可能である。これは、実際に観測される現象の理論的な説明であり、入力行列がより小さいエントリを持つ場合、注意計算はずっと効率的である。

関連論文リスト

When Can We Solve the Weighted Low Rank Approximation Problem in Truly Subquadratic Time? [22.047262762274414]
目標は、2つの低ランク行列 (U, V in mathbbRn times k$) を見つけることで、$| W circ (U Vtop - A) |_F2$ のコストを最小化することである。この研究では、もし$A$ と $W$ が高密度であっても、ほぼ線形な$n1+o(1) の時間で重み付けされた低ランク近似問題を解くことを望んでおり、ある体制が存在することを示す。
論文参考訳（メタデータ） (2025-02-24T07:18:24Z)
The Communication Complexity of Approximating Matrix Rank [50.6867896228563]
この問題は通信複雑性のランダム化を$Omega(frac1kcdot n2log|mathbbF|)$とする。アプリケーションとして、$k$パスを持つ任意のストリーミングアルゴリズムに対して、$Omega(frac1kcdot n2log|mathbbF|)$スペースローバウンドを得る。
論文参考訳（メタデータ） (2024-10-26T06:21:42Z)
LevAttention: Time, Space, and Streaming Efficient Algorithm for Heavy Attentions [54.54897832889028]
任意の$K$に対して、$n$とは独立に「普遍集合」$Uサブセット[n]$が存在し、任意の$Q$と任意の行$i$に対して、大きな注目スコアが$A_i,j$ in row $i$ of $A$は全て$jin U$を持つことを示す。我々は、視覚変換器のスキームの利点を実証的に示し、トレーニング中に我々の普遍的なセットを使用する新しいモデルのトレーニング方法を示した。
論文参考訳（メタデータ） (2024-10-07T19:47:13Z)
Optimal Sketching for Residual Error Estimation for Matrix and Vector Norms [50.15964512954274]
線形スケッチを用いた行列とベクトルノルムの残差誤差推定問題について検討する。これは、前作とほぼ同じスケッチサイズと精度で、経験的にかなり有利であることを示す。また、スパースリカバリ問題に対して$Omega(k2/pn1-2/p)$低いバウンダリを示し、これは$mathrmpoly(log n)$ factorまで厳密である。
論文参考訳（メタデータ） (2024-08-16T02:33:07Z)
Hardness of Low Rank Approximation of Entrywise Transformed Matrix Products [9.661328973620531]
自然言語処理における高速アルゴリズムにインスパイアされ、エントリ変換された設定における低階近似について研究する。我々は、平坦なスパースベクトルのレバレッジスコアの低境界に依存するStrong Exponential Time hypothesis (SETH) から、新しい還元を与える。我々の低階アルゴリズムは行列ベクトルに依存しているため、我々の下限は、小さな行列に対してさえも$f(UV)W$は$Omega(n2-o(1))$時間を必要とすることを示すために拡張される。
論文参考訳（メタデータ） (2023-11-03T14:56:24Z)
How to Capture Higher-order Correlations? Generalizing Matrix Softmax Attention to Kronecker Computation [12.853829771559916]
本稿では,三重相関を捉える注意の一般化について検討する。この一般化は、変圧器では不可能であった三重結合の検出に関する問題を解くことができる。構築, アルゴリズム, 下位境界が自然に高次テンソルや相関に一般化されることが示される。
論文参考訳（メタデータ） (2023-10-06T07:42:39Z)
A Fast Optimization View: Reformulating Single Layer Attention in LLM Based on Tensor and SVM Trick, and Solving It in Matrix Multiplication Time [7.613259578185218]
我々は、一層注意ネットワーク目的関数 $L(X,Y) の証明可能な保証を提供することに注力する。多層LCMネットワークでは、mathbbRn×d2$の行列$Bを層の出力と見なすことができる。損失関数をトレーニングする反復アルゴリズムを$L(X,Y)$ up $epsilon$で、$widetildeO( (cal T_mathrmmat(n,d) + dで実行される。
論文参考訳（メタデータ） (2023-09-14T04:23:40Z)
Fast $(1+\varepsilon)$-Approximation Algorithms for Binary Matrix Factorization [54.29685789885059]
本稿では, 2次行列分解(BMF)問題に対する効率的な$(1+varepsilon)$-approximationアルゴリズムを提案する。目標は、低ランク因子の積として$mathbfA$を近似することである。我々の手法はBMF問題の他の一般的な変種に一般化する。
論文参考訳（メタデータ） (2023-06-02T18:55:27Z)
Randomized and Deterministic Attention Sparsification Algorithms for Over-parameterized Feature Dimension [18.57735939471469]
我々は注意問題のスパシフィケーションを考慮する。超大規模特徴量の場合、文の長さをほぼ線形に縮めることができる。
論文参考訳（メタデータ） (2023-04-10T05:52:38Z)
Learning a Latent Simplex in Input-Sparsity Time [58.30321592603066]
我々は、$AinmathbbRdtimes n$へのアクセスを考えると、潜入$k$-vertex simplex $KsubsetmathbbRdtimes n$を学習する問題を考える。実行時間における$k$への依存は、トップ$k$特異値の質量が$a$であるという自然な仮定から不要であることを示す。
論文参考訳（メタデータ） (2021-05-17T16:40:48Z)
Algorithms and Hardness for Linear Algebra on Geometric Graphs [14.822517769254352]
グリーンガードとロークリンの有名な高速多重極法における次元$dの指数的依存は改善できないことを示す。これは高速多重極法について証明された最初の公式な制限である。
論文参考訳（メタデータ） (2020-11-04T18:35:02Z)
The Average-Case Time Complexity of Certifying the Restricted Isometry Property [66.65353643599899]
圧縮センシングにおいて、100万倍のN$センシング行列上の制限等尺性(RIP)はスパースベクトルの効率的な再構成を保証する。 Mtimes N$ matrices with i.d.$mathcalN(0,1/M)$ entry。
論文参考訳（メタデータ） (2020-05-22T16:55:01Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。