Fugu-MT 論文翻訳(概要): LevAttention: Time, Space, and Streaming Efficient Algorithm for Heavy Attentions

論文の概要: LevAttention: Time, Space, and Streaming Efficient Algorithm for Heavy Attentions

arxiv url: http://arxiv.org/abs/2410.05462v1
Date: Mon, 7 Oct 2024 19:47:13 GMT
ステータス: 翻訳完了
システム内更新日: 2024-11-01 18:37:46.417067
Title: LevAttention: Time, Space, and Streaming Efficient Algorithm for Heavy Attentions
Title（参考訳）: LevAttention: 重心注意のための時間、空間、ストリーミング効率のアルゴリズム
Authors: Ravindran Kannan, Chiranjib Bhattacharyya, Praneeth Kacham, David P. Woodruff,
Abstract要約: 任意の$K$に対して、$n$とは独立に「普遍集合」$Uサブセット[n]$が存在し、任意の$Q$と任意の行$i$に対して、大きな注目スコアが$A_i,j$ in row $i$ of $A$は全て$jin U$を持つことを示す。我々は、視覚変換器のスキームの利点を実証的に示し、トレーニング中に我々の普遍的なセットを使用する新しいモデルのトレーニング方法を示した。
参考スコア（独自算出の注目度）: 54.54897832889028
License: http://creativecommons.org/licenses/by-sa/4.0/
Abstract: A central problem related to transformers can be stated as follows: given two $n \times d$ matrices $Q$ and $K$, and a non-negative function $f$, define the matrix $A$ as follows: (1) apply the function $f$ to each entry of the $n \times n$ matrix $Q K^T$, and then (2) normalize each of the row sums of $A$ to be equal to $1$. The matrix $A$ can be computed in $O(n^2 d)$ time assuming $f$ can be applied to a number in constant time, but the quadratic dependence on $n$ is prohibitive in applications where it corresponds to long context lengths. For a large class of functions $f$, we show how to find all the ``large attention scores", i.e., entries of $A$ which are at least a positive value $\varepsilon$, in time with linear dependence on $n$ (i.e., $n \cdot \textrm{poly}(d/\varepsilon)$) for a positive parameter $\varepsilon > 0$. Our class of functions include all functions $f$ of the form $f(x) = |x|^p$, as explored recently in transformer models. Using recently developed tools from randomized numerical linear algebra, we prove that for any $K$, there is a ``universal set" $U \subset [n]$ of size independent of $n$, such that for any $Q$ and any row $i$, the large attention scores $A_{i,j}$ in row $i$ of $A$ all have $j \in U$. We also find $U$ in $n \cdot \textrm{poly}(d/\varepsilon)$ time. Notably, we (1) make no assumptions on the data, (2) our workspace does not grow with $n$, and (3) our algorithms can be computed in streaming and parallel settings. We call the attention mechanism that uses only the subset of keys in the universal set as LevAttention since our algorithm to identify the universal set $U$ is based on leverage scores. We empirically show the benefits of our scheme for vision transformers, showing how to train new models that use our universal set while training as well, showing that our model is able to consistently select ``important keys'' during training.
Abstract（参考訳）: 2つの$n \times d$ matrices $Q$ と $K$ が与えられ、非負の関数 $f$ が定義される: (1) 関数 $f$ を $n \times n$ matrix $Q K^T$ の各エントリに適用し、(2) を正規化して$A$ の行和を 1$ に等しいものとする。行列 $A$ は $O(n^2 d)$ time で計算でき、$f$ が定数時間で数に適用できると仮定できるが、長い文脈長に対応するアプリケーションでは $n$ に対する二次的依存は禁じられる。例えば、少なくとも正の値である$A$のエントリを$n$(つまり、$n \cdot \textrm{poly}(d/\varepsilon)$)の線形依存に間に合わせると、$A$は$\varepsilon > 0$となる。我々の関数のクラスは、最近トランスフォーマーモデルで調べられたように、$f(x) = |x|^p$ という形のすべての関数 $f$ を含む。ランダム化された数値線型代数から最近開発されたツールを用いて、任意の$K$に対して、$U \subset [n]$が$n$とは独立な大きさであること、すなわち任意の$Q$と任意の行$i$に対して、大きな注意スコアが$A_{i,j}$の行$i$の$A$が$j \in U$であることを示す。また、$U$ in $n \cdot \textrm{poly}(d/\varepsilon)$ time も見つける。特に、(1)データに対する仮定を一切行わず、(2)ワークスペースは$n$で成長せず、(3)アルゴリズムはストリーミングと並列設定で計算できる。我々は、普遍集合のキーのサブセットのみを LevAttention と呼びます。我々は、視覚変換器のスキームの利点を実証的に示し、トレーニング中にユニバーサルセットを使用する新しいモデルをトレーニングする方法を示し、トレーニング中に「重要キー」を一貫して選択できることを示します。

論文の概要: LevAttention: Time, Space, and Streaming Efficient Algorithm for Heavy Attentions

関連論文リスト