Fugu-MT 論文翻訳(概要): Provably learning a multi-head attention layer

論文の概要: Provably learning a multi-head attention layer

arxiv url: http://arxiv.org/abs/2402.04084v1
Date: Tue, 6 Feb 2024 15:39:09 GMT
ステータス: 翻訳完了
システム内更新日: 2024-02-07 14:19:39.286054
Title: Provably learning a multi-head attention layer
Title（参考訳）: 多面的注意層を学習する可能性
Authors: Sitan Chen, Yuanzhi Li
Abstract要約: マルチヘッドアテンション層は、従来のフィードフォワードモデルとは分離したトランスフォーマーアーキテクチャの重要な構成要素の1つである。本研究では,ランダムな例から多面的注意層を実証的に学習する研究を開始する。最悪の場合、$m$に対する指数的依存は避けられないことを示す。
参考スコア（独自算出の注目度）: 55.2904547651831
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The multi-head attention layer is one of the key components of the transformer architecture that sets it apart from traditional feed-forward models. Given a sequence length $k$, attention matrices $\mathbf{\Theta}_1,\ldots,\mathbf{\Theta}_m\in\mathbb{R}^{d\times d}$, and projection matrices $\mathbf{W}_1,\ldots,\mathbf{W}_m\in\mathbb{R}^{d\times d}$, the corresponding multi-head attention layer $F: \mathbb{R}^{k\times d}\to \mathbb{R}^{k\times d}$ transforms length-$k$ sequences of $d$-dimensional tokens $\mathbf{X}\in\mathbb{R}^{k\times d}$ via $F(\mathbf{X}) \triangleq \sum^m_{i=1} \mathrm{softmax}(\mathbf{X}\mathbf{\Theta}_i\mathbf{X}^\top)\mathbf{X}\mathbf{W}_i$. In this work, we initiate the study of provably learning a multi-head attention layer from random examples and give the first nontrivial upper and lower bounds for this problem: - Provided $\{\mathbf{W}_i, \mathbf{\Theta}_i\}$ satisfy certain non-degeneracy conditions, we give a $(dk)^{O(m^3)}$-time algorithm that learns $F$ to small error given random labeled examples drawn uniformly from $\{\pm 1\}^{k\times d}$. - We prove computational lower bounds showing that in the worst case, exponential dependence on $m$ is unavoidable. We focus on Boolean $\mathbf{X}$ to mimic the discrete nature of tokens in large language models, though our techniques naturally extend to standard continuous settings, e.g. Gaussian. Our algorithm, which is centered around using examples to sculpt a convex body containing the unknown parameters, is a significant departure from existing provable algorithms for learning feedforward networks, which predominantly exploit algebraic and rotation invariance properties of the Gaussian distribution. In contrast, our analysis is more flexible as it primarily relies on various upper and lower tail bounds for the input distribution and "slices" thereof.
Abstract（参考訳）: マルチヘッドアテンション層は、従来のフィードフォワードモデルとは分離したトランスフォーマーアーキテクチャの重要なコンポーネントの1つである。 Given a sequence length $k$, attention matrices $\mathbf{\Theta}_1,\ldots,\mathbf{\Theta}_m\in\mathbb{R}^{d\times d}$, and projection matrices $\mathbf{W}_1,\ldots,\mathbf{W}_m\in\mathbb{R}^{d\times d}$, the corresponding multi-head attention layer $F: \mathbb{R}^{k\times d}\to \mathbb{R}^{k\times d}$ transforms length-$k$ sequences of $d$-dimensional tokens $\mathbf{X}\in\mathbb{R}^{k\times d}$ via $F(\mathbf{X}) \triangleq \sum^m_{i=1} \mathrm{softmax}(\mathbf{X}\mathbf{\Theta}_i\mathbf{X}^\top)\mathbf{X}\mathbf{W}_i$. 本研究では、ランダムな例から多元的注意層を学習し、この問題に対して最初の非自明な上界と下界を与える研究を開始する: - 特定の非退化条件を満たす$\{\mathbf{w}_i, \mathbf{\theta}_i\}$を提供し、$(dk)^{o(m^3)$-timeアルゴリズムを与え、$\{\pm 1\}^{k\times d}$から一様に描かれたランダムラベル付き例に対して$f$から小さな誤差を学習する。 -数値下限を証明し、最悪の場合、$m$ の指数依存は避けられないことを示す。大規模な言語モデルにおけるトークンの離散的な性質を模倣するためにboolean $\mathbf{x}$にフォーカスしていますが、私たちのテクニックは自然に標準の連続的な設定(例えばガウス的)に拡張しています。提案アルゴリズムは,未知のパラメータを含む凸体を例を用いて彫刻することを中心に,ガウス分布の代数的および回転不変性を主に活用するフィードフォワードネットワーク学習のための既存の証明可能なアルゴリズムから大きく離れている。対照的に,本解析は主に入力分布とスライスの様々な上端と下端の境界に依存しているため,より柔軟である。

論文の概要: Provably learning a multi-head attention layer

関連論文リスト