Fugu-MT 論文翻訳(概要): $O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers

論文の概要: $O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers

arxiv url: http://arxiv.org/abs/2006.04862v2
Date: Sat, 19 Dec 2020 07:16:15 GMT
ステータス: 翻訳完了
システム内更新日: 2022-11-24 00:58:50.635495
Title: $O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers
Title（参考訳）: o(n)$接続は十分表現力がある:スパーストランスフォーマーの普遍近似可能性
Authors: Chulhee Yun, Yin-Wen Chang, Srinadh Bhojanapalli, Ankit Singh Rawat, Sashank J. Reddi, Sanjiv Kumar
Abstract要約: 注意層ごとに$O(n)$接続しか持たないスパース変換器は、$n2$接続を持つ高密度モデルと同じ関数クラスを近似できることを示す。また、標準NLPタスクにおいて、異なるパターン・レベルの違いを比較検討する。
参考スコア（独自算出の注目度）: 71.31712741938837
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Recently, Transformer networks have redefined the state of the art in many NLP tasks. However, these models suffer from quadratic computational cost in the input sequence length $n$ to compute pairwise attention in each layer. This has prompted recent research into sparse Transformers that sparsify the connections in the attention layers. While empirically promising for long sequences, fundamental questions remain unanswered: Can sparse Transformers approximate any arbitrary sequence-to-sequence function, similar to their dense counterparts? How does the sparsity pattern and the sparsity level affect their performance? In this paper, we address these questions and provide a unifying framework that captures existing sparse attention models. We propose sufficient conditions under which we prove that a sparse attention model can universally approximate any sequence-to-sequence function. Surprisingly, our results show that sparse Transformers with only $O(n)$ connections per attention layer can approximate the same function class as the dense model with $n^2$ connections. Lastly, we present experiments comparing different patterns/levels of sparsity on standard NLP tasks.
Abstract（参考訳）: 近年,多くのNLPタスクにおいてトランスフォーマーネットワークが技術状況を再定義している。しかし、これらのモデルは、各層でペアワイズ注意を計算するために入力シーケンス長$n$の2次計算コストに苦しむ。このことは、注意層内の接続を分散させるスパーストランスフォーマーの最近の研究を引き起こしている。長い列に対して経験的に有望な一方で、基本的な疑問は解決されていない。 sparsityパターンとsparsityレベルはパフォーマンスにどのように影響しますか? 本稿では,これらの問題に対処し,既存のスパースアテンションモデルをキャプチャする統一フレームワークを提供する。スパース注意モデルが任意の列列列関数を普遍的に近似できることを示す十分条件を提案する。驚くべきことに、o(n)$の接続しか持たないスパーストランスフォーマは、n^2$の接続を持つ密接なモデルと同じ関数クラスに近似できることがわかった。最後に、標準NLPタスクにおいて、異なるパターンや疎度を比較検討する。

関連論文リスト

When Do Transformers Outperform Feedforward and Recurrent Networks? A Statistical Perspective [22.831594980764216]
フィードフォワードとリカレントニューラルネットワークはトランスフォーマーに比べてサンプルの複雑さが大きいことが証明された。提案したスパース検索モデルは,これらのアーキテクチャにおけるサンプルの複雑さの自然な階層構造を示す。
論文参考訳（メタデータ） (2025-03-14T10:30:42Z)
Exact Sequence Classification with Hardmax Transformers [0.0]
我々は、ハードマックスのアテンショントランスフォーマーが$N$ラベル付きシーケンスのデータセットを$mathbbRd$, $dgeq 2$で完全に分類することを証明している。具体的には、$mathbbRd$で任意の長さの$N$シーケンスを与えられた場合、$mathcalO(N)$ブロックと$mathcalO(Nd)$パラメータで変換器を構築し、このデータセットを完全に分類する。
論文参考訳（メタデータ） (2025-02-04T12:31:00Z)
On the Role of Depth and Looping for In-Context Learning with Task Diversity [69.4145579827826]
多様なタスクを伴う線形回帰のための文脈内学習について検討する。 We show that multilayer Transformer is not robust to even distributional shifts as $O(e-L)$ in Wasserstein distance。
論文参考訳（メタデータ） (2024-10-29T03:27:56Z)
Unveiling Induction Heads: Provable Training Dynamics and Feature Learning in Transformers [54.20763128054692]
我々は,2層変換器が$n$-gramのマルコフ連鎖データ上でICLを実行するためにどのように訓練されているかを検討する。クロスエントロピー ICL 損失に対する勾配流が極限モデルに収束することを証明する。
論文参考訳（メタデータ） (2024-09-09T18:10:26Z)
Mixture-of-Depths: Dynamically allocating compute in transformer-based language models [8.774705201394916]
トランスフォーマーベースの言語モデルは、FLOPを入力シーケンスに均一に展開した。変換器はシーケンス内の特定の位置にFLOPを動的に割り当てることが可能であることを示す。
論文参考訳（メタデータ） (2024-04-02T19:28:11Z)
Most Likely Sequence Generation for $n$-Grams, Transformers, HMMs, and Markov Chains, by Using Rollout Algorithms [3.014160809522789]
本稿では,ChatGPTの基盤となるような$n$-gram構造を持つ変換器について考察する。変換器は、単語列を生成するために使用できる次の単語確率を提供する。これらの確率に基づいて,確率の高い単語列の計算方法を検討する。
論文参考訳（メタデータ） (2024-03-19T19:58:46Z)
Fast Multipole Attention: A Divide-and-Conquer Attention Mechanism for Long Sequences [1.5484595752241124]
我々は、長さ$n$のシーケンスに対する注意の時間とメモリの複雑さを低減するために、分割・参照戦略を利用する新しい注意機構であるFast Multipole Attentionを提案する。階層的なアプローチは、クエリ、キー、値を$mathcalO(log n)$の解像度レベルにグループ化する。我々は,高速多極変換器がメモリサイズや精度の点で,他の効率的な変換器よりもはるかに優れていることを実証的に見出した。
論文参考訳（メタデータ） (2023-10-18T13:40:41Z)
Transformers Learn Shortcuts to Automata [52.015990420075944]
低深度変換器は任意の有限状態オートマトンを計算できる。我々は,$O(log T)$レイヤを持つ変換器が,長さ$T$の入力シーケンス上で,オートマトンを正確に再現可能であることを示す。さらに、これらの解の脆性について検討し、潜在的な緩和を提案する。
論文参考訳（メタデータ） (2022-10-19T17:45:48Z)
What Dense Graph Do You Need for Self-Attention? [73.82686008622596]
我々はハイパーキューブにおけるトークンインタラクションをモデル化し、バニラ変換器と同等あるいはそれ以上の結果を示すスパーストランスフォーマーHypercube Transformerを提案する。様々なシーケンス長を必要とするタスクの実験は、グラフ関数の検証をうまく行いました。
論文参考訳（メタデータ） (2022-05-27T14:36:55Z)
Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
計算の複雑さを低く保ちつつ、各注目ヘッドにフルアテンション機能を提供するコンバインダを提案する。既存のスパース変圧器で使用されるスパースアテンションパターンのほとんどは、そのような分解設計をフルアテンションに刺激することができることを示す。自己回帰的タスクと双方向シーケンスタスクの両方に関する実験的評価は、このアプローチの有効性を示す。
論文参考訳（メタデータ） (2021-07-12T22:43:11Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。