Fugu-MT 論文翻訳(概要): Fast Multipole Attention: A Divide-and-Conquer Attention Mechanism for Long Sequences

論文の概要: Fast Multipole Attention: A Divide-and-Conquer Attention Mechanism for Long Sequences

arxiv url: http://arxiv.org/abs/2310.11960v1
Date: Wed, 18 Oct 2023 13:40:41 GMT
ステータス: 翻訳完了
システム内更新日: 2023-10-19 16:27:28.620380
Title: Fast Multipole Attention: A Divide-and-Conquer Attention Mechanism for Long Sequences
Title（参考訳）: 高速多極型アテンション:長周期の分極型アテンション機構
Authors: Yanming Kang, Giang Tran, Hans De Sterck
Abstract要約: 我々は、長さ$n$のシーケンスに対する注意の時間とメモリの複雑さを低減するために、分割・参照戦略を利用する新しい注意機構であるFast Multipole Attentionを提案する。階層的なアプローチは、クエリ、キー、値を$mathcalO(log n)$の解像度レベルにグループ化する。我々は,高速多極変換器がメモリサイズや精度の点で,他の効率的な変換器よりもはるかに優れていることを実証的に見出した。
参考スコア（独自算出の注目度）: 1.7403133838762448
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Transformer-based models have achieved state-of-the-art performance in many areas. However, the quadratic complexity of self-attention with respect to the input length hinders the applicability of Transformer-based models to long sequences. To address this, we present Fast Multipole Attention, a new attention mechanism that uses a divide-and-conquer strategy to reduce the time and memory complexity of attention for sequences of length $n$ from $\mathcal{O}(n^2)$ to $\mathcal{O}(n \log n)$ or $O(n)$, while retaining a global receptive field. The hierarchical approach groups queries, keys, and values into $\mathcal{O}( \log n)$ levels of resolution, where groups at greater distances are increasingly larger in size and the weights to compute group quantities are learned. As such, the interaction between tokens far from each other is considered in lower resolution in an efficient hierarchical manner. The overall complexity of Fast Multipole Attention is $\mathcal{O}(n)$ or $\mathcal{O}(n \log n)$, depending on whether the queries are down-sampled or not. This multi-level divide-and-conquer strategy is inspired by fast summation methods from $n$-body physics and the Fast Multipole Method. We perform evaluation on autoregressive and bidirectional language modeling tasks and compare our Fast Multipole Attention model with other efficient attention variants on medium-size datasets. We find empirically that the Fast Multipole Transformer performs much better than other efficient transformers in terms of memory size and accuracy. The Fast Multipole Attention mechanism has the potential to empower large language models with much greater sequence lengths, taking the full context into account in an efficient, naturally hierarchical manner during training and when generating long sequences.
Abstract（参考訳）: トランスフォーマーベースのモデルは、多くの分野で最先端のパフォーマンスを達成した。しかし、入力長に対する自己着脱の二次的複雑さは、トランスフォーマベースのモデルを長い列に適用する可能性を妨げる。これを解決するために、Fast Multipole Attentionという新しいアテンションメカニズムを提案する。これは、長さ$n$から$\mathcal{O}(n^2)$から$\mathcal{O}(n \log n)$または$O(n)$へのアテンションの時間とメモリの複雑さを減らし、グローバルな受容場を保持しながら、新しいアテンションメカニズムである。階層的アプローチでは、クエリ、キー、値を$\mathcal{o}( \log n)$の分解レベルにグループ化する。このように、互いに遠く離れたトークン間の相互作用は、効率的な階層的方法で低い解像度で考慮される。 Fast Multipole Attentionの全体的な複雑さは、クエリがダウンサンプリングされているかどうかによって、$\mathcal{O}(n)$または$\mathcal{O}(n \log n)$である。この多値除算戦略は、n$-body 物理学と高速多重極法からの高速和法に触発されたものである。自動回帰および双方向言語モデリングタスクの評価を行い、中規模データセット上での高速多極性注意モデルと他の効率的な注意モデルとの比較を行った。高速マルチポールトランスフォーマーは,メモリサイズや精度において,他の効率的なトランスフォーマーよりもはるかに優れた性能を示す。 Fast Multipole Attentionメカニズムは、トレーニング中や長いシーケンスを生成する際に、完全なコンテキストを効率的で自然に階層的な方法で考慮し、はるかに大きなシーケンス長の言語モデルを強化する可能性がある。

関連論文リスト

Multi-Layer Transformers Gradient Can be Approximated in Almost Linear Time [17.086679273053853]
本研究では,新しい高速近似法により,ほぼ線形時間で勾配を計算することができることを示す。勾配の効率を改善することで、この作業がより効果的なトレーニングと長期コンテキスト言語モデルのデプロイを促進することを期待する。
論文参考訳（メタデータ） (2024-08-23T17:16:43Z)
A Training-free Sub-quadratic Cost Transformer Model Serving Framework With Hierarchically Pruned Attention [43.211427581302715]
大規模言語モデルにおける文脈長を増大させるため,HiP(Hierarchically Pruned Attention)を提案する。 HiPは注意機構の時間的複雑さを$O(T log T)$に減らし、空間的複雑さを$O(T)$に減らし、$T$はシーケンス長である。 HiPは, 劣化を最小限に抑えつつ, プリフィルとデコードの両方のレイテンシとメモリ使用率を著しく低減することを示す。
論文参考訳（メタデータ） (2024-06-14T08:32:45Z)
HyperZ$\cdot$Z$\cdot$W Operator Connects Slow-Fast Networks for Full Context Interaction [0.0]
自己注意機構は、ドット製品ベースのアクティベーションを通じてプログラムされた大きな暗黙の重み行列を利用して、訓練可能なパラメータがほとんどないため、長いシーケンスモデリングを可能にする。本稿では,ネットワークの各層におけるコンテキストの完全な相互作用を実現するために,大きな暗黙のカーネルを用いて残差学習を破棄する可能性について検討する。このモデルにはいくつかの革新的なコンポーネントが組み込まれており、遅いネットワークを更新するための局所的なフィードバックエラー、安定なゼロ平均機能、より高速なトレーニング収束、より少ないモデルパラメータなど、優れた特性を示している。
論文参考訳（メタデータ） (2024-01-31T15:57:21Z)
Efficient Long Sequence Modeling via State Space Augmented Transformer [92.74707853711374]
我々はSPADE($underlinetextbfS$tate sunderlinetextbfP$ace)を提案する。我々は,SPADEの底層にSSMを付加し,他の層に対して効率的な局所的注意法を適用した。 Long Range Arenaベンチマークと言語モデリングタスクの実験結果から,提案手法の有効性が示された。
論文参考訳（メタデータ） (2022-12-15T20:51:27Z)
Transformers Learn Shortcuts to Automata [52.015990420075944]
低深度変換器は任意の有限状態オートマトンを計算できる。我々は,$O(log T)$レイヤを持つ変換器が,長さ$T$の入力シーケンス上で,オートマトンを正確に再現可能であることを示す。さらに、これらの解の脆性について検討し、潜在的な緩和を提案する。
論文参考訳（メタデータ） (2022-10-19T17:45:48Z)
Fastformer: Additive Attention Can Be All You Need [51.79399904527525]
本稿では,加法的注意に基づく効率的なトランスフォーマーモデルであるFastformerを提案する。 Fastformerでは、トークン間のペアワイズインタラクションをモデル化する代わりに、まずグローバルコンテキストをモデル化するために追加アテンションメカニズムを使用します。このように、Fastformerは線形複雑性を伴う効果的なコンテキストモデリングを実現することができる。
論文参考訳（メタデータ） (2021-08-20T09:44:44Z)
Combiner: Full Attention Transformer with Sparse Computation Cost [142.10203598824964]
計算の複雑さを低く保ちつつ、各注目ヘッドにフルアテンション機能を提供するコンバインダを提案する。既存のスパース変圧器で使用されるスパースアテンションパターンのほとんどは、そのような分解設計をフルアテンションに刺激することができることを示す。自己回帰的タスクと双方向シーケンスタスクの両方に関する実験的評価は、このアプローチの有効性を示す。
論文参考訳（メタデータ） (2021-07-12T22:43:11Z)
Stable, Fast and Accurate: Kernelized Attention with Relative Positional Encoding [63.539333383965726]
相対的位置符号化(RPE)を用いた変換器の注意計算を高速化する新しい手法を提案する。相対的な位置符号化がToeplitz行列を形成するという観測に基づいて、Fast Fourier Transform (FFT) を用いて、RPEによるカーネル化された注意を効率的に計算できることを数学的に示す。
論文参考訳（メタデータ） (2021-06-23T17:51:26Z)
$O(n)$ Connections are Expressive Enough: Universal Approximability of Sparse Transformers [71.31712741938837]
注意層ごとに$O(n)$接続しか持たないスパース変換器は、$n2$接続を持つ高密度モデルと同じ関数クラスを近似できることを示す。また、標準NLPタスクにおいて、異なるパターン・レベルの違いを比較検討する。
論文参考訳（メタデータ） (2020-06-08T18:30:12Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。