Fugu-MT 論文翻訳(概要): SEA: Sparse Linear Attention with Estimated Attention Mask

論文の概要: SEA: Sparse Linear Attention with Estimated Attention Mask

arxiv url: http://arxiv.org/abs/2310.01777v1
Date: Tue, 3 Oct 2023 03:56:26 GMT
ステータス: 翻訳完了
システム内更新日: 2023-10-04 17:48:25.713439
Title: SEA: Sparse Linear Attention with Estimated Attention Mask
Title（参考訳）: SEA: 意識マスクを推定したスパースリニア注意
Authors: Heejun Lee, Jina Kim, Jeffrey Willette, Sung Ju Hwang
Abstract要約: 推定アテンションマスクを用いたSparse linear attentionを提案する。 SEAは、カーネルベースの線形注意による線形複雑度でアテンション行列を推定し、トップk選択によるフルアテンション行列へのスパース近似を生成する。 SEAは解釈可能な注意行列を維持しており、知識蒸留を利用して既存の事前学習トランスの複雑さを下げることができる。
参考スコア（独自算出の注目度）: 55.95853565717624
License: http://creativecommons.org/licenses/by-nc-nd/4.0/
Abstract: The transformer architecture has made breakthroughs in recent years on tasks which require modeling pairwise relationships between sequential elements, as is the case in natural language understanding. However, transformers struggle with long sequences due to the quadratic complexity of the attention operation, and previous research has aimed to lower the complexity by sparsifying or linearly approximating the attention matrix. Yet, these approaches cannot straightforwardly distill knowledge from a teacher's attention matrix, and often require complete retraining from scratch. Furthermore, previous sparse and linear approaches may also lose interpretability if they do not produce full quadratic attention matrices. To address these challenges, we propose SEA: Sparse linear attention with an Estimated Attention mask. SEA estimates the attention matrix with linear complexity via kernel-based linear attention, then creates a sparse approximation to the full attention matrix with a top-k selection to perform a sparse attention operation. For language modeling tasks (Wikitext2), previous linear and sparse attention methods show a roughly two-fold worse perplexity scores over the quadratic OPT-125M baseline, while SEA achieves an even better perplexity than OPT-125M, using roughly half as much memory as OPT-125M. Moreover, SEA maintains an interpretable attention matrix and can utilize knowledge distillation to lower the complexity of existing pretrained transformers. We believe that our work will have a large practical impact, as it opens the possibility of running large transformers on resource-limited devices with less memory.
Abstract（参考訳）: トランスフォーマーアーキテクチャは近年、自然言語理解のように、シーケンシャル要素間のペアリレーションをモデル化する必要があるタスクにおいて画期的になっている。しかし、注意操作の二次的な複雑さのため、変換器は長いシーケンスに悩まされ、以前の研究では、注意行列をスペーシングまたは線形に近似することで複雑さを減らそうとしている。しかし、これらの手法は教師の注意マトリックスから直接知識を抽出することはできず、しばしばゼロから完全に再訓練する必要がある。さらに、従来のスパースおよび線形アプローチは、完全な二次注意行列を生成しない場合、解釈可能性を失うこともある。これらの課題に対処するため,提案するSEA: 推定注意マスクを用いた疎線形注意法を提案する。 SEAは、カーネルベースの線形注意による線形複雑度でアテンション行列を推定し、スパースアテンション行列をトップk選択でスパースアテンション行列に近似し、スパースアテンション演算を行う。言語モデリングタスク(Wikitext2)では、以前の線形およびスパースなアテンション手法は、OPT-125Mベースラインよりも約2倍悪いパープレキシティスコアを示し、SEAはOPT-125Mの約半分のメモリを使用して、OPT-125Mよりもさらに優れたパープレキシティを達成する。さらに、seaは解釈可能な注意行列を維持し、既存の訓練済みトランスフォーマーの複雑さを減らすために知識蒸留を利用することができる。メモリの少ないリソース制限のデバイスで大きなトランスフォーマーを動作させることで、我々の作業に大きな実践的影響を与えるだろうと考えています。

関連論文リスト

A Random Matrix Analysis of In-context Memorization for Nonlinear Attention [18.90197287760915]
非線形注意は、ランダムな入力に対する線形リッジ回帰よりも高い記憶誤差をもたらすことを示す。その結果,非線形注意の記憶性能を管理するために,非線形性と入力構造がどのように相互作用するかが明らかになった。
論文参考訳（メタデータ） (2025-06-23T13:56:43Z)
DAM: Dynamic Attention Mask for Long-Context Large Language Model Inference Acceleration [12.172968576254469]
本稿では,アダプティブマスクをアダプティブマップレベルで割り当てる動的スパースアテンション機構を提案する。コンテキスト認識型アテンション構造を学習することにより、フルアテンションモデルとの高アライメントを実現し、パフォーマンスの低下を最小限に抑える。このアプローチは、大規模言語モデルの実践的な展開を可能にする、フルアテンションに代わるスケーラブルな代替手段を提供する。
論文参考訳（メタデータ） (2025-06-06T20:24:36Z)
Bridging the Divide: Reconsidering Softmax and Linear Attention [116.34723260730405]
線形注意の限界を理解し緩和する2つの重要な視点を提示する。線形注意は単射ではなく、異なるクエリベクトルに同一の注意重みを割り当てる傾向があることを証明した。第2に,線形の注意が不足するソフトマックスの注意を成功させるためには,効果的な局所モデリングが不可欠であることを確認した。
論文参考訳（メタデータ） (2024-12-09T15:44:22Z)
Long-Sequence Recommendation Models Need Decoupled Embeddings [49.410906935283585]
我々は、既存の長期推薦モデルにおいて無視された欠陥を識別し、特徴付ける。埋め込みの単一のセットは、注意と表現の両方を学ぶのに苦労し、これら2つのプロセス間の干渉につながります。本稿では,2つの異なる埋め込みテーブルを別々に学習し,注意と表現を完全に分離する,DARE(Decoupled Attention and Representation Embeddings)モデルを提案する。
論文参考訳（メタデータ） (2024-10-03T15:45:15Z)
Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
本稿では,状態空間モデルを短時間の畳み込みに置き換えたCHELAを提案する。提案手法の有効性を示すために,Long Range Arenaベンチマークと言語モデリングタスクについて実験を行った。
論文参考訳（メタデータ） (2024-06-12T12:12:38Z)
Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models [20.78813311569383]
本稿では、線形アテンションによる理論計算の利点を実現するための最初の線形アテンション実装であるLightning Attentionを紹介する。具体的には、従来のアテンション機構をブロック内に適用し、インターブロックに対して線形アテンションカーネルのトリックを適用する。異なるモデルサイズとシーケンス長について様々な実験を行った。
論文参考訳（メタデータ） (2024-01-09T16:27:28Z)
HyperAttention: Long-context Attention in Near-Linear Time [78.33061530066185]
本稿では,長期的文脈の複雑さの増大に伴う計算課題に対処するため,HyperAttentionという近似的な注意機構を提案する。実証的には、大規模なエントリを特定するためにLocality Sensitive Hashing(LSH)を使用して、HyperAttentionは既存のメソッドよりも優れています。各種長文長データセットにおけるHyperAttentionの実証的性能を検証した。
論文参考訳（メタデータ） (2023-10-09T17:05:25Z)
DBA: Efficient Transformer with Dynamic Bilinear Low-Rank Attention [53.02648818164273]
動的双線形低ランク注意(DBA)という,効率的かつ効果的な注意機構を提案する。 DBAは入力感度の動的射影行列によってシーケンス長を圧縮し、線形時間と空間の複雑さを実現する。様々なシーケンス長条件のタスクに対する実験は、DBAが最先端のパフォーマンスを達成することを示す。
論文参考訳（メタデータ） (2022-11-24T03:06:36Z)
Sketching as a Tool for Understanding and Accelerating Self-attention for Long Sequences [52.6022911513076]
トランスフォーマーベースのモデルは、自己アテンションモジュールの二次空間と時間的複雑さのために、長いシーケンスを処理するのに効率的ではない。我々はLinformerとInformerを提案し、低次元投影と行選択により2次複雑性を線形(モジュラー対数因子)に還元する。理論的解析に基づいて,Skeinformerを提案することにより,自己注意の促進と,自己注意への行列近似の精度の向上を図ることができる。
論文参考訳（メタデータ） (2021-12-10T06:58:05Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。