Fugu-MT 論文翻訳(概要): Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level

論文の概要: Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level

arxiv url: http://arxiv.org/abs/2403.04690v2
Date: Fri, 22 Mar 2024 16:26:40 GMT
ステータス: 翻訳完了
システム内更新日: 2024-03-25 21:51:11.365175
Title: Faster Neighborhood Attention: Reducing the O(n^2) Cost of Self Attention at the Threadblock Level
Title（参考訳）: 近隣のより高速な注意:スレッドブロックレベルでの自己注意のO(n^2)コスト削減
Authors: Ali Hassani, Wen-Mei Hwu, Humphrey Shi,
Abstract要約: 近隣の注意は、それぞれのトークンの注意を隣人に限定することで、自己注意のコストを減少させる。そこで本研究では,従来のGEMM問題と同様に,近隣の注意をバッチ化したGEMM問題として表現し,その実装を1次元,2次元の近所の注意のために行なえることを示す。我々はまた、異なる空間軸にまたがる注意を細かく制御できる、融合したドット積アテンションカーネルの適応として、融合した近傍アテンションも展開する。
参考スコア（独自算出の注目度）: 30.681204292813998
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Neighborhood attention reduces the cost of self attention by restricting each token's attention span to its nearest neighbors. This restriction, parameterized by a window size and dilation factor, draws a spectrum of possible attention patterns between linear projection and self attention. Neighborhood attention, and more generally sliding window attention patterns, have long been bounded by infrastructure, particularly in higher-rank spaces (2-D and 3-D), calling for the development of custom kernels, which have been limited in either functionality, or performance, if not both. In this work, we first show that neighborhood attention can be represented as a batched GEMM problem, similar to standard attention, and implement it for 1-D and 2-D neighborhood attention. These kernels on average provide 895% and 272% improvement in full precision latency compared to existing naive kernels for 1-D and 2-D neighborhood attention respectively. We find certain inherent inefficiencies in all unfused neighborhood attention kernels that bound their performance and lower-precision scalability. We also developed fused neighborhood attention; an adaptation of fused dot-product attention kernels that allow fine-grained control over attention across different spatial axes. Known for reducing the quadratic time complexity of self attention to a linear complexity, neighborhood attention can now enjoy a reduced and constant memory footprint, and record-breaking half precision latency. We observe that our fused kernels successfully circumvent some of the unavoidable inefficiencies in unfused implementations. While our unfused GEMM-based kernels only improve half precision performance compared to naive kernels by an average of 496% and 113% in 1-D and 2-D problems respectively, our fused kernels improve naive kernels by an average of 1607% and 581% in 1-D and 2-D problems respectively.
Abstract（参考訳）: 近隣の注意は、それぞれのトークンの注意を隣人に限定することで、自己注意のコストを減少させる。この制限は、ウィンドウサイズと拡張係数によってパラメータ化され、線形投影と自己注意の間の潜在的な注意パターンのスペクトルを引き出す。特に高階空間(2-Dと3-D)では、機能や性能に制限があるカスタムカーネルの開発が求められている。本研究は,まず,従来のGEMM問題と同様のバッチ化問題として近所の注意を表現できることを示し,その実装を1次元,2次元の近所の注意のために行う。これらのカーネルの平均は895%と272%で、既存の1次元と2次元の隣り合わせのカーネルと比較して、完全なレイテンシが向上している。我々は、その性能と低い精度のスケーラビリティを束縛する、未利用の周辺注目カーネルに固有の非効率性を見出した。我々はまた、異なる空間軸をまたいだ注意をきめ細かく制御できる、融合したドット積アテンションカーネルの適応として、融合した近傍アテンションも開発した。線形複雑性への自己注意の二次的時間的複雑さを低減したことで知られ、近隣の注意は減少し、一定のメモリフットプリントを享受し、記録破りの半精度のレイテンシーを享受できるようになった。我々は、融合カーネルが、未利用実装における避けられない非効率を回避できたことを観察する。 GEMMをベースとしたカーネルは, 平均496%, 平均113%の1-D問題に対して, 平均1607%, 平均581%の2-D問題に対して, 半精度しか改善していない。

関連論文リスト

SLA2: Sparse-Linear Attention with Learnable Routing and QAT [86.22100800353991]
SLA2は97%の注意空間を達成でき、世代品質を維持しつつ18.6倍の注意速度を達成できることを示す。実験の結果、SLA2は97%の注意範囲を達成でき、世代品質を維持しながら18.6倍の注意速度を達成できることが示された。
論文参考訳（メタデータ） (2026-02-13T07:16:02Z)
Double-P: Hierarchical Top-P Sparse Attention for Long-Context LLMs [45.84463775890072]
長文推論は、大きな言語モデルの中心となる。トップpのスパースアテンションは、アテンションの質量を直接保存し、より強力な精度保証を提供する。既存のトップpメソッドは、トップpの精度、選択オーバーヘッド、わずかな注意コストを共同で最適化することができない。
論文参考訳（メタデータ） (2026-02-05T01:37:10Z)
InfLLM-V2: Dense-Sparse Switchable Attention for Seamless Short-to-Long Adaptation [56.694702609077495]
ロングシーケンス処理は、現代の大規模言語モデルにとって重要な機能である。 InfLLM-V2は、ショートシーケンスからロングシーケンスまでのモデルをシームレスに適応する訓練可能なスパースアテンションフレームワークである。実験では、InfLLM-V2は高密度の注意より4$times$速いが、98.1%と99.7%のパフォーマンスを維持している。
論文参考訳（メタデータ） (2025-09-29T12:08:33Z)
RAT: Bridging RNN Efficiency and Attention Accuracy in Language Modeling [17.437929000395112]
再発機構と注意機構の間にラットという中間設計を導入する。入力をチャンクに分割し、各チャンク内で単純なリニアリカレンスを適用してローカル依存関係をキャプチャし、その後、チャンク全体でソフトマックスアテンションを行い、長距離インタラクションをモデル化する。チャンクサイズが16の場合、ラット層は100Kトークンシーケンスで(7時間)訓練速度を向上し、4Kシーケンス長で(9時間)生成する。
論文参考訳（メタデータ） (2025-07-06T15:08:49Z)
Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free [81.65559031466452]
我々は、ゲーティング強化ソフトマックスアテンションの変種を調べる実験を行った。 SDPA(Scaled Dot-Product Attention)後の頭部特異的シグモイドゲートを簡易に修正することで,性能が向上することがわかった。
論文参考訳（メタデータ） (2025-05-10T17:15:49Z)
Bridging the Divide: Reconsidering Softmax and Linear Attention [116.34723260730405]
線形注意の限界を理解し緩和する2つの重要な視点を提示する。線形注意は単射ではなく、異なるクエリベクトルに同一の注意重みを割り当てる傾向があることを証明した。第2に,線形の注意が不足するソフトマックスの注意を成功させるためには,効果的な局所モデリングが不可欠であることを確認した。
論文参考訳（メタデータ） (2024-12-09T15:44:22Z)
S2-Attention: Hardware-Aware Context Sharding Among Attention Heads [49.1454481007861]
スパースアテンションは、コンテキスト内のトークンのサブセットに選択的に出席する。スパース・アテンションが今日の大規模言語モデルでモデルの品質を維持することができるかどうかは不明だ。本稿では,Sparsely-Sharded(S2) attention, a Triton library that provide kernel optimization for sparse attention for sparse attention to customizable per-head and per-context-range levels。
論文参考訳（メタデータ） (2024-07-25T00:27:07Z)
Short-Long Convolutions Help Hardware-Efficient Linear Attention to Focus on Long Sequences [60.489682735061415]
本稿では,状態空間モデルを短時間の畳み込みに置き換えたCHELAを提案する。提案手法の有効性を示すために,Long Range Arenaベンチマークと言語モデリングタスクについて実験を行った。
論文参考訳（メタデータ） (2024-06-12T12:12:38Z)
Lightning Attention-2: A Free Lunch for Handling Unlimited Sequence Lengths in Large Language Models [20.78813311569383]
本稿では、線形アテンションによる理論計算の利点を実現するための最初の線形アテンション実装であるLightning Attentionを紹介する。具体的には、従来のアテンション機構をブロック内に適用し、インターブロックに対して線形アテンションカーネルのトリックを適用する。異なるモデルサイズとシーケンス長について様々な実験を行った。
論文参考訳（メタデータ） (2024-01-09T16:27:28Z)
RFAConv: Innovating Spatial Attention and Standard Convolutional Operation [7.2646541547165056]
RFA(Receptive-Field Attention)と呼ばれる新しい注意機構を提案する。 RFAは受容場空間的特徴に重点を置いているが、大規模な畳み込みカーネルに対して効果的な注意重みを与える。計算コストとパラメータのほとんど無視可能な増加を提供すると同時に、ネットワーク性能も大幅に向上する。
論文参考訳（メタデータ） (2023-04-06T16:21:56Z)
UNETR++: Delving into Efficient and Accurate 3D Medical Image Segmentation [93.88170217725805]
本稿では,高画質なセグメンテーションマスクと,パラメータ,計算コスト,推論速度の両面での効率性を提供するUNETR++という3次元医用画像セグメンテーション手法を提案する。我々の設計の核となるのは、空間的およびチャネル的な識別的特徴を効率的に学習する、新しい効率的な対注意ブロック(EPA)の導入である。 Synapse, BTCV, ACDC, BRaTs, Decathlon-Lungの5つのベンチマークで評価した結果, 効率と精度の両面で, コントリビューションの有効性が示された。
論文参考訳（メタデータ） (2022-12-08T18:59:57Z)
Rethinking Query-Key Pairwise Interactions in Vision Transformers [5.141895475956681]
本稿では,問合せキーの対の相互作用を排除し,注意重みを求めるために計算効率の高い相性ゲートを用いるキーオンリーの注意を提案する。我々は、ImageNet分類ベンチマークのパラメータ限定設定において、最先端の精度に達する新しい自己注意モデルファミリーLinGlosを開発した。
論文参考訳（メタデータ） (2022-07-01T03:36:49Z)
Towards Joint Intent Detection and Slot Filling via Higher-order Attention [47.78365472691051]
Intent Detection (ID) と Slot fill (SF) は、音声言語理解(SLU)における2つの主要なタスクである。本稿では,文脈的およびチャネル的両線的アテンション分布を利用したバイリニアアテンションブロックを提案する。我々のアプローチは最先端のアプローチと比較して改善をもたらすことを示す。
論文参考訳（メタデータ） (2021-09-18T09:50:23Z)
Unlocking Pixels for Reinforcement Learning via Implicit Attention [61.666538764049854]
我々は最近,トランスフォーマーに非常に有効であることが示されている,新しい効率的なアテンションアルゴリズムを利用している。これにより、注意に基づくコントローラは、より大きな視覚入力にスケールでき、より小さなパッチの使用が容易になります。さらに,ソフトマックスの注目度をハイブリッドランダム特徴量で近似するアルゴリズムを提案する。
論文参考訳（メタデータ） (2021-02-08T17:00:26Z)
AttentionNAS: Spatiotemporal Attention Cell Search for Video Classification [86.64702967379709]
本稿では,時間的注意のための新しい検索空間を提案する。これにより,検索アルゴリズムはセルの様々な設計選択を柔軟に探索することができる。検出されたアテンションセルは既存のバックボーンネットワーク(例えばI3DやS3D)にシームレスに挿入することができ、Kinetics-600とMiTのデータセットでビデオの精度を2%以上改善することができる。
論文参考訳（メタデータ） (2020-07-23T14:30:05Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。