Fugu-MT 論文翻訳(概要): FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

論文の概要: FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

arxiv url: http://arxiv.org/abs/2307.08691v1
Date: Mon, 17 Jul 2023 17:50:36 GMT
ステータス: 翻訳完了
システム内更新日: 2023-07-18 11:46:03.108856
Title: FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Title（参考訳）: FlashAttention-2: 並列処理と作業分割を高速化する
Authors: Tri Dao
Abstract要約: 非対称なGPUメモリ階層を利用して、メモリの大幅な節約と実行時の高速化を実現しています。 FlashAttentionはまだGEMM(Optimized matrix-multiply)操作ほど高速ではなく、理論上の最大FLOP/sの25-40%にしか達していない。これらの問題に対処するために、より優れた作業パーティショニングを備えたFlashAttention-2を提案する。
参考スコア（独自算出の注目度）: 11.508362885430133
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Scaling Transformers to longer sequence lengths has been a major problem in the last several years, promising to improve performance in language modeling and high-resolution image understanding, as well as to unlock new applications in code, audio, and video generation. The attention layer is the main bottleneck in scaling to longer sequences, as its runtime and memory increase quadratically in the sequence length. FlashAttention exploits the asymmetric GPU memory hierarchy to bring significant memory saving (linear instead of quadratic) and runtime speedup (2-4$\times$ compared to optimized baselines), with no approximation. However, FlashAttention is still not nearly as fast as optimized matrix-multiply (GEMM) operations, reaching only 25-40\% of the theoretical maximum FLOPs/s. We observe that the inefficiency is due to suboptimal work partitioning between different thread blocks and warps on the GPU, causing either low-occupancy or unnecessary shared memory reads/writes. We propose FlashAttention-2, with better work partitioning to address these issues. In particular, we (1) tweak the algorithm to reduce the number of non-matmul FLOPs (2) parallelize the attention computation, even for a single head, across different thread blocks to increase occupancy, and (3) within each thread block, distribute the work between warps to reduce communication through shared memory. These yield around 2$\times$ speedup compared to FlashAttention, reaching 50-73\% of the theoretical maximum FLOPs/s on A100 and getting close to the efficiency of GEMM operations. We empirically validate that when used end-to-end to train GPT-style models, FlashAttention-2 reaches training speed of up to 225 TFLOPs/s per A100 GPU (72\% model FLOPs utilization).
Abstract（参考訳）: トランスフォーマーを長いシーケンス長にスケールすることは、ここ数年で大きな問題であり、言語モデリングと高解像度画像理解のパフォーマンス向上と、コード、オーディオ、ビデオ生成における新しいアプリケーションのアンロックを約束している。注意層は、実行時とメモリがシーケンス長で2次的に増加するため、長いシーケンスへのスケーリングにおける主要なボトルネックである。 FlashAttentionは非対称のGPUメモリ階層を利用して、最適化されたベースラインと比較してメモリの大幅な節約(2-4$\times$)と実行時の高速化を実現している。しかし、FlashAttentionはGEMM(Optimized matrix-multiply)操作ほど高速ではないため、理論上の最大FLOP/sの25～40倍にしか達しない。この非効率性は、異なるスレッドブロックとGPU上のワープ間の最適な作業分割によるものであり、低占有率または不必要な共有メモリの読み取り/書き込みを引き起こす。これらの問題に対処するために、より優れた作業分割を備えたFlashAttention-2を提案する。特に,(1)非マルチFLOPの数を減少させるためにアルゴリズムを微調整し,(2)単一ヘッドでも注目計算を並列化して,異なるスレッドブロックにまたがって占有率を増大させ,(3)各スレッドブロック内でワープ間の作業を分散し,共有メモリによる通信を減らす。これらはFlashAttentionと比較して約2$\times$スピードアップし、A100上の理論最大FLOP/sの50-73\%に達し、GEMM操作の効率に近づいた。 GPTスタイルのモデルをエンドツーエンドでトレーニングする場合、FlashAttention-2はA100 GPU当たり最大225 TFLOP/s(モデルFLOPs利用率72\%)のトレーニング速度に達することを実証的に検証した。

関連論文リスト

APB: Accelerating Distributed Long-Context Inference by Passing Compressed Context Blocks across GPUs [81.5049387116454]
我々は、効率的な長文推論フレームワークであるAPBを紹介する。 APBはプリフィル速度を高めるためにマルチホスト近似アテンションを使用する。 APBはFlashAttn、RingAttn、StarAttnと比較して最大9.2x、4.2x、1.6xの速度を実現している。
論文参考訳（メタデータ） (2025-02-17T17:59:56Z)
Breaking the Memory Barrier: Near Infinite Batch Size Scaling for Contrastive Loss [59.835032408496545]
本稿では, コントラスト損失計算を任意の小ブロックに分割するタイルベースの戦略を提案する。分散システムの階層構造を活用するためのマルチレベルタイリング戦略も導入する。 SOTAメモリ効率のソリューションと比較すると、同等の速度を維持しながら、メモリの2桁の削減を実現している。
論文参考訳（メタデータ） (2024-10-22T17:59:30Z)
FlashMask: Efficient and Rich Mask Extension of FlashAttention [22.810595298076866]
FlashMaskはFlashAttentionの拡張であり、アテンションマスクのカラム単位のスパース表現を導入している。この新しい表現を採用することで、FlashMaskは長いコンテキストシーケンスのモデリングに適した線形メモリ複雑性$O(N)$を達成する。 SFT, LoRA, DPO, RMなどのLLMの微調整およびアライメント訓練におけるFlashMaskの性能を評価する。
論文参考訳（メタデータ） (2024-10-02T09:17:26Z)
vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
大規模言語モデル(LLM)は様々なドメインで広く使われ、数百万の日次要求を処理する。大規模言語モデル(LLM)は様々なドメインで広く使われ、数百万の日次要求を処理する。
論文参考訳（メタデータ） (2024-07-22T14:37:58Z)
Efficient Video Object Segmentation via Modulated Cross-Attention Memory [123.12273176475863]
頻繁なメモリ拡張を必要とせず、時間的滑らかさをモデル化するトランスフォーマーベースの手法MAVOSを提案する。我々のMAVOSは、単一のV100 GPU上で37フレーム/秒(FPS)で動作しながら、J&Fスコア63.3%を達成する。
論文参考訳（メタデータ） (2024-03-26T17:59:58Z)
A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library [0.7366405857677227]
我々は、NVIDIA Hopperアーキテクチャをターゲットとしたカスタムフューズカーネルとして、FlashAttention-2の前方パスの最適化実装を提供する。最新のNVIDIA Ampereアーキテクチャ向けに最適化されたFlashAttention-2のバージョンよりも20～50%高いFLOP/sを観測した。
論文参考訳（メタデータ） (2023-12-19T07:56:25Z)
DISTFLASHATTN: Distributed Memory-efficient Attention for Long-context LLMs Training [82.06732962485754]
FlashAttentionは、1つのGPU上でのトレーニングトランスフォーマーベースの大規模言語モデル(LLM)において、2次ピークメモリの使用を線形に削減する。本研究では,長期LLM学習に最適化されたメモリ効率の高い注意機構であるDisTFLASHATTNを紹介する。最近のRing AttentionやDeepSpeed-Ulyssesと比較して、1.67xと1.26 - 1.88xのスピードアップを実現している。
論文参考訳（メタデータ） (2023-10-05T03:47:57Z)
Simple Hardware-Efficient Long Convolutions for Sequence Modeling [18.3719016967593]
状態空間モデル(SSM)は、長いシーケンスモデリングにおいて高い性能を持つ。単純な代替手段が性能と効率においてSSMと一致するかどうかを考察する。我々は、長い畳み込みのランタイム性能を改善するためのIO対応アルゴリズムであるFlashButterflyを開発した。
論文参考訳（メタデータ） (2023-02-13T19:19:23Z)
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness [80.3586155104237]
FlashAttentionは、トランスフォーマーのためのIO対応の正確な注意アルゴリズムである。これにより、GPU高帯域メモリ(HBM)とGPUオンチップ間のメモリ読み込み/書き込み数を削減できる。 FlashAttentionとブロックスパース FlashAttentionは、トランスフォーマーのコンテキストを長くすることを可能にする。
論文参考訳（メタデータ） (2022-05-27T17:53:09Z)
Efficient Video Semantic Segmentation with Labels Propagation and Refinement [138.55845680523908]
本稿では,ハイブリッドGPU/CPUを用いた高精細ビデオのリアルタイムセマンティックセマンティックセマンティック化の問題に取り組む。 i) CPU上では、非常に高速な光フロー法であり、ビデオの時間的側面を利用して、あるフレームから次のフレームへ意味情報を伝達するために使用される。高解像度フレーム(2048 x 1024)を持つ一般的なCityscapesデータセットでは、単一のGPUとCPU上で80から1000Hzの動作ポイントが提案されている。
論文参考訳（メタデータ） (2019-12-26T11:45:15Z)

関連論文リストは本サイト内にある論文のタイトル・アブストラクトから自動的に作成しています。

指定された論文の情報です。
本サイトの運営者は本サイト（すべての情報・翻訳含む）の品質を保証せず、本サイト（すべての情報・翻訳含む）を使用して発生したあらゆる結果について一切の責任を負いません。