Fugu-MT 論文翻訳(概要): VFA: Relieving Vector Operations in Flash Attention with Global Maximum Pre-computation

論文の概要: VFA: Relieving Vector Operations in Flash Attention with Global Maximum Pre-computation

arxiv url: http://arxiv.org/abs/2604.12798v1
Date: Tue, 14 Apr 2026 14:28:50 GMT
ステータス: 翻訳完了
システム内更新日: 2026-04-15 19:11:32.49755
Title: VFA: Relieving Vector Operations in Flash Attention with Global Maximum Pre-computation
Title（参考訳）: VFA: グローバル最大計算によるFlashアテンションにおけるベクトル操作の救済
Authors: Yupeng Sun, Yanzhao Li, Zhiqiang Zou, Bai Du, Zhiyuan Zhang, Hui Dong, Gaoyige Fan, Hui Wang,
Abstract要約: FlashAttentionスタイルのオンラインソフトマックスは、線形メモリによる正確な注意計算を可能にする。オンラインソフトマックスの非マルチコンポーネントはベクトルまたはSIMD制限となり、遅延が支配的になる。本稿では,Vector Relieved Flash Attention (VFA)を提案する。
参考スコア（独自算出の注目度）: 5.279829639786756
License: http://creativecommons.org/licenses/by/4.0/
Abstract: FlashAttention-style online softmax enables exact attention computation with linear memory by streaming score tiles through on-chip memory and maintaining a running maximum and normalizer. However, as attention kernels approach peak tensor-core/cube-core throughput on modern accelerators, non-matmul components of online softmax -- especially per-tile rowmax and rowsum reductions and rescale chains -- can become vector or SIMD limited and dominate latency. This paper revisits FlashAttention and proposes Vector Relieved Flash Attention (VFA), a hardware-friendly method that reduces rowmax-driven updates of the running maximum while retaining the online-softmax structure. VFA initializes the running maximum via a cheap approximation from key-block representations, reorders key-block traversal to prioritize high-impact sink and local blocks, and freezes the maximum for remaining blocks to avoid repeated reductions and rescaling. We further integrate VFA with block-sparse skipping methods such as BLASST to form Vector Relieved Sparse Attention (VSA), which reduces both block count and per-block overhead. Notably, VFA and VSA completely avoid the conditional rescale operation in the update stage used in FA4.0. Extensive evaluations on benchmarks including MMLU and MATH500, together with attention statistics, verify our design: (i) sink and local reordering stabilizes the running maximum early; (ii) simple Q and K block summaries fail due to intra-block heterogeneity; (iii) m-initialization is required when maxima appear in middle blocks. Overall, VFA and VSA efficiently alleviate online-softmax reduction bottlenecks without performance loss. Compared to the C16V32 baseline, C8V32, C4V32 and C4V16 achieve nearly two times speedup on modern hardware while hitting the vector bottleneck. With upcoming architecture improvements, C4V16 will deliver six times speedup by enhancing exponent capacity.
Abstract（参考訳）: FlashAttentionスタイルのオンラインソフトマックスは、オンチップメモリを通じてスコアタイルをストリーミングすることで、リニアメモリによる正確な注意計算を可能にし、実行最大化と正規化を維持できる。しかし、アテンションカーネルが現代のアクセラレーター上でピークテンソルコア/キューブコアのスループットに近づくと、オンラインソフトマックスの非マルチコンポーネント -- 特にタイルごとのローマックスとローサムの削減と再スケールチェーン -- はベクトルまたはSIMDに制限され、遅延が支配的になる可能性がある。本稿では,FlashAttentionを再検討し,Vector Relieved Flash Attention (VFA)を提案する。 VFAは、キーブロック表現からの安価な近似によってランニング最大値を初期化し、キーブロックトラバースをリオーダして、ハイインパクトシンクとローカルブロックを優先し、残りブロックの最大値を凍結し、繰り返しの削減と再スケーリングを避ける。さらに,ブロック数とブロック単位のオーバーヘッドを低減できるVSA(Vector Relieved Sparse Attention)を形成するために,BLASSTなどのブロックスパーススキップ手法とVFAを統合した。特に、VFAとVSAは、FA4.0で使用される更新段階での条件付き再スケール操作を完全に回避している。 MMLUとMATH500を含むベンチマークの大規模な評価と、アテンション統計とともに、我々の設計を検証する。 (i)シンクとローカルリオーダは、実行時の最大値を早期に安定化させる。 (ii) ブロック内不均一性により、単純なQブロックとKブロックのサマリーが失敗する。三) 中間ブロックに極大が現れるとき、m-初期化が要求される。全体として、VFAとVSAは、パフォーマンスを損なうことなく、オンラインソフトマックスのボトルネックを効果的に軽減します。 C16V32のベースラインと比較すると、C8V32、C4V32、C4V16はベクターボトルネックに到達しながら、現代のハードウェアでほぼ2倍のスピードアップを実現している。今後のアーキテクチャ改善により、C4V16は指数容量を6倍に向上する。

論文の概要: VFA: Relieving Vector Operations in Flash Attention with Global Maximum Pre-computation

関連論文リスト