Fugu-MT 論文翻訳(概要): Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants

論文の概要: Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants

arxiv url: http://arxiv.org/abs/2511.02043v1
Date: Mon, 03 Nov 2025 20:25:19 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-05 18:47:05.672538
Title: Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants
Title（参考訳）: Flashlight: 注意変数を高速化するPyTorchコンパイラ拡張
Authors: Bozhi You, Irene Wang, Zelal Su Mustafaoglu, Abhinav Jangda, Angélica Moreira, Roshan Dathathri, Divya Mahajan, Keshav Pingali,
Abstract要約: PyTorchエコシステム内のコンパイラネイティブフレームワークであるFlashlightを紹介します。任意のアテンションベースのプログラムのために、融合したFlashAttentionスタイルのカーネルを自動的に生成する。この結果から,Flashlight は FlexAttention と競合する,あるいは優れた性能のカーネルを生成できることがわかった。
参考スコア（独自算出の注目度）: 2.9955129797385482
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Bad charactors when submitting to arXiv: Attention is a fundamental building block of large language models (LLMs), so there have been many efforts to implement it efficiently. For example, FlashAttention leverages tiling and kernel fusion to optimize attention. Recently, a number of variants of attention have been introduced to enhance model quality or efficiency. Supporting them efficiently remains difficult since they usually require specialized kernels or hand-tuned implementations. FlexAttention recently addressed part of this gap by using static programming templates to support FlashAttention-like kernels for a subset of attention variants. In this paper, we introduce Flashlight, a compiler-native framework within the PyTorch ecosystem that automatically generates fused, FlashAttention-style kernels for arbitrary attention-based programs, without relying on static templates or predefined kernel specializations. Flashlight leverages PyTorch's compilation workflow to fuse and tile attention computations transparently, enabling efficient execution for diverse attention patterns. Not only does it support all variants expressible in the FlexAttention model but it also handles more general, data-dependent attention formulations that are beyond the capabilities of FlexAttention. Our results show that Flashlight produces kernels with competitive or superior performance to FlexAttention, while offering the flexibility of native PyTorch code, enabling developers to rapidly explore new attention models without sacrificing performance.
Abstract（参考訳）: 注意は大きな言語モデル(LLM)の基本的なビルディングブロックであるため、効率的に実装するための多くの努力が続けられています。例えば、FlashAttentionは注意を最適化するためにタイリングとカーネル融合を利用する。近年,モデルの品質向上や効率向上のために,多種多様な注目が寄せられている。通常、特別なカーネルや手作業による実装を必要とするため、効率的なサポートは難しいままである。 FlexAttentionはFlashAttentionのようなカーネルをサポートするために静的なプログラミングテンプレートを使用することで、このギャップの一部に対処した。本稿では,PyTorchエコシステム内のコンパイラネイティブなフレームワークであるFlashlightを紹介し,静的テンプレートや事前に定義されたカーネルの特殊化に頼ることなく,任意のアテンションベースのプログラムに対して,融合したFlashAttentionスタイルのカーネルを自動的に生成する。 Flashlightは、PyTorchのコンパイルワークフローを利用して、透過的な注意計算をフューズしタイル状にすることで、多様な注意パターンの効率的な実行を可能にする。 FlexAttentionモデルで表現可能なすべての変種をサポートするだけでなく、FlexAttentionの能力を超えた、より一般的な、データに依存したアテンションの定式化も処理します。この結果から,Flashlight は FlexAttention よりも競合的あるいは優れた性能を持つカーネルを生成すると同時に,ネイティブな PyTorch コードの柔軟性を提供する。

論文の概要: Flashlight: PyTorch Compiler Extensions to Accelerate Attention Variants

関連論文リスト