Fugu-MT 論文翻訳(概要): From Loop Nests to Silicon: Mapping AI Workloads onto AMD NPUs with MLIR-AIR

論文の概要: From Loop Nests to Silicon: Mapping AI Workloads onto AMD NPUs with MLIR-AIR

arxiv url: http://arxiv.org/abs/2510.14871v1
Date: Thu, 16 Oct 2025 16:49:05 GMT
ステータス: 翻訳完了
システム内更新日: 2025-10-17 21:15:14.955026
Title: From Loop Nests to Silicon: Mapping AI Workloads onto AMD NPUs with MLIR-AIR
Title（参考訳）: Loop NestsからSiiliconへ:AIワークロードをMLIR-AIRでAMD NPUにマッピング
Authors: Erwei Wang, Samuel Bayliss, Andra Bisca, Zachary Blair, Sangeeta Chowdhary, Kristof Denolf, Jeff Fifield, Brandon Freiberger, Erika Hunhoff, Phil James-Roxby, Jack Lo, Joseph Melber, Stephen Neuendorffer, Eddie Richter, Andre Rosti, Javier Setoain, Gagandeep Singh, Endri Taka, Pranathi Vasireddy, Zhewen Yu, Niansong Zhang, Jinming Zhuang,
Abstract要約: 汎用コンパイラは並列性、局所性、同期性を抽象化し、現代の空間アーキテクチャにおけるそれらの有効性を制限する。 MLIR上に構築された新しいオープンソースのコンパイラスタックであるMLIR-AIRを紹介する。 LLaMA2モデルの行列乗算とマルチヘッドアテンションブロックの2つのケーススタディを通してMLIR-AIRの機能を示す。
参考スコア（独自算出の注目度）: 6.2957456904504525
License: http://creativecommons.org/licenses/by/4.0/
Abstract: General-purpose compilers abstract away parallelism, locality, and synchronization, limiting their effectiveness on modern spatial architectures. As modern computing architectures increasingly rely on fine-grained control over data movement, execution order, and compute placement for performance, compiler infrastructure must provide explicit mechanisms for orchestrating compute and data to fully exploit such architectures. We introduce MLIR-AIR, a novel, open-source compiler stack built on MLIR that bridges the semantic gap between high-level workloads and fine-grained spatial architectures such as AMD's NPUs. MLIR-AIR defines the AIR dialect, which provides structured representations for asynchronous and hierarchical operations across compute and memory resources. AIR primitives allow the compiler to orchestrate spatial scheduling, distribute computation across hardware regions, and overlap communication with computation without relying on ad hoc runtime coordination or manual scheduling. We demonstrate MLIR-AIR's capabilities through two case studies: matrix multiplication and the multi-head attention block from the LLaMA 2 model. For matrix multiplication, MLIR-AIR achieves up to 78.7% compute efficiency and generates implementations with performance almost identical to state-of-the-art, hand-optimized matrix multiplication written using the lower-level, close-to-metal MLIR-AIE framework. For multi-head attention, we demonstrate that the AIR interface supports fused implementations using approximately 150 lines of code, enabling tractable expression of complex workloads with efficient mapping to spatial hardware. MLIR-AIR transforms high-level structured control flow into spatial programs that efficiently utilize the compute fabric and memory hierarchy of an NPU, leveraging asynchronous execution, tiling, and communication overlap through compiler-managed scheduling.
Abstract（参考訳）: 汎用コンパイラは並列性、局所性、同期性を抽象化し、現代の空間アーキテクチャにおけるそれらの有効性を制限する。現代のコンピューティングアーキテクチャは、データ移動、実行順序、パフォーマンスの計算配置のきめ細かい制御にますます依存しているため、コンパイラインフラストラクチャは、そのようなアーキテクチャを完全に活用するために、計算とデータをオーケストレーションするための明確なメカニズムを提供する必要がある。 MLIR-AIRは,高レベルのワークロードとAMDのNPUのようなきめ細かい空間アーキテクチャとのセマンティックギャップを埋める,MLIR上に構築された,新しいオープンソースコンパイラスタックである。 MLIR-AIRは、計算とメモリリソースをまたいだ非同期および階層的な操作のための構造化された表現を提供するAIR方言を定義する。 AIRプリミティブは、コンパイラが空間スケジューリングをオーケストレーションし、ハードウェア領域に分散し、アドホックなランタイム調整や手動のスケジューリングに頼ることなく、計算との重複通信を可能にする。 LLaMA2モデルの行列乗算とマルチヘッドアテンションブロックの2つのケーススタディを通してMLIR-AIRの機能を示す。行列乗算では、MLIR-AIRは78.7%の計算効率を達成し、低レベルに近いMLIR-AIEフレームワークを用いて記述された最先端の手動最適化行列乗算とほぼ同じ性能を持つ実装を生成する。多面的な注意を払って、約150行のコードを使用してAIRインターフェースが融合実装をサポートし、空間ハードウェアへの効率的なマッピングによる複雑なワークロードの抽出可能な表現を可能にすることを実証する。 MLIR-AIRは、高レベルの構造化制御フローを空間プログラムに変換し、NPUの計算ファブリックとメモリ階層を効率的に利用し、非同期実行、タイリング、通信オーバラップを活用する。

論文の概要: From Loop Nests to Silicon: Mapping AI Workloads onto AMD NPUs with MLIR-AIR

関連論文リスト