Fugu-MT 論文翻訳(概要): HipKittens: Fast and Furious AMD Kernels

論文の概要: HipKittens: Fast and Furious AMD Kernels

arxiv url: http://arxiv.org/abs/2511.08083v1
Date: Wed, 12 Nov 2025 01:38:41 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-12 20:17:03.63448
Title: HipKittens: Fast and Furious AMD Kernels
Title（参考訳）: HipKittens:高速で恐ろしいAMDカーネル
Authors: William Hu, Drew Wadsworth, Sean Siddens, Stanley Winata, Daniel Y. Fu, Ryann Swann, Muhammad Osama, Christopher Ré, Simran Arora,
Abstract要約: 本稿では,AMDAIカーネルの性能向上につながるプログラミングプリミティブについて検討する。我々は,AMDAIカーネルの性能向上につながるプログラミングプリミティブについて,初めて詳細な研究を行った。これらの発見は、高性能AIカーネルのための単一のタイルベースのソフトウェアレイヤの道を開くのに役立つ。
参考スコア（独自算出の注目度）: 36.63732085611713
License: http://creativecommons.org/publicdomain/zero/1.0/
Abstract: AMD GPUs offer state-of-the-art compute and memory bandwidth; however, peak performance AMD kernels are written in raw assembly. To address the difficulty of mapping AI algorithms to hardware, recent work proposes C++ embedded and PyTorch-inspired domain-specific languages like ThunderKittens (TK) to simplify high performance AI kernel development on NVIDIA hardware. We explore the extent to which such primitives -- for explicit tile-based programming with optimized memory accesses and fine-grained asynchronous execution across workers -- are NVIDIA-specific or general. We provide the first detailed study of the programming primitives that lead to performant AMD AI kernels, and we encapsulate these insights in the HipKittens (HK) programming framework. We find that tile-based abstractions used in prior DSLs generalize to AMD GPUs, however we need to rethink the algorithms that instantiate these abstractions for AMD. We validate the HK primitives across CDNA3 and CDNA4 AMD platforms. In evaluations, HK kernels compete with AMD's hand-optimized assembly kernels for GEMMs and attention, and consistently outperform compiler baselines. Moreover, assembly is difficult to scale to the breadth of AI workloads; reflecting this, in some settings HK outperforms all available kernel baselines by $1.2-2.4\times$ (e.g., $d=64$ attention, GQA backwards, memory-bound kernels). These findings help pave the way for a single, tile-based software layer for high-performance AI kernels that translates across GPU vendors. HipKittens is released at: https://github.com/HazyResearch/HipKittens.
Abstract（参考訳）: AMD GPUは最先端の計算とメモリ帯域幅を提供するが、ピーク性能のAMDカーネルは生のアセンブリで書かれている。 AIアルゴリズムをハードウェアにマッピングすることの難しさに対処するため、最近の研究は、NVIDIAハードウェア上での高性能AIカーネル開発を単純化するために、C++組み込みとPyTorchにインスパイアされたThunderKittens(TK)のようなドメイン固有言語を提案する。このようなプリミティブ -- 最適化されたメモリアクセスとワーカー間のきめ細かい非同期実行を備えた明示的なタイルベースのプログラミング -- がNVIDIA固有のものなのか、それとも一般的なものなのかを調査する。我々は,AMDAIカーネルの性能向上につながるプログラミングプリミティブに関する最初の詳細な研究を行い,これらの知見をHipKittens(HK)プログラミングフレームワークにカプセル化する。先行DSLで使用されるタイルベースの抽象化はAMD GPUに一般化するが、これらの抽象化をAMD向けにインスタンス化するアルゴリズムを再考する必要がある。 CDNA3とCDNA4のAMDプラットフォームでHKプリミティブを検証する。評価において、HKカーネルはGEMMと注意のためにAMDの手で最適化されたアセンブリカーネルと競合し、コンパイラのベースラインを一貫して上回っている。一部の設定では、HKは利用可能なカーネルベースラインを1.2-2.4\times$(例:$d=64$ attention, GQA backwards, memory-bound kernels)で上回っている。これらの発見は、GPUベンダー間で翻訳される高性能AIカーネルのための単一のタイルベースのソフトウェアレイヤの道を開くのに役立つ。 HipKittensはhttps://github.com/HazyResearch/HipKittens.comでリリースされた。

論文の概要: HipKittens: Fast and Furious AMD Kernels

関連論文リスト