Fugu-MT 論文翻訳(概要): AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search

論文の概要: AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search

arxiv url: http://arxiv.org/abs/2603.21331v1
Date: Sun, 22 Mar 2026 17:15:28 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-24 19:11:39.361452
Title: AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search
Title（参考訳）: AutoKernel: 反復エージェント駆動検索による自動GPUカーネル最適化
Authors: Jaber Jaber, Osama Jaber,
Abstract要約: Auto Kernelは、任意のPyTorchモデルのGPUカーネル最適化に自律エージェントループを適用するフレームワークである。システムには、9000行以上のPythonと、2つのバックエンドにまたがる18のスターターカーネル実装、6層最適化プレイブック、KernelBenchベンチマークスイートとの統合が含まれている。 NVIDIA H100では、テストされたプレイブックの大部分で、私たちのTritonカーネルがPyTorchとTorch.compile(max-autotune)の両方を上回っています。
参考スコア（独自算出の注目度）: 0.0
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Writing high-performance GPU kernels is among the most labor-intensive tasks in machine learning systems engineering. We present AutoKernel, an open-source framework that applies an autonomous agent loop to GPU kernel optimization for arbitrary PyTorch models. Given a model, AutoKernel profiles it to identify computational bottlenecks, ranks them by Amdahl's law impact, and iteratively refines Triton or CUDA C++ kernel implementations through hundreds of experiments without human intervention. A five-stage correctness harness covering smoke tests, shape sweeps, numerical stability, determinism verification, and edge-case coverage ensures every candidate kernel is validated before any speedup is recorded. The system comprises over 9,000 lines of Python, 18 starter kernel implementations across two backends, a six-tier optimization playbook, and integration with the KernelBench benchmark suite. AutoKernel covers nine kernel types spanning the dominant operations in modern transformer architectures. On an NVIDIA H100, our Triton kernels outperform both PyTorch eager and torch.compile (max-autotune) on the majority of tested configurations: 5.29x over eager on RMSNorm, 2.82x on softmax, and 2.21x on cross-entropy, while beating torch.compile by 2.83x, 3.44x, and 2.94x respectively. In community deployment, an AutoKernel-optimized kernel achieved first place on the vectorsum_v2 B200 leaderboard. The full system is available at https://github.com/RightNow-AI/autokernel.
Abstract（参考訳）: 高性能GPUカーネルを書くことは、機械学習システムエンジニアリングにおいて最も労働集約的なタスクの一つである。任意のPyTorchモデルに対して,GPUカーネル最適化に自律エージェントループを適用するオープンソースフレームワークであるAutoKernelを提案する。モデルが与えられた場合、AutoKernelは計算ボトルネックを特定し、Amdahlの法則の影響でランク付けし、人間による介入なしに数百の実験を通じて、TritonやCUDA C++のカーネル実装を反復的に洗練する。スモークテスト、シェイプスイープ、数値安定性、決定性検証、エッジケースカバレッジをカバーする5段階の修正ハーネスは、任意のスピードアップが記録される前にすべての候補カーネルが検証されることを保証する。システムには、9000行以上のPythonと、2つのバックエンドにまたがる18のスターターカーネル実装、6層最適化プレイブック、KernelBenchベンチマークスイートとの統合が含まれている。 AutoKernelは、現代のトランスフォーマーアーキテクチャにおいて支配的な操作にまたがる9つのカーネルタイプをカバーしている。 NVIDIA H100では、我々のTritonカーネルは、テスト構成の大部分でPyTorch eagerとtorch.compile(max-autotune)の両方を上回り、RMSNormでは5.29倍、ソフトマックスでは2.82倍、クロスエントロピーでは2.21倍、トーチ.compileでは2.83倍、3.44倍、および2.94倍である。コミュニティのデプロイでは、AutoKernelに最適化されたカーネルがベクターsum_v2 B200のリーダーボードで1位を獲得した。完全なシステムはhttps://github.com/RightNow-AI/autokernel.comで入手できる。

論文の概要: AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search

関連論文リスト