Fugu-MT 論文翻訳(概要): GPU Kernel Optimization Beyond Full Builds: An LLM Framework with Minimal Executable Programs

論文の概要: GPU Kernel Optimization Beyond Full Builds: An LLM Framework with Minimal Executable Programs

arxiv url: http://arxiv.org/abs/2512.22147v1
Date: Mon, 15 Dec 2025 07:20:15 GMT
ステータス: 翻訳完了
システム内更新日: 2026-01-04 08:45:17.072663
Title: GPU Kernel Optimization Beyond Full Builds: An LLM Framework with Minimal Executable Programs
Title（参考訳）: フルビルドを越えたGPUカーネル最適化 - 最小実行可能なプログラムを備えたLLMフレームワーク
Authors: Ruifan Chu, Anbang Wang, Xiuxiu Bai, Shuai Liu, Xiaoshe Dong,
Abstract要約: 大規模な言語モデル手法では、カーネルのコンパイルと実行を安価にチューニングできると仮定する。完全なアプリケーションを構築することなくカーネルを最適化する性能フィードバックを備えたエンドツーエンドのLLMフレームワークを提案する。このフレームワークは自動エラー修正とパフォーマンスパターン継承を統合し、欠陥を修正し、正確性を保ち、効果的なタイリング/メモリ/同期戦略を再利用し、検索コストを削減する。
参考スコア（独自算出の注目度）: 5.25288153386589
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: In high-performance computing, hotspot GPU kernels are primary bottlenecks, and expert manual tuning is costly and hard to port. Large language model methods often assume kernels can be compiled and executed cheaply, which fails in large applications where full builds and runs are expensive. We present an end-to-end LLM framework with performance feedback that optimizes kernels without building the full application. From independently extracted hotspot kernels, it automatically completes code into a Minimal Executable Program (MEP), then performs multi-round iterative optimization and evaluation outside the full application. The framework integrates Automatic Error Repair and Performance Pattern Inheritance to fix faults, preserve correctness, reuse effective tiling/memory/synchronization strategies, and reduce search cost. Optimized variants are reintegrated into the original application for validation. We evaluate on NVIDIA GPUs and the Haiguang Deep Computing Unit (DCU) platform (AMD-licensed architecture) using PolyBench, the AMD APP SDK, and hotspot kernels from large-scale supercomputing applications. The method achieves average speedups of 5.05x (PolyBench on NVIDIA), 7.77x (PolyBench on DCU), 1.77x (AMD APP SDK), and 1.25x on three hotspot kernels, surpassing direct LLM optimization. The approach requires no full-source dependencies, offers cross-platform portability, and enables practical, low-cost GPU kernel optimization.
Abstract（参考訳）: 高性能コンピューティングでは、ホットスポットGPUカーネルが主要なボトルネックであり、専門家の手動チューニングは高価で移植が難しい。大規模な言語モデルメソッドは、カーネルのコンパイルと実行を安価に行うことができると仮定することが多いが、完全なビルドと実行が高価である大規模なアプリケーションでは失敗する。完全なアプリケーションを構築することなくカーネルを最適化する性能フィードバックを備えたエンドツーエンドのLLMフレームワークを提案する。独立に抽出されたホットスポットカーネルから,MEP(Minimmal Executable Program)にコードを自動で完了し,マルチラウンドの反復最適化と完全なアプリケーション外での評価を行う。このフレームワークは自動エラー修正とパフォーマンスパターン継承を統合し、欠陥を修正し、正確性を保ち、効果的なタイリング/メモリ/同期戦略を再利用し、検索コストを削減する。最適化された変種は、バリデーションのために元のアプリケーションに再統合される。 AMD APP SDKであるPolyBenchと大規模スーパーコンピュータアプリケーションのホットスポットカーネルを用いてNVIDIA GPUとHayguang Deep Computing Unit(DCU)プラットフォーム(AMDライセンスアーキテクチャ)を評価した。この手法は,5.05x(NVIDIAのPolyBench),7.77x(DCUのPolyBench),1.77x(AMD APP SDK),1.25xの3つのホットスポットカーネル上での高速化を実現し,LCMを直接最適化した。このアプローチは、フルソースの依存関係を必要とせず、クロスプラットフォームのポータビリティを提供し、実用的な低コストのGPUカーネル最適化を可能にする。

論文の概要: GPU Kernel Optimization Beyond Full Builds: An LLM Framework with Minimal Executable Programs

関連論文リスト