Fugu-MT 論文翻訳(概要): QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation

論文の概要: QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation

arxiv url: http://arxiv.org/abs/2511.20100v1
Date: Tue, 25 Nov 2025 09:17:47 GMT
ステータス: 翻訳完了
システム内更新日: 2025-11-26 17:37:04.378237
Title: QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation
Title（参考訳）: QiMeng-Kernel: LLMに基づく高性能GPUカーネル生成のためのマクロシンキングマイクロコーディングパラダイム
Authors: Xinguo Zhu, Shaohui Peng, Jiaming Guo, Yunji Chen, Qi Guo, Yuanbo Wen, Hang Qin, Ruizhi Chen, Qirui Zhou, Ke Gao, Yanjun Wu, Chen Zhao, Ling Li,
Abstract要約: マイクロコーディングは、人間の専門家の段階最適化戦略にインスパイアされた階層的なフレームワークである。最適化戦略を実装の詳細から切り離し、高レベルの戦略と低レベルの実装によって正確性を確保する。レベル1-2と3で100%と70%の精度を達成し、SOTAの汎用とドメインファインチュアリングのLLMよりも50%以上、LLMよりも7.3倍、エキスパート最適化のPyTorch Eagerカーネルより2.2倍のスピードアップを実現している。
参考スコア（独自算出の注目度）: 41.53673797546332
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: Developing high-performance GPU kernels is critical for AI and scientific computing, but remains challenging due to its reliance on expert crafting and poor portability. While LLMs offer promise for automation, both general-purpose and finetuned LLMs suffer from two fundamental and conflicting limitations: correctness and efficiency. The key reason is that existing LLM-based approaches directly generate the entire optimized low-level programs, requiring exploration of an extremely vast space encompassing both optimization policies and implementation codes. To address the challenge of exploring an intractable space, we propose Macro Thinking Micro Coding (MTMC), a hierarchical framework inspired by the staged optimization strategy of human experts. It decouples optimization strategy from implementation details, ensuring efficiency through high-level strategy and correctness through low-level implementation. Specifically, Macro Thinking employs reinforcement learning to guide lightweight LLMs in efficiently exploring and learning semantic optimization strategies that maximize hardware utilization. Micro Coding leverages general-purpose LLMs to incrementally implement the stepwise optimization proposals from Macro Thinking, avoiding full-kernel generation errors. Together, they effectively navigate the vast optimization space and intricate implementation details, enabling LLMs for high-performance GPU kernel generation. Comprehensive results on widely adopted benchmarks demonstrate the superior performance of MTMC on GPU kernel generation in both accuracy and running time. On KernelBench, MTMC achieves near 100% and 70% accuracy at Levels 1-2 and 3, over 50% than SOTA general-purpose and domain-finetuned LLMs, with up to 7.3x speedup over LLMs, and 2.2x over expert-optimized PyTorch Eager kernels. On the more challenging TritonBench, MTMC attains up to 59.64% accuracy and 34x speedup.
Abstract（参考訳）: 高性能GPUカーネルの開発は、AIと科学計算にとって重要であるが、専門家の工芸と移植性に頼っているため、依然として難しい。 LLMは自動化を約束するが、汎用性と微調整性の両方のLLMは2つの基本的かつ矛盾する制限、すなわち正確性と効率性に悩まされている。主な理由は、既存のLLMベースのアプローチが最適化された低レベルプログラムを直接生成し、最適化ポリシーと実装コードの両方を含む非常に広大な空間を探索する必要があるからである。難解な空間を探索する上での課題に対処するために,人間専門家の段階最適化戦略に触発された階層型フレームワークであるマクロシンキングマイクロコーディング(MTMC)を提案する。最適化戦略を実装の詳細から切り離し、高レベルの戦略と低レベルの実装によって正確性を確保する。特に、Macro Thinkingは、ハードウェア利用を最大化するセマンティック最適化戦略を効率的に探索し学習するために、軽量LLMをガイドする強化学習を採用している。 Micro Coding は汎用 LLM を活用して,Macro Thinking の段階的な最適化提案を段階的に実装する。同時に、巨大な最適化空間を効果的にナビゲートし、実装の詳細を複雑にすることで、高性能GPUカーネル生成のためのLLMを実現する。広く採用されているベンチマークの総合的な結果は、GPUカーネル生成におけるMTMCの精度と実行時間の両方で優れた性能を示す。 KernelBenchでは、MTMCはレベル1-2と3で100%と70%の精度を達成し、SOTAの汎用およびドメインファインチュアリングのLLMよりも50%以上、LLMよりも7.3倍、エキスパート最適化のPyTorch Eagerカーネルより2.2倍のスピードアップを実現している。より挑戦的なトリトンベンチでは、MTMCは59.64%の精度と34倍のスピードアップを達成した。

論文の概要: QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation

関連論文リスト