Fugu-MT 論文翻訳(概要): Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization

論文の概要: Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization

arxiv url: http://arxiv.org/abs/2603.28342v1
Date: Mon, 30 Mar 2026 12:12:49 GMT
ステータス: 翻訳完了
システム内更新日: 2026-03-31 23:18:45.378254
Title: Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization
Title（参考訳）: Kernel-Smith:進化的カーネル最適化のための統一レシピ
Authors: He Du, Qiming Ge, Jiakai Hu, Aijun Yang, Zheng Cai, Zixian Huang, Sheng Yuan, Qinxiu Cheng, Xinchen Xie, Yicheng Chen, Yining Li, Jiaxing Xie, Huanan Dong, Yaguang Wu, Xiangjun Huang, Jian Yang, Hui Wang, Bowen Zhou, Bowen Li, Qipeng Guo, Kai Chen,
Abstract要約: Kernel-Smithは高性能GPUカーネルと演算子生成のためのフレームワークである。エージェント側では、Kernel-Smithは実行可能な候補の集団を維持し、反復的にそれらを改善している。トレーニング側では、長距離進化軌道をステップ中心の監視と強化学習信号に変換する。
参考スコア（独自算出の注目度）: 48.656549870801285
License: http://creativecommons.org/licenses/by/4.0/
Abstract: We present Kernel-Smith, a framework for high-performance GPU kernel and operator generation that combines a stable evaluation-driven evolutionary agent with an evolution-oriented post-training recipe. On the agent side, Kernel-Smith maintains a population of executable candidates and iteratively improves them using an archive of top-performing and diverse programs together with structured execution feedback on compilation, correctness, and speedup. To make this search reliable, we build backend-specific evaluation services for Triton on NVIDIA GPUs and Maca on MetaX GPUs. On the training side, we convert long-horizon evolution trajectories into step-centric supervision and reinforcement learning signals by retaining correctness-preserving, high-gain revisions, so that the model is optimized as a strong local improver inside the evolutionary loop rather than as a one-shot generator. Under a unified evolutionary protocol, Kernel-Smith-235B-RL achieves state-of-the-art overall performance on KernelBench with Nvidia Triton backend, attaining the best average speedup ratio and outperforming frontier proprietary models including Gemini-3.0-pro and Claude-4.6-opus. We further validate the framework on the MetaX MACA backend, where our Kernel-Smith-MACA-30B surpasses large-scale counterparts such as DeepSeek-V3.2-think and Qwen3-235B-2507-think, highlighting potential for seamless adaptation across heterogeneous platforms. Beyond benchmark results, the same workflow produces upstream contributions to production systems including SGLang and LMDeploy, demonstrating that LLM-driven kernel optimization can transfer from controlled evaluation to practical deployment.
Abstract（参考訳）: 本稿では,高性能GPUカーネルと演算子生成のためのフレームワークであるKernel-Smithについて述べる。エージェント側では、Kernel-Smithは実行可能な候補の集団を維持し、コンパイル、正確性、スピードアップに関する構造化された実行フィードバックとともに、トップパフォーマンスで多様なプログラムのアーカイブを使用して反復的にそれらを改善している。この検索を信頼性のあるものにするために、GPU GPU上でTriton、MetaX GPU上でMacaのバックエンド固有の評価サービスを構築します。トレーニング側では、長軸の進化軌跡をステップ中心の監視・強化学習信号に変換し、精度を保った高利得リビジョンを保ちながら、モデルがワンショットジェネレータではなく進化ループ内の強力な局所改善器として最適化されるようにした。統一された進化的プロトコルの下で、Kernel-Smith-235B-RLはNvidia TritonバックエンドでKernelBench上での最先端の全体的なパフォーマンスを実現し、Gemini-3.0-proやClaude-4.6-opusといったフロンティアプロプライエタリモデルよりも優れた平均スピードアップ比を達成した。我々はさらに、MetaX MACAバックエンドのフレームワークを検証する。私たちのKernel-Smith-MACA-30Bは、DeepSeek-V3.2-thinkやQwen3-235B-2507-thinkのような大規模なフレームワークを超え、異種プラットフォーム間のシームレスな適応の可能性を強調している。ベンチマーク結果以外にも、同じワークフローがSGLangやLMDeployといったプロダクションシステムへのアップストリームコントリビューションを生成し、LLM駆動のカーネル最適化が制御された評価から実用的なデプロイメントへ移行可能であることを実証している。

論文の概要: Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization

関連論文リスト