Fugu-MT 論文翻訳(概要): Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

論文の概要: Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

arxiv url: http://arxiv.org/abs/2602.05885v1
Date: Thu, 05 Feb 2026 17:01:09 GMT
ステータス: 翻訳完了
システム内更新日: 2026-02-06 18:49:09.06789
Title: Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations
Title（参考訳）: カーネル博士:トリトンカーネルジェネレーションのための強化学習が正しい
Authors: Wei Liu, Jiawei Xu, Yingru Li, Longtao Zheng, Tianjian Li, Qian Liu, Junxian He,
Abstract要約: カーネル生成のための強化学習(RL)について検討する。そこで我々は,不偏利推定を行うために,ターンレベル強化-Leave-One-Out (TRLOO)を提案する。プロファイリングベースのリワード(PR)とプロファイリングベースのリジェクションサンプリング(PRS)を組み込んでこの問題を克服する。
参考スコア（独自算出の注目度）: 32.98036846113632
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: High-quality kernel is critical for scalable AI systems, and enabling LLMs to generate such code would advance AI development. However, training LLMs for this task requires sufficient data, a robust environment, and the process is often vulnerable to reward hacking and lazy optimization. In these cases, models may hack training rewards and prioritize trivial correctness over meaningful speedup. In this paper, we systematically study reinforcement learning (RL) for kernel generation. We first design KernelGYM, a robust distributed GPU environment that supports reward hacking check, data collection from multi-turn interactions and long-term RL training. Building on KernelGYM, we investigate effective multi-turn RL methods and identify a biased policy gradient issue caused by self-inclusion in GRPO. To solve this, we propose Turn-level Reinforce-Leave-One-Out (TRLOO) to provide unbiased advantage estimation for multi-turn RL. To alleviate lazy optimization, we incorporate mismatch correction for training stability and introduce Profiling-based Rewards (PR) and Profiling-based Rejection Sampling (PRS) to overcome the issue. The trained model, Dr.Kernel-14B, reaches performance competitive with Claude-4.5-Sonnet in Kernelbench. Finally, we study sequential test-time scaling for Dr.Kernel-14B. On the KernelBench Level-2 subset, 31.6% of the generated kernels achieve at least a 1.2x speedup over the Torch reference, surpassing Claude-4.5-Sonnet (26.7%) and GPT-5 (28.6%). When selecting the best candidate across all turns, this 1.2x speedup rate further increases to 47.8%. All resources, including environment, training code, models, and dataset, are included in https://www.github.com/hkust-nlp/KernelGYM.
Abstract（参考訳）: スケーラブルなAIシステムでは高品質なカーネルが重要であり、LLMがそのようなコードを生成することがAI開発を前進させる。しかし、このタスクのためにLLMをトレーニングするには十分なデータ、堅牢な環境が必要であり、そのプロセスはハッキングや遅延最適化に報いるために脆弱であることが多い。このような場合、モデルはトレーニング報酬をハックし、意味のあるスピードアップよりも自明な正しさを優先する。本稿では,カーネル生成のための強化学習(RL)を体系的に研究する。我々はまず,報奨ハッキングチェック,マルチターンインタラクションからのデータ収集,長期RLトレーニングをサポートする,堅牢な分散GPU環境であるKernelGYMを設計する。 KernelGYM に基づく実効マルチターン RL 法について検討し,GRPO の自己包摂性に起因する偏りのある政策勾配問題を同定した。そこで本稿では,マルチターンRLの非バイアス利得推定を実現するために,TRLOO(Turn-level Reinforce-Leave-One-Out)を提案する。遅延最適化を緩和するために、トレーニング安定のためのミスマッチ補正を導入し、プロファイリングベースのリワード(PR)とプロファイリングベースのリジェクションサンプリング(PRS)を導入して問題を克服する。訓練されたモデルであるDr.Kernel-14Bは、ケルネルベンチのClaude-4.5-Sonnetと競合する性能に達した。最後に,Dr.Kernel-14Bにおける連続的なテスト時間スケーリングについて検討する。 KernelBench Level-2サブセットでは、生成されたカーネルの31.6%がTorch参照を少なくとも1.2倍高速化し、Claude-4.5-Sonnet (26.7%) と GPT-5 (28.6%) を上回った。全ターンで最高の候補を選択すると、この1.2倍のスピードアップ率が47.8%に向上する。環境、トレーニングコード、モデル、データセットを含むすべてのリソースはhttps://www.github.com/hkust-nlp/KernelGYMに含まれる。

論文の概要: Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations

関連論文リスト