Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations
- URL: http://arxiv.org/abs/2602.05885v2
- Date: Fri, 06 Feb 2026 07:05:49 GMT
- Title: Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations
- Authors: Wei Liu, Jiawei Xu, Yingru Li, Longtao Zheng, Tianjian Li, Qian Liu, Junxian He,
- Abstract summary: We study reinforcement learning (RL) for kernel generation.<n>We propose Turn-level Reinforce-Leave-One-Out (TRLOO) to provide unbiased advantage estimation.<n>We introduce Profiling-based Rewards (PR) and Profiling-based Rejection Sampling (PRS) to overcome the issue.
- Score: 32.98036846113632
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: High-quality kernel is critical for scalable AI systems, and enabling LLMs to generate such code would advance AI development. However, training LLMs for this task requires sufficient data, a robust environment, and the process is often vulnerable to reward hacking and lazy optimization. In these cases, models may hack training rewards and prioritize trivial correctness over meaningful speedup. In this paper, we systematically study reinforcement learning (RL) for kernel generation. We first design KernelGYM, a robust distributed GPU environment that supports reward hacking check, data collection from multi-turn interactions and long-term RL training. Building on KernelGYM, we investigate effective multi-turn RL methods and identify a biased policy gradient issue caused by self-inclusion in GRPO. To solve this, we propose Turn-level Reinforce-Leave-One-Out (TRLOO) to provide unbiased advantage estimation for multi-turn RL. To alleviate lazy optimization, we incorporate mismatch correction for training stability and introduce Profiling-based Rewards (PR) and Profiling-based Rejection Sampling (PRS) to overcome the issue. The trained model, Dr Kernel-14B, reaches performance competitive with Claude-4.5-Sonnet in Kernelbench. Finally, we study sequential test-time scaling for Dr Kernel-14B. On the KernelBench Level-2 subset, 31.6% of the generated kernels achieve at least a 1.2x speedup over the Torch reference, surpassing Claude-4.5-Sonnet (26.7%) and GPT-5 (28.6%). When selecting the best candidate across all turns, this 1.2x speedup rate further increases to 47.8%. All resources, including environment, training code, models, and dataset, are included in https://www.github.com/hkust-nlp/KernelGYM.
Related papers
- Fine-Tuning GPT-5 for GPU Kernel Generation [5.109141377873154]
We present Makora's environment and tools for reinforcement learning finetuning of frontier models.<n>In the single-attempt setting, our fine-tuned model improves kernel correctness from 43.7% to 77.0%.<n>When integrated into a full coding agent, it is able to solve up to 97.4% of problems in an expanded KernelBench suite.
arXiv Detail & Related papers (2026-02-11T16:22:54Z) - CVeDRL: An Efficient Code Verifier via Difficulty-aware Reinforcement Learning [57.24524263804788]
Code verifiers play a critical role in post-verification for LLM-based code generation.<n>Existing supervised fine-tuning methods suffer from data scarcity, high failure rates, and poor inference efficiency.<n>We show that naive RL with only functionality rewards fails to generate effective unit tests for difficult branches and samples.
arXiv Detail & Related papers (2026-01-30T10:33:29Z) - RLoop: An Self-Improving Framework for Reinforcement Learning with Iterative Policy Initialization [65.23034604711489]
We introduce RLoop, a self-improving framework for training large reasoning models.<n>RLoop transforms the standard training process into a virtuous cycle: it first uses RL to explore the solution space from a given policy, then filters the successful trajectories to create an expert dataset.<n>Our experiments show RLoops forgetting and substantially improves generalization, boosting average accuracy by 9% and pass@32 by over 15% compared to vanilla RL.
arXiv Detail & Related papers (2025-11-06T11:27:16Z) - ConCuR: Conciseness Makes State-of-the-Art Kernel Generation [5.010229074860956]
Key challenge for kernel generation is the scarcity of high-quality data.<n>We develop a pipeline that generates and curates high-quality kernels with reasoning traces.<n>We show that the average reasoning length can serve as a metric to assess the difficulty of kernel generation tasks.
arXiv Detail & Related papers (2025-10-08T15:41:15Z) - Kevin: Multi-Turn RL for Generating CUDA Kernels [0.0]
We develop a flexible multi-turn RL recipe that addresses unique challenges encountered in real-world settings.<n>In our evaluation setup, Kevin shows significant gains over its base model.<n>We also study its behavior across test-time scaling axes.
arXiv Detail & Related papers (2025-07-16T06:33:07Z) - DiffuCoder: Understanding and Improving Masked Diffusion Models for Code Generation [68.19756761027351]
Diffusion large language models (dLLMs) are compelling alternatives to autoregressive (AR) models.<n>We investigate their denoising processes and reinforcement learning methods.<n>Our work provides deeper insight into the machinery of dLLM generation and offers an effective, diffusion-native RL training framework.
arXiv Detail & Related papers (2025-06-25T17:35:47Z) - Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs [51.21041884010009]
Ring-lite is a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL)<n>Our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks.
arXiv Detail & Related papers (2025-06-17T17:12:34Z) - R1-Code-Interpreter: LLMs Reason with Code via Supervised and Multi-stage Reinforcement Learning [23.795932850992816]
We present R1-Code-Interpreter, an extension of a text-only Large Language Models (LLMs) trained via multi-turn supervised fine-tuning (SFT) and reinforcement learning (RL)<n>We show that training a general-purpose Code Interpreter across 144 diverse reasoning and planning tasks presents significant challenges due to task heterogeneity and scarcity of effective samples.<n>Our final model, R1-CI-14B, improves average accuracy on the 37 test tasks from 44.1% to 72.4%, outperforming text-only GPT-4o (58.6%) and GPT-4o with Code Interpreter (70.9%).
arXiv Detail & Related papers (2025-05-27T18:47:33Z) - Kernel Ridge Regression for Efficient Learning of High-Capacity Hopfield Networks [0.0]
We propose Kernel Ridge Regression (KRR) as an efficient kernel-based alternative for learning high-capacity Hopfield networks.<n>KRR utilizes the kernel trick and predicts bipolar states via regression, crucially offering a non-iterative, closed-form solution for learning dual variables.<n>Our results demonstrate that KRR achieves state-of-the-art storage capacity (reaching a storage load of 1.5) and noise robustness, comparable to KLR.
arXiv Detail & Related papers (2025-04-17T01:17:28Z) - Liger Kernel: Efficient Triton Kernels for LLM Training [6.373771349397682]
Training Large Language Models (LLMs) efficiently at scale presents a formidable challenge, driven by their ever-increasing computational demands.<n>We introduce Liger- Kernel, an open-sourced set of Triton kernels developed specifically for LLM training.<n>With kernel optimization techniques like kernel operation fusing and input chunking, our kernels achieve on average a 20% increase in training throughput and a 60% reduction in GPU memory usage.
arXiv Detail & Related papers (2024-10-14T18:17:01Z) - DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning [61.10299147201369]
This paper introduces a novel autonomous RL approach, called DigiRL, for training in-the-wild device control agents.
We build a scalable and parallelizable Android learning environment equipped with a VLM-based evaluator.
We demonstrate the effectiveness of DigiRL using the Android-in-the-Wild dataset, where our 1.3B VLM trained with RL achieves a 49.5% absolute improvement.
arXiv Detail & Related papers (2024-06-14T17:49:55Z) - Learning to Optimize for Reinforcement Learning [58.01132862590378]
Reinforcement learning (RL) is essentially different from supervised learning, and in practice, these learneds do not work well even in simple RL tasks.
Agent-gradient distribution is non-independent and identically distributed, leading to inefficient meta-training.
We show that, although only trained in toy tasks, our learned can generalize unseen complex tasks in Brax.
arXiv Detail & Related papers (2023-02-03T00:11:02Z) - PolyScientist: Automatic Loop Transformations Combined with Microkernels
for Optimization of Deep Learning Primitives [55.79741270235602]
We develop a hybrid solution to the development of deep learning kernels.
We use the advanced polyhedral technology to automatically tune the outer loops for performance.
arXiv Detail & Related papers (2020-02-06T08:02:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.