Kevin: Multi-Turn RL for Generating CUDA Kernels
- URL: http://arxiv.org/abs/2507.11948v1
- Date: Wed, 16 Jul 2025 06:33:07 GMT
- Title: Kevin: Multi-Turn RL for Generating CUDA Kernels
- Authors: Carlo Baronio, Pietro Marsella, Ben Pan, Simon Guo, Silas Alberti,
- Abstract summary: We develop a flexible multi-turn RL recipe that addresses unique challenges encountered in real-world settings.<n>In our evaluation setup, Kevin shows significant gains over its base model.<n>We also study its behavior across test-time scaling axes.
- Score: 0.0
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Writing GPU kernels is a challenging task and critical for AI systems' efficiency. It is also highly iterative: domain experts write code and improve performance through execution feedback. Moreover, it presents verifiable rewards like correctness and speedup, making it a natural environment to apply Reinforcement Learning (RL). To explicitly incorporate the iterative nature of this process into training, we develop a flexible multi-turn RL recipe that addresses unique challenges encountered in real-world settings, such as learning from long trajectories and effective reward attribution across turns. We present Kevin - K(ernel D)evin, the first model trained with multi-turn RL for CUDA kernel generation and optimization. In our evaluation setup, Kevin shows significant gains over its base model (QwQ-32B), improving correctness of generated kernels (in pure CUDA) from 56% to 82% and mean speedup from 0.53x to 1.10x of baseline (PyTorch Eager), and surpassing frontier models like o4-mini (0.78x). Finally, we study its behavior across test-time scaling axes: we found scaling serial refinement more beneficial than parallel sampling. In particular, when given more refinement turns, Kevin shows a higher rate of improvement.
Related papers
- CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation [51.72529978689561]
Agent is a large-scale agentic reinforcement learning system that develops kernel expertise through three components.<n>Agent delivers 100%, 100%, and 92% faster rate over torchcompile on KernelBench.
arXiv Detail & Related papers (2026-02-27T18:58:05Z) - Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations [32.98036846113632]
We study reinforcement learning (RL) for kernel generation.<n>We propose Turn-level Reinforce-Leave-One-Out (TRLOO) to provide unbiased advantage estimation.<n>We introduce Profiling-based Rewards (PR) and Profiling-based Rejection Sampling (PRS) to overcome the issue.
arXiv Detail & Related papers (2026-02-05T17:01:09Z) - CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization [36.794824560677064]
CudaForge is a training-free multi-agent workflow for kernel generation and optimization.<n>By leveraging base models like OpenAI-o3, CudaForge achieves 97.6% correctness generated kernels and an average 1.68$times$ speedup.
arXiv Detail & Related papers (2025-10-23T22:52:00Z) - EvoEngineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models [27.430839306140157]
Large Language Models (LLMs) for automating kernel optimization promise promise.<n>General-purpose LLM code evolution methods cannot meet strict correctness requirements of kernel optimization.<n>EvoEngineer provides guidance for designing and adapting optimization strategies to achieve a balance between performance and correctness.<n>Our method achieves a maximum speedup of textbf36.75$times among all operations over PyTorch kernels and delivers the highest speedup on textbf28 (textbf56.0%) of 50 operations that achieve over textbf2times$ acceleration.
arXiv Detail & Related papers (2025-10-04T10:00:25Z) - Kernel Ridge Regression for Efficient Learning of High-Capacity Hopfield Networks [0.0]
We propose Kernel Ridge Regression (KRR) as an efficient kernel-based alternative for learning high-capacity Hopfield networks.<n>KRR utilizes the kernel trick and predicts bipolar states via regression, crucially offering a non-iterative, closed-form solution for learning dual variables.<n>Our results demonstrate that KRR achieves state-of-the-art storage capacity (reaching $beta$=1.5) and noise robustness, comparable to KLR.
arXiv Detail & Related papers (2025-04-17T01:17:28Z) - Hyperspherical Normalization for Scalable Deep Reinforcement Learning [57.016639036237315]
SimbaV2 is a novel reinforcement learning architecture designed to stabilize optimization.<n>It scales up effectively with larger models and greater compute, achieving state-of-the-art performance on 57 continuous control tasks.
arXiv Detail & Related papers (2025-02-21T08:17:24Z) - Stochastic Kernel Regularisation Improves Generalisation in Deep Kernel Machines [23.09717258810923]
Recent work developed convolutional deep kernel machines, achieving 92.7% test accuracy on CIFAR-10.
We introduce several modifications to improve the convolutional deep kernel machine's generalisation.
The resulting model achieves 94.5% test accuracy on CIFAR-10.
arXiv Detail & Related papers (2024-10-08T16:15:53Z) - KernelWarehouse: Rethinking the Design of Dynamic Convolution [16.101179962553385]
KernelWarehouse redefines the basic concepts of Kernels", assembling kernels" and attention function"
We testify the effectiveness of KernelWarehouse on ImageNet and MS-COCO datasets using various ConvNet architectures.
arXiv Detail & Related papers (2024-06-12T05:16:26Z) - InceptionNeXt: When Inception Meets ConvNeXt [147.50287103414115]
We build a series of networks, namely IncepitonNeXt, which not only enjoy high throughputs but also maintain competitive performance.<n>InceptionNeXt achieves 1.6x higher training throughputs than ConvNeXt-T, as well as attains 0.2% top-1 accuracy improvement on ImageNet-1K.
arXiv Detail & Related papers (2023-03-29T17:59:58Z) - Scaling Up 3D Kernels with Bayesian Frequency Re-parameterization for
Medical Image Segmentation [25.62587471067468]
RepUX-Net is a pure CNN architecture with a simple large kernel block design.
Inspired by the spatial frequency in the human visual system, we extend to vary the kernel convergence into element-wise setting.
arXiv Detail & Related papers (2023-03-10T08:38:34Z) - Learning to Optimize for Reinforcement Learning [58.01132862590378]
Reinforcement learning (RL) is essentially different from supervised learning, and in practice, these learneds do not work well even in simple RL tasks.
Agent-gradient distribution is non-independent and identically distributed, leading to inefficient meta-training.
We show that, although only trained in toy tasks, our learned can generalize unseen complex tasks in Brax.
arXiv Detail & Related papers (2023-02-03T00:11:02Z) - Accelerating Training and Inference of Graph Neural Networks with Fast
Sampling and Pipelining [58.10436813430554]
Mini-batch training of graph neural networks (GNNs) requires a lot of computation and data movement.
We argue in favor of performing mini-batch training with neighborhood sampling in a distributed multi-GPU environment.
We present a sequence of improvements to mitigate these bottlenecks, including a performance-engineered neighborhood sampler.
We also conduct an empirical analysis that supports the use of sampling for inference, showing that test accuracies are not materially compromised.
arXiv Detail & Related papers (2021-10-16T02:41:35Z) - Integrating Circle Kernels into Convolutional Neural Networks [30.950819638148104]
The square kernel is a standard unit for contemporary Convolutional Neural Networks (CNNs)
We propose using circle kernels with isotropic receptive fields for the convolution.
Our training takes approximately equivalent amount of calculation when compared with the corresponding CNN with square kernels.
arXiv Detail & Related papers (2021-07-06T07:59:36Z) - Kernel Based Progressive Distillation for Adder Neural Networks [71.731127378807]
Adder Neural Networks (ANNs) which only contain additions bring us a new way of developing deep neural networks with low energy consumption.
There is an accuracy drop when replacing all convolution filters by adder filters.
We present a novel method for further improving the performance of ANNs without increasing the trainable parameters.
arXiv Detail & Related papers (2020-09-28T03:29:19Z) - Learning to Prune Deep Neural Networks via Reinforcement Learning [64.85939668308966]
PuRL is a deep reinforcement learning based algorithm for pruning neural networks.
It achieves sparsity and accuracy comparable to current state-of-the-art methods.
arXiv Detail & Related papers (2020-07-09T13:06:07Z) - PolyScientist: Automatic Loop Transformations Combined with Microkernels
for Optimization of Deep Learning Primitives [55.79741270235602]
We develop a hybrid solution to the development of deep learning kernels.
We use the advanced polyhedral technology to automatically tune the outer loops for performance.
arXiv Detail & Related papers (2020-02-06T08:02:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.