CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation
- URL: http://arxiv.org/abs/2602.24286v1
- Date: Fri, 27 Feb 2026 18:58:05 GMT
- Title: CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation
- Authors: Weinan Dai, Hanlin Wu, Qiying Yu, Huan-ang Gao, Jiahao Li, Chengquan Jiang, Weiqiang Lou, Yufan Song, Hongli Yu, Jiaze Chen, Wei-Ying Ma, Ya-Qin Zhang, Jingjing Liu, Mingxuan Wang, Xin Liu, Hao Zhou,
- Abstract summary: Agent is a large-scale agentic reinforcement learning system that develops kernel expertise through three components.<n>Agent delivers 100%, 100%, and 92% faster rate over torchcompile on KernelBench.
- Score: 51.72529978689561
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: GPU kernel optimization is fundamental to modern deep learning but remains a highly specialized task requiring deep hardware expertise. Despite strong performance in general programming, large language models (LLMs) remain uncompetitive with compiler-based systems such as torch.compile for CUDA kernel generation. Existing CUDA code generation approaches either rely on training-free refinement or fine-tune models within fixed multi-turn execution-feedback loops, but both paradigms fail to fundamentally improve the model's intrinsic CUDA optimization ability, resulting in limited performance gains. We present CUDA Agent, a large-scale agentic reinforcement learning system that develops CUDA kernel expertise through three components: a scalable data synthesis pipeline, a skill-augmented CUDA development environment with automated verification and profiling to provide reliable reward signals, and reinforcement learning algorithmic techniques enabling stable training. CUDA Agent achieves state-of-the-art results on KernelBench, delivering 100\%, 100\%, and 92\% faster rate over torch.compile on KernelBench Level-1, Level-2, and Level-3 splits, outperforming the strongest proprietary models such as Claude Opus 4.5 and Gemini 3 Pro by about 40\% on the hardest Level-3 setting.
Related papers
- StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning [26.264303471292845]
We propose StitchCUDA, a multi-agent framework for end-to-end GPU program generation.<n> Experiments show that StitchCUDA achieves nearly 100% success rate on end-to-end programming tasks, with 1.72x better speedup than the multi-agent baseline.
arXiv Detail & Related papers (2026-03-03T06:04:49Z) - K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model [57.440609834690385]
Existing approaches treat Large Language Models (LLMs) as rapid code generators within evolutionary loops.<n>We propose Search via Co-Evolving World Model and build K-Search based on this method.<n>We evaluate K-Search on diverse, complex kernels FlashInfer, including GQA, MLA, and MoE kernels.
arXiv Detail & Related papers (2026-02-22T11:06:22Z) - KernelBlaster: Continual Cross-Task CUDA Optimization via Memory-Augmented In-Context Reinforcement Learning [3.4998382481249286]
We release KernelBlaster as an open-source agentic framework, accompanied by a test harness, verification components, and a reproducible evaluation.<n>Our method achieves mean speed-ups of 1.43x, 2.50x, and 1.50x on KernelBench Levels 1, 2, and 3, respectively.
arXiv Detail & Related papers (2026-02-15T19:48:43Z) - DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels [17.979042914049842]
Diffusion large language models (dLLMs) have emerged as a compelling alternative to autoregressive (AR) LLMs.<n>CuKe is an augmented dataset optimized for high-performance kernels.<n>DICE is a series of diffusion large language models designed for kernel generation.
arXiv Detail & Related papers (2026-02-12T08:45:13Z) - Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z) - CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization [36.794824560677064]
CudaForge is a training-free multi-agent workflow for kernel generation and optimization.<n>By leveraging base models like OpenAI-o3, CudaForge achieves 97.6% correctness generated kernels and an average 1.68$times$ speedup.
arXiv Detail & Related papers (2025-10-23T22:52:00Z) - CUDA-L1: Improving CUDA Optimization via Contrastive Reinforcement Learning [14.142680357918328]
We introduce an automated learning framework for optimization that employs a novel contrastive RL algorithm.<n>--L1 achieves significant performance improvements on the optimization task.
arXiv Detail & Related papers (2025-07-18T17:43:56Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - CUDA: Convolution-based Unlearnable Datasets [77.70422525613084]
Large-scale training of modern deep learning models heavily relies on publicly available data on the web.
Recent works aim to make data for deep learning models by adding small, specially designed noises.
These methods are vulnerable to adversarial training (AT) and/or are computationally heavy.
arXiv Detail & Related papers (2023-03-07T22:57:23Z) - PolyScientist: Automatic Loop Transformations Combined with Microkernels
for Optimization of Deep Learning Primitives [55.79741270235602]
We develop a hybrid solution to the development of deep learning kernels.
We use the advanced polyhedral technology to automatically tune the outer loops for performance.
arXiv Detail & Related papers (2020-02-06T08:02:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.