CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization
- URL: http://arxiv.org/abs/2511.01884v2
- Date: Wed, 05 Nov 2025 02:10:35 GMT
- Title: CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization
- Authors: Zijian Zhang, Rong Wang, Shiyang Li, Yuebo Luo, Mingyi Hong, Caiwen Ding,
- Abstract summary: CudaForge is a training-free multi-agent workflow for kernel generation and optimization.<n>By leveraging base models like OpenAI-o3, CudaForge achieves 97.6% correctness generated kernels and an average 1.68$times$ speedup.
- Score: 36.794824560677064
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Developing efficient CUDA kernels is increasingly critical for AI applications such as large-scale LLM training. However, manual kernel design is both costly and time-consuming, motivating automatic approaches that leverage LLMs for code generation. Existing methods for automatic kernel generation, however, often produce low-efficiency kernels, incur high computational overhead, and fail to generalize across settings. In this work, we propose CudaForge, a training-free multi-agent workflow for CUDA kernel generation and optimization. Our workflow is inspired by the iterative workflow of human experts, which contains steps such as developing initial kernels, testing correctness, analyzing hardware feedback, and iterative improvement. More specifically, CudaForge employs two LLM agents: a Coder and a Judge, that iteratively generate, correct, and optimize CUDA kernels, while integrating hardware feedback such as Nsight Compute (NCU) metrics. In extensive evaluations, we show that CudaForge, by leveraging base models like OpenAI-o3, achieves 97.6\% correctness of generated kernels and an average 1.68$\times$ speedup over PyTorch baselines, substantially surpassing state-of-the-art models including OpenAI-o3 and Kevin on KernelBench.Beyond accuracy and speed, CudaForge demonstrates strong generalization across GPUs (A100, RTX 6000, 4090, 3090) and base models (OpenAI-o3, GPT-5, gpt-oss-120B, Claude-Sonnet-4, QwQ-32B), while maintaining high efficiency. In particular, generating an optimized kernel takes about 26.5 minutes on one RTX6000 and incurs about \$ 0.3 API cost, which is significantly cheaper than existing agentic work that costs 6 H100 hours and \$ 5 API cost per kernel. Our results highlight that multi-agent, training-free workflows can enable cost-effective, generalizable, and high-performance CUDA kernel optimization. Code available at https://github.com/OptimAI-Lab/CudaForge
Related papers
- StitchCUDA: An Automated Multi-Agents End-to-End GPU Programing Framework with Rubric-based Agentic Reinforcement Learning [26.264303471292845]
We propose StitchCUDA, a multi-agent framework for end-to-end GPU program generation.<n> Experiments show that StitchCUDA achieves nearly 100% success rate on end-to-end programming tasks, with 1.72x better speedup than the multi-agent baseline.
arXiv Detail & Related papers (2026-03-03T06:04:49Z) - CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation [51.72529978689561]
Agent is a large-scale agentic reinforcement learning system that develops kernel expertise through three components.<n>Agent delivers 100%, 100%, and 92% faster rate over torchcompile on KernelBench.
arXiv Detail & Related papers (2026-02-27T18:58:05Z) - GPU Kernel Optimization Beyond Full Builds: An LLM Framework with Minimal Executable Programs [5.25288153386589]
Large language model methods assume kernels can be compiled and executed cheaply tuning.<n>We present an end-to-end LLM framework with performance feedback that optimize kernels without building the full application.<n>The framework integrates Automatic Error Repair and Performance Pattern Inheritance to fix faults, preserve correctness, reuse effective tiling/memory/synchronization strategies, and reduce search cost.
arXiv Detail & Related papers (2025-12-15T07:20:15Z) - EvoEngineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models [27.430839306140157]
Large Language Models (LLMs) for automating kernel optimization promise promise.<n>General-purpose LLM code evolution methods cannot meet strict correctness requirements of kernel optimization.<n>EvoEngineer provides guidance for designing and adapting optimization strategies to achieve a balance between performance and correctness.<n>Our method achieves a maximum speedup of textbf36.75$times among all operations over PyTorch kernels and delivers the highest speedup on textbf28 (textbf56.0%) of 50 operations that achieve over textbf2times$ acceleration.
arXiv Detail & Related papers (2025-10-04T10:00:25Z) - Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization [25.135006275638172]
We introduce robust-kbench, a new benchmark for rigorous evaluation of kernel performance and correctness across varied scenarios.<n>We also present a comprehensive agentic framework that translates torch code to kernels and iteratively improve their runtime setting.<n>Our approach produces kernels outperforming torch implementations for practical applications, including forward and backward passes.
arXiv Detail & Related papers (2025-09-16T11:08:30Z) - Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks [11.253534066141668]
It is imperative to automate low-level kernel development to meet performance and productivity demands.<n>Major cloud providers, semiconductor companies, and research institutions are now investing heavily in AI-driven code generation for GPU.<n>We present an evaluation suite for Triton-based GPU kernels and GEAK (Generating Efficient AI-centric GPU Kernels)
arXiv Detail & Related papers (2025-07-31T02:26:58Z) - Kevin: Multi-Turn RL for Generating CUDA Kernels [0.0]
We develop a flexible multi-turn RL recipe that addresses unique challenges encountered in real-world settings.<n>In our evaluation setup, Kevin shows significant gains over its base model.<n>We also study its behavior across test-time scaling axes.
arXiv Detail & Related papers (2025-07-16T06:33:07Z) - KernelBench: Can LLMs Write Efficient GPU Kernels? [36.4117525096377]
KernelBench is an open-source framework for evaluating language models' ability to write fast and correct kernels.<n>We introduce a new evaluation metric fast_p, which measures the percentage of generated kernels that are functionally correct.<n>Our experiments show that frontier reasoning models perform the best out of the box but still fall short overall.
arXiv Detail & Related papers (2025-02-14T19:30:53Z) - Explore as a Storm, Exploit as a Raindrop: On the Benefit of Fine-Tuning Kernel Schedulers with Coordinate Descent [48.791943145735]
We show the potential to reduce Ansor's search time while enhancing kernel quality.
We apply this approach to the first 300 kernels that Ansor generates.
This result has been replicated in 20 well-known deep-learning models.
arXiv Detail & Related papers (2024-06-28T16:34:22Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems.
Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections.
Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z) - PolyScientist: Automatic Loop Transformations Combined with Microkernels
for Optimization of Deep Learning Primitives [55.79741270235602]
We develop a hybrid solution to the development of deep learning kernels.
We use the advanced polyhedral technology to automatically tune the outer loops for performance.
arXiv Detail & Related papers (2020-02-06T08:02:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.