cuPilot: A Strategy-Coordinated Multi-agent Framework for CUDA Kernel Evolution
- URL: http://arxiv.org/abs/2512.16465v2
- Date: Tue, 23 Dec 2025 07:16:16 GMT
- Title: cuPilot: A Strategy-Coordinated Multi-agent Framework for CUDA Kernel Evolution
- Authors: Jinwu Chen, Qidie Wu, Bin Li, Lin Ma, Xin Si, Yang Hu, Shouyi Yin, Jun Yang,
- Abstract summary: cuPilot is a strategy-coordinated multi-agent framework that introduces strategy as an intermediate semantic representation for kernel evolution.<n>On the GEMM tasks, cuPilot showcases sophisticated optimizations and achieves high utilization of critical hardware units.
- Score: 15.701861287574296
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Optimizing CUDA kernels is a challenging and labor-intensive task, given the need for hardware-software co-design expertise and the proprietary nature of high-performance kernel libraries. While recent large language models (LLMs) combined with evolutionary algorithms show promise in automatic kernel optimization, existing approaches often fall short in performance due to their suboptimal agent designs and mismatched evolution representations. This work identifies these mismatches and proposes cuPilot, a strategy-coordinated multi-agent framework that introduces strategy as an intermediate semantic representation for kernel evolution. Key contributions include a strategy-coordinated evolution algorithm, roofline-guided prompting, and strategy-level population initialization. Experimental results show that the generated kernels by cuPilot achieve an average speed up of 3.09$\times$ over PyTorch on a benchmark of 100 kernels. On the GEMM tasks, cuPilot showcases sophisticated optimizations and achieves high utilization of critical hardware units. The generated kernels are open-sourced at https://github.com/champloo2878/cuPilot-Kernels.git.
Related papers
- CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation [51.72529978689561]
Agent is a large-scale agentic reinforcement learning system that develops kernel expertise through three components.<n>Agent delivers 100%, 100%, and 92% faster rate over torchcompile on KernelBench.
arXiv Detail & Related papers (2026-02-27T18:58:05Z) - K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model [57.440609834690385]
Existing approaches treat Large Language Models (LLMs) as rapid code generators within evolutionary loops.<n>We propose Search via Co-Evolving World Model and build K-Search based on this method.<n>We evaluate K-Search on diverse, complex kernels FlashInfer, including GQA, MLA, and MoE kernels.
arXiv Detail & Related papers (2026-02-22T11:06:22Z) - AKG kernel Agent: A Multi-Agent Framework for Cross-Platform Kernel Synthesis [13.239454996851771]
Modern AI models demand high-performance computation kernels.<n>Akg kernel agent (AI-driven Kernel Generator) is designed to support multiple domain-specific languages.<n>System's modular design allows rapid integration of backend DSLs and hardware targets.<n>System achieves an average speedup of 1.46$times over PyTorch Eager baselines.
arXiv Detail & Related papers (2025-12-29T12:42:05Z) - KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta [8.852510847427164]
This paper presents KernelEvolve-an agentic kernel coding framework to tackle heterogeneous at-scale for deep learning recommendation model (DLRM)<n> KernelEvolve is designed to take kernel specifications as input and automate the process of kernel generation and optimization for recommendation model across heterogeneous hardware architectures.<n>We show KernelEvolve reduces development time from weeks to hours and substantial performance improvements over PyTorch baselines across diverse production use cases and for heterogeneous AI systems at-scale.
arXiv Detail & Related papers (2025-12-29T06:31:55Z) - Optimizing PyTorch Inference with LLM-Based Multi-Agent Systems [1.2289544895833646]
We present a framework for comparing multi-agent PyTorch optimization systems.<n>We show that exploit-heavy strategies perform best when paired with error-fixing agents.<n>The best implementation achieves an average 2.88x speedup on an H100 GPU.
arXiv Detail & Related papers (2025-11-21T05:37:38Z) - CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization [36.794824560677064]
CudaForge is a training-free multi-agent workflow for kernel generation and optimization.<n>By leveraging base models like OpenAI-o3, CudaForge achieves 97.6% correctness generated kernels and an average 1.68$times$ speedup.
arXiv Detail & Related papers (2025-10-23T22:52:00Z) - Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization [25.135006275638172]
We introduce robust-kbench, a new benchmark for rigorous evaluation of kernel performance and correctness across varied scenarios.<n>We also present a comprehensive agentic framework that translates torch code to kernels and iteratively improve their runtime setting.<n>Our approach produces kernels outperforming torch implementations for practical applications, including forward and backward passes.
arXiv Detail & Related papers (2025-09-16T11:08:30Z) - Astra: A Multi-Agent System for GPU Kernel Performance Optimization [10.715861478214961]
We introduce Astra, the first multi-agent system for GPU kernel optimization.<n>Within Astra, specialized agents collaborate through code generation, profiling, and planning to produce kernels that are both correct and high-performance.
arXiv Detail & Related papers (2025-09-09T08:39:50Z) - Explore as a Storm, Exploit as a Raindrop: On the Benefit of Fine-Tuning Kernel Schedulers with Coordinate Descent [48.791943145735]
We show the potential to reduce Ansor's search time while enhancing kernel quality.
We apply this approach to the first 300 kernels that Ansor generates.
This result has been replicated in 20 well-known deep-learning models.
arXiv Detail & Related papers (2024-06-28T16:34:22Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - Structural Kernel Search via Bayesian Optimization and Symbolical
Optimal Transport [5.1672267755831705]
For Gaussian processes, selecting the kernel is a crucial task, often done manually by the expert.
We propose a novel, efficient search method through a general, structured kernel space.
arXiv Detail & Related papers (2022-10-21T09:30:21Z) - Kernel Identification Through Transformers [54.3795894579111]
Kernel selection plays a central role in determining the performance of Gaussian Process (GP) models.
This work addresses the challenge of constructing custom kernel functions for high-dimensional GP regression models.
We introduce a novel approach named KITT: Kernel Identification Through Transformers.
arXiv Detail & Related papers (2021-06-15T14:32:38Z) - PolyScientist: Automatic Loop Transformations Combined with Microkernels
for Optimization of Deep Learning Primitives [55.79741270235602]
We develop a hybrid solution to the development of deep learning kernels.
We use the advanced polyhedral technology to automatically tune the outer loops for performance.
arXiv Detail & Related papers (2020-02-06T08:02:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.