KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta
- URL: http://arxiv.org/abs/2512.23236v1
- Date: Mon, 29 Dec 2025 06:31:55 GMT
- Title: KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta
- Authors: Gang Liao, Hongsen Qin, Ying Wang, Alicia Golden, Michael Kuchnik, Yavuz Yetim, Jia Jiunn Ang, Chunli Fu, Yihan He, Samuel Hsia, Zewei Jiang, Dianshi Li, Uladzimir Pashkevich, Varna Puvvada, Feng Shi, Matt Steiner, Ruichao Xiao, Nathan Yan, Xiayu Yu, Zhou Fang, Abdul Zainul-Abedin, Ketan Singh, Hongtao Yu, Wenyuan Chi, Barney Huang, Sean Zhang, Noah Weller, Zach Marine, Wyatt Cook, Carole-Jean Wu, Gaoxiang Liu,
- Abstract summary: This paper presents KernelEvolve-an agentic kernel coding framework to tackle heterogeneous at-scale for deep learning recommendation model (DLRM)<n> KernelEvolve is designed to take kernel specifications as input and automate the process of kernel generation and optimization for recommendation model across heterogeneous hardware architectures.<n>We show KernelEvolve reduces development time from weeks to hours and substantial performance improvements over PyTorch baselines across diverse production use cases and for heterogeneous AI systems at-scale.
- Score: 8.852510847427164
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Making deep learning recommendation model (DLRM) training and inference fast and efficient is important. However, this presents three key system challenges - model architecture diversity, kernel primitive diversity, and hardware generation and architecture heterogeneity. This paper presents KernelEvolve-an agentic kernel coding framework-to tackle heterogeneity at-scale for DLRM. KernelEvolve is designed to take kernel specifications as input and automate the process of kernel generation and optimization for recommendation model across heterogeneous hardware architectures. KernelEvolve does so by operating at multiple programming abstractions, from Triton and CuTe DSL to low-level hardware agnostic languages, spanning the full hardware-software optimization stack. The kernel optimization process is described as graph-based search with selection policy, universal operator, fitness function, and termination rule, dynamically adapts to runtime execution context through retrieval-augmented prompt synthesis. We designed, implemented, and deployed KernelEvolve to optimize a wide variety of production recommendation models across generations of NVIDIA and AMD GPUs, as well as Meta's AI accelerators. We validate KernelEvolve on the publicly-available KernelBench suite, achieving 100% pass rate on all 250 problems across three difficulty levels, and 160 PyTorch ATen operators across three heterogeneous hardware platforms, demonstrating 100% correctness. KernelEvolve reduces development time from weeks to hours and achieves substantial performance improvements over PyTorch baselines across diverse production use cases and for heterogeneous AI systems at-scale. Beyond performance efficiency improvements, KernelEvolve significantly mitigates the programmability barrier for new AI hardware by enabling automated kernel generation for in-house developed AI hardware.
Related papers
- CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation [51.72529978689561]
Agent is a large-scale agentic reinforcement learning system that develops kernel expertise through three components.<n>Agent delivers 100%, 100%, and 92% faster rate over torchcompile on KernelBench.
arXiv Detail & Related papers (2026-02-27T18:58:05Z) - K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model [57.440609834690385]
Existing approaches treat Large Language Models (LLMs) as rapid code generators within evolutionary loops.<n>We propose Search via Co-Evolving World Model and build K-Search based on this method.<n>We evaluate K-Search on diverse, complex kernels FlashInfer, including GQA, MLA, and MoE kernels.
arXiv Detail & Related papers (2026-02-22T11:06:22Z) - EmboCoach-Bench: Benchmarking AI Agents on Developing Embodied Robots [68.29056647487519]
Embodied AI is fueled by high-fidelity simulation and large-scale data collection.<n>However, this scaling capability remains bottlenecked by a reliance on labor-intensive manual oversight.<n>We introduce textscEmboCoach-Bench, a benchmark evaluating the capacity of LLM agents to autonomously engineer embodied policies.
arXiv Detail & Related papers (2026-01-29T11:33:49Z) - AKG kernel Agent: A Multi-Agent Framework for Cross-Platform Kernel Synthesis [13.239454996851771]
Modern AI models demand high-performance computation kernels.<n>Akg kernel agent (AI-driven Kernel Generator) is designed to support multiple domain-specific languages.<n>System's modular design allows rapid integration of backend DSLs and hardware targets.<n>System achieves an average speedup of 1.46$times over PyTorch Eager baselines.
arXiv Detail & Related papers (2025-12-29T12:42:05Z) - cuPilot: A Strategy-Coordinated Multi-agent Framework for CUDA Kernel Evolution [15.701861287574296]
cuPilot is a strategy-coordinated multi-agent framework that introduces strategy as an intermediate semantic representation for kernel evolution.<n>On the GEMM tasks, cuPilot showcases sophisticated optimizations and achieves high utilization of critical hardware units.
arXiv Detail & Related papers (2025-12-18T12:34:00Z) - TritonForge: Profiling-Guided Framework for Automated Triton Kernel Optimization [24.065109818256605]
TritonForge is a profiling-guided framework for automated GPU kernel optimization.<n>It integrates kernel analysis, runtime profiling, and iterative code transformation to streamline the process.<n>It achieves up to 5x performance improvement over baseline implementations and on average 1.76x of the cases are successful.
arXiv Detail & Related papers (2025-12-09T23:44:35Z) - CudaForge: An Agent Framework with Hardware Feedback for CUDA Kernel Optimization [36.794824560677064]
CudaForge is a training-free multi-agent workflow for kernel generation and optimization.<n>By leveraging base models like OpenAI-o3, CudaForge achieves 97.6% correctness generated kernels and an average 1.68$times$ speedup.
arXiv Detail & Related papers (2025-10-23T22:52:00Z) - Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks [11.253534066141668]
It is imperative to automate low-level kernel development to meet performance and productivity demands.<n>Major cloud providers, semiconductor companies, and research institutions are now investing heavily in AI-driven code generation for GPU.<n>We present an evaluation suite for Triton-based GPU kernels and GEAK (Generating Efficient AI-centric GPU Kernels)
arXiv Detail & Related papers (2025-07-31T02:26:58Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - FPGA-optimized Hardware acceleration for Spiking Neural Networks [69.49429223251178]
This work presents the development of a hardware accelerator for an SNN, with off-line training, applied to an image recognition task.
The design targets a Xilinx Artix-7 FPGA, using in total around the 40% of the available hardware resources.
It reduces the classification time by three orders of magnitude, with a small 4.5% impact on the accuracy, if compared to its software, full precision counterpart.
arXiv Detail & Related papers (2022-01-18T13:59:22Z) - Kernel Identification Through Transformers [54.3795894579111]
Kernel selection plays a central role in determining the performance of Gaussian Process (GP) models.
This work addresses the challenge of constructing custom kernel functions for high-dimensional GP regression models.
We introduce a novel approach named KITT: Kernel Identification Through Transformers.
arXiv Detail & Related papers (2021-06-15T14:32:38Z) - Kernel methods through the roof: handling billions of points efficiently [94.31450736250918]
Kernel methods provide an elegant and principled approach to nonparametric learning, but so far could hardly be used in large scale problems.
Recent advances have shown the benefits of a number of algorithmic ideas, for example combining optimization, numerical linear algebra and random projections.
Here, we push these efforts further to develop and test a solver that takes full advantage of GPU hardware.
arXiv Detail & Related papers (2020-06-18T08:16:25Z) - PolyScientist: Automatic Loop Transformations Combined with Microkernels
for Optimization of Deep Learning Primitives [55.79741270235602]
We develop a hybrid solution to the development of deep learning kernels.
We use the advanced polyhedral technology to automatically tune the outer loops for performance.
arXiv Detail & Related papers (2020-02-06T08:02:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.