Related papers: AKG kernel Agent: A Multi-Agent Framework for Cross-Platform Kernel Synthesis

AKG kernel Agent: A Multi-Agent Framework for Cross-Platform Kernel Synthesis

URL: http://arxiv.org/abs/2512.23424v1
Date: Mon, 29 Dec 2025 12:42:05 GMT
Title: AKG kernel Agent: A Multi-Agent Framework for Cross-Platform Kernel Synthesis
Authors: Jinye Du, Quan Yuan, Zuyao Zhang, Yanzhi Yi, Jiahui Hu, Wangyi Chen, Yiyang Zhu, Qishui Zheng, Wenxiang Zou, Xiangyu Chang, Zuohe Zheng, Zichun Ye, Chao Liu, Shanni Li, Renwei Zhang, Yiping Deng, Xinwei Hu, Xuefeng Jin, Jie Zhao,
Abstract summary: Modern AI models demand high-performance computation kernels.<n>Akg kernel agent (AI-driven Kernel Generator) is designed to support multiple domain-specific languages.<n>System's modular design allows rapid integration of backend DSLs and hardware targets.<n>System achieves an average speedup of 1.46$times over PyTorch Eager baselines.
Score: 13.239454996851771
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Modern AI models demand high-performance computation kernels. The growing complexity of LLMs, multimodal architectures, and recommendation systems, combined with techniques like sparsity and quantization, creates significant computational challenges. Moreover, frequent hardware updates and diverse chip architectures further complicate this landscape, requiring tailored kernel implementations for each platform. However, manual optimization cannot keep pace with these demands, creating a critical bottleneck in AI system development. Recent advances in LLM code generation capabilities have opened new possibilities for automating kernel development. In this work, we propose AKG kernel agent (AI-driven Kernel Generator), a multi-agent system that automates kernel generation, migration, and performance tuning. AKG kernel agent is designed to support multiple domain-specific languages (DSLs), including Triton, TileLang, CPP, and CUDA-C, enabling it to target different hardware backends while maintaining correctness and portability. The system's modular design allows rapid integration of new DSLs and hardware targets. When evaluated on KernelBench using Triton DSL across GPU and NPU backends, AKG kernel agent achieves an average speedup of 1.46$\times$ over PyTorch Eager baselines implementations, demonstrating its effectiveness in accelerating kernel development for modern AI workloads.

Related papers

CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation [51.72529978689561]
Agent is a large-scale agentic reinforcement learning system that develops kernel expertise through three components.<n>Agent delivers 100%, 100%, and 92% faster rate over torchcompile on KernelBench.
arXiv Detail & Related papers (2026-02-27T18:58:05Z)
Hexagon-MLIR: An AI Compilation Stack For Qualcomm's Neural Processing Units (NPUs) [3.8043062351078585]
Hexagon-MLIR is an open-source compilation stack that targets Qualcomm Hexagon Neural Processing Unit (NPU)<n>It provides unified support for lowering Triton kernels and PyTorch models.
arXiv Detail & Related papers (2026-02-23T12:12:39Z)
K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model [57.440609834690385]
Existing approaches treat Large Language Models (LLMs) as rapid code generators within evolutionary loops.<n>We propose Search via Co-Evolving World Model and build K-Search based on this method.<n>We evaluate K-Search on diverse, complex kernels FlashInfer, including GQA, MLA, and MoE kernels.
arXiv Detail & Related papers (2026-02-22T11:06:22Z)
KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta [8.852510847427164]
This paper presents KernelEvolve-an agentic kernel coding framework to tackle heterogeneous at-scale for deep learning recommendation model (DLRM)<n> KernelEvolve is designed to take kernel specifications as input and automate the process of kernel generation and optimization for recommendation model across heterogeneous hardware architectures.<n>We show KernelEvolve reduces development time from weeks to hours and substantial performance improvements over PyTorch baselines across diverse production use cases and for heterogeneous AI systems at-scale.
arXiv Detail & Related papers (2025-12-29T06:31:55Z)
cuPilot: A Strategy-Coordinated Multi-agent Framework for CUDA Kernel Evolution [15.701861287574296]
cuPilot is a strategy-coordinated multi-agent framework that introduces strategy as an intermediate semantic representation for kernel evolution.<n>On the GEMM tasks, cuPilot showcases sophisticated optimizations and achieves high utilization of critical hardware units.
arXiv Detail & Related papers (2025-12-18T12:34:00Z)
Optimizing PyTorch Inference with LLM-Based Multi-Agent Systems [1.2289544895833646]
We present a framework for comparing multi-agent PyTorch optimization systems.<n>We show that exploit-heavy strategies perform best when paired with error-fixing agents.<n>The best implementation achieves an average 2.88x speedup on an H100 GPU.
arXiv Detail & Related papers (2025-11-21T05:37:38Z)
Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z)
STARK: Strategic Team of Agents for Refining Kernels [23.717055490630596]
We introduce an agentic framework for GPU kernel optimization that explores the design space through multi-agent collaboration.<n>This framework mimics the workflow of expert engineers, enabling LLMs to reason about hardware trade-offs, incorporate profiling feedback, and refine kernels iteratively.<n>We evaluate our approach on KernelBench, a benchmark for LLM-based kernel optimization, and demonstrate substantial improvements over baseline agents.
arXiv Detail & Related papers (2025-10-19T20:41:46Z)
xLLM Technical Report [57.13120905321185]
We introduce xLLM, an intelligent and efficient Large Language Model (LLM) inference framework.<n>xLLM builds a novel decoupled service-engine architecture.<n>xLLM-Engine co-optimizes system and algorithm designs to fully saturate computing resources.
arXiv Detail & Related papers (2025-10-16T13:53:47Z)
Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels. We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
An Adaptive Device-Edge Co-Inference Framework Based on Soft Actor-Critic [72.35307086274912]
High-dimension parameter model and large-scale mathematical calculation restrict execution efficiency, especially for Internet of Things (IoT) devices. We propose a new Deep Reinforcement Learning (DRL)-Soft Actor Critic for discrete (SAC-d), which generates the emphexit point, emphexit point, and emphcompressing bits by soft policy iterations. Based on the latency and accuracy aware reward design, such an computation can well adapt to the complex environment like dynamic wireless channel and arbitrary processing, and is capable of supporting the 5G URL
arXiv Detail & Related papers (2022-01-09T09:31:50Z)
PolyScientist: Automatic Loop Transformations Combined with Microkernels for Optimization of Deep Learning Primitives [55.79741270235602]
We develop a hybrid solution to the development of deep learning kernels. We use the advanced polyhedral technology to automatically tune the outer loops for performance.
arXiv Detail & Related papers (2020-02-06T08:02:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.