AscendKernelGen: A Systematic Study of LLM-Based Kernel Generation for Neural Processing Units
- URL: http://arxiv.org/abs/2601.07160v1
- Date: Mon, 12 Jan 2026 03:12:58 GMT
- Title: AscendKernelGen: A Systematic Study of LLM-Based Kernel Generation for Neural Processing Units
- Authors: Xinzi Cao, Jianyang Zhai, Pengfei Li, Zhiheng Hu, Cen Yan, Bingxu Mu, Guanghuan Fang, Bin She, Jiayu Li, Yihan Su, Dongyang Tao, Xiansong Huang, Fan Xu, Feidiao Yang, Yao Lu, Chang-Dong Wang, Yutong Lu, Weicheng Xue, Bin Zhou, Yonghong Tian,
- Abstract summary: We propose Ascend KernelGen, a generation-evaluation integrated framework for NPU kernel development.<n>We introduce Ascend-CoT, a high-quality dataset incorporating chain-of-thought reasoning derived from real-world kernel implementations.<n>We also design NPU KernelBench, a comprehensive benchmark for assessing compilation, correctness, and performance across varying complexity levels.
- Score: 39.846358001824996
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: To meet the ever-increasing demand for computational efficiency, Neural Processing Units (NPUs) have become critical in modern AI infrastructure. However, unlocking their full potential requires developing high-performance compute kernels using vendor-specific Domain-Specific Languages (DSLs), a task that demands deep hardware expertise and is labor-intensive. While Large Language Models (LLMs) have shown promise in general code generation, they struggle with the strict constraints and scarcity of training data in the NPU domain. Our preliminary study reveals that state-of-the-art general-purpose LLMs fail to generate functional complex kernels for Ascend NPUs, yielding a near-zero success rate. To address these challenges, we propose AscendKernelGen, a generation-evaluation integrated framework for NPU kernel development. We introduce Ascend-CoT, a high-quality dataset incorporating chain-of-thought reasoning derived from real-world kernel implementations, and KernelGen-LM, a domain-adaptive model trained via supervised fine-tuning and reinforcement learning with execution feedback. Furthermore, we design NPUKernelBench, a comprehensive benchmark for assessing compilation, correctness, and performance across varying complexity levels. Experimental results demonstrate that our approach significantly bridges the gap between general LLMs and hardware-specific coding. Specifically, the compilation success rate on complex Level-2 kernels improves from 0% to 95.5% (Pass@10), while functional correctness achieves 64.3% compared to the baseline's complete failure. These results highlight the critical role of domain-specific reasoning and rigorous evaluation in automating accelerator-aware code generation.
Related papers
- K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model [57.440609834690385]
Existing approaches treat Large Language Models (LLMs) as rapid code generators within evolutionary loops.<n>We propose Search via Co-Evolving World Model and build K-Search based on this method.<n>We evaluate K-Search on diverse, complex kernels FlashInfer, including GQA, MLA, and MoE kernels.
arXiv Detail & Related papers (2026-02-22T11:06:22Z) - AscendCraft: Automatic Ascend NPU Kernel Generation via DSL-Guided Transcompilation [8.878393510726008]
We present AscendCraft, a DSL-guided approach for automatic AscendC kernel generation.<n>AscendingCraft achieves 98.1% compilation success and 90.4% functional correctness.<n>We also demonstrate that DSL-guided transcompilation can enable LLMs to generate both correct and competitive NPU kernels.
arXiv Detail & Related papers (2026-01-30T09:34:59Z) - EvolVE: Evolutionary Search for LLM-based Verilog Generation and Optimization [0.2796197251957245]
We present EvolVE, the first framework to analyze multiple evolution strategies on chip design tasks.<n>We also introduce IC-RTL, targeting industry-scale problems derived from the National Integrated Circuit Contest.
arXiv Detail & Related papers (2026-01-26T01:53:54Z) - Towards Automated Kernel Generation in the Era of LLMs [17.69471168609145]
Kernel engineering is a time-consuming and non-scalable process.<n>Recent advances in large language models (LLMs) and agentic systems have opened new possibilities for automating kernel generation and optimization.<n>The field remains fragmented, lacking a systematic perspective for LLM-driven kernel generation.
arXiv Detail & Related papers (2026-01-22T07:53:52Z) - QiMeng-NeuComBack: Self-Evolving Translation from IR to Assembly Code [52.66657751895655]
Large Language Models (LLMs) offer a compelling new paradigm: Neural Compilation.<n>This paper introduces NeuComBack, a novel benchmark dataset specifically designed for IR-to-assembly compilation.<n>We propose a self-evolving prompt optimization method that enables LLMs to evolve their internal prompt strategies.
arXiv Detail & Related papers (2025-11-03T03:20:26Z) - ConCuR: Conciseness Makes State-of-the-Art Kernel Generation [5.010229074860956]
Key challenge for kernel generation is the scarcity of high-quality data.<n>We develop a pipeline that generates and curates high-quality kernels with reasoning traces.<n>We show that the average reasoning length can serve as a metric to assess the difficulty of kernel generation tasks.
arXiv Detail & Related papers (2025-10-08T15:41:15Z) - Evolution of Kernels: Automated RISC-V Kernel Optimization with Large Language Models [26.985412258634256]
Large language models (LLMs) have shown promise for automated kernel optimization, demonstrating success in domains with comprehensive technical documents and mature scarcitys.<n>We present Evolution of Kernels (EoK), a novel LLM-based evolutionary program search framework that automates kernel design for domains with limited reference material.<n>EoK achieves a median 1.27x speedup, surpassing human experts on all 80 evaluated kernel design tasks.
arXiv Detail & Related papers (2025-09-14T08:11:06Z) - InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling [71.37579508777843]
Large language models (LLMs) have revolutionized artificial intelligence by enabling complex reasoning capabilities.<n>To address this gap, we present InternBootcamp, an open-source framework comprising 1000+ domain-diverse task environments.
arXiv Detail & Related papers (2025-08-12T05:00:00Z) - Thinking Longer, Not Larger: Enhancing Software Engineering Agents via Scaling Test-Time Compute [61.00662702026523]
We propose a unified Test-Time Compute scaling framework that leverages increased inference-time instead of larger models.<n>Our framework incorporates two complementary strategies: internal TTC and external TTC.<n>We demonstrate our textbf32B model achieves a 46% issue resolution rate, surpassing significantly larger models such as DeepSeek R1 671B and OpenAI o1.
arXiv Detail & Related papers (2025-03-31T07:31:32Z) - Enhancing Dropout-based Bayesian Neural Networks with Multi-Exit on FPGA [20.629635991749808]
This paper proposes an algorithm and hardware co-design framework that can generate field-programmable gate array (FPGA)-based accelerators for efficient BayesNNs.
At the algorithm level, we propose novel multi-exit dropout-based BayesNNs with reduced computational and memory overheads.
At the hardware level, this paper introduces a transformation framework that can generate FPGA-based accelerators for the proposed efficient BayesNNs.
arXiv Detail & Related papers (2024-06-20T17:08:42Z) - Efficient kernel surrogates for neural network-based regression [0.8030359871216615]
We study the performance of the Conjugate Kernel (CK), an efficient approximation to the Neural Tangent Kernel (NTK)
We show that the CK performance is only marginally worse than that of the NTK and, in certain cases, is shown to be superior.
In addition to providing a theoretical grounding for using CKs instead of NTKs, our framework suggests a recipe for improving DNN accuracy inexpensively.
arXiv Detail & Related papers (2023-10-28T06:41:47Z) - Inducing Gaussian Process Networks [80.40892394020797]
We propose inducing Gaussian process networks (IGN), a simple framework for simultaneously learning the feature space as well as the inducing points.
The inducing points, in particular, are learned directly in the feature space, enabling a seamless representation of complex structured domains.
We report on experimental results for real-world data sets showing that IGNs provide significant advances over state-of-the-art methods.
arXiv Detail & Related papers (2022-04-21T05:27:09Z) - PolyScientist: Automatic Loop Transformations Combined with Microkernels
for Optimization of Deep Learning Primitives [55.79741270235602]
We develop a hybrid solution to the development of deep learning kernels.
We use the advanced polyhedral technology to automatically tune the outer loops for performance.
arXiv Detail & Related papers (2020-02-06T08:02:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.