QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation
- URL: http://arxiv.org/abs/2511.20100v1
- Date: Tue, 25 Nov 2025 09:17:47 GMT
- Title: QiMeng-Kernel: Macro-Thinking Micro-Coding Paradigm for LLM-Based High-Performance GPU Kernel Generation
- Authors: Xinguo Zhu, Shaohui Peng, Jiaming Guo, Yunji Chen, Qi Guo, Yuanbo Wen, Hang Qin, Ruizhi Chen, Qirui Zhou, Ke Gao, Yanjun Wu, Chen Zhao, Ling Li,
- Abstract summary: Micro Coding is a hierarchical framework inspired by the staged optimization strategy of human experts.<n>It decouples optimization strategy from implementation details, ensuring efficiency through high-level strategy and correctness through low-level implementation.<n>It achieves near 100% and 70% accuracy at Levels 1-2 and 3, over 50% than SOTA general-purpose and domain-finetuned LLMs, with up to 7.3x speedup over LLMs, and 2.2x over expert-optimized PyTorch Eager kernels.
- Score: 41.53673797546332
- License: http://creativecommons.org/licenses/by-nc-sa/4.0/
- Abstract: Developing high-performance GPU kernels is critical for AI and scientific computing, but remains challenging due to its reliance on expert crafting and poor portability. While LLMs offer promise for automation, both general-purpose and finetuned LLMs suffer from two fundamental and conflicting limitations: correctness and efficiency. The key reason is that existing LLM-based approaches directly generate the entire optimized low-level programs, requiring exploration of an extremely vast space encompassing both optimization policies and implementation codes. To address the challenge of exploring an intractable space, we propose Macro Thinking Micro Coding (MTMC), a hierarchical framework inspired by the staged optimization strategy of human experts. It decouples optimization strategy from implementation details, ensuring efficiency through high-level strategy and correctness through low-level implementation. Specifically, Macro Thinking employs reinforcement learning to guide lightweight LLMs in efficiently exploring and learning semantic optimization strategies that maximize hardware utilization. Micro Coding leverages general-purpose LLMs to incrementally implement the stepwise optimization proposals from Macro Thinking, avoiding full-kernel generation errors. Together, they effectively navigate the vast optimization space and intricate implementation details, enabling LLMs for high-performance GPU kernel generation. Comprehensive results on widely adopted benchmarks demonstrate the superior performance of MTMC on GPU kernel generation in both accuracy and running time. On KernelBench, MTMC achieves near 100% and 70% accuracy at Levels 1-2 and 3, over 50% than SOTA general-purpose and domain-finetuned LLMs, with up to 7.3x speedup over LLMs, and 2.2x over expert-optimized PyTorch Eager kernels. On the more challenging TritonBench, MTMC attains up to 59.64% accuracy and 34x speedup.
Related papers
- OPT-Engine: Benchmarking the Limits of LLMs in Optimization Modeling via Complexity Scaling [13.57588221678224]
Large Language Models (LLMs) have demonstrated impressive progress in optimization modeling.<n>The boundaries of their capabilities in automated formulation and problem solving remain poorly understood.<n>We propose OPT-ENGINE, a benchmark framework designed to evaluate LLMs on optimization modeling with controllable and scalable difficulty levels.
arXiv Detail & Related papers (2026-01-09T09:22:33Z) - KernelBand: Boosting LLM-based Kernel Optimization with a Hierarchical and Hardware-aware Multi-armed Bandit [15.810081332925584]
KernelBand is a novel framework that formulates kernel optimization as a hierarchical multi-armed bandit problem.<n>We show that KernelBand significantly outperforms state-of-the-art methods, achieving superior performance with fewer tokens while exhibiting consistent improvement without saturation as computational resources increase.
arXiv Detail & Related papers (2025-11-24T08:11:50Z) - STARK: Strategic Team of Agents for Refining Kernels [23.717055490630596]
We introduce an agentic framework for GPU kernel optimization that explores the design space through multi-agent collaboration.<n>This framework mimics the workflow of expert engineers, enabling LLMs to reason about hardware trade-offs, incorporate profiling feedback, and refine kernels iteratively.<n>We evaluate our approach on KernelBench, a benchmark for LLM-based kernel optimization, and demonstrate substantial improvements over baseline agents.
arXiv Detail & Related papers (2025-10-19T20:41:46Z) - MM-HELIX: Boosting Multimodal Long-Chain Reflective Reasoning with Holistic Platform and Adaptive Hybrid Policy Optimization [103.74675519953898]
Long-chain reflective reasoning is a prerequisite for solving complex real-world problems.<n>We build a benchmark consisting 1,260 samples of 42 challenging synthetic tasks.<n>We generate post-training data and explore learning paradigms for exploiting such data.
arXiv Detail & Related papers (2025-10-09T17:53:58Z) - MAHL: Multi-Agent LLM-Guided Hierarchical Chiplet Design with Adaptive Debugging [30.305211001929496]
Large Language Models (LLMs) are promising to extend their abilities to 2.5D integration.<n>LLMs face challenges such as flatten design, high validation cost and imprecise parameter optimization.<n>We propose MAHL, a hierarchical LLM-based chiplet design generation framework.
arXiv Detail & Related papers (2025-08-08T05:47:31Z) - GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization [0.0]
This paper introduces an automated methodology for iteratively refining accelerator kernels.<n>Our methodology employs LLMs in a multi-stage, evolutionary process.<n>We detail how this approach navigates the challenges of the AMD MI300 target architecture.
arXiv Detail & Related papers (2025-06-25T19:59:34Z) - QiMeng-Attention: SOTA Attention Operator is generated by SOTA Attention Algorithm [24.09018606185114]
We propose an LLM-friendly Thinking Language (LLM-TL) to help LLMs decouple the generation of high-level optimization logic and low-level implementation on GPU.<n>Along with a 2-stage reasoning workflow, TL-Code generation and translation, the LLMs can automatically generate FlashAttention implementation on diverse GPU.
arXiv Detail & Related papers (2025-06-14T05:38:19Z) - Large language models as uncertainty-calibrated optimizers for experimental discovery [4.968931211284832]
We show that training language models through the uncertainty-aware objectives of traditional optimization methods enables their use as reliable overconfidence guided by natural language interfaces.<n>Our method nearly doubles the discovery rate of high-yielding reaction conditions, from 24% to 43% in 50 experimental starting from 10 unsuccessful conditions.
arXiv Detail & Related papers (2025-04-08T17:59:57Z) - PerfCodeGen: Improving Performance of LLM Generated Code with Execution Feedback [78.89596149768458]
Large Language Models (LLMs) are widely adopted for assisting in software development tasks.<n>We propose PerfCodeGen, a training-free framework that enhances the performance of LLM-generated code.
arXiv Detail & Related papers (2024-11-18T06:22:38Z) - Improving Parallel Program Performance with LLM Optimizers via Agent-System Interfaces [9.880183350366792]
A key challenge in improving parallel program performance is efficiently mapping tasks to processors and data to memory.<n>We introduce a framework that automates mapper development with generative optimization.<n>Our approach finds mappers that surpass expert-written mappers by up to 1.34X speedup across nine benchmarks.
arXiv Detail & Related papers (2024-10-21T04:08:37Z) - LLM as a Complementary Optimizer to Gradient Descent: A Case Study in Prompt Tuning [69.95292905263393]
We show that gradient-based and high-level LLMs can effectively collaborate a combined optimization framework.<n>In this paper, we show that these complementary to each other and can effectively collaborate a combined optimization framework.
arXiv Detail & Related papers (2024-05-30T06:24:14Z) - Revisiting Zeroth-Order Optimization for Memory-Efficient LLM Fine-Tuning: A Benchmark [166.40879020706151]
This paper proposes a shift towards BP-free, zeroth-order (ZO) optimization as a solution for reducing memory costs during fine-tuning.
Unlike traditional ZO-SGD methods, our work expands the exploration to a wider array of ZO optimization techniques.
Our study unveils previously overlooked optimization principles, highlighting the importance of task alignment, the role of the forward gradient method, and the balance between algorithm complexity and fine-tuning performance.
arXiv Detail & Related papers (2024-02-18T14:08:48Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.