Related papers: A Two-Stage GPU Kernel Tuner Combining Semantic Refactoring and Search-Based Optimization

A Two-Stage GPU Kernel Tuner Combining Semantic Refactoring and Search-Based Optimization

URL: http://arxiv.org/abs/2601.12698v3
Date: Fri, 23 Jan 2026 00:54:57 GMT
Title: A Two-Stage GPU Kernel Tuner Combining Semantic Refactoring and Search-Based Optimization
Authors: Qiuyi Qu, Yicheng Sui, Yufei Sun, Rui Chen, Xiaofei Zhang, Yuzhi Zhang, Haofeng Wang, Ge Lan,
Abstract summary: This paper introduces a template-based rewriting layer on top of an agent-driven iterative loop.<n>The proposed method can be extended to deliver automated performance optimization for real production workloads.
Score: 9.49293344824955
License: http://creativecommons.org/licenses/by-nc-sa/4.0/
Abstract: GPU code optimization is a key performance bottleneck for HPC workloads as well as large-model training and inference. Although compiler optimizations and hand-written kernels can partially alleviate this issue, achieving near-hardware-limit performance still relies heavily on manual code refactoring and parameter tuning. Recent progress in LLM-agent-based kernel generation and optimization has been reported, yet many approaches primarily focus on direct code rewriting, where parameter choices are often implicit and hard to control, or require human intervention, leading to unstable performance gains. This paper introduces a template-based rewriting layer on top of an agent-driven iterative loop: kernels are semantically refactored into explicitly parameterizable templates, and template parameters are then optimized via search-based autotuning, yielding more stable and higher-quality speedups. Experiments on a set of real-world kernels demonstrate speedups exceeding 3x in the best case. We extract representative CUDA kernels from SGLang as evaluation targets; the proposed agentic tuner iteratively performs templating, testing, analysis, and planning, and leverages profiling feedback to execute constrained parameter search under hardware resource limits. Compared to agent-only direct rewriting, the template-plus-search design significantly reduces the randomness of iterative optimization, making the process more interpretable and enabling a more systematic approach toward high-performance configurations. The proposed method can be further extended to OpenCL, HIP, and other backends to deliver automated performance optimization for real production workloads.

Related papers

$\ abla$-Reasoner: LLM Reasoning via Test-Time Gradient Descent in Latent Space [71.23672814629448]
$nabla$-Reasoner is an iterative generation framework that integrates differentiable optimization over token logits into the decoding loop.<n>$nabla$-Reasoner achieves over 20% accuracy improvement on a challenging mathematical reasoning benchmark.
arXiv Detail & Related papers (2026-03-05T08:42:54Z)
PRAGMA: A Profiling-Reasoned Multi-Agent Framework for Automatic Kernel Optimization [12.24680414520151]
PRAGMA is a profile-guided AI kernel generation framework.<n>It integrates execution feedback and fine-grained hardware profiling into the reasoning loop.<n>We evaluate PRAGMA on KernelBench, covering GPU and CPU backends.
arXiv Detail & Related papers (2025-11-09T12:01:43Z)
STARK: Strategic Team of Agents for Refining Kernels [23.717055490630596]
We introduce an agentic framework for GPU kernel optimization that explores the design space through multi-agent collaboration.<n>This framework mimics the workflow of expert engineers, enabling LLMs to reason about hardware trade-offs, incorporate profiling feedback, and refine kernels iteratively.<n>We evaluate our approach on KernelBench, a benchmark for LLM-based kernel optimization, and demonstrate substantial improvements over baseline agents.
arXiv Detail & Related papers (2025-10-19T20:41:46Z)
Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization [25.135006275638172]
We introduce robust-kbench, a new benchmark for rigorous evaluation of kernel performance and correctness across varied scenarios.<n>We also present a comprehensive agentic framework that translates torch code to kernels and iteratively improve their runtime setting.<n>Our approach produces kernels outperforming torch implementations for practical applications, including forward and backward passes.
arXiv Detail & Related papers (2025-09-16T11:08:30Z)
COrAL: Order-Agnostic Language Modeling for Efficient Iterative Refinement [80.18490952057125]
Iterative refinement has emerged as an effective paradigm for enhancing the capabilities of large language models (LLMs) on complex tasks. We propose Context-Wise Order-Agnostic Language Modeling (COrAL) to overcome these challenges. Our approach models multiple token dependencies within manageable context windows, enabling the model to perform iterative refinement internally.
arXiv Detail & Related papers (2024-10-12T23:56:19Z)
RTLRewriter: Methodologies for Large Models aided RTL Code Optimization [21.61206887869307]
This paper introduces RTLRewriter, an innovative framework that leverages large models to optimize RTL code. A circuit partition pipeline is utilized for fast synthesis and efficient rewriting. A specialized search engine is designed to identify useful optimization guides, algorithms, and code snippets.
arXiv Detail & Related papers (2024-09-04T09:59:37Z)
Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels. We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
Performance Embeddings: A Similarity-based Approach to Automatic Performance Optimization [71.69092462147292]
Performance embeddings enable knowledge transfer of performance tuning between applications. We demonstrate this transfer tuning approach on case studies in deep neural networks, dense and sparse linear algebra compositions, and numerical weather prediction stencils.
arXiv Detail & Related papers (2023-03-14T15:51:35Z)
Reducing the Variance of Gaussian Process Hyperparameter Optimization with Preconditioning [54.01682318834995]
Preconditioning is a highly effective step for any iterative method involving matrix-vector multiplication. We prove that preconditioning has an additional benefit that has been previously unexplored. It simultaneously can reduce variance at essentially negligible cost.
arXiv Detail & Related papers (2021-07-01T06:43:11Z)
Static Neural Compiler Optimization via Deep Reinforcement Learning [1.458855293397494]
In this paper, we employ a deep reinforcement learning approach to the phase-ordering problem. Provided with sub-sequences constituting LLVM's O3 sequence, our agent learns to outperform the O3 sequence on the set of source codes used for training. We believe that the models trained using our approach can be integrated into modern compilers as neural optimization agents.
arXiv Detail & Related papers (2020-08-20T13:16:29Z)
Global Optimization of Gaussian processes [52.77024349608834]
We propose a reduced-space formulation with trained Gaussian processes trained on few data points. The approach also leads to significantly smaller and computationally cheaper sub solver for lower bounding. In total, we reduce time convergence by orders of orders of the proposed method.
arXiv Detail & Related papers (2020-05-21T20:59:11Z)

This list is automatically generated from the titles and abstracts of the papers in this site.