Related papers: STARK: Strategic Team of Agents for Refining Kernels

STARK: Strategic Team of Agents for Refining Kernels

URL: http://arxiv.org/abs/2510.16996v1
Date: Sun, 19 Oct 2025 20:41:46 GMT
Title: STARK: Strategic Team of Agents for Refining Kernels
Authors: Juncheng Dong, Yang Yang, Tao Liu, Yang Wang, Feng Qi, Vahid Tarokh, Kaushik Rangadurai, Shuang Yang,
Abstract summary: We introduce an agentic framework for GPU kernel optimization that explores the design space through multi-agent collaboration.<n>This framework mimics the workflow of expert engineers, enabling LLMs to reason about hardware trade-offs, incorporate profiling feedback, and refine kernels iteratively.<n>We evaluate our approach on KernelBench, a benchmark for LLM-based kernel optimization, and demonstrate substantial improvements over baseline agents.
Score: 23.717055490630596
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The efficiency of GPU kernels is central to the progress of modern AI, yet optimizing them remains a difficult and labor-intensive task due to complex interactions between memory hierarchies, thread scheduling, and hardware-specific characteristics. While recent advances in large language models (LLMs) provide new opportunities for automated code generation, existing approaches largely treat LLMs as single-shot generators or naive refinement tools, limiting their effectiveness in navigating the irregular kernel optimization landscape. We introduce an LLM agentic framework for GPU kernel optimization that systematically explores the design space through multi-agent collaboration, grounded instruction, dynamic context management, and strategic search. This framework mimics the workflow of expert engineers, enabling LLMs to reason about hardware trade-offs, incorporate profiling feedback, and refine kernels iteratively. We evaluate our approach on KernelBench, a benchmark for LLM-based kernel optimization, and demonstrate substantial improvements over baseline agents: our system produces correct solutions where baselines often fail, and achieves kernels with up to 16x faster runtime performance. These results highlight the potential of agentic LLM frameworks to advance fully automated, scalable GPU kernel optimization.

Related papers

K-Search: LLM Kernel Generation via Co-Evolving Intrinsic World Model [57.440609834690385]
Existing approaches treat Large Language Models (LLMs) as rapid code generators within evolutionary loops.<n>We propose Search via Co-Evolving World Model and build K-Search based on this method.<n>We evaluate K-Search on diverse, complex kernels FlashInfer, including GQA, MLA, and MoE kernels.
arXiv Detail & Related papers (2026-02-22T11:06:22Z)
Towards Automated Kernel Generation in the Era of LLMs [17.69471168609145]
Kernel engineering is a time-consuming and non-scalable process.<n>Recent advances in large language models (LLMs) and agentic systems have opened new possibilities for automating kernel generation and optimization.<n>The field remains fragmented, lacking a systematic perspective for LLM-driven kernel generation.
arXiv Detail & Related papers (2026-01-22T07:53:52Z)
FlashInfer-Bench: Building the Virtuous Cycle for AI-driven LLM Systems [39.33711841865621]
FlashInfer-Bench is a framework that connects kernel generation, benchmarking, and deployment.<n>Built on real serving traces, FlashInfer-Bench includes a curated dataset, a robust correctness- and performance-aware benchmarking framework, and a public leaderboard.<n>Using FlashInfer-Bench, we evaluate the performance and limitations of LLM agents, compare the trade-offs among different GPU programming languages, and provide insights for future agent design.
arXiv Detail & Related papers (2026-01-01T06:18:53Z)
KernelBand: Boosting LLM-based Kernel Optimization with a Hierarchical and Hardware-aware Multi-armed Bandit [15.810081332925584]
KernelBand is a novel framework that formulates kernel optimization as a hierarchical multi-armed bandit problem.<n>We show that KernelBand significantly outperforms state-of-the-art methods, achieving superior performance with fewer tokens while exhibiting consistent improvement without saturation as computational resources increase.
arXiv Detail & Related papers (2025-11-24T08:11:50Z)
Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z)
xLLM Technical Report [57.13120905321185]
We introduce xLLM, an intelligent and efficient Large Language Model (LLM) inference framework.<n>xLLM builds a novel decoupled service-engine architecture.<n>xLLM-Engine co-optimizes system and algorithm designs to fully saturate computing resources.
arXiv Detail & Related papers (2025-10-16T13:53:47Z)
Evolution of Kernels: Automated RISC-V Kernel Optimization with Large Language Models [26.985412258634256]
Large language models (LLMs) have shown promise for automated kernel optimization, demonstrating success in domains with comprehensive technical documents and mature scarcitys.<n>We present Evolution of Kernels (EoK), a novel LLM-based evolutionary program search framework that automates kernel design for domains with limited reference material.<n>EoK achieves a median 1.27x speedup, surpassing human experts on all 80 evaluated kernel design tasks.
arXiv Detail & Related papers (2025-09-14T08:11:06Z)
Astra: A Multi-Agent System for GPU Kernel Performance Optimization [10.715861478214961]
We introduce Astra, the first multi-agent system for GPU kernel optimization.<n>Within Astra, specialized agents collaborate through code generation, profiling, and planning to produce kernels that are both correct and high-performance.
arXiv Detail & Related papers (2025-09-09T08:39:50Z)
GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization [0.0]
This paper introduces an automated methodology for iteratively refining accelerator kernels.<n>Our methodology employs LLMs in a multi-stage, evolutionary process.<n>We detail how this approach navigates the challenges of the AMD MI300 target architecture.
arXiv Detail & Related papers (2025-06-25T19:59:34Z)
CUDA-LLM: LLMs Can Write Efficient CUDA Kernels [9.287036563375617]
Large Language Models (LLMs) have demonstrated strong capabilities in general-purpose code generation.<n>We propose a novel framework called textbfFeature SearchReinforcement (FSR) FSR jointly optimize compilation and functional correctness.
arXiv Detail & Related papers (2025-06-10T10:51:03Z)
Training of Scaffolded Language Models with Language Supervision: A Survey [62.59629932720519]
This survey organizes the literature on the design and optimization of emerging structures around post-trained LMs.<n>We refer to this overarching structure as scaffolded LMs and focus on LMs that are integrated into multi-step processes with tools.
arXiv Detail & Related papers (2024-10-21T18:06:25Z)
LLM as a Complementary Optimizer to Gradient Descent: A Case Study in Prompt Tuning [69.95292905263393]
We show that gradient-based and high-level LLMs can effectively collaborate a combined optimization framework.<n>In this paper, we show that these complementary to each other and can effectively collaborate a combined optimization framework.
arXiv Detail & Related papers (2024-05-30T06:24:14Z)
Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels. We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
PolyScientist: Automatic Loop Transformations Combined with Microkernels for Optimization of Deep Learning Primitives [55.79741270235602]
We develop a hybrid solution to the development of deep learning kernels. We use the advanced polyhedral technology to automatically tune the outer loops for performance.
arXiv Detail & Related papers (2020-02-06T08:02:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.