Related papers: Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks

Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks

URL: http://arxiv.org/abs/2507.23194v1
Date: Thu, 31 Jul 2025 02:26:58 GMT
Title: Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks
Authors: Jianghui Wang, Vinay Joshi, Saptarshi Majumder, Xu Chao, Bin Ding, Ziqiong Liu, Pratik Prabhanjan Brahma, Dong Li, Zicheng Liu, Emad Barsoum,
Abstract summary: It is imperative to automate low-level kernel development to meet performance and productivity demands.<n>Major cloud providers, semiconductor companies, and research institutions are now investing heavily in AI-driven code generation for GPU.<n>We present an evaluation suite for Triton-based GPU kernels and GEAK (Generating Efficient AI-centric GPU Kernels)
Score: 11.253534066141668
License: http://creativecommons.org/licenses/by/4.0/
Abstract: The demand for AI-generated GPU kernels is rapidly growing, influenced by the need for scalable, hardware-optimized solutions in both industry and academia. As deep learning workloads grow in complexity and diversity, it is imperative to automate low-level kernel development to meet performance and productivity demands. Major cloud providers, semiconductor companies, and research institutions are now investing heavily in AI-driven code generation for GPUs, aiming to reduce manual optimization efforts while achieving near-expert performance on hardware like AMD MI300X. The Triton language, a Python-based DSL for GPU programming, has emerged as a popular target for such AI-generated kernels due to its balance of performance and ease-of-coding. In this work, we present an evaluation suite for Triton-based GPU kernels and GEAK (Generating Efficient AI-centric GPU Kernels)-a framework that leverages cutting-edge LLMs to generate performant Triton code specifically for AMD GPUs, including the AMD MI300X and MI250. GEAK leverages inference-time compute scaling to produce Triton-based GPU kernels using a reasoning loop adapted from Reflexion-style feedback mechanisms. On two evaluation benchmarks, GEAK significantly outperformed the baselines of directly prompting frontier LLMs as well as Reflexion-based generation pipelines by achieving correctness up to $63$% and execution speed up of up to $2.59$X. These results highlight the promise of GEAK-like agentic code generation for accelerating the adoption of diverse hardware platforms and democratizing access to expert-level kernel performance.

Related papers

AutoTriton: Automatic Triton Programming with Reinforcement Learning in LLMs [87.8306870967343]
We introduce AutoTriton, the first model dedicated to Triton programming powered by reinforcement learning (RL)<n>AutoTriton performs supervised fine-tuning (SFT) to be equipped with essential Triton programming expertise using a high-quality data gathering pipeline.<n> Experiments across five evaluation channels of TritonBench and KernelBench illustrate that our 8B model AutoTriton achieves performance comparable to mainstream large models.
arXiv Detail & Related papers (2025-07-08T05:38:24Z)
Omniwise: Predicting GPU Kernels Performance with LLMs [0.06666419797034795]
We introduce Omniwise, the first end-to-end, self-supervised fine-tuning pipeline that applies large language models (LLMs) to GPU kernel performance prediction.<n>It can predict key performance metrics, including memory bandwidth, cache hit rates, GFLOPs, and arithmetic intensity, directly from kernel code without the need for code execution or profiling tools.<n>Our approach achieves over 90% of predictions within 10% relative error on GPU kernels executed on AMD MI250 and MI300X architectures.
arXiv Detail & Related papers (2025-06-25T23:36:44Z)
GPU Kernel Scientist: An LLM-Driven Framework for Iterative Kernel Optimization [0.18416014644193066]
" GPU Kernel Scientist" is an automated methodology for iteratively refining accelerator kernels.<n>Our methodology employs LLMs in a multi-stage, evolutionary process.<n>We detail how this approach navigates the challenges of the AMD MI300 target architecture.
arXiv Detail & Related papers (2025-06-25T19:59:34Z)
CUDA-LLM: LLMs Can Write Efficient CUDA Kernels [9.287036563375617]
Large Language Models (LLMs) have demonstrated strong capabilities in general-purpose code generation.<n>We propose a novel framework called textbfFeature SearchReinforcement (FSR) FSR jointly optimize compilation and functional correctness.
arXiv Detail & Related papers (2025-06-10T10:51:03Z)
TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators [59.625889531331815]
Triton is a high-level Python-like language designed for building efficient GPU kernels.<n>Despite advances in large language models (LLMs) for conventional code generation, these models struggle to generate accurate, performance-optimized Triton code.<n>In this work, we introduce TritonBench, the first comprehensive benchmark for Triton operator generation.
arXiv Detail & Related papers (2025-02-20T17:21:27Z)
Liger Kernel: Efficient Triton Kernels for LLM Training [6.373771349397682]
Training Large Language Models (LLMs) efficiently at scale presents a formidable challenge, driven by their ever-increasing computational demands.<n>We introduce Liger- Kernel, an open-sourced set of Triton kernels developed specifically for LLM training.<n>With kernel optimization techniques like kernel operation fusing and input chunking, our kernels achieve on average a 20% increase in training throughput and a 60% reduction in GPU memory usage.
arXiv Detail & Related papers (2024-10-14T18:17:01Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels. We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
Towards making the most of NLP-based device mapping optimization for OpenCL kernels [5.6596607119831575]
We extend the work of Cummins et al., namely Deeptune, that tackles the problem of optimal device selection ( CPU or GPU) for accelerated OpenCL kernels. We propose four different models that provide enhanced contextual information of source codes. Experimental results show that our proposed methodology surpasses that of Cummins et al. work, providing up to 4% improvement in prediction accuracy.
arXiv Detail & Related papers (2022-08-30T10:20:55Z)
FPGA-optimized Hardware acceleration for Spiking Neural Networks [69.49429223251178]
This work presents the development of a hardware accelerator for an SNN, with off-line training, applied to an image recognition task. The design targets a Xilinx Artix-7 FPGA, using in total around the 40% of the available hardware resources. It reduces the classification time by three orders of magnitude, with a small 4.5% impact on the accuracy, if compared to its software, full precision counterpart.
arXiv Detail & Related papers (2022-01-18T13:59:22Z)
PolyScientist: Automatic Loop Transformations Combined with Microkernels for Optimization of Deep Learning Primitives [55.79741270235602]
We develop a hybrid solution to the development of deep learning kernels. We use the advanced polyhedral technology to automatically tune the outer loops for performance.
arXiv Detail & Related papers (2020-02-06T08:02:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.