Fine-Tuning GPT-5 for GPU Kernel Generation
- URL: http://arxiv.org/abs/2602.11000v1
- Date: Wed, 11 Feb 2026 16:22:54 GMT
- Title: Fine-Tuning GPT-5 for GPU Kernel Generation
- Authors: Ali Tehrani, Yahya Emara, Essam Wissam, Wojciech Paluch, Waleed Atallah, Ćukasz Dudziak, Mohamed S. Abdelfattah,
- Abstract summary: We present Makora's environment and tools for reinforcement learning finetuning of frontier models.<n>In the single-attempt setting, our fine-tuned model improves kernel correctness from 43.7% to 77.0%.<n>When integrated into a full coding agent, it is able to solve up to 97.4% of problems in an expanded KernelBench suite.
- Score: 5.109141377873154
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Developing efficient GPU kernels is essential for scaling modern AI systems, yet it remains a complex task due to intricate hardware architectures and the need for specialized optimization expertise. Although Large Language Models (LLMs) demonstrate strong capabilities in general sequential code generation, they face significant challenges in GPU code generation because of the scarcity of high-quality labeled training data, compiler biases when generating synthetic solutions, and limited generalization across hardware generations. This precludes supervised fine-tuning (SFT) as a scalable methodology for improving current LLMs. In contrast, reinforcement learning (RL) offers a data-efficient and adaptive alternative but requires access to relevant tools, careful selection of training problems, and a robust evaluation environment. We present Makora's environment and tools for reinforcement learning finetuning of frontier models and report our results from fine-tuning GPT-5 for Triton code generation. In the single-attempt setting, our fine-tuned model improves kernel correctness from 43.7% to 77.0% (+33.3 percentage points) and increases the fraction of problems outperforming TorchInductor from 14.8% to 21.8% (+7 percentage points) compared to baseline GPT-5, while exceeding prior state-of-the-art models on KernelBench. When integrated into a full coding agent, it is able to solve up to 97.4% of problems in an expanded KernelBench suite, outperforming the PyTorch TorchInductor compiler on 72.9% of problems with a geometric mean speedup of 2.12x. Our work demonstrates that targeted post-training with reinforcement learning can unlock LLM capabilities in highly specialized technical domains where traditional supervised learning is limited by data availability, opening new pathways for AI-assisted accelerator programming.
Related papers
- CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation [51.72529978689561]
Agent is a large-scale agentic reinforcement learning system that develops kernel expertise through three components.<n>Agent delivers 100%, 100%, and 92% faster rate over torchcompile on KernelBench.
arXiv Detail & Related papers (2026-02-27T18:58:05Z) - Dr. Kernel: Reinforcement Learning Done Right for Triton Kernel Generations [32.98036846113632]
We study reinforcement learning (RL) for kernel generation.<n>We propose Turn-level Reinforce-Leave-One-Out (TRLOO) to provide unbiased advantage estimation.<n>We introduce Profiling-based Rewards (PR) and Profiling-based Rejection Sampling (PRS) to overcome the issue.
arXiv Detail & Related papers (2026-02-05T17:01:09Z) - AscendKernelGen: A Systematic Study of LLM-Based Kernel Generation for Neural Processing Units [39.846358001824996]
We propose Ascend KernelGen, a generation-evaluation integrated framework for NPU kernel development.<n>We introduce Ascend-CoT, a high-quality dataset incorporating chain-of-thought reasoning derived from real-world kernel implementations.<n>We also design NPU KernelBench, a comprehensive benchmark for assessing compilation, correctness, and performance across varying complexity levels.
arXiv Detail & Related papers (2026-01-12T03:12:58Z) - QiMeng-NeuComBack: Self-Evolving Translation from IR to Assembly Code [52.66657751895655]
Large Language Models (LLMs) offer a compelling new paradigm: Neural Compilation.<n>This paper introduces NeuComBack, a novel benchmark dataset specifically designed for IR-to-assembly compilation.<n>We propose a self-evolving prompt optimization method that enables LLMs to evolve their internal prompt strategies.
arXiv Detail & Related papers (2025-11-03T03:20:26Z) - QueST: Incentivizing LLMs to Generate Difficult Problems [77.75835742350644]
Large Language Models have achieved strong performance on reasoning tasks, solving competition-level coding and math problems.<n>Existing competitive coding datasets contain only thousands to tens of thousands of problems.<n>We propose QueST, a novel framework which combines difficulty-aware graph sampling and difficulty-aware rejection fine-tuning.
arXiv Detail & Related papers (2025-10-20T16:29:53Z) - ConCuR: Conciseness Makes State-of-the-Art Kernel Generation [5.010229074860956]
Key challenge for kernel generation is the scarcity of high-quality data.<n>We develop a pipeline that generates and curates high-quality kernels with reasoning traces.<n>We show that the average reasoning length can serve as a metric to assess the difficulty of kernel generation tasks.
arXiv Detail & Related papers (2025-10-08T15:41:15Z) - LIFT: LLM-Based Pragma Insertion for HLS via GNN Supervised Fine-Tuning [38.679497621876926]
LIFT is a large language model (LLM)-based coding assistant for HLS that automatically generates performance-critical pragmas.<n>We fine-tune the LLM by tightly integrating and supervising the training process with a graph neural network (GNN)
arXiv Detail & Related papers (2025-04-29T21:42:59Z) - FusionLLM: A Decentralized LLM Training System on Geo-distributed GPUs with Adaptive Compression [55.992528247880685]
Decentralized training faces significant challenges regarding system design and efficiency.
We present FusionLLM, a decentralized training system designed and implemented for training large deep neural networks (DNNs)
We show that our system and method can achieve 1.45 - 9.39x speedup compared to baseline methods while ensuring convergence.
arXiv Detail & Related papers (2024-10-16T16:13:19Z) - Building Math Agents with Multi-Turn Iterative Preference Learning [56.71330214021884]
This paper studies the complementary direct preference learning approach to further improve model performance.<n>Existing direct preference learning algorithms are originally designed for the single-turn chat task.<n>We introduce a multi-turn direct preference learning framework, tailored for this context.
arXiv Detail & Related papers (2024-09-04T02:41:04Z) - OptiBench Meets ReSocratic: Measure and Improve LLMs for Optimization Modeling [62.19438812624467]
Large language models (LLMs) have exhibited their problem-solving abilities in mathematical reasoning.<n>We propose OptiBench, a benchmark for End-to-end optimization problem-solving with human-readable inputs and outputs.
arXiv Detail & Related papers (2024-07-13T13:27:57Z) - Learning Performance-Improving Code Edits [107.21538852090208]
We introduce a framework for adapting large language models (LLMs) to high-level program optimization.
First, we curate a dataset of performance-improving edits made by human programmers of over 77,000 competitive C++ programming submission pairs.
For prompting, we propose retrieval-based few-shot prompting and chain-of-thought, and for finetuning, these include performance-conditioned generation and synthetic data augmentation based on self-play.
arXiv Detail & Related papers (2023-02-15T18:59:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.