DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels
- URL: http://arxiv.org/abs/2602.11715v1
- Date: Thu, 12 Feb 2026 08:45:13 GMT
- Title: DICE: Diffusion Large Language Models Excel at Generating CUDA Kernels
- Authors: Haolei Bai, Lingcheng Kong, Xueyi Chen, Jianmian Wang, Zhiqiang Tao, Huan Wang,
- Abstract summary: Diffusion large language models (dLLMs) have emerged as a compelling alternative to autoregressive (AR) LLMs.<n>CuKe is an augmented dataset optimized for high-performance kernels.<n>DICE is a series of diffusion large language models designed for kernel generation.
- Score: 17.979042914049842
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Diffusion large language models (dLLMs) have emerged as a compelling alternative to autoregressive (AR) LLMs, owing to their capacity for parallel token generation. This paradigm is particularly well-suited for code generation, where holistic structural planning and non-sequential refinement are critical. Despite this potential, tailoring dLLMs for CUDA kernel generation remains challenging, obstructed not only by the high specialization but also by the severe lack of high-quality training data. To address these challenges, we construct CuKe, an augmented supervised fine-tuning dataset optimized for high-performance CUDA kernels. On top of it, we propose a bi-phase curated reinforcement learning (BiC-RL) framework consisting of a CUDA kernel infilling stage and an end-to-end CUDA kernel generation stage. Leveraging this training framework, we introduce DICE, a series of diffusion large language models designed for CUDA kernel generation, spanning three parameter scales, 1.7B, 4B, and 8B. Extensive experiments on KernelBench demonstrate that DICE significantly outperforms both autoregressive and diffusion LLMs of comparable scale, establishing a new state-of-the-art for CUDA kernel generation.
Related papers
- CUDA Agent: Large-Scale Agentic RL for High-Performance CUDA Kernel Generation [51.72529978689561]
Agent is a large-scale agentic reinforcement learning system that develops kernel expertise through three components.<n>Agent delivers 100%, 100%, and 92% faster rate over torchcompile on KernelBench.
arXiv Detail & Related papers (2026-02-27T18:58:05Z) - KernelBlaster: Continual Cross-Task CUDA Optimization via Memory-Augmented In-Context Reinforcement Learning [3.4998382481249286]
We release KernelBlaster as an open-source agentic framework, accompanied by a test harness, verification components, and a reproducible evaluation.<n>Our method achieves mean speed-ups of 1.43x, 2.50x, and 1.50x on KernelBench Levels 1, 2, and 3, respectively.
arXiv Detail & Related papers (2026-02-15T19:48:43Z) - Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z) - ConCuR: Conciseness Makes State-of-the-Art Kernel Generation [5.010229074860956]
Key challenge for kernel generation is the scarcity of high-quality data.<n>We develop a pipeline that generates and curates high-quality kernels with reasoning traces.<n>We show that the average reasoning length can serve as a metric to assess the difficulty of kernel generation tasks.
arXiv Detail & Related papers (2025-10-08T15:41:15Z) - EvoEngineer: Mastering Automated CUDA Kernel Code Evolution with Large Language Models [27.430839306140157]
Large Language Models (LLMs) for automating kernel optimization promise promise.<n>General-purpose LLM code evolution methods cannot meet strict correctness requirements of kernel optimization.<n>EvoEngineer provides guidance for designing and adapting optimization strategies to achieve a balance between performance and correctness.<n>Our method achieves a maximum speedup of textbf36.75$times among all operations over PyTorch kernels and delivers the highest speedup on textbf28 (textbf56.0%) of 50 operations that achieve over textbf2times$ acceleration.
arXiv Detail & Related papers (2025-10-04T10:00:25Z) - Towards Robust Agentic CUDA Kernel Benchmarking, Verification, and Optimization [25.135006275638172]
We introduce robust-kbench, a new benchmark for rigorous evaluation of kernel performance and correctness across varied scenarios.<n>We also present a comprehensive agentic framework that translates torch code to kernels and iteratively improve their runtime setting.<n>Our approach produces kernels outperforming torch implementations for practical applications, including forward and backward passes.
arXiv Detail & Related papers (2025-09-16T11:08:30Z) - HPCTransCompile: An AI Compiler Generated Dataset for High-Performance CUDA Transpilation and LLM Preliminary Exploration [13.53425131505526]
Deep learning has driven exponential increases in model parameters and computational demands.<n> NVIDIA GPUs and their-based software ecosystem provide robust support for parallel computing.<n>The ecosystem has established a dominant position in the field of parallel software.<n> translating code to other platforms poses significant challenges due to differences parallel programming paradigms and hardware.
arXiv Detail & Related papers (2025-06-12T06:48:33Z) - CUDA-LLM: LLMs Can Write Efficient CUDA Kernels [9.287036563375617]
Large Language Models (LLMs) have demonstrated strong capabilities in general-purpose code generation.<n>We propose a novel framework called textbfFeature SearchReinforcement (FSR) FSR jointly optimize compilation and functional correctness.
arXiv Detail & Related papers (2025-06-10T10:51:03Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - Compacting Binary Neural Networks by Sparse Kernel Selection [58.84313343190488]
This paper is motivated by a previously revealed phenomenon that the binary kernels in successful BNNs are nearly power-law distributed.
We develop the Permutation Straight-Through Estimator (PSTE) that is able to not only optimize the selection process end-to-end but also maintain the non-repetitive occupancy of selected codewords.
Experiments verify that our method reduces both the model size and bit-wise computational costs, and achieves accuracy improvements compared with state-of-the-art BNNs under comparable budgets.
arXiv Detail & Related papers (2023-03-25T13:53:02Z) - LKD-Net: Large Kernel Convolution Network for Single Image Dehazing [70.46392287128307]
We propose a novel Large Kernel Convolution Dehaze Block (LKD Block) consisting of the Decomposition deep-wise Large Kernel Convolution Block (DLKCB) and the Channel Enhanced Feed-forward Network (CEFN)
The designed DLKCB can split the deep-wise large kernel convolution into a smaller depth-wise convolution and a depth-wise dilated convolution without introducing massive parameters and computational overhead.
Our LKD-Net dramatically outperforms the Transformer-based method Dehamer with only 1.79% #Param and 48.9% FLOPs.
arXiv Detail & Related papers (2022-09-05T06:56:48Z) - PolyScientist: Automatic Loop Transformations Combined with Microkernels
for Optimization of Deep Learning Primitives [55.79741270235602]
We develop a hybrid solution to the development of deep learning kernels.
We use the advanced polyhedral technology to automatically tune the outer loops for performance.
arXiv Detail & Related papers (2020-02-06T08:02:34Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.