OptiML: An End-to-End Framework for Program Synthesis and CUDA Kernel Optimization
- URL: http://arxiv.org/abs/2602.12305v1
- Date: Thu, 12 Feb 2026 04:50:19 GMT
- Title: OptiML: An End-to-End Framework for Program Synthesis and CUDA Kernel Optimization
- Authors: Arijit Bhattacharjee, Heng Ping, Son Vu Le, Paul Bogdan, Nesreen K. Ahmed, Ali Jannesari,
- Abstract summary: We present OptiML, an end-to-end framework that maps either natural-language intent or input code to performance-optimized kernels.<n>A search-based (OptiML-X) then refines either synthesized or user-provided kernels using Monte Carlo Tree Search over LLM-aware, guided by a hardware-driven reward derived from profiler feedback.
- Score: 21.882017397032964
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Generating high-performance CUDA kernels remains challenging due to the need to navigate a combinatorial space of low-level transformations under noisy and expensive hardware feedback. Although large language models can synthesize functionally correct CUDA code, achieving competitive performance requires systematic exploration and verification of optimization choices. We present OptiML, an end-to-end framework that maps either natural-language intent or input CUDA code to performance-optimized CUDA kernels by formulating kernel optimization as search under verification. OptiML consists of two decoupled stages. When the input is natural language, a Mixture-of-Thoughts generator (OptiML-G) acts as a proposal policy over kernel implementation strategies, producing an initial executable program. A search-based optimizer (OptiML-X) then refines either synthesized or user-provided kernels using Monte Carlo Tree Search over LLM-driven edits, guided by a hardware-aware reward derived from profiler feedback. Each candidate transformation is compiled, verified, and profiled with Nsight Compute, and evaluated by a composite objective that combines runtime with hardware bottleneck proxies and guardrails against regressions. We evaluate OptiML in both synthesis-and-optimize and optimization-only settings on a diverse suite of CUDA kernels. Results show that OptiML consistently discovers verified performance improvements over strong LLM baselines and produces interpretable optimization trajectories grounded in profiler evidence.
Related papers
- KernelBlaster: Continual Cross-Task CUDA Optimization via Memory-Augmented In-Context Reinforcement Learning [3.4998382481249286]
We release KernelBlaster as an open-source agentic framework, accompanied by a test harness, verification components, and a reproducible evaluation.<n>Our method achieves mean speed-ups of 1.43x, 2.50x, and 1.50x on KernelBench Levels 1, 2, and 3, respectively.
arXiv Detail & Related papers (2026-02-15T19:48:43Z) - LOOPRAG: Enhancing Loop Transformation Optimization with Retrieval-Augmented Large Language Models [23.6344001089164]
LOOPRAG is a retrieval-augmented generation framework designed to guide Large Language Models (LLMs) in performing effective loop optimization.<n>We introduce a parameter-driven method to harness loop properties, which trigger various loop transformations, and generate diverse yet legal example codes.<n>To enhance correct and efficient code generation, we introduce a feedback-based iterative mechanism that incorporates compilation, testing and performance results.
arXiv Detail & Related papers (2025-12-12T11:09:48Z) - STARK: Strategic Team of Agents for Refining Kernels [23.717055490630596]
We introduce an agentic framework for GPU kernel optimization that explores the design space through multi-agent collaboration.<n>This framework mimics the workflow of expert engineers, enabling LLMs to reason about hardware trade-offs, incorporate profiling feedback, and refine kernels iteratively.<n>We evaluate our approach on KernelBench, a benchmark for LLM-based kernel optimization, and demonstrate substantial improvements over baseline agents.
arXiv Detail & Related papers (2025-10-19T20:41:46Z) - Combining Large Language Models and Gradient-Free Optimization for Automatic Control Policy Synthesis [2.8593976574111264]
Large Language models (LLMs) have shown promise as generators of symbolic control policies.<n>We propose a hybrid approach that decouples structural synthesis from parameter optimization.<n>We show that combining symbolic program synthesis with numerical optimization yields interpretable yet high-performing policies.
arXiv Detail & Related papers (2025-10-01T00:42:15Z) - REASONING COMPILER: LLM-Guided Optimizations for Efficient Model Serving [6.19179006129561]
We introduce a novel compilation framework (dubbed Reasoning) that formulates optimization as a sequential, context-aware decision process.<n>Our approach demonstrates the potential of LLM-guided reasoning to transform the landscape of compiler optimization.
arXiv Detail & Related papers (2025-06-02T07:02:46Z) - Make Optimization Once and for All with Fine-grained Guidance [78.14885351827232]
Learning to Optimize (L2O) enhances optimization efficiency with integrated neural networks.<n>L2O paradigms achieve great outcomes, e.g., refitting, generating unseen solutions iteratively or directly.<n>Our analyses explore general framework for learning optimization, called Diff-L2O, focusing on augmenting solutions from a wider view.
arXiv Detail & Related papers (2025-03-14T14:48:12Z) - Scaffolded Language Models with Language Supervision for Mixed-Autonomy: A Survey [52.00674453604779]
This survey organizes the literature on the design and optimization of emerging structures around post-trained LMs.<n>We refer to this overarching structure as scaffolded LMs and focus on LMs that are integrated into multi-step processes with tools.
arXiv Detail & Related papers (2024-10-21T18:06:25Z) - OptiBench Meets ReSocratic: Measure and Improve LLMs for Optimization Modeling [62.19438812624467]
Large language models (LLMs) have exhibited their problem-solving abilities in mathematical reasoning.<n>We propose OptiBench, a benchmark for End-to-end optimization problem-solving with human-readable inputs and outputs.
arXiv Detail & Related papers (2024-07-13T13:27:57Z) - LLM as a Complementary Optimizer to Gradient Descent: A Case Study in Prompt Tuning [69.95292905263393]
We show that gradient-based and high-level LLMs can effectively collaborate a combined optimization framework.<n>In this paper, we show that these complementary to each other and can effectively collaborate a combined optimization framework.
arXiv Detail & Related papers (2024-05-30T06:24:14Z) - Localized Zeroth-Order Prompt Optimization [54.964765668688806]
We propose a novel algorithm, namely localized zeroth-order prompt optimization (ZOPO)
ZOPO incorporates a Neural Tangent Kernel-based derived Gaussian process into standard zeroth-order optimization for an efficient search of well-performing local optima in prompt optimization.
Remarkably, ZOPO outperforms existing baselines in terms of both the optimization performance and the query efficiency.
arXiv Detail & Related papers (2024-03-05T14:18:15Z) - An Empirical Evaluation of Zeroth-Order Optimization Methods on
AI-driven Molecule Optimization [78.36413169647408]
We study the effectiveness of various ZO optimization methods for optimizing molecular objectives.
We show the advantages of ZO sign-based gradient descent (ZO-signGD)
We demonstrate the potential effectiveness of ZO optimization methods on widely used benchmark tasks from the Guacamol suite.
arXiv Detail & Related papers (2022-10-27T01:58:10Z) - Learning to Superoptimize Real-world Programs [79.4140991035247]
We propose a framework to learn to superoptimize real-world programs by using neural sequence-to-sequence models.
We introduce the Big Assembly benchmark, a dataset consisting of over 25K real-world functions mined from open-source projects in x86-64 assembly.
arXiv Detail & Related papers (2021-09-28T05:33:21Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.