Related papers: AutoTriton: Automatic Triton Programming with Reinforcement Learning in LLMs

AutoTriton: Automatic Triton Programming with Reinforcement Learning in LLMs

URL: http://arxiv.org/abs/2507.05687v1
Date: Tue, 08 Jul 2025 05:38:24 GMT
Title: AutoTriton: Automatic Triton Programming with Reinforcement Learning in LLMs
Authors: Shangzhan Li, Zefan Wang, Ye He, Yuxuan Li, Qi Shi, Jianling Li, Yonggang Hu, Wanxiang Che, Xu Han, Zhiyuan Liu, Maosong Sun,
Abstract summary: We introduce AutoTriton, the first model dedicated to Triton programming powered by reinforcement learning (RL)<n>AutoTriton performs supervised fine-tuning (SFT) to be equipped with essential Triton programming expertise using a high-quality data gathering pipeline.<n> Experiments across five evaluation channels of TritonBench and KernelBench illustrate that our 8B model AutoTriton achieves performance comparable to mainstream large models.
Score: 87.8306870967343
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Kernel development in deep learning requires optimizing computational units across hardware while balancing memory management, parallelism, and hardware-specific optimizations through extensive empirical tuning. Although domain-specific languages like Triton simplify GPU programming by abstracting low-level details, developers must still manually tune critical parameters such as tile sizes and memory access patterns through iterative experimentation, creating substantial barriers to optimal performance and wider adoption. In this work, we introduce AutoTriton, the first model dedicated to Triton programming powered by reinforcement learning (RL). AutoTriton performs supervised fine-tuning (SFT) to be equipped with essential Triton programming expertise using a high-quality data gathering pipeline, and conducts RL with Group Relative Policy Optimization (GRPO) algorithm, combining a rule-based reward and an execution-based reward to further improve Triton programming ability, sequentially. Experiments across five evaluation channels of TritonBench and KernelBench illustrate that our 8B model AutoTriton achieves performance comparable to mainstream large models, including Claude-4-Sonnet and DeepSeek-R1-0528. Further experimental analysis demonstrates the crucial role of each module within AutoTriton, including the SFT stage, the RL stage, and the reward design strategy. These findings underscore the promise of RL for automatically generating high-performance kernels, and since high-performance kernels are core components of AI systems, this breakthrough establishes an important foundation for building more efficient AI systems. The model and code will be available at https://github.com/AI9Stars/AutoTriton.

Related papers

Ring-lite: Scalable Reasoning via C3PO-Stabilized Reinforcement Learning for LLMs [51.21041884010009]
Ring-lite is a Mixture-of-Experts (MoE)-based large language model optimized via reinforcement learning (RL)<n>Our approach matches the performance of state-of-the-art (SOTA) small-scale reasoning models on challenging benchmarks.
arXiv Detail & Related papers (2025-06-17T17:12:34Z)
TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators [59.625889531331815]
Triton is a high-level Python-like language designed for building efficient GPU kernels.<n>Despite advances in large language models (LLMs) for conventional code generation, these models struggle to generate accurate, performance-optimized Triton code.<n>In this work, we introduce TritonBench, the first comprehensive benchmark for Triton operator generation.
arXiv Detail & Related papers (2025-02-20T17:21:27Z)
TD3: Tucker Decomposition Based Dataset Distillation Method for Sequential Recommendation [50.23504065567638]
This paper introduces textbfTD3, a novel textbfDataset textbfDistillation method within a meta-learning framework.<n> TD3 distills a fully expressive emphsynthetic sequence summary from original data.<n>An augmentation technique allows the learner to closely fit the synthetic summary, ensuring an accurate update of it in the emphouter-loop.
arXiv Detail & Related papers (2025-02-05T03:13:25Z)
AutoSculpt: A Pattern-based Model Auto-pruning Framework Using Reinforcement Learning and Graph Learning [32.10443611442628]
AutoSculpt is a pattern-based automated pruning framework for deep neural networks (DNNs)<n>It automatically identifies and prunes regular patterns within DNN architectures that can be recognized by existing inference engines.<n>It achieves pruning rates of up to 90% and nearly 18% improvement in FLOPs reduction, outperforming all baselines.
arXiv Detail & Related papers (2024-12-24T02:05:51Z)
Quantized Distillation: Optimizing Driver Activity Recognition Models for Resource-Constrained Environments [34.80538284957094]
This paper introduces a lightweight framework for resource-efficient driver activity recognition. The framework enhances 3D MobileNet, a neural architecture optimized for speed in video classification. It achieves a threefold reduction in model size and a 1.4-fold improvement in inference time.
arXiv Detail & Related papers (2023-11-10T10:07:07Z)
A Learned Performance Model for Tensor Processing Units [5.733911161090224]
We demonstrate a method of learning performance models from a corpus of graph programs for Processing Unit (TPU) instances. We show that our learned model outperforms a heavily-optimized analytical performance model on two tasks. It helps an autotuner discover faster programs in a setting where access to TPUs is limited or expensive.
arXiv Detail & Related papers (2020-08-03T17:24:52Z)
Auto-PyTorch Tabular: Multi-Fidelity MetaLearning for Efficient and Robust AutoDL [53.40030379661183]
Auto-PyTorch is a framework to enable fully automated deep learning (AutoDL) It combines multi-fidelity optimization with portfolio construction for warmstarting and ensembling of deep neural networks (DNNs) We show that Auto-PyTorch performs better than several state-of-the-art competitors on average.
arXiv Detail & Related papers (2020-06-24T15:15:17Z)
PolyScientist: Automatic Loop Transformations Combined with Microkernels for Optimization of Deep Learning Primitives [55.79741270235602]
We develop a hybrid solution to the development of deep learning kernels. We use the advanced polyhedral technology to automatically tune the outer loops for performance.
arXiv Detail & Related papers (2020-02-06T08:02:34Z)

This list is automatically generated from the titles and abstracts of the papers in this site.

This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.