ML-Triton, A Multi-Level Compilation and Language Extension to Triton GPU Programming
- URL: http://arxiv.org/abs/2503.14985v2
- Date: Wed, 26 Mar 2025 04:06:37 GMT
- Title: ML-Triton, A Multi-Level Compilation and Language Extension to Triton GPU Programming
- Authors: Dewei Wang, Wei Zhu, Liyang Ling, Ettore Tiotto, Quintin Wang, Whitney Tsang, Julian Opperman, Jacky Deng,
- Abstract summary: Triton is a DSL that offers a more user-friendly and portable alternative by programming at a higher level.<n>We propose ML-Triton which features multi-level compilation flow and programming interface.<n>Our approach achieves performance above 95% of expert-written kernels on Intel GPU.
- Score: 2.4665562732779773
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: In the era of LLMs, dense operations such as GEMM and MHA are critical components. These operations are well-suited for parallel execution using a tilebased approach. While traditional GPU programming often relies on low level interfaces like CUDA or SYCL, Triton has emerged as a DSL that offers a more user-friendly and portable alternative by programming at a higher level. The current Triton starts at the workgroup (aka threadblock) level, and directly lowers to per-thread level. And then attempt to coalesce and amend through a series of passes, promoting information from low-level representation. We believe this is pre-mature lowering based on the below observations. 1. GPU has a hierarchical structure both physically and logically. Modern GPUs often feature SIMD units capable of directly operating on tiles on a warp or warpgroup basis, such as blocked load and blocked MMA. 2. Multi-level gradual lowering can make compiler decoupled and clean by separating considerations inter and intra a logical layer. 3. Kernel developers often need fine control to get good performance on the latest hardware. FlashAttention2 advocates explicit data partition between warps to make a performance boost. In this context, we propose ML-Triton which features multi-level compilation flow and programming interface. Our approach begins at the workgroup level and progressively lowers to the warp and intrinsic level, implementing a multilevel lowering align with the hierarchical nature of GPU. Additionally, we extend triton language to support user-set compiler hint and warp level programming, enabling researchers to get good out-of-the box performance without awaiting compiler updates. Experimental results demonstrate that our approach achieves performance above 95% of expert-written kernels on Intel GPU, as measured by the geometric mean.
Related papers
- Geak: Introducing Triton Kernel AI Agent & Evaluation Benchmarks [11.253534066141668]
It is imperative to automate low-level kernel development to meet performance and productivity demands.<n>Major cloud providers, semiconductor companies, and research institutions are now investing heavily in AI-driven code generation for GPU.<n>We present an evaluation suite for Triton-based GPU kernels and GEAK (Generating Efficient AI-centric GPU Kernels)
arXiv Detail & Related papers (2025-07-31T02:26:58Z) - Nexus:Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving [4.309392302169281]
Engine-level prefill-decode (PD) disaggregation avoids interference but incurs higher hardware and coordination overhead.<n>PD achieves up to 2.2x higher throughput, 20x lower TTFT, and 2.5x lower TBT than vLLM; outperforms SG by up to 2x; and matches or exceeds disaggregated vLLM.
arXiv Detail & Related papers (2025-07-09T07:27:18Z) - TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators [59.625889531331815]
Triton is a high-level Python-like language designed for building efficient GPU kernels.<n>Despite advances in large language models (LLMs) for conventional code generation, these models struggle to generate accurate, performance-optimized Triton code.<n>In this work, we introduce TritonBench, the first comprehensive benchmark for Triton operator generation.
arXiv Detail & Related papers (2025-02-20T17:21:27Z) - FlashAttention on a Napkin: A Diagrammatic Approach to Deep Learning IO-Awareness [0.0]
Methods like FlashAttention have achieved a x6 performance improvement over native PyTorch by avoiding unnecessary data transfers.<n>This paper extends Neural Circuit Diagrams for deep learning models to consider resource usage and the distribution of tasks across a GPU hierarchy.<n>We develop a methodology for representing intermediate-level pseudocode with diagrams, allowing hardware-aware algorithms to be derived step-by-step.
arXiv Detail & Related papers (2024-12-04T13:52:04Z) - ThunderKittens: Simple, Fast, and Adorable AI Kernels [43.32681787348603]
We present ThunderKittens (TK), a framework for writing performant AI kernels while remaining easy to use and maintain.
We show the value of TK by providing kernels that match or outperform prior kernels for a range of AI operations.
arXiv Detail & Related papers (2024-10-27T10:07:16Z) - Hierarchical Mixture of Experts: Generalizable Learning for High-Level Synthesis [43.612837464039686]
High-level synthesis (HLS) is a widely used tool in designing Field Programmable Gate Array (FPGA)<n>We propose a more domain-generalizable model structure: a two-level hierarchical Mixture of Experts (MoE)<n>In the low-level MoE, we apply MoE on three natural granularities of a program: node, basic block, and graph. The high-level MoE learns to aggregate the three granularities for the final decision.
arXiv Detail & Related papers (2024-10-25T00:27:53Z) - Hierarchical Resource Partitioning on Modern GPUs: A Reinforcement Learning Approach [1.076745840431781]
We propose a method for comprehensively co-optimizing the setup of hierarchical partitioning and the selection of co-scheduling groups from a given set of jobs.
This results in a maximum throughput improvement by a factor of 1.87 compared to the time-sharing scheduling.
arXiv Detail & Related papers (2024-05-14T16:40:06Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - Cramming: Training a Language Model on a Single GPU in One Day [64.18297923419627]
Recent trends in language modeling have focused on increasing performance through scaling.
We investigate the downstream performance achievable with a transformer-based language model trained completely from scratch with masked language modeling for a single day on a single consumer GPU.
We provide evidence that even in this constrained setting, performance closely follows scaling laws observed in large-compute settings.
arXiv Detail & Related papers (2022-12-28T18:59:28Z) - Systolic Computing on GPUs for Productive Performance [2.8064596842326575]
We propose a language and compiler to productively build high-performance systolic arrays that run on GPUs.
A programmer it' specifies a projection of a dataflow compute onto a linear systolic array, while leaving the detailed implementation of the projection to a compiler.
The compiler implements the specified projection and maps the linear systolic array to the SIMD execution units and vector registers of GPUs.
arXiv Detail & Related papers (2020-10-29T18:49:54Z) - MPLP++: Fast, Parallel Dual Block-Coordinate Ascent for Dense Graphical
Models [96.1052289276254]
This work introduces a new MAP-solver, based on the popular Dual Block-Coordinate Ascent principle.
Surprisingly, by making a small change to the low-performing solver, we derive the new solver MPLP++ that significantly outperforms all existing solvers by a large margin.
arXiv Detail & Related papers (2020-04-16T16:20:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.