Related papers: The Anatomy of a Triton Attention Kernel

The Anatomy of a Triton Attention Kernel

URL: http://arxiv.org/abs/2511.11581v1
Date: Tue, 07 Oct 2025 13:34:51 GMT
Title: The Anatomy of a Triton Attention Kernel
Authors: Burkhard Ringlein, Jan van Lunteren, Radu Stoica, Thomas Parnell,
Abstract summary: A long-standing goal in both industry and academia is to develop an LLM inference platform that is portable across hardware architectures.<n>We develop a state-of-the-art paged attention kernel, that builds exclusively on the domain-specific just-in-time compiled language Triton.<n>We describe our high-level approach, the key algorithmic and system-level improvements, the parameter auto-tuning required to unlock efficiency, and the integrations into a popular inference server.
Score: 0.0
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: A long-standing goal in both industry and academia is to develop an LLM inference platform that is portable across hardware architectures, eliminates the need for low-level hand-tuning, and still delivers best-in-class efficiency. In this work, we demonstrate that portable, efficient cross-platform LLM inference is indeed possible and share our experience. We develop a state-of-the-art paged attention kernel, the core performance-critical component of many LLM deployments, that builds exclusively on the domain-specific just-in-time compiled language Triton to achieve state-of-the-art performance on both NVIDIA and AMD GPUs. We describe our high-level approach, the key algorithmic and system-level improvements, the parameter auto-tuning required to unlock efficiency, and the integrations into a popular inference server that are necessary to bring the performance of a generic Triton attention kernel from 19.7% of the state-of-the-art to 105.9%. Our results highlight how open-source domain-specific languages can be leveraged to unlock model portability across different GPU vendors.

Related papers

RooflineBench: A Benchmarking Framework for On-Device LLMs via Roofline Analysis [53.90240071275054]
The transition toward localized intelligence through Small Language Models (SLMs) has intensified the need for rigorous performance characterization on resource-constrained edge hardware.<n>We propose a systematic framework that unifies architectural primitives and hardware constraints through the lens of operational intensity (OI)<n>By defining an inference-potential region, we introduce the Relative Inference Potential as a novel metric to compare efficiency differences between Large Language Models (LLMs) on the same hardware substrate.
arXiv Detail & Related papers (2026-02-12T03:02:22Z)
Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z)
STARK: Strategic Team of Agents for Refining Kernels [23.717055490630596]
We introduce an agentic framework for GPU kernel optimization that explores the design space through multi-agent collaboration.<n>This framework mimics the workflow of expert engineers, enabling LLMs to reason about hardware trade-offs, incorporate profiling feedback, and refine kernels iteratively.<n>We evaluate our approach on KernelBench, a benchmark for LLM-based kernel optimization, and demonstrate substantial improvements over baseline agents.
arXiv Detail & Related papers (2025-10-19T20:41:46Z)
MultiKernelBench: A Multi-Platform Benchmark for Kernel Generation [17.461533973039064]
MultiKernelBench is a benchmark for the generation of deep learning kernels using large language models (LLMs)<n>It spans 285 tasks across 14 well-defined kernel categories and supports three major hardware platforms.<n>We show significant variation in task difficulty, poor generalization to platforms with less training exposure, and the effectiveness of targeted prompting strategies.
arXiv Detail & Related papers (2025-07-20T00:58:33Z)
GPU Performance Portability needs Autotuning [0.0]
LLMs grow in complexity, achieving state-of-the-art performance requires tight co-design across algorithms, software, and hardware.<n>We make the case for combining just-in-time (JIT) compilation with comprehensive kernel parameter autotuning.<n>Our results highlight autotuning as a promising path to unlocking model portability across GPU vendors.
arXiv Detail & Related papers (2025-04-30T12:57:21Z)
NVLM: Open Frontier-Class Multimodal LLMs [64.00053046838225]
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks. We propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. We develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks.
arXiv Detail & Related papers (2024-09-17T17:59:06Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
QIGen: Generating Efficient Kernels for Quantized Inference on Large Language Models [22.055655390093722]
We present an automatic code generation approach for supporting quantized generative inference on LLMs such as LLaMA or OPT on off-the-shelf CPUs. Results on CPU-based inference for LLaMA models show that our approach can lead to high performance and high accuracy, comparing favorably to the best existing open-source solution.
arXiv Detail & Related papers (2023-07-07T17:46:08Z)
LAMM: Language-Assisted Multi-Modal Instruction-Tuning Dataset, Framework, and Benchmark [81.42376626294812]
We present Language-Assisted Multi-Modal instruction tuning dataset, framework, and benchmark. Our aim is to establish LAMM as a growing ecosystem for training and evaluating MLLMs. We present a comprehensive dataset and benchmark, which cover a wide range of vision tasks for 2D and 3D vision.
arXiv Detail & Related papers (2023-06-11T14:01:17Z)
Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels. We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
EfficientViT: Multi-Scale Linear Attention for High-Resolution Dense Prediction [67.11722682878722]
This work presents EfficientViT, a new family of high-resolution vision models with novel multi-scale linear attention. Our multi-scale linear attention achieves the global receptive field and multi-scale learning. EfficientViT delivers remarkable performance gains over previous state-of-the-art models.
arXiv Detail & Related papers (2022-05-29T20:07:23Z)

This list is automatically generated from the titles and abstracts of the papers in this site.