Related papers: Iris: First-Class Multi-GPU Programming Experience in Triton

Iris: First-Class Multi-GPU Programming Experience in Triton

URL: http://arxiv.org/abs/2511.12500v1
Date: Sun, 16 Nov 2025 08:24:45 GMT
Title: Iris: First-Class Multi-GPU Programming Experience in Triton
Authors: Muhammad Awad, Muhammad Osama, Brandon Potter,
Abstract summary: We present Iris, a multi-GPU communication library implemented entirely in Python and Triton.<n>Iris provides tile-based symmetric memory abstractions that naturally align with Triton's programming model.<n>We demonstrate that Iris achieves near-optimal bandwidth utilization in microbenchmarks and delivers up to 1.79x speedup over PyTorch and RCCL.
Score: 0.09290947230642188
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Multi-GPU programming traditionally requires developers to navigate complex trade-offs between performance and programmability. High-performance implementations typically rely on low-level HIP/CUDA communication libraries that demand substantial engineering effort for even basic overlap patterns, while simpler abstractions often sacrifice performance. We present Iris, a multi-GPU communication library implemented entirely in Python and Triton that eliminates this trade-off. Iris provides tile-based symmetric memory abstractions that naturally align with Triton's programming model, enabling developers to write single-source kernels that seamlessly interleave computation and communication. We demonstrate a taxonomy of compute-communication overlap patterns--from bulk-synchronous to fine-grained workgroup specialization--that can be implemented with minimal code changes in Iris, often requiring just a few additional lines within the same Triton kernel. Our evaluation shows that Iris achieves near-optimal bandwidth utilization in microbenchmarks and delivers up to 1.79x speedup over PyTorch and RCCL for GEMM+All-Scatter workloads, demonstrating that high-level implementations can match or exceed heavily-optimized libraries while dramatically simplifying multi-GPU programming.

Related papers

ParallelKittens: Systematic and Practical Simplification of Multi-GPU AI Kernels [40.94392896555992]
Existing systems mitigate this through compute-communication overlap but often fail to meet theoretical bandwidth across workloads and new accelerators.<n>Instead of operator-specific techniques, we ask whether a small set of simple, reusable principles can guide the optimal optimal performance of workloads.<n>PKKittens (PK) kernels achieves up to $2.33 times times parallel workloads.
arXiv Detail & Related papers (2025-11-17T21:48:33Z)
Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z)
NGPU-LM: GPU-Accelerated N-Gram Language Model for Context-Biasing in Greedy ASR Decoding [54.88765757043535]
This work rethinks data structures for statistical n-gram language models to enable fast and parallel operations for GPU-optimized inference.<n>Our approach, named NGPU-LM, introduces customizable greedy decoding for all major ASR model types with less than 7% computational overhead.<n>The proposed approach can eliminate more than 50% of the accuracy gap between greedy and beam search for out-of-domain scenarios while avoiding significant slowdown caused by beam search.
arXiv Detail & Related papers (2025-05-28T20:43:10Z)
Efficient and Adaptable Overlapping for Computation and Communication via Signaling and Reordering [13.185314408519107]
Generative models have achieved remarkable success across various applications, driving the demand for multi-GPU computing.<n>We propose FlashOverlap, which utilizes a novel signaling mechanism.<n>Experiments show that FlashOverlap achieves up to 1.65x speedup through overlap, outperforming existing works in most cases.
arXiv Detail & Related papers (2025-04-28T06:37:57Z)
ML-Triton, A Multi-Level Compilation and Language Extension to Triton GPU Programming [2.4665562732779773]
Triton is a DSL that offers a more user-friendly and portable alternative by programming at a higher level.<n>We propose ML-Triton which features multi-level compilation flow and programming interface.<n>Our approach achieves performance above 95% of expert-written kernels on Intel GPU.
arXiv Detail & Related papers (2025-03-19T08:31:39Z)
TritonBench: Benchmarking Large Language Model Capabilities for Generating Triton Operators [59.625889531331815]
Triton is a high-level Python-like language designed for building efficient GPU kernels.<n>Despite advances in large language models (LLMs) for conventional code generation, these models struggle to generate accurate, performance-optimized Triton code.<n>In this work, we introduce TritonBench, the first comprehensive benchmark for Triton operator generation.
arXiv Detail & Related papers (2025-02-20T17:21:27Z)
Liger Kernel: Efficient Triton Kernels for LLM Training [6.373771349397682]
Training Large Language Models (LLMs) efficiently at scale presents a formidable challenge, driven by their ever-increasing computational demands.<n>We introduce Liger- Kernel, an open-sourced set of Triton kernels developed specifically for LLM training.<n>With kernel optimization techniques like kernel operation fusing and input chunking, our kernels achieve on average a 20% increase in training throughput and a 60% reduction in GPU memory usage.
arXiv Detail & Related papers (2024-10-14T18:17:01Z)
FusionAI: Decentralized Training and Deploying LLMs with Massive Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU. This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z)
Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE) MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA. Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z)
Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels. We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion. We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z)
Synthesizing Collective Communication Algorithms for Heterogeneous Networks with TACCL [1.5528708400965123]
We present TACCL, a synthesizer for collective communication primitives for large-scale multi-GPU systems. TACCL encodes a profiled topology and input size into a synthesis problem to generate optimized communication algorithms. Using TACCL's algorithms speeds up the end-to-end training of an internal mixture of experts model by $17%$.
arXiv Detail & Related papers (2021-11-08T23:20:52Z)

This list is automatically generated from the titles and abstracts of the papers in this site.