FlashDMoE: Fast Distributed MoE in a Single Kernel
- URL: http://arxiv.org/abs/2506.04667v2
- Date: Mon, 09 Jun 2025 05:15:13 GMT
- Title: FlashDMoE: Fast Distributed MoE in a Single Kernel
- Authors: Osayamen Jonathan Aimuyo, Byungsoo Oh, Rachee Singh,
- Abstract summary: FlashDMoE is a fully GPU-resident MoE operator that fuses expert computation and inter-GPU communication into a emphsingle persistent GPU kernel.<n>We show that FlashDMoE achieves up to textbf9$times$ higher GPU utilization, textbf6$times$ lower latency, textbf5.7$times$ higher throughput, and textbf4$times$ better overlap efficiency compared to state-of-the-art baselines.
- Score: 2.246222223318928
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The computational sparsity of Mixture-of-Experts (MoE) models enables sub-linear growth in compute cost as model size increases, thus offering a scalable path to training massive neural networks. However, existing implementations suffer from \emph{low GPU utilization}, \emph{significant latency overhead}, and a fundamental \emph{inability to leverage task locality}, primarily due to CPU-managed scheduling, host-initiated communication, and frequent kernel launches. To overcome these limitations, we develop FlashDMoE, a fully GPU-resident MoE operator that fuses expert computation and inter-GPU communication into a \emph{single persistent GPU kernel}. FlashDMoE enables fine-grained pipelining of dispatch, compute, and combine phases, eliminating launch overheads and reducing idle gaps. Unlike existing work, FlashDMoE obviates bulk-synchronous collectives for one-sided, device-initiated, inter-GPU (R)DMA transfers, thus unlocking \emph{payload efficiency}, where we eliminate bloated or redundant network payloads in sparsely activated layers. When evaluated on a single 8-H100 GPU node with MoE models having up to 128 experts and 16K token sequences, FlashDMoE achieves up to \textbf{9}$\times$ higher GPU utilization, \textbf{6}$\times$ lower latency, \textbf{5.7}$\times$ higher throughput, and \textbf{4}$\times$ better overlap efficiency compared to state-of-the-art baselines, despite using FP32 while baselines use FP16. FlashDMoE demonstrates that principled GPU kernel-hardware co-design is key to unlocking the performance ceiling of large-scale distributed ML workloads.
Related papers
- BlockFFN: Towards End-Side Acceleration-Friendly Mixture-of-Experts with Chunk-Level Activation Sparsity [66.94629945519125]
We introduce a novel MoE architecture, BlockFFN, as well as its efficient training and deployment techniques.<n>Specifically, we use a router integrating ReLU activation and RMSNorm for differentiable and flexible routing.<n>Next, to promote both token-level sparsity (TLS) and chunk-level sparsity ( CLS), CLS-aware training objectives are designed, making BlockFFN more acceleration-friendly.
arXiv Detail & Related papers (2025-07-11T17:28:56Z) - Nexus:Proactive Intra-GPU Disaggregation of Prefill and Decode in LLM Serving [4.309392302169281]
Engine-level prefill-decode (PD) disaggregation avoids interference but incurs higher hardware and coordination overhead.<n>PD achieves up to 2.2x higher throughput, 20x lower TTFT, and 2.5x lower TBT than vLLM; outperforms SG by up to 2x; and matches or exceeds disaggregated vLLM.
arXiv Detail & Related papers (2025-07-09T07:27:18Z) - FloE: On-the-Fly MoE Inference on Memory-constrained GPU [22.2581000412208]
FloE is built on the insight that there exists substantial untapped redundancy within sparsely activated experts.<n>FloE achieves a 9.3x compression of parameters per expert in Mixtral-8x7B.<n>It enables deployment on a GPU with only 11GB VRAM, reducing the memory footprint by up to 8.5x.
arXiv Detail & Related papers (2025-05-09T10:53:47Z) - MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism [26.923312725688735]
Mixture-of-Experts (MoE) showcases tremendous potential to scale large language models (LLMs) with enhanced performance and reduced computational complexity.<n>We present MegaScale-Infer, an efficient and cost-effective system for serving large-scale MoE models.
arXiv Detail & Related papers (2025-04-03T04:20:44Z) - MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost.
MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization.
MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
arXiv Detail & Related papers (2024-11-18T01:06:12Z) - FLUX: Fast Software-based Communication Overlap On GPUs Through Kernel Fusion [9.5114389643299]
This paper proposes a novel method, Flux, to significantly hide communication latencies with dependent computations for GPUs.
Flux can potentially overlap up to 96% of communication given a fused kernel.
Overall, it can achieve up to 1.24x speedups for training over Megatron-LM on a cluster of 128 GPU with various GPU generations and interconnects.
arXiv Detail & Related papers (2024-06-11T00:17:39Z) - Flash-LLM: Enabling Cost-Effective and Highly-Efficient Large Generative
Model Inference with Unstructured Sparsity [12.663030430488922]
We propose Flash-LLM for enabling low-cost and highly-efficient large generative model inference on high-performance Cores.
At SpMM kernel level, Flash-LLM significantly outperforms the state-of-the-art library, i.e., Sputnik and SparTA by an average of 2.9x and 1.5x, respectively.
arXiv Detail & Related papers (2023-09-19T03:20:02Z) - FusionAI: Decentralized Training and Deploying LLMs with Massive
Consumer-Level GPUs [57.12856172329322]
We envision a decentralized system unlocking the potential vast untapped consumer-level GPU.
This system faces critical challenges, including limited CPU and GPU memory, low network bandwidth, the variability of peer and device heterogeneity.
arXiv Detail & Related papers (2023-09-03T13:27:56Z) - Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture
with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE)
MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA.
Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z) - FlexGen: High-Throughput Generative Inference of Large Language Models
with a Single GPU [89.2451963569343]
FlexGen is a generation engine for running large language model (LLM) inference on a single commodity GPU.
When running OPT-175B on a single 16GB GPU, FlexGen achieves significantly higher throughput compared to state-of-the-art offloading systems.
On the HELM benchmark, FlexGen can benchmark a 30B model with a 16GB GPU on 7 representative sub-scenarios in 21 hours.
arXiv Detail & Related papers (2023-03-13T05:19:28Z) - VersaGNN: a Versatile accelerator for Graph neural networks [81.1667080640009]
We propose textitVersaGNN, an ultra-efficient, systolic-array-based versatile hardware accelerator.
textitVersaGNN achieves on average 3712$times$ speedup with 1301.25$times$ energy reduction on CPU, and 35.4$times$ speedup with 17.66$times$ energy reduction on GPU.
arXiv Detail & Related papers (2021-05-04T04:10:48Z) - DistGNN: Scalable Distributed Training for Large-Scale Graph Neural
Networks [58.48833325238537]
Full-batch training on Graph Neural Networks (GNN) to learn the structure of large graphs is a critical problem that needs to scale to hundreds of compute nodes to be feasible.
In this paper, we presentGNN that optimize the well-known Deep Graph Library (DGL) for full-batch training on CPU clusters.
Our results on four common GNN benchmark datasets show up to 3.7x speed-up using a single CPU socket and up to 97x speed-up using 128 CPU sockets.
arXiv Detail & Related papers (2021-04-14T08:46:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.