Efficient MoE Inference with Fine-Grained Scheduling of Disaggregated Expert Parallelism
- URL: http://arxiv.org/abs/2512.21487v1
- Date: Thu, 25 Dec 2025 03:22:03 GMT
- Title: Efficient MoE Inference with Fine-Grained Scheduling of Disaggregated Expert Parallelism
- Authors: Xinglin Pan, Shaohuai Shi, Wenxiang Lin, Yuxin Wang, Zhenheng Tang, Wei Wang, Xiaowen Chu,
- Abstract summary: FinDEP is a fine-grained task scheduling algorithm for DEP that maximizes task overlap to improve MoE inference throughput.<n>Experiments on four GPU systems with DeepSeek-V2 and Qwen3-MoE show FinDEP improves throughput by up to 1.61x over prior methods.
- Score: 32.0077995747942
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The mixture-of-experts (MoE) architecture scales model size with sublinear computational increase but suffers from memory-intensive inference due to KV caches and sparse expert activation. Recent disaggregated expert parallelism (DEP) distributes attention and experts to dedicated GPU groups but lacks support for shared experts and efficient task scheduling, limiting performance. We propose FinDEP, a fine-grained task scheduling algorithm for DEP that maximizes task overlap to improve MoE inference throughput. FinDEP introduces three innovations: 1) partitioning computation/communication into smaller tasks for fine-grained pipelining, 2) formulating a scheduling optimization supporting variable granularity and ordering, and 3) developing an efficient solver for this large search space. Experiments on four GPU systems with DeepSeek-V2 and Qwen3-MoE show FinDEP improves throughput by up to 1.61x over prior methods, achieving up to 1.24x speedup on a 32-GPU system.
Related papers
- FastUSP: A Multi-Level Collaborative Acceleration Framework for Distributed Diffusion Model Inference [11.772150619675527]
Unified Sequence Parallelism (USP) has emerged as the state-of-the-art approach for distributed attention computation.<n>Existing USP implementations suffer from excessive kernel launch overhead and suboptimal-communication scheduling.<n>We propose textbfFastUSP, a framework that integrates compile-level optimization, communication-level optimization, and operator-level optimization.
arXiv Detail & Related papers (2026-02-11T15:19:57Z) - Spava: Accelerating Long-Video Understanding via Sequence-Parallelism-aware Approximate Attention [63.69228529380251]
Spava is a sequence-parallel framework with optimized attention for long-video inference.<n>Spava delivers speedups of 12.72x, 1.70x, and 1.18x over FlashAttn, ZigZagRing, and APB, without notable performance loss.
arXiv Detail & Related papers (2026-01-29T09:23:13Z) - A Scheduling Framework for Efficient MoE Inference on Edge GPU-NDP Systems [28.86723467729703]
Mixture-of-Experts (MoE) models facilitate edge deployment by decoupling model capacity from active computation, yet their large memory footprint drives the need for GPU systems with near-data processing capabilities that offload experts to dedicated processing units.<n> deploying MoE models on such edge-based GPU-NDP systems faces three critical challenges: 1) severe load imbalance across NDP units due to non-uniform expert selection and expert parallelism, 2) insufficient GPU utilization during expert computation within NDP units, and 3) extensive data pre-profiling necessitated by unpredictable expert activation patterns for pre-fetching.
arXiv Detail & Related papers (2026-01-07T15:02:57Z) - Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z) - EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes.<n>Our results demonstrate at most 52.4% improvement in prefill throughput compared to existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - Parm: Efficient Training of Large Sparsely-Activated Models with Dedicated Schedules [15.680276212483292]
We propose Parm, a system that accelerates MP+EP+ESP training by designing two dedicated schedules for placing communication tasks.
Parm achieves 1.13$times$ to 5.77$times$ speedup on 1296 manually configured MoE layers and approximately 3$times$ improvement on two real-world MoE models.
arXiv Detail & Related papers (2024-06-30T05:55:11Z) - Edge-MoE: Memory-Efficient Multi-Task Vision Transformer Architecture
with Task-level Sparsity via Mixture-of-Experts [60.1586169973792]
M$3$ViT is the latest multi-task ViT model that introduces mixture-of-experts (MoE)
MoE achieves better accuracy and over 80% reduction computation but leaves challenges for efficient deployment on FPGA.
Our work, dubbed Edge-MoE, solves the challenges to introduce the first end-to-end FPGA accelerator for multi-task ViT with a collection of architectural innovations.
arXiv Detail & Related papers (2023-05-30T02:24:03Z) - Harnessing Deep Learning and HPC Kernels via High-Level Loop and Tensor Abstractions on CPU Architectures [67.47328776279204]
This work introduces a framework to develop efficient, portable Deep Learning and High Performance Computing kernels.
We decompose the kernel development in two steps: 1) Expressing the computational core using Processing Primitives (TPPs) and 2) Expressing the logical loops around TPPs in a high-level, declarative fashion.
We demonstrate the efficacy of our approach using standalone kernels and end-to-end workloads that outperform state-of-the-art implementations on diverse CPU platforms.
arXiv Detail & Related papers (2023-04-25T05:04:44Z) - Distributed Deep Learning Inference Acceleration using Seamless
Collaboration in Edge Computing [93.67044879636093]
This paper studies inference acceleration using distributed convolutional neural networks (CNNs) in collaborative edge computing.
We design a novel task collaboration scheme in which the overlapping zone of the sub-tasks on secondary edge servers (ESs) is executed on the host ES, named as HALP.
Experimental results show that HALP can accelerate CNN inference in VGG-16 by 1.7-2.0x for a single task and 1.7-1.8x for 4 tasks per batch on GTX 1080TI and JETSON AGX Xavier.
arXiv Detail & Related papers (2022-07-22T18:39:09Z) - aphBO-2GP-3B: A budgeted asynchronous parallel multi-acquisition
functions for constrained Bayesian optimization on high-performing computing
architecture [4.738678765150249]
An asynchronous constrained batch-parallel Bayesian optimization method is proposed to solve the computationally-expensive simulation-based optimization problems.
The advantages of this method are three-fold.
The aphBO-2GP-3B framework is demonstrated using two high-fidelity expensive industrial applications.
arXiv Detail & Related papers (2020-03-20T18:02:27Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.