A Scheduling Framework for Efficient MoE Inference on Edge GPU-NDP Systems
- URL: http://arxiv.org/abs/2601.03992v1
- Date: Wed, 07 Jan 2026 15:02:57 GMT
- Title: A Scheduling Framework for Efficient MoE Inference on Edge GPU-NDP Systems
- Authors: Qi Wu, Chao Fang, Jiayuan Chen, Ye Lin, Yueqi Zhang, Yichuan Bai, Yuan Du, Li Du,
- Abstract summary: Mixture-of-Experts (MoE) models facilitate edge deployment by decoupling model capacity from active computation, yet their large memory footprint drives the need for GPU systems with near-data processing capabilities that offload experts to dedicated processing units.<n> deploying MoE models on such edge-based GPU-NDP systems faces three critical challenges: 1) severe load imbalance across NDP units due to non-uniform expert selection and expert parallelism, 2) insufficient GPU utilization during expert computation within NDP units, and 3) extensive data pre-profiling necessitated by unpredictable expert activation patterns for pre-fetching.
- Score: 28.86723467729703
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mixture-of-Experts (MoE) models facilitate edge deployment by decoupling model capacity from active computation, yet their large memory footprint drives the need for GPU systems with near-data processing (NDP) capabilities that offload experts to dedicated processing units. However, deploying MoE models on such edge-based GPU-NDP systems faces three critical challenges: 1) severe load imbalance across NDP units due to non-uniform expert selection and expert parallelism, 2) insufficient GPU utilization during expert computation within NDP units, and 3) extensive data pre-profiling necessitated by unpredictable expert activation patterns for pre-fetching. To address these challenges, this paper proposes an efficient inference framework featuring three key optimizations. First, the underexplored tensor parallelism in MoE inference is exploited to partition and compute large expert parameters across multiple NDP units simultaneously towards edge low-batch scenarios. Second, a load-balancing-aware scheduling algorithm distributes expert computations across NDP units and GPU to maximize resource utilization. Third, a dataset-free pre-fetching strategy proactively loads frequently accessed experts to minimize activation delays. Experimental results show that our framework enables GPU-NDP systems to achieve 2.41x on average and up to 2.56x speedup in end-to-end latency compared to state-of-the-art approaches, significantly enhancing MoE inference efficiency in resource-constrained environments.
Related papers
- TriMoE: Augmenting GPU with AMX-Enabled CPU and DIMM-NDP for High-Throughput MoE Inference via Offloading [38.243293392367086]
TriMoE is a novel GPU- CPU-NDP architecture that exploits AMX-enabled CPU to map hot, warm, and cold experts onto their optimal compute units.<n> Experiments demonstrate that TriMoE achieves up to 2.83x speedup over state-of-the-art solutions.
arXiv Detail & Related papers (2026-03-01T11:27:37Z) - HALO: Semantic-Aware Distributed LLM Inference in Lossy Edge Network [50.33808558714122]
Large language models' (LLMs) inference at the edge can facilitate prompt service responsiveness while protecting user privacy.<n>We propose HALO, a novel framework that can boost the distributed LLM inference in lossy edge network.<n> Experimental results from a Raspberry Pi cluster demonstrate that HALO achieves a 3.41x end-to-end speedup for LLaMA-series LLMs under unreliable network conditions.
arXiv Detail & Related papers (2026-01-16T07:37:23Z) - Efficient MoE Inference with Fine-Grained Scheduling of Disaggregated Expert Parallelism [32.0077995747942]
FinDEP is a fine-grained task scheduling algorithm for DEP that maximizes task overlap to improve MoE inference throughput.<n>Experiments on four GPU systems with DeepSeek-V2 and Qwen3-MoE show FinDEP improves throughput by up to 1.61x over prior methods.
arXiv Detail & Related papers (2025-12-25T03:22:03Z) - Remoe: Towards Efficient and Low-Cost MoE Inference in Serverless Computing [29.98726492279776]
Mixture-of-Experts (MoE) has become a dominant architecture in large language models.<n>MoEs incurs high inference costs due to memory-intensive parameter caching.<n>We propose Remoe, a heterogeneous MoE inference system tailored for serverless computing.
arXiv Detail & Related papers (2025-12-21T10:27:50Z) - Eliminating Multi-GPU Performance Taxes: A Systems Approach to Efficient Distributed LLMs [61.953548065938385]
We introduce the ''Three Taxes'' (Bulk Synchronous, Inter- Kernel Data Locality, and Kernel Launch Overhead) as an analytical framework.<n>We propose moving beyond the rigid BSP model to address key inefficiencies in distributed GPU execution.<n>We observe a 10-20% speedup in end-to-end latency over BSP-based approaches.
arXiv Detail & Related papers (2025-11-04T01:15:44Z) - Efficient Edge LLMs Deployment via HessianAware Quantization and CPU GPU Collaborative [31.74122603714625]
Mixture of Experts (MoE) architecture enhances model capacity through sparse activation.<n>MoE faces two major difficulties in practical deployment.<n>Under limited memory, efficient offloading and collaborative inference of expert modules struggle to balance latency and throughput.<n>This paper proposes an efficient MoE edge deployment scheme based on Hessian-Aware Quantization (HAQ) and CPU- GPU collaborative inference.
arXiv Detail & Related papers (2025-08-10T12:59:57Z) - Intra-DP: A High Performance Collaborative Inference System for Mobile Edge Computing [67.98609858326951]
Intra-DP is a high-performance collaborative inference system optimized for deep neural networks (DNNs) on mobile devices.<n>It reduces per-inference latency by up to 50% and energy consumption by up to 75% compared to state-of-the-art baselines.<n>The evaluation demonstrates that Intra-DP reduces per-inference latency by up to 50% and energy consumption by up to 75% compared to state-of-the-art baselines.
arXiv Detail & Related papers (2025-07-08T09:50:57Z) - A Multi-Head Ensemble Multi-Task Learning Approach for Dynamical
Computation Offloading [62.34538208323411]
We propose a multi-head ensemble multi-task learning (MEMTL) approach with a shared backbone and multiple prediction heads (PHs)
MEMTL outperforms benchmark methods in both the inference accuracy and mean square error without requiring additional training data.
arXiv Detail & Related papers (2023-09-02T11:01:16Z) - SqueezeLLM: Dense-and-Sparse Quantization [80.32162537942138]
Main bottleneck for generative inference with LLMs is memory bandwidth, rather than compute, for single batch inference.
We introduce SqueezeLLM, a post-training quantization framework that enables lossless compression to ultra-low precisions of up to 3-bit.
Our framework incorporates two novel ideas: (i) sensitivity-based non-uniform quantization, which searches for the optimal bit precision assignment based on second-order information; and (ii) the Dense-and-Sparse decomposition that stores outliers and sensitive weight values in an efficient sparse format.
arXiv Detail & Related papers (2023-06-13T08:57:54Z) - Design and Prototyping Distributed CNN Inference Acceleration in Edge
Computing [85.74517957717363]
HALP accelerates inference by designing a seamless collaboration among edge devices (EDs) in Edge Computing.
Experiments show that the distributed inference HALP achieves 1.7x inference acceleration for VGG-16.
It is shown that the model selection with distributed inference HALP can significantly improve service reliability.
arXiv Detail & Related papers (2022-11-24T19:48:30Z) - EdgeBERT: Sentence-Level Energy Optimizations for Latency-Aware
Multi-Task NLP Inference [82.1584439276834]
Transformer-based language models such as BERT provide significant accuracy improvement for a multitude of natural language processing (NLP) tasks.
We present EdgeBERT, an in-depth algorithm- hardware co-design for latency-aware energy optimization for multi-task NLP.
arXiv Detail & Related papers (2020-11-28T19:21:47Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.