Related papers: Enabling MoE on the Edge via Importance-Driven Expert Scheduling

Enabling MoE on the Edge via Importance-Driven Expert Scheduling

URL: http://arxiv.org/abs/2508.18983v1
Date: Tue, 26 Aug 2025 12:32:09 GMT
Title: Enabling MoE on the Edge via Importance-Driven Expert Scheduling
Authors: Guoying Zhu, Meng Li, Haipeng Dai, Xuechen Liu, Weijun Wang, Keran Li, Jun xiao, Ligeng Chen, Wei Wang,
Abstract summary: MoE is a key technique for scaling Large Language Models by activating only a subset of experts per query.<n>We leverage expert importance to guide decisions, substituting low-cached activated experts with functionally similar ones already cached in GPU memory.<n>This design reduces memory usage and data transfer, while largely eliminating PCIe overhead.
Score: 21.860330824352527
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: The Mixture of Experts (MoE) architecture has emerged as a key technique for scaling Large Language Models by activating only a subset of experts per query. Deploying MoE on consumer-grade edge hardware, however, is constrained by limited device memory, making dynamic expert offloading essential. Unlike prior work that treats offloading purely as a scheduling problem, we leverage expert importance to guide decisions, substituting low-importance activated experts with functionally similar ones already cached in GPU memory, thereby preserving accuracy. As a result, this design reduces memory usage and data transfer, while largely eliminating PCIe overhead. In addition, we introduce a scheduling policy that maximizes the reuse ratio of GPU-cached experts, further boosting efficiency. Extensive evaluations show that our approach delivers 48% lower decoding latency with over 60% expert cache hit rate, while maintaining nearly lossless accuracy.

Related papers

ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling [56.88966608455977]
ZipMoE exploits the synergy between the hardware properties of edge devices and the statistical redundancy inherent to MoE parameters.<n>ZipMoE achieves up to $72.77%$ inference latency reduction and up to $6.76times$ higher throughput than the state-of-the-art systems.
arXiv Detail & Related papers (2026-01-29T02:51:59Z)
MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts [29.437264687850874]
We present MoE-SpeQ, a new inference system built on a novel co-design of speculative execution and expert offloading.<n>MoE-SpeQ employs a small, on-device draft model to predict the sequence of required experts for future tokens.<n>Our evaluation on memory-constrained devices shows that for the Phi-MoE model, MoE-SpeQ achieves at most 2.34x speedup over the state-of-the-art offloading framework.
arXiv Detail & Related papers (2025-11-18T03:40:19Z)
MoBiLE: Efficient Mixture-of-Experts Inference on Consumer GPU with Mixture of Big Little Experts [17.518573710849513]
MoBiLE is a plug-and-play offloading-based MoE inference framework with textitmixture of big-little experts.<n>MoBiLE achieves a speedup of 1.60x to 1.72x compared to the baseline on a consumer GPU system, with negligible degradation in accuracy.
arXiv Detail & Related papers (2025-10-14T10:22:44Z)
MC#: Mixture Compressor for Mixture-of-Experts Large Models [86.64315380917827]
Mixture-of-Experts (MoE) effectively scales large language models (LLMs) and vision-language models (VLMs) by increasing capacity through sparse activation.<n>We propose MC# (Mixture-Compressor-sharp), a framework that combines static quantization and dynamic expert pruning.
arXiv Detail & Related papers (2025-10-13T03:12:46Z)
One-Prompt Strikes Back: Sparse Mixture of Experts for Prompt-based Continual Learning [52.966712416640085]
We propose SMoPE, a novel framework that integrates the benefits of both task-specific and shared prompt strategies.<n>SMoPE consistently outperforms task-specific prompt methods and achieves performance competitive with state-of-the-art approaches.
arXiv Detail & Related papers (2025-09-29T08:54:58Z)
MoE-Beyond: Learning-Based Expert Activation Prediction on Edge Devices [0.0]
We introduce MoE-Beyond, a learning-based expert activation predictor trained to predict expert activations during autoregressive decoding.<n>Our predictor generalizes effectively across unseen prompts from WebGLM-QA dataset, achieving 97.5% accuracy and an 86.6% F1-score.
arXiv Detail & Related papers (2025-08-23T20:28:32Z)
QuartDepth: Post-Training Quantization for Real-Time Depth Estimation on the Edge [55.75103034526652]
We propose QuartDepth which adopts post-training quantization to quantize MDE models with hardware accelerations for ASICs.<n>Our approach involves quantizing both weights and activations to 4-bit precision, reducing the model size and computation cost.<n>We design a flexible and programmable hardware accelerator by supporting kernel fusion and customized instruction programmability.
arXiv Detail & Related papers (2025-03-20T21:03:10Z)
Mixture of Lookup Experts [63.787712153454464]
Mixture-of-Experts (MoE) activates only a subset of experts during inference.<n>MoLE is a new MoE architecture that is efficient in both communication and VRAM usage.
arXiv Detail & Related papers (2025-03-20T02:31:57Z)
eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference [6.642099288463585]
We propose eMoE, a memory efficient inference system for large language models (LLMs)<n>eMoE reduces memory usage by predicting and loading only the required experts based on recurrent patterns in expert routing.<n>It also enables processing prompts 40x longer, batches 4.5x larger, and achieves 1.5x higher throughput.
arXiv Detail & Related papers (2025-03-10T01:11:52Z)
DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference [14.676716521856813]
Mixture-of-Experts (MoE) models face significant deployment challenges on memory-constrained devices.<n>We presentP, an on-device MoE inference engine to optimize parallel GPU- CPU execution.<n>P outperforms traditional expert caching and prefetching methods by up to 8.20x and offloading techniques by 1.35x while maintaining accuracy.
arXiv Detail & Related papers (2024-12-16T07:59:21Z)
HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [54.40808356999408]
We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference. Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency. HOBBIT achieves up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
arXiv Detail & Related papers (2024-11-03T04:25:46Z)
ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference [41.41316718220569]
ExpertFlow is designed to enhance inference efficiency by accommodating flexible routing and enabling efficient expert scheduling between CPU and GPU. Our experiments demonstrate that ExpertFlow achieves up to 93.72% GPU memory savings and enhances inference speed by 2 to 10 times compared to baseline methods.
arXiv Detail & Related papers (2024-10-23T15:24:54Z)
Mixture Compressor for Mixture-of-Experts LLMs Gains More [71.0473038084673]
We propose a training-free Mixture-Compressor for Mixture-of-Experts large language models (MoE-LLMs)<n>Our MC integrates static quantization and dynamic pruning to collaboratively achieve extreme compression for MoE-LLMs with less accuracy loss.<n>For instance, at 2.54 bits, MC compresses 76.6% of the model, with only a 3.8% average accuracy loss.
arXiv Detail & Related papers (2024-10-08T18:09:38Z)

This list is automatically generated from the titles and abstracts of the papers in this site.