MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts
- URL: http://arxiv.org/abs/2511.14102v1
- Date: Tue, 18 Nov 2025 03:40:19 GMT
- Title: MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts
- Authors: Wenfeng Wang, Jiacheng Liu, Xiaofeng Hou, Xinfeng Xia, Peng Tang, Mingxuan Zhang, Chao Li, Minyi Guo,
- Abstract summary: We present MoE-SpeQ, a new inference system built on a novel co-design of speculative execution and expert offloading.<n>MoE-SpeQ employs a small, on-device draft model to predict the sequence of required experts for future tokens.<n>Our evaluation on memory-constrained devices shows that for the Phi-MoE model, MoE-SpeQ achieves at most 2.34x speedup over the state-of-the-art offloading framework.
- Score: 29.437264687850874
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The immense memory requirements of state-of-the-art Mixture-of-Experts (MoE) models present a significant challenge for inference, often exceeding the capacity of a single accelerator. While offloading experts to host memory is a common solution, it introduces a severe I/O bottleneck over the PCIe bus, as the data-dependent nature of expert selection places these synchronous transfers directly on the critical path of execution, crippling performance. This paper argues that the I/O bottleneck can be overcome by trading a small amount of cheap, on-device computation to hide the immense cost of data movement. We present MoE-SpeQ, a new inference system built on a novel co-design of speculative execution and expert offloading. MoE-SpeQ employs a small, on-device draft model to predict the sequence of required experts for future tokens. This foresight enables a runtime orchestrator to prefetch these experts from host memory, effectively overlapping the expensive I/O with useful computation and hiding the latency from the critical path. To maximize performance, an adaptive governor, guided by an Amortization Roofline Model, dynamically tunes the speculation strategy to the underlying hardware. Our evaluation on memory-constrained devices shows that for the Phi-MoE model, MoE-SpeQ achieves at most 2.34x speedup over the state-of-the-art offloading framework. Our work establishes a new, principled approach for managing data-dependent memory access in resource-limited environments, making MoE inference more accessible on commodity hardware.
Related papers
- ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling [56.88966608455977]
ZipMoE exploits the synergy between the hardware properties of edge devices and the statistical redundancy inherent to MoE parameters.<n>ZipMoE achieves up to $72.77%$ inference latency reduction and up to $6.76times$ higher throughput than the state-of-the-art systems.
arXiv Detail & Related papers (2026-01-29T02:51:59Z) - Least-Loaded Expert Parallelism: Load Balancing An Imbalanced Mixture-of-Experts [74.40169987564724]
Expert parallelism (EP) is designed to scale MoE models by distributing experts across multiple devices.<n>Under extreme imbalance, EP can funnel a disproportionate number of tokens to a small number of experts, leading to compute- and memory-bound failures.<n>We propose Least-Loaded Expert Parallelism (LLEP), a novel EP algorithm that dynamically reroutes excess tokens and associated expert parameters from overloaded devices to underutilized ones.
arXiv Detail & Related papers (2026-01-23T18:19:15Z) - ExpertFlow: Adaptive Expert Scheduling and Memory Coordination for Efficient MoE Inference [8.296993547783808]
ExpertFlow is a runtime system for MoE inference that combines adaptive expert prefetching and cache-aware routing.<n>Our evaluation demonstrates that ExpertFlow reduces model stall time to less than 0.1% of the baseline.
arXiv Detail & Related papers (2025-10-30T17:29:27Z) - MC#: Mixture Compressor for Mixture-of-Experts Large Models [86.64315380917827]
Mixture-of-Experts (MoE) effectively scales large language models (LLMs) and vision-language models (VLMs) by increasing capacity through sparse activation.<n>We propose MC# (Mixture-Compressor-sharp), a framework that combines static quantization and dynamic expert pruning.
arXiv Detail & Related papers (2025-10-13T03:12:46Z) - Enabling MoE on the Edge via Importance-Driven Expert Scheduling [21.860330824352527]
MoE is a key technique for scaling Large Language Models by activating only a subset of experts per query.<n>We leverage expert importance to guide decisions, substituting low-cached activated experts with functionally similar ones already cached in GPU memory.<n>This design reduces memory usage and data transfer, while largely eliminating PCIe overhead.
arXiv Detail & Related papers (2025-08-26T12:32:09Z) - D$^{2}$MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving [14.607254882119507]
Combination of experts (MoE) model is a sparse variant of large language models (LLMs)<n>Despite its benefits, MoE is still too expensive to deploy on resource-constrained edge devices.<n>We propose D$2$MoE, an algorithm-system co-design framework that matches diverse task requirements by dynamically allocating the most proper bit-width to each expert.
arXiv Detail & Related papers (2025-04-17T05:37:35Z) - A Universal Framework for Compressing Embeddings in CTR Prediction [68.27582084015044]
We introduce a Model-agnostic Embedding Compression (MEC) framework that compresses embedding tables by quantizing pre-trained embeddings.<n>Our approach consists of two stages: first, we apply popularity-weighted regularization to balance code distribution between high- and low-frequency features.<n> Experiments on three datasets reveal that our method reduces memory usage by over 50x while maintaining or improving recommendation performance.
arXiv Detail & Related papers (2025-02-21T10:12:34Z) - Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference [14.57414071160821]
We introduce a novel cache-aware routing strategy that leverages expert reuse during token generation to improve cache locality.<n>We present on-device results demonstrating 2$times$ speedups on mobile devices.
arXiv Detail & Related papers (2024-11-27T18:59:48Z) - HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [54.40808356999408]
We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference.
Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency.
HOBBIT achieves up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
arXiv Detail & Related papers (2024-11-03T04:25:46Z) - Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models.
Our approach employs activation sparsity to extract experts.
Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z) - Temporal Feature Matters: A Framework for Diffusion Model Quantization [105.3033493564844]
Diffusion models rely on the time-step for the multi-round denoising.<n>We introduce a novel quantization framework that includes three strategies.<n>This framework preserves most of the temporal information and ensures high-quality end-to-end generation.
arXiv Detail & Related papers (2024-07-28T17:46:15Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.