ExpertFlow: Adaptive Expert Scheduling and Memory Coordination for Efficient MoE Inference
- URL: http://arxiv.org/abs/2510.26730v1
- Date: Thu, 30 Oct 2025 17:29:27 GMT
- Title: ExpertFlow: Adaptive Expert Scheduling and Memory Coordination for Efficient MoE Inference
- Authors: Zixu Shen, Kexin Chu, Yifan Zhang, Dawei Xiang, Runxin Wu, Wei Zhang,
- Abstract summary: ExpertFlow is a runtime system for MoE inference that combines adaptive expert prefetching and cache-aware routing.<n>Our evaluation demonstrates that ExpertFlow reduces model stall time to less than 0.1% of the baseline.
- Score: 8.296993547783808
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: The expansion of large language models is increasingly limited by the constrained memory capacity of modern GPUs. To mitigate this, Mixture-of-Experts (MoE) architectures activate only a small portion of parameters during inference, significantly lowering both memory demand and computational overhead. However, conventional MoE inference approaches, which select active experts independently at each layer, often introduce considerable latency because of frequent parameter transfers between host and GPU memory. In addition, current cross-layer prediction strategies, which are typically based on fixed steps, lack adaptability across different hardware platforms and workloads, thereby reducing their robustness and effectiveness. To address these challenges, we present ExpertFlow, a runtime system for MoE inference that combines adaptive expert prefetching and cache-aware routing. ExpertFlow continuously adjusts its prediction horizon for expert activation by leveraging runtime statistics such as transfer bandwidth, parameter dimensionality, and model feedback signals. Furthermore, it incorporates a hybrid cross-layer prediction scheme that fuses pregating information with intermediate computational states to anticipate future expert needs. By adaptively refining prefetching decisions and aligning them with actual usage behavior, ExpertFlow effectively decreases cache misses and removes latency caused by expert swap-ins. Our evaluation demonstrates that ExpertFlow reduces model stall time to less than 0.1% of the baseline, highlighting its capability to optimize MoE inference under stringent memory constraints.
Related papers
- TS-Memory: Plug-and-Play Memory for Time Series Foundation Models [63.21390142212087]
Time Series Foundation Models (TSFMs) achieve strong zero-shot forecasting through large-scale pre-training.<n>Existing solutions face a trade-off: Parametric Adaptation can cause catastrophic forgetting, while Non-Parametric Retrieval improves forecasts but incurs high latency due to datastore search.<n>We propose Parametric Memory Distillation and implement it as TS-Memory, a lightweight memory adapter that augments frozen TSFMs.
arXiv Detail & Related papers (2026-02-12T04:16:19Z) - MoE-SpeQ: Speculative Quantized Decoding with Proactive Expert Prefetching and Offloading for Mixture-of-Experts [29.437264687850874]
We present MoE-SpeQ, a new inference system built on a novel co-design of speculative execution and expert offloading.<n>MoE-SpeQ employs a small, on-device draft model to predict the sequence of required experts for future tokens.<n>Our evaluation on memory-constrained devices shows that for the Phi-MoE model, MoE-SpeQ achieves at most 2.34x speedup over the state-of-the-art offloading framework.
arXiv Detail & Related papers (2025-11-18T03:40:19Z) - OptPipe: Memory- and Scheduling-Optimized Pipeline Parallelism for LLM Training [13.814101909348183]
Pipeline (PP) has become a standard technique for scaling large language model (LLM) training across multiple devices.<n>In this work, we revisit the pipeline scheduling problem from a principled optimization perspective.<n>We formulate scheduling as a constrained optimization problem that jointly accounts for memory capacity, activation reuse, and pipeline bubble minimization.
arXiv Detail & Related papers (2025-10-06T01:06:33Z) - CSGO: Generalized Optimization for Cold Start in Wireless Collaborative Edge LLM Systems [62.24576366776727]
We propose a latency-aware scheduling framework to minimize total inference latency.<n>We show that the proposed method significantly reduces cold-start latency compared to baseline strategies.
arXiv Detail & Related papers (2025-08-15T07:49:22Z) - The Larger the Merrier? Efficient Large AI Model Inference in Wireless Edge Networks [56.37880529653111]
The demand for large computation model (LAIM) services is driving a paradigm shift from traditional cloud-based inference to edge-based inference for low-latency, privacy-preserving applications.<n>In this paper, we investigate the LAIM-inference scheme, where a pre-trained LAIM is pruned and partitioned into on-device and on-server sub-models for deployment.
arXiv Detail & Related papers (2025-05-14T08:18:55Z) - Klotski: Efficient Mixture-of-Expert Inference via Expert-Aware Multi-Batch Pipeline [39.52960562420227]
Mixture of Experts (MoE) enables the scaling of language models up to trillions of parameters without significantly increasing computational costs.<n> offloading techniques utilise memory from the CPU and disk and parallelise the I/O and computation for efficiency.<n>We propose Klotski, an efficient MoE inference engine that significantly reduces pipeline bubbles through a novel expert-aware multi-batch pipeline paradigm.
arXiv Detail & Related papers (2025-02-09T08:47:06Z) - Dovetail: A CPU/GPU Heterogeneous Speculative Decoding for LLM inference [31.901686946969786]
Dovetail is an inference method that leverages the complementary characteristics of heterogeneous devices and the advantages of speculative decoding.<n>Dovetail achieves inference speedups ranging from 1.79x to 10.1x across different devices, while maintaining consistency and stability in the distribution of generated texts.
arXiv Detail & Related papers (2024-12-25T15:45:18Z) - ExpertFlow: Optimized Expert Activation and Token Allocation for Efficient Mixture-of-Experts Inference [41.41316718220569]
ExpertFlow is designed to enhance inference efficiency by accommodating flexible routing and enabling efficient expert scheduling between CPU and GPU.
Our experiments demonstrate that ExpertFlow achieves up to 93.72% GPU memory savings and enhances inference speed by 2 to 10 times compared to baseline methods.
arXiv Detail & Related papers (2024-10-23T15:24:54Z) - Temporal Feature Matters: A Framework for Diffusion Model Quantization [105.3033493564844]
Diffusion models rely on the time-step for the multi-round denoising.<n>We introduce a novel quantization framework that includes three strategies.<n>This framework preserves most of the temporal information and ensures high-quality end-to-end generation.
arXiv Detail & Related papers (2024-07-28T17:46:15Z) - SHERL: Synthesizing High Accuracy and Efficient Memory for Resource-Limited Transfer Learning [63.93193829913252]
We propose an innovative METL strategy called SHERL for resource-limited scenarios.
In the early route, intermediate outputs are consolidated via an anti-redundancy operation.
In the late route, utilizing minimal late pre-trained layers could alleviate the peak demand on memory overhead.
arXiv Detail & Related papers (2024-07-10T10:22:35Z) - Sparser is Faster and Less is More: Efficient Sparse Attention for Long-Range Transformers [58.5711048151424]
We introduce SPARSEK Attention, a novel sparse attention mechanism designed to overcome computational and memory obstacles.
Our approach integrates a scoring network and a differentiable top-k mask operator, SPARSEK, to select a constant number of KV pairs for each query.
Experimental results reveal that SPARSEK Attention outperforms previous sparse attention methods.
arXiv Detail & Related papers (2024-06-24T15:55:59Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.