Accelerating Mixture-of-Expert Inference with Adaptive Expert Split Mechanism
- URL: http://arxiv.org/abs/2509.08342v1
- Date: Wed, 10 Sep 2025 07:28:24 GMT
- Title: Accelerating Mixture-of-Expert Inference with Adaptive Expert Split Mechanism
- Authors: Jiaming Yan, Jianchun Liu, Hongli Xu, Liusheng Huang,
- Abstract summary: MoEpic is an efficient MoE inference system with a novel expert split mechanism.<n>Experiments on popular MoE LLMs demonstrate that MoEpic can save about half of the GPU cost.
- Score: 29.862588578556366
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mixture-of-Experts (MoE) has emerged as a promising architecture for modern large language models (LLMs). However, massive parameters impose heavy GPU memory (i.e., VRAM) demands, hindering the widespread adoption of MoE LLMs. Offloading the expert parameters to CPU RAM offers an effective way to alleviate the VRAM requirements for MoE inference. Existing approaches typically cache a small subset of experts in VRAM and dynamically prefetch experts from RAM during inference, leading to significant degradation in inference speed due to the poor cache hit rate and substantial expert loading latency. In this work, we propose MoEpic, an efficient MoE inference system with a novel expert split mechanism. Specifically, each expert is vertically divided into two segments: top and bottom. MoEpic caches the top segment of hot experts, so that more experts will be stored under the limited VRAM budget, thereby improving the cache hit rate. During each layer's inference, MoEpic predicts and prefetches the activated experts for the next layer. Since the top segments of cached experts are exempt from fetching, the loading time is reduced, which allows efficient transfer-computation overlap. Nevertheless, the performance of MoEpic critically depends on the cache configuration (i.e., each layer's VRAM budget and expert split ratio). To this end, we propose a divide-and-conquer algorithm based on fixed-point iteration for adaptive cache configuration. Extensive experiments on popular MoE LLMs demonstrate that MoEpic can save about half of the GPU cost, while lowering the inference latency by about 37.51%-65.73% compared to the baselines.
Related papers
- MELINOE: Fine-Tuning Enables Memory-Efficient Inference for Mixture-of-Experts Models [13.907916161242794]
Mixture-of-Experts (MoE) model architectures can significantly reduce the number of activated parameters per token.<n>Their large overall parameter counts and model sizes have precluded their widespread usage in resource-constrained settings.<n>We present MELINOE, a method that fine-tunes an MoE model to more strongly prefer activating a smaller number of experts per sequence.
arXiv Detail & Related papers (2026-01-30T14:40:18Z) - MoBiLE: Efficient Mixture-of-Experts Inference on Consumer GPU with Mixture of Big Little Experts [17.518573710849513]
MoBiLE is a plug-and-play offloading-based MoE inference framework with textitmixture of big-little experts.<n>MoBiLE achieves a speedup of 1.60x to 1.72x compared to the baseline on a consumer GPU system, with negligible degradation in accuracy.
arXiv Detail & Related papers (2025-10-14T10:22:44Z) - FloE: On-the-Fly MoE Inference on Memory-constrained GPU [22.2581000412208]
FloE is built on the insight that there exists substantial untapped redundancy within sparsely activated experts.<n>FloE achieves a 9.3x compression of parameters per expert in Mixtral-8x7B.<n>It enables deployment on a GPU with only 11GB VRAM, reducing the memory footprint by up to 8.5x.
arXiv Detail & Related papers (2025-05-09T10:53:47Z) - Mixture of Lookup Experts [63.787712153454464]
Mixture-of-Experts (MoE) activates only a subset of experts during inference.<n>MoLE is a new MoE architecture that is efficient in both communication and VRAM usage.
arXiv Detail & Related papers (2025-03-20T02:31:57Z) - QuantCache: Adaptive Importance-Guided Quantization with Hierarchical Latent and Layer Caching for Video Generation [84.91431271257437]
Diffusion Transformers (DiTs) have emerged as a dominant architecture in video generation.<n>DiTs come with significant drawbacks, including increased computational and memory costs.<n>We propose QuantCache, a novel training-free inference acceleration framework.
arXiv Detail & Related papers (2025-03-09T10:31:51Z) - fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving [9.956997242640728]
fMoE is a fine-grained expert offloading system for MoE serving.<n>We show that fMoE reduces inference latency by 47% and improves expert hit rate by 36% over state-of-the-art solutions.
arXiv Detail & Related papers (2025-02-07T22:51:17Z) - DAOP: Data-Aware Offloading and Predictive Pre-Calculation for Efficient MoE Inference [14.676716521856813]
Mixture-of-Experts (MoE) models face significant deployment challenges on memory-constrained devices.<n>We presentP, an on-device MoE inference engine to optimize parallel GPU- CPU execution.<n>P outperforms traditional expert caching and prefetching methods by up to 8.20x and offloading techniques by 1.35x while maintaining accuracy.
arXiv Detail & Related papers (2024-12-16T07:59:21Z) - MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs [55.95879347182669]
MoE architecture is renowned for its ability to increase model capacity without a proportional increase in inference cost.
MoE-Lightning introduces a novel CPU-GPU-I/O pipelining schedule, CGOPipe, with paged weights to achieve high resource utilization.
MoE-Lightning can achieve up to 10.3x higher throughput than state-of-the-art offloading-enabled LLM inference systems for Mixtral 8x7B on a single T4 GPU (16GB)
arXiv Detail & Related papers (2024-11-18T01:06:12Z) - HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [54.40808356999408]
We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference.
Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency.
HOBBIT achieves up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
arXiv Detail & Related papers (2024-11-03T04:25:46Z) - vTensor: Flexible Virtual Tensor Management for Efficient LLM Serving [53.972175896814505]
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
Large Language Models (LLMs) are widely used across various domains, processing millions of daily requests.
arXiv Detail & Related papers (2024-07-22T14:37:58Z) - PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference [57.53291046180288]
Large Language Models (LLMs) have shown remarkable comprehension abilities but face challenges in GPU memory usage during inference.
We propose PyramidInfer, a method that compresses the KV cache by layer-wise retaining crucial context.
PyramidInfer improves 2.2x throughput compared to Accelerate with over 54% GPU memory reduction in KV cache.
arXiv Detail & Related papers (2024-05-21T06:46:37Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.