eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference
- URL: http://arxiv.org/abs/2503.06823v1
- Date: Mon, 10 Mar 2025 01:11:52 GMT
- Title: eMoE: Task-aware Memory Efficient Mixture-of-Experts-Based (MoE) Model Inference
- Authors: Suraiya Tairin, Shohaib Mahmud, Haiying Shen, Anand Iyer,
- Abstract summary: We propose eMoE, a memory efficient inference system for large language models (LLMs)<n>eMoE reduces memory usage by predicting and loading only the required experts based on recurrent patterns in expert routing.<n>It also enables processing prompts 40x longer, batches 4.5x larger, and achieves 1.5x higher throughput.
- Score: 6.642099288463585
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: In recent years, Mixture-of-Experts (MoE) has emerged as an effective approach for enhancing the capacity of deep neural network (DNN) with sub-linear computational costs. However, storing all experts on GPUs incurs significant memory overhead, increasing the monetary cost of MoE-based inference. To address this, we propose eMoE, a memory efficient inference system for MoE-based large language models (LLMs) by leveraging our observations from experiment measurements. eMoE reduces memory usage by predicting and loading only the required experts based on recurrent patterns in expert routing. To reduce loading latency while maintaining accuracy, as we found using the same experts for subsequent prompts has minimal impact on perplexity, eMoE invokes the expert predictor every few prompts rather than for each prompt. In addition, it skips predictions for tasks less sensitive to routing accuracy. Finally, it has task-aware scheduling to minimize inference latency by considering Service Level Objectives (SLOs), task-specific output lengths, and expert loading latencies. Experimental results show that compared to existing systems, eMoE reduces memory consumption by up to 80% while maintaining accuracy and reduces inference latency by up to 17%. It also enables processing prompts 40x longer, batches 4.5x larger, and achieves 1.5x higher throughput.
Related papers
- Remoe: Towards Efficient and Low-Cost MoE Inference in Serverless Computing [29.98726492279776]
Mixture-of-Experts (MoE) has become a dominant architecture in large language models.<n>MoEs incurs high inference costs due to memory-intensive parameter caching.<n>We propose Remoe, a heterogeneous MoE inference system tailored for serverless computing.
arXiv Detail & Related papers (2025-12-21T10:27:50Z) - BuddyMoE: Exploiting Expert Redundancy to Accelerate Memory-Constrained Mixture-of-Experts Inference [11.5035097836611]
Growing size of modern MoE models causes their full parameter sets to exceed GPU memory capacity.<n>Prefetchings aim to hide this latency by predicting which experts are needed, but prefetch failures introduce significant stalls and amplify inference latency.<n>The critical challenge is to maintain both high inference speed and model accuracy when prefetching fails.
arXiv Detail & Related papers (2025-11-13T07:56:50Z) - MoBiLE: Efficient Mixture-of-Experts Inference on Consumer GPU with Mixture of Big Little Experts [17.518573710849513]
MoBiLE is a plug-and-play offloading-based MoE inference framework with textitmixture of big-little experts.<n>MoBiLE achieves a speedup of 1.60x to 1.72x compared to the baseline on a consumer GPU system, with negligible degradation in accuracy.
arXiv Detail & Related papers (2025-10-14T10:22:44Z) - MC#: Mixture Compressor for Mixture-of-Experts Large Models [86.64315380917827]
Mixture-of-Experts (MoE) effectively scales large language models (LLMs) and vision-language models (VLMs) by increasing capacity through sparse activation.<n>We propose MC# (Mixture-Compressor-sharp), a framework that combines static quantization and dynamic expert pruning.
arXiv Detail & Related papers (2025-10-13T03:12:46Z) - Enabling MoE on the Edge via Importance-Driven Expert Scheduling [21.860330824352527]
MoE is a key technique for scaling Large Language Models by activating only a subset of experts per query.<n>We leverage expert importance to guide decisions, substituting low-cached activated experts with functionally similar ones already cached in GPU memory.<n>This design reduces memory usage and data transfer, while largely eliminating PCIe overhead.
arXiv Detail & Related papers (2025-08-26T12:32:09Z) - PreMoe: Lightening MoEs on Constrained Memory by Expert Pruning and Retrieval [36.9586523272496]
Mixture-of-experts (MoE) architectures enable scaling large language models (LLMs) to vast parameter counts without a proportional rise in computational costs.<n>However, the significant memory demands of large MoE models hinder their deployment across various computational environments.<n>We introduce PreMoe, a novel framework that enables efficient deployment of massive MoE models in memory-constrained environments.
arXiv Detail & Related papers (2025-05-23T08:59:16Z) - D$^{2}$MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving [14.607254882119507]
Combination of experts (MoE) model is a sparse variant of large language models (LLMs)
Despite its benefits, MoE is still too expensive to deploy on resource-constrained edge devices.
We propose D$2$MoE, an algorithm-system co-design framework that matches diverse task requirements by dynamically allocating the most proper bit-width to each expert.
arXiv Detail & Related papers (2025-04-17T05:37:35Z) - Mixture of Lookup Experts [63.787712153454464]
Mixture-of-Experts (MoE) activates only a subset of experts during inference.
MoLE is a new MoE architecture that is efficient in both communication and VRAM usage.
arXiv Detail & Related papers (2025-03-20T02:31:57Z) - ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration [61.579842548990754]
Mixture-of-Experts (MoE) Transformer, the backbone of multiple phenomenal language models, leverages sparsity by activating only a fraction of model parameters for each input token.<n>We introduce ResMoE, an innovative MoE approximation framework that utilizes Wasserstein barycenter to extract a common expert (barycenter expert) and approximate the residuals between this barycenter expert and the original ones.
arXiv Detail & Related papers (2025-03-10T03:15:54Z) - fMoE: Fine-Grained Expert Offloading for Large Mixture-of-Experts Serving [9.956997242640728]
fMoE is a fine-grained expert offloading system for MoE serving.<n>We show that fMoE reduces inference latency by 47% and improves expert hit rate by 36% over state-of-the-art solutions.
arXiv Detail & Related papers (2025-02-07T22:51:17Z) - HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [54.40808356999408]
We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference.
Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency.
HOBBIT achieves up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
arXiv Detail & Related papers (2024-11-03T04:25:46Z) - AdapMoE: Adaptive Sensitivity-based Expert Gating and Management for Efficient MoE Inference [13.263938935671646]
AdapMoE is an algorithm-system co-design framework for efficient MoE inference.
AdapMoE features adaptive expert gating and management to reduce the on-demand loading overheads.
We show AdapMoE consistently outperforms existing techniques, reducing the average number of activated experts by 25% and achieving a 1.35x speedup without degradation accuracy.
arXiv Detail & Related papers (2024-08-19T03:27:15Z) - Efficient Expert Pruning for Sparse Mixture-of-Experts Language Models: Enhancing Performance and Reducing Inference Costs [30.07344792770254]
We introduce a gradient-free evolutionary strategy named EEP (Efficient Expert Pruning) to enhance the pruning of experts in SMoE models.
EEP relies solely on model inference (i.e., no gradient computation) and greater sparsity while maintaining or even improving performance on downstream tasks.
We demonstrate that pruning up to 75% of experts in Mixtral $8times7$B-Instruct results in a substantial reduction in parameters with minimal performance loss.
arXiv Detail & Related papers (2024-07-01T03:57:35Z) - SEER-MoE: Sparse Expert Efficiency through Regularization for Mixture-of-Experts [49.01990048827639]
We introduce SEER-MoE, a framework for reducing both the memory footprint and compute requirements of pre-trained MoE models.
The first stage involves pruning the total number of experts using a heavy-hitters counting guidance, while the second stage employs a regularization-based fine-tuning strategy to recover accuracy loss.
Our empirical studies demonstrate the effectiveness of our method, resulting in a sparse MoEs model optimized for inference efficiency with minimal accuracy trade-offs.
arXiv Detail & Related papers (2024-04-07T22:13:43Z) - Merging Experts into One: Improving Computational Efficiency of Mixture
of Experts [71.44422347502409]
A sparse Mixture of Experts (MoE) can reduce the cost by activating a small subset of parameters.
Can we retain the advantages of adding more experts without substantially increasing the computational costs?
We propose a computation-efficient approach called textbftexttMerging Experts into One (MEO) which reduces the computation cost to that of a single expert.
arXiv Detail & Related papers (2023-10-15T13:28:42Z) - Task-Specific Expert Pruning for Sparse Mixture-of-Experts [105.20605021416276]
Mixture-of-Experts (MoE) model is powerful for large-scale pre-training.
MoE is hard to be deployed on cloud or mobile environment.
We propose a general method to progressively drop the non-professional experts for the target downstream task.
arXiv Detail & Related papers (2022-06-01T07:09:01Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.