MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems
- URL: http://arxiv.org/abs/2505.11415v2
- Date: Wed, 21 May 2025 13:20:21 GMT
- Title: MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems
- Authors: Yinsicheng Jiang, Yao Fu, Yeqi Huang, Ping Nie, Zhan Lu, Leyang Xue, Congjie He, Man-Kit Sit, Jilong Xue, Li Dong, Ziming Miao, Dayou Du, Tairan Xu, Kai Zou, Edoardo Ponti, Luo Mai,
- Abstract summary: MoE architecture is increasingly favored for scaling Large Language Models (LLMs) efficiently.<n>Existing benchmarks often fail to capture these trade-offs accurately.<n>We introduce MoE-CAP, a benchmark specifically designed for MoE systems.
- Score: 25.47183452671666
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: The sparse Mixture-of-Experts (MoE) architecture is increasingly favored for scaling Large Language Models (LLMs) efficiently, but it depends on heterogeneous compute and memory resources. These factors jointly affect system Cost, Accuracy, and Performance (CAP), making trade-offs inevitable. Existing benchmarks often fail to capture these trade-offs accurately, complicating practical deployment decisions. To address this, we introduce MoE-CAP, a benchmark specifically designed for MoE systems. Our analysis reveals that achieving an optimal balance across CAP is difficult with current hardware; MoE systems typically optimize two of the three dimensions at the expense of the third-a dynamic we term the MoE-CAP trade-off. To visualize this, we propose the CAP Radar Diagram. We further introduce sparsity-aware performance metrics-Sparse Memory Bandwidth Utilization (S-MBU) and Sparse Model FLOPS Utilization (S-MFU)-to enable accurate performance benchmarking of MoE systems across diverse hardware platforms and deployment scenarios.
Related papers
- Routing Mamba: Scaling State Space Models with Mixture-of-Experts Projection [88.47928738482719]
Linear State Space Models (SSMs) offer remarkable performance gains in sequence modeling.<n>Recent advances, such as Mamba, further enhance SSMs with input-dependent gating and hardware-aware implementations.<n>We introduce Routing Mamba (RoM), a novel approach that scales SSM parameters using sparse mixtures of linear projection experts.
arXiv Detail & Related papers (2025-06-22T19:26:55Z) - Harnessing On-Device Large Language Model: Empirical Results and Implications for AI PC [8.837470787975308]
Large Language Models (LLMs) on edge devices offer significant privacy benefits.<n>These on-device LLMs inherently face performance limitations due to reduced model capacity and necessary compression techniques.<n>We introduce a systematic methodology -- encompassing model capability, development efficiency, and system resources -- for evaluating on-device LLMs.
arXiv Detail & Related papers (2025-05-21T02:23:01Z) - D$^{2}$MoE: Dual Routing and Dynamic Scheduling for Efficient On-Device MoE-based LLM Serving [14.607254882119507]
Combination of experts (MoE) model is a sparse variant of large language models (LLMs)<n>Despite its benefits, MoE is still too expensive to deploy on resource-constrained edge devices.<n>We propose D$2$MoE, an algorithm-system co-design framework that matches diverse task requirements by dynamically allocating the most proper bit-width to each expert.
arXiv Detail & Related papers (2025-04-17T05:37:35Z) - COSMOS: A Hybrid Adaptive Optimizer for Memory-Efficient Training of LLMs [81.01082659623552]
Large Language Models (LLMs) have demonstrated remarkable success across various domains.<n>Their optimization remains a significant challenge due to the complex and high-dimensional loss landscapes they inhabit.
arXiv Detail & Related papers (2025-02-24T18:42:19Z) - Federated Fine-Tuning of LLMs: Framework Comparison and Research Directions [59.5243730853157]
Federated learning (FL) provides a privacy-preserving solution for fine-tuning pre-trained large language models (LLMs) using distributed private datasets.<n>This article conducts a comparative analysis of three advanced federated LLM (FedLLM) frameworks that integrate knowledge distillation (KD) and split learning (SL) to mitigate these issues.
arXiv Detail & Related papers (2025-01-08T11:37:06Z) - MoE-CAP: Benchmarking Cost, Accuracy and Performance of Sparse Mixture-of-Experts Systems [26.493762260392284]
MoE-CAP is a benchmarking method for evaluating sparse MoE systems.<n>Its key innovation is a sparsity-aware CAP analysis model, the first to integrate cost, performance, and accuracy metrics into a single diagram.
arXiv Detail & Related papers (2024-12-10T00:19:28Z) - GUIDE: A Global Unified Inference Engine for Deploying Large Language Models in Heterogeneous Environments [1.0558515062670693]
Large language models (LLMs) in real-world scenarios remains a critical challenge.<n>These challenges often lead to inefficiencies in memory utilization, latency, and throughput.<n>We develop a framework to address these issues, achieving prediction errors between 9.9% and 42.3% for key metrics such as batch latency, TTFT, and decode throughput.
arXiv Detail & Related papers (2024-12-06T05:46:43Z) - HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [54.40808356999408]
We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference.
Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency.
HOBBIT achieves up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
arXiv Detail & Related papers (2024-11-03T04:25:46Z) - Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models.
Our approach employs activation sparsity to extract experts.
Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z) - Can SAM Boost Video Super-Resolution? [78.29033914169025]
We propose a simple yet effective module -- SAM-guidEd refinEment Module (SEEM)
This light-weight plug-in module is specifically designed to leverage the attention mechanism for the generation of semantic-aware feature.
We apply our SEEM to two representative methods, EDVR and BasicVSR, resulting in consistently improved performance with minimal implementation effort.
arXiv Detail & Related papers (2023-05-11T02:02:53Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.