MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design
- URL: http://arxiv.org/abs/2505.05799v1
- Date: Fri, 09 May 2025 05:32:21 GMT
- Title: MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design
- Authors: Haojie Duanmu, Xiuhong Li, Zhihang Yuan, Size Zheng, Jiangfei Duan, Xingcheng Zhang, Dahua Lin,
- Abstract summary: MxMoE is a mixed-precision optimization framework for Mixture-of-Experts (MoE) models.<n>MxMoE navigates the design space defined by parameter sensitivity, expert activation dynamics, and hardware resources to derive efficient mixed-precision configurations.
- Score: 41.7649957078564
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: Mixture-of-Experts (MoE) models face deployment challenges due to their large parameter counts and computational demands. We explore quantization for MoE models and highlight two key insights: 1) linear blocks exhibit varying quantization sensitivity, and 2) divergent expert activation frequencies create heterogeneous computational characteristics. Based on these observations, we introduce MxMoE, a mixed-precision optimization framework for MoE models that considers both algorithmic and system perspectives. MxMoE navigates the design space defined by parameter sensitivity, expert activation dynamics, and hardware resources to derive efficient mixed-precision configurations. Additionally, MxMoE automatically generates optimized mixed-precision GroupGEMM kernels, enabling parallel execution of GEMMs with different precisions. Evaluations show that MxMoE outperforms existing methods, achieving 2.4 lower Wikitext-2 perplexity than GPTQ at 2.25-bit and delivering up to 3.4x speedup over full precision, as well as up to 29.4% speedup over uniform quantization at equivalent accuracy with 5-bit weight-activation quantization. Our code is available at https://github.com/cat538/MxMoE.
Related papers
- Routing Mamba: Scaling State Space Models with Mixture-of-Experts Projection [88.47928738482719]
Linear State Space Models (SSMs) offer remarkable performance gains in sequence modeling.<n>Recent advances, such as Mamba, further enhance SSMs with input-dependent gating and hardware-aware implementations.<n>We introduce Routing Mamba (RoM), a novel approach that scales SSM parameters using sparse mixtures of linear projection experts.
arXiv Detail & Related papers (2025-06-22T19:26:55Z) - MoEQuant: Enhancing Quantization for Mixture-of-Experts Large Language Models via Expert-Balanced Sampling and Affinity Guidance [10.817003682434425]
Mixture-of-Experts (MoE) large language models (LLMs) leverage dynamic routing and sparse activation to enhance efficiency and scalability.<n>Post-training quantization (PTQ) encounters severe accuracy degradation and diminished performance when applied to MoE models.<n>This paper investigates the impact of MoE's sparse and dynamic characteristics on quantization.
arXiv Detail & Related papers (2025-05-02T08:51:55Z) - MiLo: Efficient Quantized MoE Inference with Mixture of Low-Rank Compensators [17.024303421560578]
We introduce MiLo, a novel method that augments highly quantized MoEs with a mixture of low-rank compensators.<n>MiLo does not rely on calibration data, allowing it to generalize to different MoE models and datasets without overfitting to a calibration set.<n>Our evaluation shows that MiLo outperforms existing methods on SoTA MoE models across various tasks.
arXiv Detail & Related papers (2025-04-03T14:54:17Z) - MoQa: Rethinking MoE Quantization with Multi-stage Data-model Distribution Awareness [12.059149430757863]
Mix-of-Experts (MoE) has become the main form of Large Language Models (LLMs)<n>MoQa decouples the data-model distribution complexity of MoEs in multiple analysis stages.<n>Experiments show that MoQa achieves a 1.692.18 perplexity decrease in language modeling tasks and a 1.58%8.91% accuracy improvement in zero-shot inference tasks.
arXiv Detail & Related papers (2025-03-27T03:52:25Z) - EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes.<n>Our results demonstrate at most 52.4% improvement in prefill throughput compared to existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z) - Mixture Compressor for Mixture-of-Experts LLMs Gains More [71.0473038084673]
We propose a training-free Mixture-Compressor for Mixture-of-Experts large language models (MoE-LLMs)<n>Our MC integrates static quantization and dynamic pruning to collaboratively achieve extreme compression for MoE-LLMs with less accuracy loss.<n>For instance, at 2.54 bits, MC compresses 76.6% of the model, with only a 3.8% average accuracy loss.
arXiv Detail & Related papers (2024-10-08T18:09:38Z) - Mixture of Experts with Mixture of Precisions for Tuning Quality of Service [0.0]
This paper presents an adaptive serving approach for the efficient deployment of MoE models.
By dynamically determining the number of quantized experts, we offer a fine-grained range of configurations for tuning throughput and model quality.
Results highlight the practical applicability of our approach in dynamic and accuracy-sensitive applications.
arXiv Detail & Related papers (2024-07-19T15:42:49Z) - QuantMoE-Bench: Examining Post-Training Quantization for Mixture-of-Experts [47.01697456105496]
Mixture-of-Experts (MoE) is a promising way to scale up the learning capacity of large language models.<n>MoE suffers from significant memory overheads due to the vast parameter size.<n>Post-training quantization offers a powerful approach for model compression.
arXiv Detail & Related papers (2024-06-12T12:44:48Z) - SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [63.118592279833656]
Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs)<n>We propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise.<n> Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths.
arXiv Detail & Related papers (2024-05-23T16:21:48Z) - AutoMoE: Heterogeneous Mixture-of-Experts with Adaptive Computation for
Efficient Neural Machine Translation [104.0979785739202]
Mixture-of-Expert (MoE) models have obtained state-of-the-art performance in Neural Machine Translation (NMT) tasks.
Existing MoE models mostly consider a homogeneous design where the same number of experts of the same size are placed uniformly throughout the network.
We develop AutoMoE -- a framework for designing heterogeneous MoE's under computational constraints.
arXiv Detail & Related papers (2022-10-14T05:32:17Z) - Quaternion Factorization Machines: A Lightweight Solution to Intricate
Feature Interaction Modelling [76.89779231460193]
factorization machine (FM) is capable of automatically learning high-order interactions among features to make predictions without the need for manual feature engineering.
We propose the quaternion factorization machine (QFM) and quaternion neural factorization machine (QNFM) for sparse predictive analytics.
arXiv Detail & Related papers (2021-04-05T00:02:36Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.