MoQa: Rethinking MoE Quantization with Multi-stage Data-model Distribution Awareness
- URL: http://arxiv.org/abs/2503.21135v1
- Date: Thu, 27 Mar 2025 03:52:25 GMT
- Title: MoQa: Rethinking MoE Quantization with Multi-stage Data-model Distribution Awareness
- Authors: Zihao Zheng, Xiuping Cui, Size Zheng, Maoliang Li, Jiayu Chen, Yun, Liang, Xiang Chen,
- Abstract summary: Mix-of-Experts (MoE) has become the main form of Large Language Models (LLMs)<n>MoQa decouples the data-model distribution complexity of MoEs in multiple analysis stages.<n>Experiments show that MoQa achieves a 1.692.18 perplexity decrease in language modeling tasks and a 1.58%8.91% accuracy improvement in zero-shot inference tasks.
- Score: 12.059149430757863
- License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
- Abstract: With the advances in artificial intelligence, Mix-of-Experts (MoE) has become the main form of Large Language Models (LLMs), and its demand for model compression is increasing. Quantization is an effective method that not only compresses the models but also significantly accelerates their performance. Existing quantization methods have gradually shifted the focus from parameter scaling to the analysis of data distributions. However, their analysis is designed for dense LLMs and relies on the simple one-model-all-data mapping, which is unsuitable for MoEs. This paper proposes a new quantization framework called MoQa. MoQa decouples the data-model distribution complexity of MoEs in multiple analysis stages, quantitively revealing the dynamics during sparse data activation, data-parameter mapping, and inter-expert correlations. Based on these, MoQa identifies particular experts' and parameters' significance with optimal data-model distribution awareness and proposes a series of fine-grained mix-quantization strategies adaptive to various data activation and expert combination scenarios. Moreover, MoQa discusses the limitations of existing quantization and analyzes the impact of each stage analysis, showing novel insights for MoE quantization. Experiments show that MoQa achieves a 1.69~2.18 perplexity decrease in language modeling tasks and a 1.58%~8.91% accuracy improvement in zero-shot inference tasks. We believe MoQa will play a role in future MoE construction, optimization, and compression.
Related papers
- Beyond Standard MoE: Mixture of Latent Experts for Resource-Efficient Language Models [10.623996218106564]
We introduce a novel parameterization methodology that facilitates the mapping of specific experts into a shared latent space.
All expert operations are systematically decomposed into two principal components: a shared projection into a lower-dimensional latent space, followed by expert-specific transformations.
This factorized approach substantially diminishes parameter count and computational requirements.
arXiv Detail & Related papers (2025-03-29T14:35:34Z) - Universality of Layer-Level Entropy-Weighted Quantization Beyond Model Architecture and Size [0.0]
We present a novel approach to selective model quantization using Entropy-Weighted Quantization (EWQ)<n>EWQ determines which blocks can be safely quantized without causing significant performance degradation, independent of model architecture or size.<n>Our method outperforms uniform quantization approaches, maintaining Massive Multitask Language Understanding (MMLU) accuracy scores within 0.5% of unquantized models.
arXiv Detail & Related papers (2025-03-06T18:54:32Z) - Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging [111.8456671452411]
Multi-task learning (MTL) leverages a shared model to accomplish multiple tasks and facilitate knowledge transfer.
We propose a Weight-Ensembling Mixture of Experts (WEMoE) method for multi-task model merging.
We show that WEMoE and E-WEMoE outperform state-of-the-art (SOTA) model merging methods in terms of MTL performance, generalization, and robustness.
arXiv Detail & Related papers (2024-10-29T07:16:31Z) - QuantMoE-Bench: Examining Post-Training Quantization for Mixture-of-Experts [47.01697456105496]
Mixture-of-Experts (MoE) is a promising way to scale up the learning capacity of large language models.<n>MoE suffers from significant memory overheads due to the vast parameter size.<n>Post-training quantization offers a powerful approach for model compression.
arXiv Detail & Related papers (2024-06-12T12:44:48Z) - Analysis of Atom-level pretraining with Quantum Mechanics (QM) data for Graph Neural Networks Molecular property models [0.0]
We show how atom-level pretraining with quantum mechanics (QM) data can mitigate violations of assumptions regarding the distributional similarity between training and test data.
This is the first time that hidden state molecular representations are analyzed to compare the effects of molecule-level and atom-level pretraining on QM data.
arXiv Detail & Related papers (2024-05-23T17:51:05Z) - LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit [55.73370804397226]
Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating large language models.
We present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization.
Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats.
arXiv Detail & Related papers (2024-05-09T11:49:05Z) - Sparse MoEs meet Efficient Ensembles [49.313497379189315]
We study the interplay of two popular classes of such models: ensembles of neural networks and sparse mixture of experts (sparse MoEs)
We present Efficient Ensemble of Experts (E$3$), a scalable and simple ensemble of sparse MoEs that takes the best of both classes of models, while using up to 45% fewer FLOPs than a deep ensemble.
arXiv Detail & Related papers (2021-10-07T11:58:35Z) - MoEfication: Conditional Computation of Transformer Models for Efficient
Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost.
We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon.
We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z) - Post-mortem on a deep learning contest: a Simpson's paradox and the
complementary roles of scale metrics versus shape metrics [61.49826776409194]
We analyze a corpus of models made publicly-available for a contest to predict the generalization accuracy of neural network (NN) models.
We identify what amounts to a Simpson's paradox: where "scale" metrics perform well overall but perform poorly on sub partitions of the data.
We present two novel shape metrics, one data-independent, and the other data-dependent, which can predict trends in the test accuracy of a series of NNs.
arXiv Detail & Related papers (2021-06-01T19:19:49Z) - Non-asymptotic oracle inequalities for the Lasso in high-dimensional mixture of experts [2.794896499906838]
We consider the class of softmax-gated Gaussian MoE (SGMoE) models with softmax gating functions and Gaussian experts.
To the best of our knowledge, we are the first to investigate the $l_1$-regularization properties of SGMoE models from a non-asymptotic perspective.
We provide a lower bound on the regularization parameter of the Lasso penalty that ensures non-asymptotic theoretical control of the Kullback--Leibler loss of the Lasso estimator for SGMoE models.
arXiv Detail & Related papers (2020-09-22T15:23:35Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.