Related papers: MoQa: Rethinking MoE Quantization with Multi-stage Data-model Distribution Awareness

MoQa: Rethinking MoE Quantization with Multi-stage Data-model Distribution Awareness

URL: http://arxiv.org/abs/2503.21135v2
Date: Sat, 17 May 2025 12:20:30 GMT
Title: MoQa: Rethinking MoE Quantization with Multi-stage Data-model Distribution Awareness
Authors: Zihao Zheng, Xiuping Cui, Size Zheng, Maoliang Li, Jiayu Chen, Yun Liang, Xiang Chen,
Abstract summary: Mix-of-Experts (MoE) has become the main form of Large Language Models (LLMs)<n>MoQa is an expert-level mix-precision base quantization with distribution awareness.<n>MoQa achieves a 2.746.44 PPL decrease and 1.85%3.77% average accuracy improvements on novel distributions.
Score: 8.021289706876606
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: With the advances in artificial intelligence, Mix-of-Experts (MoE) has become the main form of Large Language Models (LLMs), and its demand for model compression is increasing. Quantization is an effective method that not only compresses the models but also significantly accelerates their performance. Existing quantization methods have gradually shifted the focus from parameter scaling to the analysis of data distributions. However, their analysis is designed for dense LLMs, which are suboptimal for MoE quantization, due to MoEs' complex data-model distribution. To address this problem, we decouple the complexity of MoEs' data-model distribution into a multi-stage analysis and reveal MoEs' inherent dynamics. The analysis results show that the expert performance of MoE varies dynamically both within and across data distributions. Based on these, we design two quantization strategies with data-model distribution awareness and integrate them into an end-to-end framework for MoE quantization, which is named MoQa. MoQa uses an expert-level mix-precision base quantization with distribution awareness. Moreover, MoQa uses a channel-level quantization adjustment to dynamically adjust expert performance to adapt to novel distributions. Experiments show that MoQa's base quantization achieves a 0.49~8.51 PPL decrease on known distributions. With the adjustments, MoQa achieves a 2.74~6.44 PPL decrease and 1.85%~3.77% average accuracy improvements on novel distributions. We believe MoQa will play a role in future MoE construction, optimization, and compression.

Related papers

MPQ-DMv2: Flexible Residual Mixed Precision Quantization for Low-Bit Diffusion Models with Temporal Distillation [74.34220141721231]
We present MPQ-DMv2, an improved textbfMixed textbfPrecision textbfQuantization framework for extremely low-bit textbfDiffusion textbfModels.
arXiv Detail & Related papers (2025-07-06T08:16:50Z)
MxMoE: Mixed-precision Quantization for MoE with Accuracy and Performance Co-Design [41.7649957078564]
MxMoE is a mixed-precision optimization framework for Mixture-of-Experts (MoE) models.<n>MxMoE navigates the design space defined by parameter sensitivity, expert activation dynamics, and hardware resources to derive efficient mixed-precision configurations.
arXiv Detail & Related papers (2025-05-09T05:32:21Z)
3D Gaussian Splatting Data Compression with Mixture of Priors [23.015728369640136]
3DGS data compression is crucial for enabling efficient storage and transmission in 3D scene modeling.<n>We propose a novel Mixture of Priors (MoP) strategy to address these two challenges.<n>Our proposed 3DGS data compression framework achieves state-of-the-art performance across multiple benchmarks.
arXiv Detail & Related papers (2025-05-06T08:42:39Z)
MoEQuant: Enhancing Quantization for Mixture-of-Experts Large Language Models via Expert-Balanced Sampling and Affinity Guidance [10.817003682434425]
Mixture-of-Experts (MoE) large language models (LLMs) leverage dynamic routing and sparse activation to enhance efficiency and scalability.<n>Post-training quantization (PTQ) encounters severe accuracy degradation and diminished performance when applied to MoE models.<n>This paper investigates the impact of MoE's sparse and dynamic characteristics on quantization.
arXiv Detail & Related papers (2025-05-02T08:51:55Z)
Beyond Standard MoE: Mixture of Latent Experts for Resource-Efficient Language Models [10.623996218106564]
We introduce a novel parameterization methodology that facilitates the mapping of specific experts into a shared latent space. All expert operations are systematically decomposed into two principal components: a shared projection into a lower-dimensional latent space, followed by expert-specific transformations. This factorized approach substantially diminishes parameter count and computational requirements.
arXiv Detail & Related papers (2025-03-29T14:35:34Z)
Universality of Layer-Level Entropy-Weighted Quantization Beyond Model Architecture and Size [0.0]
We present a novel approach to selective model quantization using Entropy-Weighted Quantization (EWQ)<n>EWQ determines which blocks can be safely quantized without causing significant performance degradation, independent of model architecture or size.<n>Our method outperforms uniform quantization approaches, maintaining Massive Multitask Language Understanding (MMLU) accuracy scores within 0.5% of unquantized models.
arXiv Detail & Related papers (2025-03-06T18:54:32Z)
Pushing the Limits of Large Language Model Quantization via the Linearity Theorem [71.3332971315821]
We present a "line theoremarity" establishing a direct relationship between the layer-wise $ell$ reconstruction error and the model perplexity increase due to quantization. This insight enables two novel applications: (1) a simple data-free LLM quantization method using Hadamard rotations and MSE-optimal grids, dubbed HIGGS, and (2) an optimal solution to the problem of finding non-uniform per-layer quantization levels.
arXiv Detail & Related papers (2024-11-26T15:35:44Z)
Efficient and Effective Weight-Ensembling Mixture of Experts for Multi-Task Model Merging [111.8456671452411]
Multi-task learning (MTL) leverages a shared model to accomplish multiple tasks and facilitate knowledge transfer. We propose a Weight-Ensembling Mixture of Experts (WEMoE) method for multi-task model merging. We show that WEMoE and E-WEMoE outperform state-of-the-art (SOTA) model merging methods in terms of MTL performance, generalization, and robustness.
arXiv Detail & Related papers (2024-10-29T07:16:31Z)
QuantMoE-Bench: Examining Post-Training Quantization for Mixture-of-Experts [47.01697456105496]
Mixture-of-Experts (MoE) is a promising way to scale up the learning capacity of large language models.<n>MoE suffers from significant memory overheads due to the vast parameter size.<n>Post-training quantization offers a powerful approach for model compression.
arXiv Detail & Related papers (2024-06-12T12:44:48Z)
Analysis of Atom-level pretraining with Quantum Mechanics (QM) data for Graph Neural Networks Molecular property models [0.0]
We show how atom-level pretraining with quantum mechanics (QM) data can mitigate violations of assumptions regarding the distributional similarity between training and test data. This is the first time that hidden state molecular representations are analyzed to compare the effects of molecule-level and atom-level pretraining on QM data.
arXiv Detail & Related papers (2024-05-23T17:51:05Z)
SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models [63.118592279833656]
Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs)<n>We propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise.<n> Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths.
arXiv Detail & Related papers (2024-05-23T16:21:48Z)
LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit [55.73370804397226]
Quantization, a key compression technique, can effectively mitigate these demands by compressing and accelerating large language models. We present LLMC, a plug-and-play compression toolkit, to fairly and systematically explore the impact of quantization. Powered by this versatile toolkit, our benchmark covers three key aspects: calibration data, algorithms (three strategies), and data formats.
arXiv Detail & Related papers (2024-05-09T11:49:05Z)
Modular Quantization-Aware Training for 6D Object Pose Estimation [52.9436648014338]
Edge applications demand efficient 6D object pose estimation on resource-constrained embedded platforms. We introduce Modular Quantization-Aware Training (MQAT), an adaptive and mixed-precision quantization-aware training strategy. MQAT guides a systematic gradated modular quantization sequence and determines module-specific bit precisions, leading to quantized models that outperform those produced by state-of-the-art uniform and mixed-precision quantization techniques.
arXiv Detail & Related papers (2023-03-12T21:01:54Z)
Compression of Generative Pre-trained Language Models via Quantization [62.80110048377957]
We find that previous quantization methods fail on generative tasks due to the textithomogeneous word embeddings We propose a token-level contrastive distillation to learn distinguishable word embeddings, and a module-wise dynamic scaling to make quantizers adaptive to different modules.
arXiv Detail & Related papers (2022-03-21T02:11:35Z)
Sparse MoEs meet Efficient Ensembles [49.313497379189315]
We study the interplay of two popular classes of such models: ensembles of neural networks and sparse mixture of experts (sparse MoEs) We present Efficient Ensemble of Experts (E$3$), a scalable and simple ensemble of sparse MoEs that takes the best of both classes of models, while using up to 45% fewer FLOPs than a deep ensemble.
arXiv Detail & Related papers (2021-10-07T11:58:35Z)
MoEfication: Conditional Computation of Transformer Models for Efficient Inference [66.56994436947441]
Transformer-based pre-trained language models can achieve superior performance on most NLP tasks due to large parameter capacity, but also lead to huge computation cost. We explore to accelerate large-model inference by conditional computation based on the sparse activation phenomenon. We propose to transform a large model into its mixture-of-experts (MoE) version with equal model size, namely MoEfication.
arXiv Detail & Related papers (2021-10-05T02:14:38Z)
Post-mortem on a deep learning contest: a Simpson's paradox and the complementary roles of scale metrics versus shape metrics [61.49826776409194]
We analyze a corpus of models made publicly-available for a contest to predict the generalization accuracy of neural network (NN) models. We identify what amounts to a Simpson's paradox: where "scale" metrics perform well overall but perform poorly on sub partitions of the data. We present two novel shape metrics, one data-independent, and the other data-dependent, which can predict trends in the test accuracy of a series of NNs.
arXiv Detail & Related papers (2021-06-01T19:19:49Z)
Non-asymptotic oracle inequalities for the Lasso in high-dimensional mixture of experts [2.794896499906838]
We consider the class of softmax-gated Gaussian MoE (SGMoE) models with softmax gating functions and Gaussian experts. To the best of our knowledge, we are the first to investigate the $l_1$-regularization properties of SGMoE models from a non-asymptotic perspective. We provide a lower bound on the regularization parameter of the Lasso penalty that ensures non-asymptotic theoretical control of the Kullback--Leibler loss of the Lasso estimator for SGMoE models.
arXiv Detail & Related papers (2020-09-22T15:23:35Z)

This list is automatically generated from the titles and abstracts of the papers in this site.