Effective MoE-based LLM Compression by Exploiting Heterogeneous Inter-Group Experts Routing Frequency and Information Density
- URL: http://arxiv.org/abs/2602.09316v2
- Date: Wed, 11 Feb 2026 21:20:04 GMT
- Title: Effective MoE-based LLM Compression by Exploiting Heterogeneous Inter-Group Experts Routing Frequency and Information Density
- Authors: Zhendong Mi, Yixiao Chen, Pu Zhao, Xiaodong Yu, Hao Wang, Yanzhi Wang, Shaoyi Huang,
- Abstract summary: Mixture-of-Experts (MoE) based Large Language Models (LLMs) have achieved superior performance.<n>Massive memory overhead caused by storing multiple expert networks severely hinders their practical deployment.<n>We propose RFID-MoE, an effective framework for MoE compression by exploiting heterogeneous Routing Frequency and Information Density.
- Score: 30.94369556247692
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mixture-of-Experts (MoE) based Large Language Models (LLMs) have achieved superior performance, yet the massive memory overhead caused by storing multiple expert networks severely hinders their practical deployment. Singular Value Decomposition (SVD)-based compression has emerged as a promising post-training technique; however, most existing methods apply uniform rank allocation or rely solely on static weight properties. This overlooks the substantial heterogeneity in expert utilization observed in MoE models, where frequent routing patterns and intrinsic information density vary significantly across experts. In this work, we propose RFID-MoE, an effective framework for MoE compression by exploiting heterogeneous Routing Frequency and Information Density. We first introduce a fused metric that combines expert activation frequency with effective rank to measure expert importance, adaptively allocating higher ranks to critical expert groups under a fixed budget. Moreover, instead of discarding compression residuals, we reconstruct them via a parameter-efficient sparse projection mechanism to recover lost information with minimal parameter overhead. Extensive experiments on representative MoE LLMs (e.g., Qwen3, DeepSeekMoE) across multiple compression ratios demonstrate that RFID-MoE consistently outperforms state-of-the-art methods like MoBE and D2-MoE. Notably, RFID-MoE achieves a perplexity of 16.92 on PTB with the Qwen3-30B model at a 60% compression ratio, reducing perplexity by over 8.0 compared to baselines, and improves zero-shot accuracy on HellaSwag by approximately 8%.
Related papers
- MoDES: Accelerating Mixture-of-Experts Multimodal Large Language Models via Dynamic Expert Skipping [52.02659589971978]
We propose MoDES, the first training-free framework that adaptively skips experts to enable efficient and accurate MoE MLLM inference.<n>MoDES significantly enhances inference speed, improving the prefilling time by 2.16$times$ and the decoding time by 1.26$times$.
arXiv Detail & Related papers (2025-11-19T18:48:27Z) - PuzzleMoE: Efficient Compression of Large Mixture-of-Experts Models via Sparse Expert Merging and Bit-packed inference [17.441141633991197]
We introduce PuzzleMoE, a training-free MoE compression method that achieves high accuracy and efficient inference through two key innovations.<n>First, PuzzleMoE performs sparse expert merging by identifying element-wise weight redundancy and specialization.<n>Second, to avoid the overhead of storing binary masks and signs, PuzzleMoE introduces a bit-packed encoding scheme that reuses underutilized exponent bits.
arXiv Detail & Related papers (2025-11-06T20:53:02Z) - MoBE: Mixture-of-Basis-Experts for Compressing MoE-based LLMs [25.218350580080447]
MoBE achieves notably lower accuracy drops compared to prior works.<n>MoBE can reduce the parameter counts of Qwen3-235B-A22B-2507, DeepSeek-V3-0324 (671B) and Kimi-K2-Instruct (1T) by 24%-30% with only 1%-2% accuracy drop.
arXiv Detail & Related papers (2025-08-07T10:48:24Z) - MoNE: Replacing Redundant Experts with Lightweight Novices for Structured Pruning of MoE [12.498106165046233]
Mixture-of-Experts (MoE) enables efficient scaling of large language models by activating only a subset of experts per input token.<n>MoNE replaces redundant experts with lightweight novices to achieve effective and robust model compression.
arXiv Detail & Related papers (2025-07-01T03:02:59Z) - ResMoE: Space-efficient Compression of Mixture of Experts LLMs via Residual Restoration [61.579842548990754]
Mixture-of-Experts (MoE) Transformer, the backbone of multiple phenomenal language models, leverages sparsity by activating only a fraction of model parameters for each input token.<n>We introduce ResMoE, an innovative MoE approximation framework that utilizes Wasserstein barycenter to extract a common expert (barycenter expert) and approximate the residuals between this barycenter expert and the original ones.
arXiv Detail & Related papers (2025-03-10T03:15:54Z) - Optimizing Singular Spectrum for Large Language Model Compression [95.7621116637755]
We introduce SoCo, a novel compression framework that learns to rescale the decomposed components of SVD in a data-driven manner.<n>Thanks to the learnable singular spectrum, SoCo adaptively prunes components according to the sparsified importance scores.<n> Experimental evaluations across multiple LLMs and benchmarks demonstrate that SoCo surpasses the state-of-the-art methods in model compression.
arXiv Detail & Related papers (2025-02-20T23:18:39Z) - HOBBIT: A Mixed Precision Expert Offloading System for Fast MoE Inference [54.40808356999408]
We present HOBBIT, a mixed precision expert offloading system to enable flexible and efficient MoE inference.
Our key insight is that dynamically replacing less critical cache-miss experts with low precision versions can substantially reduce expert-loading latency.
HOBBIT achieves up to a 9.93x speedup in decoding compared to state-of-the-art MoE offloading systems.
arXiv Detail & Related papers (2024-11-03T04:25:46Z) - MoE-I$^2$: Compressing Mixture of Experts Models through Inter-Expert Pruning and Intra-Expert Low-Rank Decomposition [32.97035551579975]
We introduce a two-stage compression method tailored for MoE to reduce the model size and decrease the computational cost.
Experiments on Qwen1.5-MoE-A2.7B, DeepSeek-V2-Lite, and Mixtral-8$times$7B demonstrate that our proposed methods can both reduce the model size and enhance inference efficiency.
arXiv Detail & Related papers (2024-11-01T20:37:58Z) - Mixture Compressor for Mixture-of-Experts LLMs Gains More [71.0473038084673]
We propose a training-free Mixture-Compressor for Mixture-of-Experts large language models (MoE-LLMs)<n>Our MC integrates static quantization and dynamic pruning to collaboratively achieve extreme compression for MoE-LLMs with less accuracy loss.<n>For instance, at 2.54 bits, MC compresses 76.6% of the model, with only a 3.8% average accuracy loss.
arXiv Detail & Related papers (2024-10-08T18:09:38Z) - MoDeGPT: Modular Decomposition for Large Language Model Compression [59.361006801465344]
This paper introduces textbfModular bfDecomposition (MoDeGPT), a novel structured compression framework.<n>MoDeGPT partitions the Transformer block into modules comprised of matrix pairs and reduces the hidden dimensions.<n>Our experiments show MoDeGPT, without backward propagation, matches or surpasses previous structured compression methods.
arXiv Detail & Related papers (2024-08-19T01:30:14Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.