Related papers: Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression

Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression

URL: http://arxiv.org/abs/2510.02345v1
Date: Sat, 27 Sep 2025 10:45:58 GMT
Title: Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression
Authors: Peijun Zhu, Ning Yang, Jiayu Wei, Jinghang Wu, Haijun Zhang,
Abstract summary: Mixture-of-Experts (MoE) Large Language Models (LLMs) face a trilemma of load imbalance, parameter redundancy, and communication overhead.<n>We introduce a unified framework based on dynamic expert clustering and structured compression to address these issues cohesively.
Score: 14.086434595924716
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Mixture-of-Experts (MoE) Large Language Models (LLMs) face a trilemma of load imbalance, parameter redundancy, and communication overhead. We introduce a unified framework based on dynamic expert clustering and structured compression to address these issues cohesively. Our method employs an online clustering procedure that periodically regroups experts using a fused metric of parameter and activation similarity, which stabilizes expert utilization. To our knowledge, this is one of the first frameworks to leverage the semantic embedding capability of the router to dynamically reconfigure the model's architecture during training for substantial efficiency gains. Within each cluster, we decompose expert weights into a shared base matrix and extremely low-rank residual adapters, achieving up to fivefold parameter reduction per group while preserving specialization. This structure enables a two-stage hierarchical routing strategy: tokens are first assigned to a cluster, then to specific experts within it, drastically reducing the routing search space and the volume of all-to-all communication. Furthermore, a heterogeneous precision scheme, which stores shared bases in FP16 and residual factors in INT4, coupled with dynamic offloading of inactive clusters, reduces peak memory consumption to levels comparable to dense models. Evaluated on GLUE and WikiText-103, our framework matches the quality of standard MoE models while reducing total parameters by approximately 80%, improving throughput by 10% to 20%, and lowering expert load variance by a factor of over three. Our work demonstrates that structural reorganization is a principled path toward scalable, efficient, and memory-effective MoE LLMs.

Related papers

Dynamic Expert Sharing: Decoupling Memory from Parallelism in Mixture-of-Experts Diffusion LLMs [22.399470395813577]
Dynamic Expert Sharing (DES) is a novel technique that shifts MoE optimization from token-centric pruning to sequence-level coreset selection.<n>DES reduces unique expert activations by over 55% and latency by up to 38%, while retaining 99% of vanilla accuracy.
arXiv Detail & Related papers (2026-01-31T20:01:47Z)
SimpleMem: Efficient Lifelong Memory for LLM Agents [73.74399447715052]
We introduce SimpleMem, an efficient memory framework based on semantic lossless compression.<n>We propose a three-stage pipeline designed to maximize information density and token utilization.<n> Experiments on benchmark datasets show that our method consistently outperforms baseline approaches in accuracy, retrieval efficiency, and inference cost.
arXiv Detail & Related papers (2026-01-05T21:02:49Z)
Improving Recursive Transformers with Mixture of LoRAs [2.672804414228544]
Mixture of LoRAs (MoL) inserts Low-Rank Adaptation (LoRA) experts inside a shared feed-forward network (FFN)<n>MoL enables token-conditional weight-space modulation of the shared FFN without untying backbone parameters.<n>ModernALBERT achieves state-of-the-art performance among compact models.
arXiv Detail & Related papers (2025-12-14T23:39:30Z)
Hierarchical LoRA MoE for Efficient CTR Model Scaling [56.608809143548946]
HiLoMoE is a hierarchical LoRA MoE framework that enables holistic scaling in a parameter-efficient manner.<n>Unlike conventional stacking, HiLoMoE routes based on prior layer scores rather than outputs, allowing all layers to execute in parallel.
arXiv Detail & Related papers (2025-10-12T03:54:11Z)
SkipGPT: Dynamic Layer Pruning Reinvented with Token Awareness and Module Decoupling [16.742839354514512]
We introduce SkipGPT, a dynamic layer pruning framework to optimize large language models.<n>We show that SkipGPT reduces over 40% of model parameters while matching or exceeding the performance of the original dense model.
arXiv Detail & Related papers (2025-06-04T17:26:31Z)
Reinforced Model Merging [53.84354455400038]
We present an innovative framework termed Reinforced Model Merging (RMM), which encompasses an environment and agent tailored for merging tasks.<n>By utilizing data subsets during the evaluation process, we addressed the bottleneck in the reward feedback phase, thereby accelerating RMM by up to 100 times.
arXiv Detail & Related papers (2025-03-27T08:52:41Z)
ALoRE: Efficient Visual Adaptation via Aggregating Low Rank Experts [71.91042186338163]
ALoRE is a novel PETL method that reuses the hypercomplex parameterized space constructed by Kronecker product to Aggregate Low Rank Experts.<n>Thanks to the artful design, ALoRE maintains negligible extra parameters and can be effortlessly merged into the frozen backbone.
arXiv Detail & Related papers (2024-12-11T12:31:30Z)
Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design [59.00758127310582]
We propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models. Our approach employs activation sparsity to extract experts. Read-ME outperforms other popular open-source dense models of similar scales.
arXiv Detail & Related papers (2024-10-24T19:48:51Z)
Bypass Back-propagation: Optimization-based Structural Pruning for Large Language Models via Policy Gradient [57.9629676017527]
We propose an optimization-based structural pruning that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model.<n>We achieve this by learning an underlying Bernoulli distribution to sample binary pruning masks.<n>Experiments conducted on LLaMA, LLaMA-2, LLaMA-3, Vicuna, and Mistral models demonstrate the promising performance of our method in efficiency and effectiveness.
arXiv Detail & Related papers (2024-06-15T09:31:03Z)
Expert-Token Resonance MoE: Bidirectional Routing with Efficiency Affinity-Driven Active Selection [16.062265609569003]
Mixture-of-Experts (MoE) architectures have emerged as a paradigm-shifting approach for large language models (LLMs)<n>We propose a novel expert routing framework that incorporates: (1) An efficient routing mechanism with lightweight computation; (2) An adaptive bidirectional selection mechanism leveraging resonance between experts and tokens; and (3) A module that determines the lower bounds of expert capacity based on dynamic token distribution analysis.
arXiv Detail & Related papers (2024-05-24T02:50:44Z)
DIVA: A Dirichlet Process Mixtures Based Incremental Deep Clustering Algorithm via Variational Auto-Encoder [26.93881074862267]
We propose a nonparametric deep clustering framework that employs an infinite mixture of Gaussians as a prior. We name the framework as DIVA, a Dirichlet Process-based Incremental deep clustering framework via Variational Auto-Encoder. Our framework, which outperforms state-of-the-art baselines, exhibits superior performance in classifying complex data with dynamically changing features.
arXiv Detail & Related papers (2023-05-23T13:44:12Z)
Efficient Micro-Structured Weight Unification and Pruning for Neural Network Compression [56.83861738731913]
Deep Neural Network (DNN) models are essential for practical applications, especially for resource limited devices. Previous unstructured or structured weight pruning methods can hardly truly accelerate inference. We propose a generalized weight unification framework at a hardware compatible micro-structured level to achieve high amount of compression and acceleration.
arXiv Detail & Related papers (2021-06-15T17:22:59Z)

This list is automatically generated from the titles and abstracts of the papers in this site.