Related papers: Dynamic Expert Sharing: Decoupling Memory from Parallelism in Mixture-of-Experts Diffusion LLMs

Dynamic Expert Sharing: Decoupling Memory from Parallelism in Mixture-of-Experts Diffusion LLMs

URL: http://arxiv.org/abs/2602.00879v1
Date: Sat, 31 Jan 2026 20:01:47 GMT
Title: Dynamic Expert Sharing: Decoupling Memory from Parallelism in Mixture-of-Experts Diffusion LLMs
Authors: Hao Mark Chen, Zhiwen Mo, Royson Lee, Qianzhou Wang, Da Li, Shell Xu Hu, Wayne Luk, Timothy Hospedales, Hongxiang Fan,
Abstract summary: Dynamic Expert Sharing (DES) is a novel technique that shifts MoE optimization from token-centric pruning to sequence-level coreset selection.<n>DES reduces unique expert activations by over 55% and latency by up to 38%, while retaining 99% of vanilla accuracy.
Score: 22.399470395813577
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Among parallel decoding paradigms, diffusion large language models (dLLMs) have emerged as a promising candidate that balances generation quality and throughput. However, their integration with Mixture-of-Experts (MoE) architectures is constrained by an expert explosion: as the number of tokens generated in parallel increases, the number of distinct experts activated grows nearly linearly. This results in substantial memory traffic that pushes inference into a memory-bound regime, negating the efficiency gains of both MoE and parallel decoding. To address this challenge, we propose Dynamic Expert Sharing (DES), a novel technique that shifts MoE optimization from token-centric pruning and conventional expert skipping methods to sequence-level coreset selection. To maximize expert reuse, DES identifies a compact, high-utility set of experts to satisfy the requirements of an entire parallel decoding block. We introduce two innovative selection strategies: (1) Intra-Sequence Sharing (DES-Seq), which adapts optimal allocation to the sequence level, and (2) Saliency-Aware Voting (DES-Vote), a novel mechanism that allows tokens to collectively elect a coreset based on aggregated router weights. Extensive experiments on MoE dLLMs demonstrate that DES reduces unique expert activations by over 55% and latency by up to 38%, while retaining 99% of vanilla accuracy, effectively decoupling memory overhead from the degree of parallelism.

Related papers

SERE: Similarity-based Expert Re-routing for Efficient Batch Decoding in MoE Models [19.56443760368644]
We present SERE, a Similarity-based Expert Re-routing method for Efficient batch decoding in MoE models.<n>SERE dynamically reduces the number of active experts by re-routing tokens from secondary experts to their most similar primary counterparts.<n>SERE achieves up to 2.0x speedup with minimal quality loss, providing a practical solution for cost-efficient and latency-sensitive large-scale MoE deployment.
arXiv Detail & Related papers (2026-02-07T16:51:16Z)
XShare: Collaborative in-Batch Expert Sharing for Faster MoE Inference [26.398691056715037]
Mixture-of-Experts (MoE) architectures are increasingly used to efficiently scale large language models.<n>We address this issue by modeling batch-aware expert selection as a modular optimization problem.<n>The proposed method, namely XShare, requires no retraining and dynamically adapts to each batch by maximizing the total gating score of selected experts.
arXiv Detail & Related papers (2026-02-06T23:33:07Z)
Token-Level LLM Collaboration via FusionRoute [60.72307345997823]
FusionRoute is a token-level multi-LLM collaboration framework.<n>It selects the most suitable expert at each decoding step and contributes a complementary logit that refines or corrects the selected expert's next-token distribution.<n>It outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning.
arXiv Detail & Related papers (2026-01-08T16:53:16Z)
MC#: Mixture Compressor for Mixture-of-Experts Large Models [86.64315380917827]
Mixture-of-Experts (MoE) effectively scales large language models (LLMs) and vision-language models (VLMs) by increasing capacity through sparse activation.<n>We propose MC# (Mixture-Compressor-sharp), a framework that combines static quantization and dynamic expert pruning.
arXiv Detail & Related papers (2025-10-13T03:12:46Z)
Breaking the MoE LLM Trilemma: Dynamic Expert Clustering with Structured Compression [14.086434595924716]
Mixture-of-Experts (MoE) Large Language Models (LLMs) face a trilemma of load imbalance, parameter redundancy, and communication overhead.<n>We introduce a unified framework based on dynamic expert clustering and structured compression to address these issues cohesively.
arXiv Detail & Related papers (2025-09-27T10:45:58Z)
Pangu Embedded: An Efficient Dual-system LLM Reasoner with Metacognition [95.54406667705999]
Pangu Embedded is an efficient Large Language Model (LLM) reasoner developed on Ascend Neural Processing Units (NPUs)<n>It addresses the significant computational costs and inference latency challenges prevalent in existing reasoning-optimized LLMs.<n>It delivers rapid responses and state-of-the-art reasoning quality within a single, unified model architecture.
arXiv Detail & Related papers (2025-05-28T14:03:02Z)
EPS-MoE: Expert Pipeline Scheduler for Cost-Efficient MoE Inference [49.94169109038806]
This paper introduces EPS-MoE, a novel expert pipeline scheduler for MoE that surpasses the existing parallelism schemes.<n>Our results demonstrate at most 52.4% improvement in prefill throughput compared to existing parallel inference methods.
arXiv Detail & Related papers (2024-10-16T05:17:49Z)
MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts [63.67734699877724]
MoE++ is a general and heterogeneous MoE framework that integrates both Feed-Forward Network(FFN) and zero-computation experts. MoE++ achieves better performance while delivering 1.1-2.1x expert forward throughput compared to a vanilla MoE model of the same size.
arXiv Detail & Related papers (2024-10-09T18:01:27Z)
Mixture Compressor for Mixture-of-Experts LLMs Gains More [71.0473038084673]
We propose a training-free Mixture-Compressor for Mixture-of-Experts large language models (MoE-LLMs)<n>Our MC integrates static quantization and dynamic pruning to collaboratively achieve extreme compression for MoE-LLMs with less accuracy loss.<n>For instance, at 2.54 bits, MC compresses 76.6% of the model, with only a 3.8% average accuracy loss.
arXiv Detail & Related papers (2024-10-08T18:09:38Z)
T-REX: Mixture-of-Rank-One-Experts with Semantic-aware Intuition for Multi-task Large Language Model Finetuning [31.276142111455847]
Large language models (LLMs) encounter significant adaptation challenges in diverse multitask finetuning.<n>We design a novel framework, mixunderlinetextbfTureunderlinetextbf-of-underlinetextbfRank-onunderlinetextbfE-eunderlinetextbfXper ts (textttT-REX)<n>Rank-1 experts enable a mix-and-match mechanism to quadratically expand the vector subspace of experts with linear parameter overheads, achieving approximate error reduction with optimal
arXiv Detail & Related papers (2024-04-13T12:14:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.