XShare: Collaborative in-Batch Expert Sharing for Faster MoE Inference
- URL: http://arxiv.org/abs/2602.07265v1
- Date: Fri, 06 Feb 2026 23:33:07 GMT
- Title: XShare: Collaborative in-Batch Expert Sharing for Faster MoE Inference
- Authors: Daniil Vankov, Nikita Ivkin, Kyle Ulrich, Xiang Song, Ashish Khetan, George Karypis,
- Abstract summary: Mixture-of-Experts (MoE) architectures are increasingly used to efficiently scale large language models.<n>We address this issue by modeling batch-aware expert selection as a modular optimization problem.<n>The proposed method, namely XShare, requires no retraining and dynamically adapts to each batch by maximizing the total gating score of selected experts.
- Score: 26.398691056715037
- License: http://creativecommons.org/licenses/by/4.0/
- Abstract: Mixture-of-Experts (MoE) architectures are increasingly used to efficiently scale large language models. However, in production inference, request batching and speculative decoding significantly amplify expert activation, eroding these efficiency benefits. We address this issue by modeling batch-aware expert selection as a modular optimization problem and designing efficient greedy algorithms for different deployment settings. The proposed method, namely XShare, requires no retraining and dynamically adapts to each batch by maximizing the total gating score of selected experts. It reduces expert activation by up to 30% under standard batching, cuts peak GPU load by up to 3x in expert-parallel deployments, and achieves up to 14% throughput gains in speculative decoding via hierarchical, correlation-aware expert selection even if requests in a batch drawn from heterogeneous datasets.
Related papers
- Dynamic Expert Sharing: Decoupling Memory from Parallelism in Mixture-of-Experts Diffusion LLMs [22.399470395813577]
Dynamic Expert Sharing (DES) is a novel technique that shifts MoE optimization from token-centric pruning to sequence-level coreset selection.<n>DES reduces unique expert activations by over 55% and latency by up to 38%, while retaining 99% of vanilla accuracy.
arXiv Detail & Related papers (2026-01-31T20:01:47Z) - Token-Level LLM Collaboration via FusionRoute [60.72307345997823]
FusionRoute is a token-level multi-LLM collaboration framework.<n>It selects the most suitable expert at each decoding step and contributes a complementary logit that refines or corrects the selected expert's next-token distribution.<n>It outperforms both sequence- and token-level collaboration, model merging, and direct fine-tuning.
arXiv Detail & Related papers (2026-01-08T16:53:16Z) - Chain-of-Experts: Unlocking the Communication Power of Mixture-of-Experts Models [58.54288496296157]
Chain-of-Experts (CoE) is a new Mixture-of-Experts (MoE) architecture that introduces sequential expert communication within each layer.<n>To support dynamic expert selection across iterations, CoE employs a dedicated router at each step within a layer.
arXiv Detail & Related papers (2025-06-23T02:15:43Z) - Advancing MoE Efficiency: A Collaboration-Constrained Routing (C2R) Strategy for Better Expert Parallelism Design [36.35520569052556]
Mixture-of-Experts (MoE) has successfully scaled up models while maintaining nearly constant computing costs.<n>We propose a novel collaboration-constrained routing (C2R) strategy to encourage more specialized expert groups.<n>We achieve an average performance improvement of 0.51% and 0.33% on ten downstream NLP benchmarks.
arXiv Detail & Related papers (2025-04-02T03:51:59Z) - Symbolic Mixture-of-Experts: Adaptive Skill-based Routing for Heterogeneous Reasoning [76.10639521319382]
We propose Symbolic-MoE, a symbolic, text-based, and gradient-free Mixture-of-Experts framework.<n>We show Symbolic-MoE beats strong LLMs like GPT4o-mini, as well as multi-agent approaches, with an absolute avg. gain of 8.15% over the best multi-agent baseline.
arXiv Detail & Related papers (2025-03-07T18:03:13Z) - MoE++: Accelerating Mixture-of-Experts Methods with Zero-Computation Experts [63.67734699877724]
MoE++ is a general and heterogeneous MoE framework that integrates both Feed-Forward Network(FFN) and zero-computation experts.
MoE++ achieves better performance while delivering 1.1-2.1x expert forward throughput compared to a vanilla MoE model of the same size.
arXiv Detail & Related papers (2024-10-09T18:01:27Z) - Expert-Token Resonance MoE: Bidirectional Routing with Efficiency Affinity-Driven Active Selection [19.365009652356793]
Expert-Token Resonance (ETR) is a theoretically-grounded bidirectional routing mechanism that reimagines expert-token interactions.<n>ETR achieves 5.4%-46.6% improvements in end-to-end training efficiency compared to baseline MoE implementations.
arXiv Detail & Related papers (2024-05-24T02:50:44Z) - Harder Tasks Need More Experts: Dynamic Routing in MoE Models [58.18526590138739]
We introduce a novel dynamic expert selection framework for Mixture of Experts (MoE) models.
Our method dynamically selects experts based on the confidence level in expert selection for each input.
arXiv Detail & Related papers (2024-03-12T13:41:15Z) - MoEC: Mixture of Expert Clusters [93.63738535295866]
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead.
MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated.
However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation.
arXiv Detail & Related papers (2022-07-19T06:09:55Z)
This list is automatically generated from the titles and abstracts of the papers in this site.
This site does not guarantee the quality of this site (including all information) and is not responsible for any consequences.