Related papers: The Illusion of Specialization: Unveiling the Domain-Invariant "Standing Committee" in Mixture-of-Experts Models

The Illusion of Specialization: Unveiling the Domain-Invariant "Standing Committee" in Mixture-of-Experts Models

URL: http://arxiv.org/abs/2601.03425v1
Date: Tue, 06 Jan 2026 21:29:45 GMT
Title: The Illusion of Specialization: Unveiling the Domain-Invariant "Standing Committee" in Mixture-of-Experts Models
Authors: Yan Wang, Yitao Xu, Nanhan Shen, Jinyan Su, Jimin Huang, Zining Zhu,
Abstract summary: Mixture of Experts models are widely assumed to achieve domain specialization through sparse routing.<n>We introduce COMMITTEEAUDIT, a framework that analyzes routing behavior at the level of expert groups rather than individual experts.<n>We find that Standing Committees consistently capture the majority of routing mass across domains, layers, and routing budgets.
Score: 18.428606280260187
License: http://creativecommons.org/licenses/by/4.0/
Abstract: Mixture of Experts models are widely assumed to achieve domain specialization through sparse routing. In this work, we question this assumption by introducing COMMITTEEAUDIT, a post hoc framework that analyzes routing behavior at the level of expert groups rather than individual experts. Across three representative models and the MMLU benchmark, we uncover a domain-invariant Standing Committee. This is a compact coalition of routed experts that consistently captures the majority of routing mass across domains, layers, and routing budgets, even when architectures already include shared experts. Qualitative analysis further shows that Standing Committees anchor reasoning structure and syntax, while peripheral experts handle domain-specific knowledge. These findings reveal a strong structural bias toward centralized computation, suggesting that specialization in Mixture of Experts models is far less pervasive than commonly believed. This inherent bias also indicates that current training objectives, such as load-balancing losses that enforce uniform expert utilization, may be working against the model's natural optimization path, thereby limiting training efficiency and performance.

Related papers

Disentangling Causal Importance from Emergent Structure in Multi-Expert Orchestration [16.543409874497733]
We introduce INFORM, an interpretability analysis that treats orchestration as an explicit, analyzable computation.<n>We evaluate an orchestrator on GSM8K, HumanEval, and MMLU using a homogeneous consortium of ten instruction-tuned experts.<n>We reveal a divergence between relational importance, captured by routing mass and interaction topology, and intrinsic importance, measured via gradient-based causal attribution.
arXiv Detail & Related papers (2026-02-04T07:41:32Z)
SAME: Stabilized Mixture-of-Experts for Multimodal Continual Instruction Tuning [83.66308307152808]
We propose StAbilized Mixture-of-Experts (SAME) for Multimodal Continual Instruction Tuning (MCIT)<n>SAME stabilizes expert selection by decomposing routing dynamics into subspaces and updating only task-relevant directions.<n>It also introduces adaptive expert activation to freeze selected experts during training, reducing redundant and cross-task interference.
arXiv Detail & Related papers (2026-02-02T11:47:06Z)
Enhancing CTR Prediction with De-correlated Expert Networks [45.50697497028273]
We propose a De-Correlated MoE (D-MoE) framework, which introduces a Cross-Expert De-Correlation loss to minimize expert correlations.<n>We show that D-MoE achieves a significant 1.19% Gross Merchandise Volume (GMV) lift compared to the Multi-Embedding MoE baseline.
arXiv Detail & Related papers (2025-05-23T14:04:38Z)
Domain-Specific Pruning of Large Mixture-of-Experts Models with Few-shot Demonstrations [48.890534958441016]
This study investigates domain specialization and expert redundancy in large-scale MoE models.<n>We propose a simple yet effective pruning framework, EASY-EP, to identify and retain only the most relevant experts.<n>Experiments on DeepSeek-R1 and DeepSeek-V3-0324 show that our method can achieve comparable performances and $2.99times$ throughput under the same memory budget with full model with only half the experts.
arXiv Detail & Related papers (2025-04-09T11:34:06Z)
Unified Sparse Mixture of Experts [14.774596844618396]
Sparse Mixture of Experts (SMoEs) models scale the capacity of models while maintaining constant computational overhead.<n>This paper proposes a Unified Sparse Mixture of Experts (USMoE) framework that addresses these limitations.
arXiv Detail & Related papers (2025-03-29T07:15:12Z)
Convergence Rates for Softmax Gating Mixture of Experts [78.3687645289918]
Mixture of experts (MoE) has emerged as an effective framework to advance the efficiency and scalability of machine learning models.<n>Central to the success of MoE is an adaptive softmax gating mechanism which takes responsibility for determining the relevance of each expert to a given input and then dynamically assigning experts their respective weights.<n>We perform a convergence analysis of parameter estimation and expert estimation under the MoE equipped with the standard softmax gating or its variants, including a dense-to-sparse gating and a hierarchical softmax gating.
arXiv Detail & Related papers (2025-03-05T06:11:24Z)
Adaptive Conditional Expert Selection Network for Multi-domain Recommendation [10.418133538132635]
Mixture-of-Experts (MOE) has recently become the de facto standard in Multi-domain recommendation (MDR) CESAA consists of Conditional Expert Selection (CES) Module and Adaptive Expert Aggregation (AEA) Module. AEA utilizes mutual information loss to strengthen the correlations between experts and specific domains, and significantly improve the distinction between experts.
arXiv Detail & Related papers (2024-11-11T09:39:31Z)
Generalization Error Analysis for Sparse Mixture-of-Experts: A Preliminary Study [65.11303133775857]
Mixture-of-Experts (MoE) computation amalgamates predictions from several specialized sub-models (referred to as experts) Sparse MoE selectively engages only a limited number, or even just one expert, significantly reducing overhead while empirically preserving, and sometimes even enhancing, performance.
arXiv Detail & Related papers (2024-03-26T05:48:02Z)
MoEC: Mixture of Expert Clusters [93.63738535295866]
Sparsely Mixture of Experts (MoE) has received great interest due to its promising scaling capability with affordable computational overhead. MoE converts dense layers into sparse experts, and utilizes a gated routing network to make experts conditionally activated. However, as the number of experts grows, MoE with outrageous parameters suffers from overfitting and sparse data allocation.
arXiv Detail & Related papers (2022-07-19T06:09:55Z)
Sparsely-gated Mixture-of-Expert Layers for CNN Interpretability [3.021134753248103]
Sparsely-gated Mixture of Expert (MoE) layers have been successfully applied for scaling large transformers. In this work, we apply sparse MoE layers to CNNs for computer vision tasks and analyze the resulting effect on model interpretability.
arXiv Detail & Related papers (2022-04-22T09:40:23Z)
On the Representation Collapse of Sparse Mixture of Experts [102.83396489230375]
Sparse mixture of experts provides larger model capacity while requiring a constant computational overhead. It employs the routing mechanism to distribute input tokens to the best-matched experts according to their hidden representations. However, learning such a routing mechanism encourages token clustering around expert centroids, implying a trend toward representation collapse.
arXiv Detail & Related papers (2022-04-20T01:40:19Z)

This list is automatically generated from the titles and abstracts of the papers in this site.