Related papers: Beyond Benchmarks: Understanding Mixture-of-Experts Models through Internal Mechanisms

Beyond Benchmarks: Understanding Mixture-of-Experts Models through Internal Mechanisms

URL: http://arxiv.org/abs/2509.23933v1
Date: Sun, 28 Sep 2025 15:13:38 GMT
Title: Beyond Benchmarks: Understanding Mixture-of-Experts Models through Internal Mechanisms
Authors: Jiahao Ying, Mingbao Lin, Qianru Sun, Yixin Cao,
Abstract summary: Mixture-of-Experts (MoE) architectures have emerged as a promising direction, offering efficiency and scalability by activating only a subset of parameters during inference.<n>We use an internal metric to investigate the mechanisms of MoE architecture by explicitly incorporating routing mechanisms and analyzing expert-level behaviors.<n>We uncover several findings: (1) neuron utilization decreases as models evolve, reflecting stronger generalization; (2) training exhibits a dynamic trajectory, where benchmark performance alone provides limited signal; (3) task completion emerges from collaborative contributions of multiple experts, with shared experts driving concentration; and (4) activation patterns at the neuron level provide a fine-grained proxy for data diversity.
Score: 55.1784306456972
License: http://arxiv.org/licenses/nonexclusive-distrib/1.0/
Abstract: Mixture-of-Experts (MoE) architectures have emerged as a promising direction, offering efficiency and scalability by activating only a subset of parameters during inference. However, current research remains largely performance-centric, with limited understanding of its internal mechanisms, thereby constraining broader progress. In this work, we use an internal metric to investigate the mechanisms of MoE architecture by explicitly incorporating routing mechanisms and analyzing expert-level behaviors. Through systematic analyses of a wide range of publicly available MoE models, we uncover several findings: (1) neuron utilization decreases as models evolve, reflecting stronger generalization; (2) training exhibits a dynamic trajectory, where benchmark performance alone provides limited signal while MUI reveals deeper insights; (3) task completion emerges from collaborative contributions of multiple experts, with shared experts driving concentration; and (4) activation patterns at the neuron level provide a fine-grained proxy for data diversity. Together, these results demonstrate the potential of MUI as a complementary indicator to benchmark performance, offering new insights into the capacity, dynamics, and specialization of MoE models. Our project can be found at https://yingjiahao14.github.io/MoE-MUI/.

Related papers

ExpertWeaver: Unlocking the Inherent MoE in Dense LLMs with GLU Activation Patterns [68.61814799047956]
Mixture-of-Experts (MoE) effectively scales model capacity while preserving computational efficiency through sparse expert activation.<n>We introduce ExpertWeaver, a training-free framework that partitions neurons according to their activation patterns and constructs shared experts and specialized routed experts with layer-adaptive configurations.
arXiv Detail & Related papers (2026-02-17T11:50:58Z)
DIML: Differentiable Inverse Mechanism Learning from Behaviors of Multi-Agent Learning Trajectories [7.764532811300023]
We study inverse mechanism learning: recovering an unknown incentive-generating mechanism from observed strategic interaction traces.<n>Unlike inverse game theory and multi-agent inverse reinforcement learning, our target includes unstructured mechanism.<n>We propose DIML, a likelihood-based framework that differentiates through a model of multi-agent learning dynamics.
arXiv Detail & Related papers (2026-01-25T03:49:25Z)
Efficient Training of Diffusion Mixture-of-Experts Models: A Practical Recipe [51.26601054313749]
Recent efforts on Diffusion MoE models have primarily focused on developing more sophisticated routing mechanisms.<n>Inspired by the MoE design paradigms established in large language models (LLMs), we identify a set of crucial architectural factors for building effective Diffusion MoE models.<n>We present novel architectures that can be efficiently applied to both latent and pixel-space diffusion frameworks.
arXiv Detail & Related papers (2025-12-01T03:52:31Z)
On Linear Mode Connectivity of Mixture-of-Experts Architectures [1.6747713135100666]
We investigate the phenomenon of linear Mode Connectivity (LMC) in neural networks.<n>LMC is a notable phenomenon in the loss landscapes of neural networks, wherein independently trained models have been to be connected--up to varying symmetries of an algorithm.
arXiv Detail & Related papers (2025-09-14T16:51:41Z)
Foundation Model for Skeleton-Based Human Action Understanding [56.89025287217221]
This paper presents a Unified Skeleton-based Dense Representation Learning framework.<n>USDRL consists of a Transformer-based Dense Spatio-Temporal (DSTE), Multi-Grained Feature Decorrelation (MG-FD), and Multi-Perspective Consistency Training (MPCT)
arXiv Detail & Related papers (2025-08-18T02:42:16Z)
MoIIE: Mixture of Intra- and Inter-Modality Experts for Large Vision Language Models [52.876185634349575]
We propose to incorporate Mixture of Intra- and Inter-Modality Experts (MoIIE) to Large Vision-Language Models (LVLMs)<n>For each token, expert routing is guided by its modality, directing tokens to their respective intra-modality experts as well as a shared pool of inter-modality experts.<n>Our MoIIE models with 5.5B and 11.3B activated parameters match or even surpass the performance of existing advanced open-source MoE-LLMs based multi-modal models.
arXiv Detail & Related papers (2025-08-13T13:00:05Z)
Mixture of Experts in Large Language Models [3.1494372222592224]
MoE architecture significantly enhances model performance while maintaining minimal computational overhead.<n>Our analysis identifies key advantages of MoE, including superior model capacity, improved task-specific performance, and the ability to scale model capacity efficiently.<n>This review outlines current research limitations, open challenges, and promising future directions, providing a foundation for continued innovation in MoE architecture and its applications.
arXiv Detail & Related papers (2025-07-15T10:36:43Z)
Unveiling Effective In-Context Configurations for Image Captioning: An External & Internal Analysis [28.52057785196361]
In Natural Language Processing (NLP), numerous studies have demonstrated the effectiveness of In-Context Learning (ICL)<n>Inspired by the success of Large Language Models (LLMs), researchers have developed Large Multimodal Models (LMMs) with ICL capabilities.<n>This paper conducts a comprehensive external and internal investigation of multimodal in-context learning on the image captioning task.
arXiv Detail & Related papers (2025-07-08T08:07:57Z)
CoMoE: Contrastive Representation for Mixture-of-Experts in Parameter-Efficient Fine-tuning [10.215751315734018]
We propose Contrastive Representation for MoE (CoMoE) to promote modularization and specialization in MoE.<n>Experiments on several benchmarks and in multi-task settings demonstrate that CoMoE can consistently enhance MoE's capacity and promote modularization among the experts.
arXiv Detail & Related papers (2025-05-23T06:58:44Z)
On DeepSeekMoE: Statistical Benefits of Shared Experts and Normalized Sigmoid Gating [75.29576838162714]
DeepSeekMoE stands out because of two unique features: the deployment of a shared expert strategy and of the normalized sigmoid gating mechanism.<n>We perform a convergence analysis of the expert estimation task to highlight the gains in sample efficiency for both the shared expert strategy and the normalized sigmoid gating.
arXiv Detail & Related papers (2025-05-16T04:58:18Z)
Quadratic Gating Mixture of Experts: Statistical Insights into Self-Attention [28.17124843417577]
Mixture of Experts (MoE) models are well known for effectively scaling model capacity while preserving computational overheads.<n>We establish a rigorous relation between MoE and the self-attention mechanism, showing that each row of a self-attention matrix can be written as a quadratic gating mixture of linear experts.<n>We propose a novelemphactive-attention mechanism where we apply a non-linear activation function to the value matrix in the formula of self-attention.
arXiv Detail & Related papers (2024-10-15T03:06:37Z)
Multi-Head Mixture-of-Experts [100.60556163597946]
We propose Multi-Head Mixture-of-Experts (MH-MoE), which employs a multi-head mechanism to split each token into multiple sub-tokens. MH-MoE is straightforward to implement and decouples from other SMoE optimization methods, making it easy to integrate with other SMoE models for enhanced performance.
arXiv Detail & Related papers (2024-04-23T13:47:09Z)
T-REX: Mixture-of-Rank-One-Experts with Semantic-aware Intuition for Multi-task Large Language Model Finetuning [31.276142111455847]
Large language models (LLMs) encounter significant adaptation challenges in diverse multitask finetuning.<n>We design a novel framework, mixunderlinetextbfTureunderlinetextbf-of-underlinetextbfRank-onunderlinetextbfE-eunderlinetextbfXper ts (textttT-REX)<n>Rank-1 experts enable a mix-and-match mechanism to quadratically expand the vector subspace of experts with linear parameter overheads, achieving approximate error reduction with optimal
arXiv Detail & Related papers (2024-04-13T12:14:58Z)

This list is automatically generated from the titles and abstracts of the papers in this site.